How to Pinpoint Agent Failures in LLM Multi-Agent Systems: A Step-by-Step Automated Attribution Guide

When your LLM-powered multi-agent system flops despite all agents buzzing with activity, you're left wondering: which agent caused the failure, and when did it happen? Manually sifting through endless interaction logs is like searching for a needle in a haystack. Researchers from Penn State University and Duke University, in collaboration with Google DeepMind and others, introduced Automated Failure Attribution to solve this, along with the Who&When benchmark and evaluation methods. This guide walks you through applying that approach to your own system.

What You Need

LLM Multi-Agent System – Any system where multiple LLM agents collaborate on a task (e.g., via chat, tool use, or shared memory).
Task Interaction Logs – Recorded conversations, actions, and reasoning steps for each agent during a failed task.
Failure Definition – Clear criteria for what constitutes a task failure (e.g., incorrect final answer, repeated loops).
Who&When Dataset – The open-source benchmark from the paper (download on Hugging Face).
Attribution Method Code – The authors’ codebase (available on GitHub).
Computational Resources – Enough GPU/CPU to run LLM inference and analysis.
Basic Python Skills – To adapt the scripts to your logs.

Step-by-Step Guide

Step 1: Understand the Problem

Automated Failure Attribution aims to identify two things: who (which agent) and when (at which step in the interaction) the failure originated. The failure might stem from a single agent’s mistake, a miscommunication between agents, or a cascading error. Reading the original paper (accepted as a Spotlight at ICML 2025) will give you the necessary background. Familiarize yourself with the Who&When benchmark—it contains annotated failure cases from various multi-agent tasks.

How to Pinpoint Agent Failures in LLM Multi-Agent Systems: A Step-by-Step Automated Attribution Guide — Source: syncedreview.com

Step 2: Prepare Your System’s Interaction Logs

Your logs must capture the full chain of agent actions and messages. Structure them as a sequence of steps: each step should include the agent’s id, the timestamp or order, the output (e.g., text, tool call), and any internal reasoning. Convert your logs into the format used by the Who&When dataset (usually JSON). The authors’ GitHub repository provides an example schema. Key fields: task_id, step_index, agent_name, message, action, context.

Step 3: Choose an Attribution Method

The paper evaluates several automated attribution methods, from simple heuristics to advanced LLM-based reasoning. For your first attempt, start with the direct prompting method: feed the entire log to a strong LLM (like GPT-4 or Claude) and ask it to output the failing agent and step. Alternatively, use the timeline-based method that checks each step for consistency. The codebase includes scripts for both. Select the method that best fits your system’s complexity and your available API budget.

Step 4: Run the Attribution Analysis

Once you’ve prepared your logs and chosen a method, execute the attribution script. For direct prompting, the script will send the log to the LLM and parse the response. For timeline-based, it will scan for anomalies like contradictions or repeated loops. Monitor the outputs for each failed task. If errors occur, check that your log format matches exactly the expected input.

Step 5: Interpret the Results

The script will output a predicted who (agent id) and when (step number) for each failure. Compare these predictions against your own manual inspection for a few cases to gauge accuracy. Look for patterns: does the same agent often cause failures? Are failures clustered at specific interaction steps? Use the Who&When benchmark as a reference; your system’s failure rate and attribution difficulty may differ.

Step 6: Iterate and Improve

If the attribution is consistently wrong, refine your approach. You might need to reformat your logs to include more context, or switch to a more robust method (e.g., chain-of-thought prompting). The open-source code allows easy modification. Consider implementing a hybrid method: use a rule-based pre-filter to isolate suspicious steps, then apply LLM-based attribution only on those.

Tips for Success

Validate with Known Failures: Before applying to unseen logs, test on a subset of failures you have already manually diagnosed. This ensures your pipeline works.
Keep Logs Clean: Remove duplicates, normalize agent names, and handle multi-turn reasoning carefully. Garbage in, garbage out.
Watch for Attribution Bias: Some LLMs may over-attribute failures to the last agent that spoke. Cross-check with other methods.
Scale Gradually: Start with a single failed task, then batch process. The authors’ code supports parallel calls, but be mindful of API rate limits.
Document Your Own "Who&When": Build a custom benchmark for your domain. Annotate a handful of failures manually; this will help you calibrate automated methods.
Stay Updated: The field is evolving quickly. Follow the GitHub repository for updates and community contributions.

With these steps, you can move from manual log archaeology to fast, automated failure attribution—saving hours of debugging and accelerating the improvement of your multi-agent systems.