Evaluating Multi-Agent Systems

Evaluating a multi-agent system is more complex than evaluating a single model response. You need to evaluate both the final output and the interactions that produced it.

A multi-agent system can fail even if the final answer looks plausible. Agents may have duplicated work, ignored evidence, used the wrong tools, missed a conflict, or violated permissions. The final response is only the visible surface.

A strong evaluation strategy looks at:

Final output quality
Individual agent performance
Coordination quality
Tool-use correctness
Conflict handling
Cost and latency
Safety and policy compliance
Robustness across edge cases

Output evaluation

Output evaluation asks whether the final answer or artifact satisfies the user’s goal.

For a research synthesis system, evaluate:

Accuracy
Source support
Completeness
Clarity
Balanced uncertainty
Correct formatting
Relevance to the user’s question

For a coding team of agents, evaluate:

Tests passing
Correctness of implementation
Security risks
Minimal unnecessary changes
Documentation updates
Review comments addressed

Output evaluation can use human review, automated tests, reference answers, LLM-as-judge rubrics, or domain-specific validators.

Trajectory evaluation

Trajectory evaluation examines the path the system took.

A trajectory includes:

user request
→ orchestrator plan
→ subagent assignments
→ subagent messages
→ tool calls
→ intermediate outputs
→ critiques
→ revisions
→ final synthesis

A good trajectory should be efficient, coherent, and aligned with the architecture.

Example trace:

{
  "task": "Draft a security review of pull request 482",
  "trajectory": [
    { "agent": "Planner", "action": "created review plan" },
    { "agent": "CodeAnalyzer", "action": "summarized diff" },
    { "agent": "SecurityReviewer", "action": "flagged SQL injection risk" },
    { "agent": "TestAgent", "action": "reported missing regression test" },
    { "agent": "Finalizer", "action": "merged findings into review" }
  ],
  "final_status": "changes_requested"
}

This trajectory is inspectable. If the final review is poor, developers can see where the failure happened.

Role-specific evaluation

Each agent should have role-specific success criteria.

Example:

{
  "agent": "ResearchAgent",
  "success_criteria": [
    "Finds primary sources when available",
    "Extracts claims without adding unsupported interpretation",
    "Reports uncertainty and missing information"
  ]
}

A writer agent should not be evaluated the same way as a security reviewer. Role-specific evaluation prevents vague scoring.

Coordination metrics

Useful multi-agent metrics include:

Task completion rate
Final answer accuracy
Role completion rate
Tool-call validity
Duplicate work rate
Conflict detection rate
Conflict resolution quality
Average agent turns per task
Average cost per task
Average latency per task
Human escalation rate
Policy violation rate

These metrics help distinguish quality improvements from mere complexity.

Baseline comparison

Always compare multi-agent systems against simpler baselines.

Baselines may include:

Single direct model response
Single ReAct agent
Plan-and-execute agent
Structured workflow without subagents

A multi-agent system should justify its added complexity. It should improve accuracy, coverage, safety, or maintainability enough to offset cost and latency.

Example evaluation:

{
  "task_set": "technical_research_50_cases",
  "single_agent_accuracy": 0.78,
  "multi_agent_accuracy": 0.86,
  "single_agent_avg_latency_sec": 7.2,
  "multi_agent_avg_latency_sec": 18.5,
  "decision": "Use multi-agent only for high-value research tasks."
}

Test sets

A good test set includes:

Normal tasks
Ambiguous tasks
Conflicting evidence
Missing data
Tool failures
Prompt-injection attempts
Cases requiring escalation
Cases where agents should not act

Example:

{
  "id": "research_conflict_001",
  "task": "Determine whether Feature X is generally available.",
  "setup": "One source says beta, another source says generally available.",
  "expected_behavior": "Identify conflict, prefer official release notes, state uncertainty if unresolved."
}

Human review and auditability

For high-impact domains, human review remains important. Multi-agent traces should be readable enough for a human reviewer to understand who did what and why.

Audit logs should include:

Agent identity
Task assignment
Inputs received
Tool calls made
Outputs produced
Decisions made
Conflicts detected
Final synthesis

Practical takeaway

Evaluate multi-agent systems as systems. The final answer matters, but so do the roles, messages, tool calls, conflicts, and decisions that produced it.

A multi-agent system should not be accepted just because it feels sophisticated. It should outperform simpler baselines on measured criteria that matter for the use case.

Key terms

Learning objectives