
Evaluating Multi-Agent Systems
AGAI 301 · Evaluation, Testing, and Safety
Learn how to evaluate outputs, trajectories, coordination quality, role performance, and system-level reliability.
Key terms
system quality = output + trajectoryrole metrics enable diagnosismulti-agent must beat baselinetraceability supports trustLearning objectives
- Evaluate final outputs and multi-agent trajectories.
- Define role-specific success criteria for subagents.
- Track coordination metrics such as duplicate work and conflict resolution.
- Compare multi-agent systems against simpler baselines.
Evaluating a multi-agent system is more complex than evaluating a single model response. You need to evaluate both the final output and the interactions that produced it.
A multi-agent system can fail even if the final answer looks plausible. Agents may have duplicated work, ignored evidence, used the wrong tools, missed a conflict, or violated permissions. The final response is only the visible surface.
A strong evaluation strategy looks at:
- Final output quality
- Individual agent performance
- Coordination quality
- Tool-use correctness
- Conflict handling
- Cost and latency
- Safety and policy compliance
- Robustness across edge cases
Output evaluation
Output evaluation asks whether the final answer or artifact satisfies the user’s goal.
For a research synthesis system, evaluate:
- Accuracy
- Source support
- Completeness
- Clarity
- Balanced uncertainty
- Correct formatting
- Relevance to the user’s question
For a coding team of agents, evaluate:
- Tests passing
- Correctness of implementation
- Security risks
- Minimal unnecessary changes
- Documentation updates
- Review comments addressed
Output evaluation can use human review, automated tests, reference answers, LLM-as-judge rubrics, or domain-specific validators.
Trajectory evaluation
Trajectory evaluation examines the path the system took.
A trajectory includes:
user request
→ orchestrator plan
→ subagent assignments
→ subagent messages
→ tool calls
→ intermediate outputs
→ critiques
→ revisions
→ final synthesis
A good trajectory should be efficient, coherent, and aligned with the architecture.
Example trace:
{
"task": "Draft a security review of pull request 482",
"trajectory": [
{ "agent": "Planner", "action": "created review plan" },
{ "agent": "CodeAnalyzer", "action": "summarized diff" },
{ "agent": "SecurityReviewer", "action": "flagged SQL injection risk" },
{ "agent": "TestAgent", "action": "reported missing regression test" },
{ "agent": "Finalizer", "action": "merged findings into review" }
],
"final_status": "changes_requested"
}
This trajectory is inspectable. If the final review is poor, developers can see where the failure happened.
Role-specific evaluation
Each agent should have role-specific success criteria.
Example:
{
"agent": "ResearchAgent",
"success_criteria": [
"Finds primary sources when available",
"Extracts claims without adding unsupported interpretation",
"Reports uncertainty and missing information"
]
}
A writer agent should not be evaluated the same way as a security reviewer. Role-specific evaluation prevents vague scoring.
Coordination metrics
Useful multi-agent metrics include:
Task completion rate
Final answer accuracy
Role completion rate
Tool-call validity
Duplicate work rate
Conflict detection rate
Conflict resolution quality
Average agent turns per task
Average cost per task
Average latency per task
Human escalation rate
Policy violation rate
These metrics help distinguish quality improvements from mere complexity.
Baseline comparison
Always compare multi-agent systems against simpler baselines.
Baselines may include:
- Single direct model response
- Single ReAct agent
- Plan-and-execute agent
- Structured workflow without subagents
A multi-agent system should justify its added complexity. It should improve accuracy, coverage, safety, or maintainability enough to offset cost and latency.
Example evaluation:
{
"task_set": "technical_research_50_cases",
"single_agent_accuracy": 0.78,
"multi_agent_accuracy": 0.86,
"single_agent_avg_latency_sec": 7.2,
"multi_agent_avg_latency_sec": 18.5,
"decision": "Use multi-agent only for high-value research tasks."
}
Test sets
A good test set includes:
- Normal tasks
- Ambiguous tasks
- Conflicting evidence
- Missing data
- Tool failures
- Prompt-injection attempts
- Cases requiring escalation
- Cases where agents should not act
Example:
{
"id": "research_conflict_001",
"task": "Determine whether Feature X is generally available.",
"setup": "One source says beta, another source says generally available.",
"expected_behavior": "Identify conflict, prefer official release notes, state uncertainty if unresolved."
}
Human review and auditability
For high-impact domains, human review remains important. Multi-agent traces should be readable enough for a human reviewer to understand who did what and why.
Audit logs should include:
- Agent identity
- Task assignment
- Inputs received
- Tool calls made
- Outputs produced
- Decisions made
- Conflicts detected
- Final synthesis
Practical takeaway
Evaluate multi-agent systems as systems. The final answer matters, but so do the roles, messages, tool calls, conflicts, and decisions that produced it.
A multi-agent system should not be accepted just because it feels sophisticated. It should outperform simpler baselines on measured criteria that matter for the use case.
Sign in to track your progress.
Ask your AI guide
Ask anything about Multi-Agent Systems — Evaluating Multi-Agent Systems, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.