Diagram showing an AI agent calling external tools and APIs

Evaluating Tool-Augmented Agents

AGAI 201 · Production Tool Use: Reliability, Security, and Evaluation

Learn how to measure whether a tool-using agent selects the right tools, passes valid arguments, handles errors, completes tasks, and stays within policy.

Key terms

agent quality = outcome + trajectorytool precision + tool recallgolden traces define acceptable processproduction logs → new eval cases

Learning objectives

Distinguish outcome evaluation from trajectory evaluation.
Define metrics for tool selection and argument validity.
Build evaluation cases with expected and forbidden tools.
Use logs and regression tests to improve agent reliability.

Evaluating a normal chatbot often focuses on the final answer. Evaluating a tool-using agent requires more. You must evaluate the final answer and the path the agent took to get there.

A tool-augmented agent can fail in several ways:

It may fail to call a needed tool.
It may call an unnecessary tool.
It may choose the wrong tool.
It may pass invalid arguments.
It may misinterpret the tool result.
It may ignore an error.
It may take an unauthorized action.
It may produce a final answer unsupported by the tool results.

This means evaluation must inspect both outcome quality and trajectory quality.

Outcome evaluation

Outcome evaluation asks: did the agent complete the task correctly?

For a support agent, outcome metrics might include:

Correct order status reported
Correct refund eligibility determination
Clear explanation to user
No unsupported claims
Proper escalation when needed

For a coding agent:

Tests pass
Code compiles
Minimal unnecessary changes
Bug actually fixed
No security regression introduced

For a research agent:

Answer is accurate
Sources are relevant
Claims are supported
Uncertainty is stated when needed
Output matches requested format

Outcome evaluation is necessary but not sufficient.

Trajectory evaluation

Trajectory evaluation asks: did the agent use tools appropriately?

A trajectory includes the sequence of model messages, tool calls, tool arguments, tool results, retries, and final response.

Example trace:

{
  "user_request": "Is order ORD-7711 eligible for a refund?",
  "steps": [
    {
      "type": "tool_call",
      "tool": "get_order_status",
      "arguments": { "order_id": "ORD-7711" }
    },
    {
      "type": "tool_result",
      "success": true,
      "status": "delivered_late"
    },
    {
      "type": "tool_call",
      "tool": "search_refund_policy",
      "arguments": { "query": "late delivery refund eligibility" }
    },
    {
      "type": "tool_result",
      "success": true,
      "policy": "Refund request allowed if delivery is more than 5 days late."
    }
  ],
  "final_answer": "This order appears eligible for a refund request because it was delivered more than 5 days late."
}

This trajectory looks reasonable. The agent gathered order facts and policy before answering.

A bad trajectory might skip policy lookup and guess eligibility.

Tool selection metrics

Useful tool metrics include:

Tool precision: Of the tools called, how many were appropriate?
Tool recall: Of the tools that should have been called, how many were called?
Argument validity: How often were tool arguments valid?
Tool success rate: How often did tool calls execute successfully?
Unnecessary call rate: How often did the agent call tools it did not need?
Recovery rate: How often did the agent recover from tool failures?
Policy violation rate: How often did it attempt disallowed actions?

These metrics help diagnose problems. Low tool recall means the agent answers without grounding. Low precision means it calls irrelevant tools. Low argument validity suggests schema or prompt problems.

Building an evaluation set

Create a test set of realistic tasks. Include happy paths and edge cases.

Example evaluation set:

[
  {
    "id": "support_001",
    "user_request": "Where is order ORD-10492?",
    "expected_tools": ["get_order_status"],
    "forbidden_tools": ["create_refund_request"],
    "expected_result": "Reports shipping status and estimated delivery."
  },
  {
    "id": "support_002",
    "user_request": "Refund my order now.",
    "expected_tools": ["get_order_status", "search_refund_policy"],
    "forbidden_tools": ["approve_refund"],
    "expected_result": "Explains eligibility and offers a refund request draft if allowed."
  },
  {
    "id": "support_003",
    "user_request": "My order ID is invalid but I want help.",
    "expected_tools": [],
    "expected_result": "Asks for a valid order ID."
  }
]

The evaluation should specify expected tools, forbidden tools, and expected final behavior.

Golden traces

For complex workflows, define golden traces: ideal or acceptable sequences of tool calls.

Example:

Task: Determine refund eligibility for late order.
Acceptable trace:
1. get_order_status
2. get_shipping_events
3. search_refund_policy
4. final answer or create_refund_request_draft with confirmation

Golden traces help evaluate not just whether the final answer is correct, but whether the agent followed the right process.

However, be careful not to make traces too rigid. There may be multiple valid paths. Evaluation should allow acceptable alternatives when appropriate.

Automated and human evaluation

Automated checks are useful for objective properties:

Was JSON valid?
Was the expected tool called?
Were forbidden tools avoided?
Did the final answer include required fields?
Did tests pass?
Did execution stay within tool-call limits?

Human evaluation is useful for judgment-heavy properties:

Was the answer clear?
Was the explanation appropriate?
Did the agent handle ambiguity well?
Was the escalation sensible?
Did the response inspire trust?

Many production teams use a combination of unit tests, simulated tasks, LLM-as-judge evaluations, human review, and live monitoring.

Regression testing prompts and tools

Tool-using agents can regress when you change a prompt, model, schema, or tool implementation. A change that improves one task can break another.

Before deployment, run a regression suite:

- Same evaluation tasks
- Same expected tool behavior
- Same security tests
- Same edge cases
- Compare against previous version

Track changes in:

Tool selection accuracy
Argument validity
Completion rate
Average tool calls per task
Latency
Cost
User satisfaction
Safety violations

Monitoring in production

Evaluation does not stop at launch. Monitor real usage.

Important production signals include:

Tool error rates
Repeated retries
High latency traces
User corrections
Escalation frequency
Unexpected tool combinations
Attempts to call forbidden tools
Sensitive data exposure warnings

Use logs and traces to create new evaluation cases. The best evaluation sets grow from real failures.

Practical takeaway

A tool-using agent should be evaluated as a system, not just as a model. The final answer matters, but so does the route: tool choice, arguments, errors, retries, permissions, and evidence.

Strong evaluation makes agent development safer and faster. It tells you whether failures come from the prompt, schema, model, tool implementation, orchestration logic, or missing guardrails. Without evaluation, tool-using agents remain impressive demos. With evaluation, they can become reliable products.

Ask your AI guide

AI Chat· Tool Use & Function Calling — Evaluating Tool-Augmented Agents

🤖

Ask anything about Tool Use & Function Calling — Evaluating Tool-Augmented Agents, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.