
Why AI Evaluation Is Different
AGAI 401 · Evaluation for Production Agents
Learn why production AI systems require probabilistic, behavioral, and trajectory-based evaluation rather than only deterministic pass/fail tests.
Key terms
evals = behavior under test casesagent quality = output + trajectorydeterministic tests still matterone good run ≠ reliabilityLearning objectives
- Explain why AI evaluation differs from traditional deterministic testing.
- Distinguish output evaluation from trajectory evaluation.
- Identify deterministic components that still need unit tests.
- Define core metrics for evaluating production agents.
Traditional software testing assumes that the same input should usually produce the same output. If a function receives 2 + 2, it should return 4. If an API receives an invalid token, it should return an authorization error. Most software tests are built around deterministic expectations.
AI agents are different. A language model may produce different valid outputs for the same input. A tool-using agent may take different paths depending on retrieved context, tool latency, model sampling, conversation history, or external system state. This does not mean AI systems cannot be tested. It means they require a broader evaluation strategy.
A production agent should be evaluated at multiple levels:
Final output quality
Tool selection quality
Argument validity
Trajectory quality
Policy compliance
Latency and cost
User experience
Failure handling
The key shift is from testing exact outputs to evaluating behavior against criteria.
Deterministic tests still matter
AI evaluation does not replace traditional testing. You still need normal software tests for deterministic components:
- Tool implementations
- API clients
- Authentication and authorization
- Database queries
- Schema validation
- Parsers and serializers
- Prompt assembly functions
- Cost calculators
- Retry logic
For example, a function that validates tool arguments should have ordinary unit tests:
def validate_order_id(order_id: str) -> bool:
return isinstance(order_id, str) and order_id.startswith("ORD-")
def test_validate_order_id():
assert validate_order_id("ORD-12345") is True
assert validate_order_id("12345") is False
assert validate_order_id(None) is False
These tests are stable and should run in CI like any other test suite.
Behavioral evaluation
The model-driven parts of an agent need behavioral evaluation. Instead of expecting exact text, you define what good behavior looks like.
Example task:
User: My order is late. Can I get a refund? Order ID ORD-7711.
Expected behavior:
- Look up order status.
- Retrieve refund policy.
- Do not approve refund automatically.
- Explain eligibility clearly.
- Offer to create a refund request draft if appropriate.
The exact wording can vary. The required behavior should not.
A simple eval case might be represented as JSON:
{
"id": "support_refund_001",
"input": "My order is late. Can I get a refund? Order ID ORD-7711.",
"expected_tools": ["get_order_status", "search_refund_policy"],
"forbidden_tools": ["approve_refund"],
"rubric": [
"Checks order status before answering",
"Uses refund policy before discussing eligibility",
"Does not claim refund is approved",
"Explains next step clearly"
]
}
This is closer to acceptance testing than snapshot testing.
Trajectory evaluation
For agents, the path matters. A final answer may look correct even if the agent used the wrong evidence, skipped a required tool, or attempted an unauthorized action before recovering.
A trace might look like:
{
"input": "Where is order ORD-10492?",
"steps": [
{ "type": "tool_call", "tool": "get_order_status", "args": { "order_id": "ORD-10492" } },
{ "type": "tool_result", "success": true, "status": "shipped" },
{ "type": "final", "content": "Your order has shipped..." }
]
}
Trajectory evaluation asks:
Did the agent call the right tool?
Were arguments valid?
Was the result interpreted correctly?
Were unnecessary calls avoided?
Were unsafe calls blocked?
Did the agent stop at the right time?
This is especially important for tool-using and multi-agent systems.
Non-determinism and statistical thinking
AI systems are probabilistic. Even with low temperature, model behavior can change across model versions, infrastructure changes, retrieval differences, or prompt edits.
That means evaluation should be repeated and tracked over time. A single successful run is weak evidence. A suite of runs across realistic cases is stronger.
You may track metrics such as:
Task success rate
Tool selection accuracy
Argument validity rate
Policy violation rate
Average cost per run
p50 / p95 / p99 latency
Human escalation rate
User correction rate
This resembles production observability more than ordinary unit testing.
Example evaluation harness
A minimal Python eval harness might look like this:
from dataclasses import dataclass
from typing import Callable, List
@dataclass
class EvalCase:
id: str
input: str
expected_tools: List[str]
forbidden_tools: List[str]
rubric: List[str]
@dataclass
class EvalResult:
id: str
passed: bool
failures: List[str]
def evaluate_trace(case: EvalCase, trace: dict) -> EvalResult:
failures = []
called_tools = [step.get("tool") for step in trace["steps"] if step.get("type") == "tool_call"]
for tool in case.expected_tools:
if tool not in called_tools:
failures.append(f"Expected tool not called: {tool}")
for tool in case.forbidden_tools:
if tool in called_tools:
failures.append(f"Forbidden tool was called: {tool}")
final_answer = trace.get("final_answer", "")
if not final_answer.strip():
failures.append("Missing final answer")
return EvalResult(id=case.id, passed=len(failures) == 0, failures=failures)
This harness does not judge prose quality yet, but it catches structural agent failures.
Practical takeaway
Evaluating production agents means combining deterministic software tests, behavioral evals, trajectory checks, human review, and production monitoring. The goal is not to prove the system will never fail. The goal is to make quality measurable, regressions visible, and failures actionable.
Sign in to track your progress.
Ask your AI guide
Ask anything about Building Production Agents — Why AI Evaluation Is Different, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.