Why AI Evaluation Is Different

Traditional software testing assumes that the same input should usually produce the same output. If a function receives 2 + 2, it should return 4. If an API receives an invalid token, it should return an authorization error. Most software tests are built around deterministic expectations.

AI agents are different. A language model may produce different valid outputs for the same input. A tool-using agent may take different paths depending on retrieved context, tool latency, model sampling, conversation history, or external system state. This does not mean AI systems cannot be tested. It means they require a broader evaluation strategy.

A production agent should be evaluated at multiple levels:

Final output quality
Tool selection quality
Argument validity
Trajectory quality
Policy compliance
Latency and cost
User experience
Failure handling

The key shift is from testing exact outputs to evaluating behavior against criteria.

Deterministic tests still matter

AI evaluation does not replace traditional testing. You still need normal software tests for deterministic components:

Tool implementations
API clients
Authentication and authorization
Database queries
Schema validation
Parsers and serializers
Prompt assembly functions
Cost calculators
Retry logic

For example, a function that validates tool arguments should have ordinary unit tests:

def validate_order_id(order_id: str) -> bool:
    return isinstance(order_id, str) and order_id.startswith("ORD-")

def test_validate_order_id():
    assert validate_order_id("ORD-12345") is True
    assert validate_order_id("12345") is False
    assert validate_order_id(None) is False

These tests are stable and should run in CI like any other test suite.

Behavioral evaluation

The model-driven parts of an agent need behavioral evaluation. Instead of expecting exact text, you define what good behavior looks like.

Example task:

User: My order is late. Can I get a refund? Order ID ORD-7711.

Expected behavior:

- Look up order status.
- Retrieve refund policy.
- Do not approve refund automatically.
- Explain eligibility clearly.
- Offer to create a refund request draft if appropriate.

The exact wording can vary. The required behavior should not.

A simple eval case might be represented as JSON:

{
  "id": "support_refund_001",
  "input": "My order is late. Can I get a refund? Order ID ORD-7711.",
  "expected_tools": ["get_order_status", "search_refund_policy"],
  "forbidden_tools": ["approve_refund"],
  "rubric": [
    "Checks order status before answering",
    "Uses refund policy before discussing eligibility",
    "Does not claim refund is approved",
    "Explains next step clearly"
  ]
}

This is closer to acceptance testing than snapshot testing.

Trajectory evaluation

For agents, the path matters. A final answer may look correct even if the agent used the wrong evidence, skipped a required tool, or attempted an unauthorized action before recovering.

A trace might look like:

{
  "input": "Where is order ORD-10492?",
  "steps": [
    { "type": "tool_call", "tool": "get_order_status", "args": { "order_id": "ORD-10492" } },
    { "type": "tool_result", "success": true, "status": "shipped" },
    { "type": "final", "content": "Your order has shipped..." }
  ]
}

Trajectory evaluation asks:

Did the agent call the right tool?
Were arguments valid?
Was the result interpreted correctly?
Were unnecessary calls avoided?
Were unsafe calls blocked?
Did the agent stop at the right time?

This is especially important for tool-using and multi-agent systems.

Non-determinism and statistical thinking

AI systems are probabilistic. Even with low temperature, model behavior can change across model versions, infrastructure changes, retrieval differences, or prompt edits.

That means evaluation should be repeated and tracked over time. A single successful run is weak evidence. A suite of runs across realistic cases is stronger.

You may track metrics such as:

Task success rate
Tool selection accuracy
Argument validity rate
Policy violation rate
Average cost per run
p50 / p95 / p99 latency
Human escalation rate
User correction rate

This resembles production observability more than ordinary unit testing.

Example evaluation harness

A minimal Python eval harness might look like this:

from dataclasses import dataclass
from typing import Callable, List

@dataclass
class EvalCase:
    id: str
    input: str
    expected_tools: List[str]
    forbidden_tools: List[str]
    rubric: List[str]

@dataclass
class EvalResult:
    id: str
    passed: bool
    failures: List[str]


def evaluate_trace(case: EvalCase, trace: dict) -> EvalResult:
    failures = []
    called_tools = [step.get("tool") for step in trace["steps"] if step.get("type") == "tool_call"]

    for tool in case.expected_tools:
        if tool not in called_tools:
            failures.append(f"Expected tool not called: {tool}")

    for tool in case.forbidden_tools:
        if tool in called_tools:
            failures.append(f"Forbidden tool was called: {tool}")

    final_answer = trace.get("final_answer", "")
    if not final_answer.strip():
        failures.append("Missing final answer")

    return EvalResult(id=case.id, passed=len(failures) == 0, failures=failures)

This harness does not judge prose quality yet, but it catches structural agent failures.

Practical takeaway

Evaluating production agents means combining deterministic software tests, behavioral evals, trajectory checks, human review, and production monitoring. The goal is not to prove the system will never fail. The goal is to make quality measurable, regressions visible, and failures actionable.

Key terms

Learning objectives