Tracing and Observability

Observability is the ability to understand what a system is doing from its external signals. For production agents, observability is not optional. When an agent fails, you need to know whether the problem came from the prompt, model, retrieval, tool selection, tool execution, memory, orchestration, or final response generation.

A trace is a full record of one agent execution. It captures the sequence of model calls, tool calls, retrieved documents, intermediate decisions, errors, and final output.

A useful phrase is:

trace = full agent execution record

Without traces, agent debugging becomes guesswork.

What a trace contains

A good agent trace includes:

Request ID
User input
System prompt version
Model name and version
Model parameters
Retrieved context
Tool calls and arguments
Tool results
Intermediate agent state
Errors and retries
Token usage
Latency by step
Final response
Evaluator scores

For privacy, traces should redact secrets and follow data retention policies. You need enough detail to debug without storing unnecessary sensitive content.

Spans and nested operations

Observability systems often represent work as spans. A trace contains spans, and each span represents one operation.

Example:

Trace: support-agent-request-123
  Span: build_prompt
  Span: retrieve_policy_docs
  Span: model_call_decide_tools
  Span: tool_call_get_order_status
  Span: model_call_final_answer
  Span: run_evaluators

This makes latency and failure points visible.

OpenTelemetry-style instrumentation

OpenTelemetry is a common standard for traces, metrics, and logs. Even if you use AI-specific tools like LangSmith, Langfuse, Braintrust, or Weights & Biases Weave, the conceptual model is similar: capture structured spans and metadata.

A simplified Python example:

from opentelemetry import trace

tracer = trace.get_tracer("support-agent")


def run_agent(user_input: str):
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("agent.name", "support-agent")
        span.set_attribute("input.length", len(user_input))

        with tracer.start_as_current_span("retrieval.policy_docs") as rspan:
            docs = retrieve_policy_docs(user_input)
            rspan.set_attribute("retrieval.count", len(docs))

        with tracer.start_as_current_span("model.final_answer") as mspan:
            answer = call_model(user_input, docs)
            mspan.set_attribute("model.name", "gpt-4.1-mini")
            mspan.set_attribute("output.length", len(answer))

        return answer

In production, you would add token usage, cost, errors, prompt version, and tool metadata.

AI observability platforms

Several tools support AI-specific tracing and evaluation workflows:

LangSmith: tracing, datasets, evals, and debugging for LangChain and LangGraph workflows.
Langfuse: open-source LLM observability with traces, prompt management, metrics, and evaluations.
Braintrust: evals, logging, experiments, and prompt/model comparisons.
Weights & Biases Weave: tracing, evaluations, and experiment tracking for AI applications.
PromptLayer: prompt management, logging, and analytics.

The right tool depends on your stack, hosting needs, privacy requirements, and whether you want open-source or managed infrastructure.

Logging tool calls

Tool calls deserve special attention. Log:

Tool name
Arguments after validation
User or agent authorization context
Start and end time
Success/failure
Error code
Result summary
Retry count

Example log event:

{
  "trace_id": "trace_abc123",
  "span": "tool_call",
  "tool": "get_order_status",
  "arguments": { "order_id": "ORD-7711" },
  "success": true,
  "duration_ms": 142,
  "result_summary": "delivered_late"
}

Do not log raw secrets. Redact or hash sensitive values.

Debugging with traces

Suppose an agent gives the wrong refund answer. The trace helps you ask:

Did retrieval return the correct policy?
Did the model call the order-status tool?
Did the tool return correct data?
Did the model misread the policy?
Did the final answer contradict the tool result?
Was the wrong prompt version deployed?

Each question maps to a trace span.

Metrics from traces

Aggregated traces produce metrics:

Tool error rate
Average model calls per task
Average retrieved chunks per task
Token usage per request
Cost per successful task
p50, p95, p99 latency
Retry rate
Escalation rate
Eval pass rate

These metrics inform optimization and incident response.

Practical takeaway

Production agents must be observable. Traces reveal the path from user input to final output. Logs capture important events. Metrics reveal trends. Together, they let you debug failures, compare versions, control cost, and improve reliability.

If you cannot inspect what your agent did, you cannot responsibly operate it in production.

Key terms

Learning objectives