
Tracing and Observability
AGAI 401 · Observability and Reliability
Learn how traces, spans, logs, and metrics reveal what happened inside a model-driven agent execution.
Key terms
trace = full agent execution recordspan = one operation in a tracelogs explain eventsmetrics reveal trendsLearning objectives
- Define traces, spans, logs, and metrics for agent systems.
- Identify the key fields to capture in an agent trace.
- Instrument a basic agent workflow with tracing concepts.
- Use traces to debug model, retrieval, and tool failures.
Observability is the ability to understand what a system is doing from its external signals. For production agents, observability is not optional. When an agent fails, you need to know whether the problem came from the prompt, model, retrieval, tool selection, tool execution, memory, orchestration, or final response generation.
A trace is a full record of one agent execution. It captures the sequence of model calls, tool calls, retrieved documents, intermediate decisions, errors, and final output.
A useful phrase is:
trace = full agent execution record
Without traces, agent debugging becomes guesswork.
What a trace contains
A good agent trace includes:
Request ID
User input
System prompt version
Model name and version
Model parameters
Retrieved context
Tool calls and arguments
Tool results
Intermediate agent state
Errors and retries
Token usage
Latency by step
Final response
Evaluator scores
For privacy, traces should redact secrets and follow data retention policies. You need enough detail to debug without storing unnecessary sensitive content.
Spans and nested operations
Observability systems often represent work as spans. A trace contains spans, and each span represents one operation.
Example:
Trace: support-agent-request-123
Span: build_prompt
Span: retrieve_policy_docs
Span: model_call_decide_tools
Span: tool_call_get_order_status
Span: model_call_final_answer
Span: run_evaluators
This makes latency and failure points visible.
OpenTelemetry-style instrumentation
OpenTelemetry is a common standard for traces, metrics, and logs. Even if you use AI-specific tools like LangSmith, Langfuse, Braintrust, or Weights & Biases Weave, the conceptual model is similar: capture structured spans and metadata.
A simplified Python example:
from opentelemetry import trace
tracer = trace.get_tracer("support-agent")
def run_agent(user_input: str):
with tracer.start_as_current_span("agent.run") as span:
span.set_attribute("agent.name", "support-agent")
span.set_attribute("input.length", len(user_input))
with tracer.start_as_current_span("retrieval.policy_docs") as rspan:
docs = retrieve_policy_docs(user_input)
rspan.set_attribute("retrieval.count", len(docs))
with tracer.start_as_current_span("model.final_answer") as mspan:
answer = call_model(user_input, docs)
mspan.set_attribute("model.name", "gpt-4.1-mini")
mspan.set_attribute("output.length", len(answer))
return answer
In production, you would add token usage, cost, errors, prompt version, and tool metadata.
AI observability platforms
Several tools support AI-specific tracing and evaluation workflows:
- LangSmith: tracing, datasets, evals, and debugging for LangChain and LangGraph workflows.
- Langfuse: open-source LLM observability with traces, prompt management, metrics, and evaluations.
- Braintrust: evals, logging, experiments, and prompt/model comparisons.
- Weights & Biases Weave: tracing, evaluations, and experiment tracking for AI applications.
- PromptLayer: prompt management, logging, and analytics.
The right tool depends on your stack, hosting needs, privacy requirements, and whether you want open-source or managed infrastructure.
Logging tool calls
Tool calls deserve special attention. Log:
Tool name
Arguments after validation
User or agent authorization context
Start and end time
Success/failure
Error code
Result summary
Retry count
Example log event:
{
"trace_id": "trace_abc123",
"span": "tool_call",
"tool": "get_order_status",
"arguments": { "order_id": "ORD-7711" },
"success": true,
"duration_ms": 142,
"result_summary": "delivered_late"
}
Do not log raw secrets. Redact or hash sensitive values.
Debugging with traces
Suppose an agent gives the wrong refund answer. The trace helps you ask:
Did retrieval return the correct policy?
Did the model call the order-status tool?
Did the tool return correct data?
Did the model misread the policy?
Did the final answer contradict the tool result?
Was the wrong prompt version deployed?
Each question maps to a trace span.
Metrics from traces
Aggregated traces produce metrics:
Tool error rate
Average model calls per task
Average retrieved chunks per task
Token usage per request
Cost per successful task
p50, p95, p99 latency
Retry rate
Escalation rate
Eval pass rate
These metrics inform optimization and incident response.
Practical takeaway
Production agents must be observable. Traces reveal the path from user input to final output. Logs capture important events. Metrics reveal trends. Together, they let you debug failures, compare versions, control cost, and improve reliability.
If you cannot inspect what your agent did, you cannot responsibly operate it in production.
Sign in to track your progress.
Ask your AI guide
Ask anything about Building Production Agents — Tracing and Observability, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.