Dashboard showing AI agent monitoring and tracing in a production environment

Monitoring, Alerting, and Incident Response

AGAI 401 · Deployment, Operations, and Optimization

Learn how to monitor live agent behavior, alert on quality and safety risks, investigate incidents, and continuously improve production systems.

Key terms

monitoring = evals for live trafficalerts must be actionableincident → eval case → fixdrift requires continuous measurement

Learning objectives

Identify production metrics for system health, AI quality, cost, and safety.
Design actionable alerts for agent failures.
Create incident response runbooks for AI-specific failures.
Use production traces to improve eval suites and system reliability.

Production monitoring answers a simple question: is the agent behaving acceptably for real users right now?

Offline evals are necessary, but they cannot cover every real-world input. Once deployed, your agent will encounter new users, unexpected phrasing, tool failures, stale documents, adversarial prompts, and changing business conditions.

Monitoring closes the loop between production behavior and system improvement.

What to monitor

Monitor at several layers.

System health:

Request volume
Error rate
Timeout rate
Model provider errors
Tool API failures
Queue depth
Rate-limit events

AI behavior:

Task success proxy
Tool-call frequency
Unexpected tool combinations
Invalid output rate
Refusal rate
Escalation rate
Retry rate
Policy violation rate
User correction rate

Cost and latency:

Tokens per request
Cost per request
Cost per successful task
p50 / p95 / p99 latency
Model calls per trace
Tool calls per trace

Quality signals:

Thumbs up/down
Human reviewer scores
LLM judge scores on sampled traces
Support tickets about AI behavior
User rephrasing or abandonment

Alerts

Alerts should be actionable. Avoid alert fatigue.

Good alerts:

Policy violation rate > 0 in last 30 minutes
Tool error rate for get_order_status > 10% for 10 minutes
p95 latency > 15 seconds for 15 minutes
Average cost per request doubled compared with 7-day baseline
Invalid JSON rate > 5% after prompt release
Prompt-injection detector triggered on high-risk tool flow

Weak alerts:

Something may be wrong with AI quality.

Each alert should have an owner and a runbook.

Example monitoring event

{
  "timestamp": "2025-11-18T14:22:00Z",
  "service": "support-agent",
  "metric": "tool_error_rate",
  "tool": "get_order_status",
  "value": 0.18,
  "threshold": 0.10,
  "window": "10m",
  "severity": "high"
}

This tells the on-call engineer where to look.

Incident response

AI incidents may involve wrong answers, unsafe tool calls, privacy leaks, cost spikes, bad prompt releases, retrieval failures, or model provider outages.

A basic incident process:

1. Detect issue through alert, user report, or review.
2. Triage severity and scope.
3. Mitigate: rollback, disable tool, switch model, or reduce permissions.
4. Preserve traces and logs.
5. Identify root cause.
6. Add eval or monitor to prevent recurrence.
7. Document timeline and corrective actions.

For high-risk agents, include a kill switch.

Runbooks

Runbooks make incidents less chaotic.

Example runbook for invalid JSON spike:

1. Check whether prompt version changed.
2. Check whether model version changed.
3. Inspect recent traces with invalid output.
4. Roll back prompt if issue started after release.
5. Enable repair fallback if safe.
6. Add representative invalid-output cases to eval suite.

Example runbook for tool misuse:

1. Disable high-impact tool through feature flag.
2. Inspect authorization logs.
3. Identify prompts or inputs that triggered misuse.
4. Verify whether action was executed or blocked.
5. Notify affected stakeholders if needed.
6. Add safety eval and permission test.

Sampling and review

Not every trace can be reviewed by a human. Use sampling.

Sample more heavily from:

High-risk workflows
Low-confidence outputs
User downvotes
New prompt versions
New model versions
Unusual tool traces
Escalations
Long or expensive traces

A reviewer can label traces and create new eval cases from failures.

Drift monitoring

AI behavior can drift even if your code does not change. Causes include:

Upstream model updates
Document corpus changes
User behavior changes
New adversarial patterns
Business policy changes
Retrieval index updates

Track metrics over time. Compare current behavior to historical baselines.

Closing the improvement loop

Monitoring should feed back into development:

Production trace → failure analysis → eval case → fix → CI eval → deployment → monitoring

This loop is how production agents mature.

Practical takeaway

Operating production agents requires monitoring, alerting, incident response, and continuous improvement. Evals catch known risks before launch. Monitoring catches unknown risks after launch.

A production AI system is never simply “done.” It is a living system that must be measured, updated, and governed over time.

Ask your AI guide

AI Chat· Building Production Agents — Monitoring, Alerting, and Incident Response

🤖

Ask anything about Building Production Agents — Monitoring, Alerting, and Incident Response, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.