
Error Handling and Graceful Degradation
AGAI 401 · Observability and Reliability
Design agents that handle model failures, tool errors, retrieval misses, invalid outputs, and latency problems without collapsing user experience.
Key terms
reliability = predictable failure handlingretry safety depends on idempotencyfallback path must be evaluatedgraceful degradation preserves trustLearning objectives
- Classify common model, tool, retrieval, and orchestration failures.
- Design structured error responses and retry policies.
- Apply fallback strategies for models and retrieval failures.
- Write user-facing failure messages that preserve useful progress.
Production agents fail in many ways. The model may return invalid JSON. A tool may time out. Retrieval may find no relevant documents. A user may provide incomplete information. A model provider may be unavailable. A downstream API may rate-limit requests.
Reliability does not mean failures never happen. Reliability means the system handles failures predictably and safely.
Graceful degradation means the agent provides the best safe result it can when the ideal path fails.
Common failure types
Production agent failures include:
Model failures:
- invalid structured output
- refusal when not needed
- hallucinated claim
- excessive verbosity
- unsafe compliance
Tool failures:
- timeout
- permission denied
- invalid arguments
- service unavailable
- rate limit
Retrieval failures:
- no relevant documents
- stale documents
- unauthorized documents
- too much irrelevant context
Orchestration failures:
- loop limit reached
- wrong agent selected
- missing state
- repeated retries
Each failure type needs a recovery policy.
Structured errors
Tools should return structured errors, not vague strings.
{
"success": false,
"error_code": "RATE_LIMITED",
"message": "The shipping API is rate limited.",
"retryable": true,
"retry_after_seconds": 30
}
The agent or orchestration layer can use this information to decide whether to retry, fallback, or explain the issue.
Retry policy
Retries should be deliberate. Retrying a read-only lookup after a timeout is usually safe. Retrying an email send may send duplicate emails unless the operation is idempotent.
Example retry logic:
RETRYABLE = {"TIMEOUT", "RATE_LIMITED", "SERVICE_UNAVAILABLE"}
def should_retry(error, attempt, is_idempotent):
if attempt >= 3:
return False
if error["error_code"] not in RETRYABLE:
return False
if not is_idempotent:
return False
return True
Idempotency matters. An action tool should support idempotency keys where possible.
Fallback models
A fallback model can help when the primary model is unavailable, too slow, or too expensive for a degraded path.
Example strategy:
def call_with_fallback(messages):
try:
return call_model("frontier-model", messages, timeout=10)
except TimeoutError:
return call_model("smaller-fast-model", messages, timeout=5)
But fallback models may have different capabilities. Do not silently use a weaker model for high-risk tasks unless it has passed evals for that path.
Retrieval fallback
If retrieval fails, the agent should not invent an answer.
Bad:
I couldn't find the policy, but refunds are usually available after 30 days.
Better:
I could not retrieve the refund policy, so I cannot determine eligibility from approved sources. I can try again or escalate this to a support specialist.
For low-risk cases, the agent can provide general guidance while clearly labeling it as general.
Structured output repair
If the model returns invalid JSON, you can attempt repair.
def repair_json(model, invalid_output, validation_error):
prompt = f"""
The following output is invalid JSON.
Error: {validation_error}
Output:
{invalid_output}
Return corrected JSON only. Do not add explanation.
"""
return model.generate(prompt)
Limit repair attempts. If the system cannot produce valid output after one or two repairs, fail safely.
User-facing graceful degradation
A user-facing failure message should include:
What failed
What partial progress was made
Whether the user should retry
What alternative path exists
Example:
I found the order, but the shipping-status service is temporarily unavailable. The order exists and was placed on May 29, but I cannot confirm its latest carrier scan right now. You can try again later, or I can create a support ticket draft with the information available.
This is better than a generic error.
Circuit breakers
If a dependency is failing repeatedly, stop hammering it. Circuit breakers prevent cascading failures.
class CircuitBreaker:
def __init__(self, failure_threshold=5):
self.failures = 0
self.failure_threshold = failure_threshold
self.open = False
def record_failure(self):
self.failures += 1
if self.failures >= self.failure_threshold:
self.open = True
def can_call(self):
return not self.open
In production, use a mature resilience library or infrastructure pattern.
Practical takeaway
Production agents must be designed for failure. Use structured errors, safe retries, idempotency, fallback models, retrieval fallback, output repair, circuit breakers, and user-facing explanations.
The best agents do not pretend everything worked. They preserve trust by failing clearly, safely, and usefully.
Sign in to track your progress.
Up next · Module 3
Deployment, Operations, and Optimization
Move agent systems into production with cost controls, latency budgets, CI/CD, monitoring, alerting, and incident response. This module focuses on the operational practices required to keep AI systems reliable and maintainable after launch.
Ask your AI guide
Ask anything about Building Production Agents — Error Handling and Graceful Degradation, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.