Dashboard showing AI agent monitoring and tracing in a production environment

Error Handling and Graceful Degradation

AGAI 401 · Observability and Reliability

Design agents that handle model failures, tool errors, retrieval misses, invalid outputs, and latency problems without collapsing user experience.

Key terms

reliability = predictable failure handlingretry safety depends on idempotencyfallback path must be evaluatedgraceful degradation preserves trust

Learning objectives

Classify common model, tool, retrieval, and orchestration failures.
Design structured error responses and retry policies.
Apply fallback strategies for models and retrieval failures.
Write user-facing failure messages that preserve useful progress.

Production agents fail in many ways. The model may return invalid JSON. A tool may time out. Retrieval may find no relevant documents. A user may provide incomplete information. A model provider may be unavailable. A downstream API may rate-limit requests.

Reliability does not mean failures never happen. Reliability means the system handles failures predictably and safely.

Graceful degradation means the agent provides the best safe result it can when the ideal path fails.

Common failure types

Production agent failures include:

Model failures:
- invalid structured output
- refusal when not needed
- hallucinated claim
- excessive verbosity
- unsafe compliance

Tool failures:
- timeout
- permission denied
- invalid arguments
- service unavailable
- rate limit

Retrieval failures:
- no relevant documents
- stale documents
- unauthorized documents
- too much irrelevant context

Orchestration failures:
- loop limit reached
- wrong agent selected
- missing state
- repeated retries

Each failure type needs a recovery policy.

Structured errors

Tools should return structured errors, not vague strings.

{
  "success": false,
  "error_code": "RATE_LIMITED",
  "message": "The shipping API is rate limited.",
  "retryable": true,
  "retry_after_seconds": 30
}

The agent or orchestration layer can use this information to decide whether to retry, fallback, or explain the issue.

Retry policy

Retries should be deliberate. Retrying a read-only lookup after a timeout is usually safe. Retrying an email send may send duplicate emails unless the operation is idempotent.

Example retry logic:

RETRYABLE = {"TIMEOUT", "RATE_LIMITED", "SERVICE_UNAVAILABLE"}


def should_retry(error, attempt, is_idempotent):
    if attempt >= 3:
        return False
    if error["error_code"] not in RETRYABLE:
        return False
    if not is_idempotent:
        return False
    return True

Idempotency matters. An action tool should support idempotency keys where possible.

Fallback models

A fallback model can help when the primary model is unavailable, too slow, or too expensive for a degraded path.

Example strategy:

def call_with_fallback(messages):
    try:
        return call_model("frontier-model", messages, timeout=10)
    except TimeoutError:
        return call_model("smaller-fast-model", messages, timeout=5)

But fallback models may have different capabilities. Do not silently use a weaker model for high-risk tasks unless it has passed evals for that path.

Retrieval fallback

If retrieval fails, the agent should not invent an answer.

Bad:

I couldn't find the policy, but refunds are usually available after 30 days.

Better:

I could not retrieve the refund policy, so I cannot determine eligibility from approved sources. I can try again or escalate this to a support specialist.

For low-risk cases, the agent can provide general guidance while clearly labeling it as general.

Structured output repair

If the model returns invalid JSON, you can attempt repair.

def repair_json(model, invalid_output, validation_error):
    prompt = f"""
    The following output is invalid JSON.
    Error: {validation_error}

    Output:
    {invalid_output}

    Return corrected JSON only. Do not add explanation.
    """
    return model.generate(prompt)

Limit repair attempts. If the system cannot produce valid output after one or two repairs, fail safely.

User-facing graceful degradation

A user-facing failure message should include:

What failed
What partial progress was made
Whether the user should retry
What alternative path exists

Example:

I found the order, but the shipping-status service is temporarily unavailable. The order exists and was placed on May 29, but I cannot confirm its latest carrier scan right now. You can try again later, or I can create a support ticket draft with the information available.

This is better than a generic error.

Circuit breakers

If a dependency is failing repeatedly, stop hammering it. Circuit breakers prevent cascading failures.

class CircuitBreaker:
    def __init__(self, failure_threshold=5):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.open = False

    def record_failure(self):
        self.failures += 1
        if self.failures >= self.failure_threshold:
            self.open = True

    def can_call(self):
        return not self.open

In production, use a mature resilience library or infrastructure pattern.

Practical takeaway

Production agents must be designed for failure. Use structured errors, safe retries, idempotency, fallback models, retrieval fallback, output repair, circuit breakers, and user-facing explanations.

The best agents do not pretend everything worked. They preserve trust by failing clearly, safely, and usefully.

Up next · Module 3

Deployment, Operations, and Optimization

Move agent systems into production with cost controls, latency budgets, CI/CD, monitoring, alerting, and incident response. This module focuses on the operational practices required to keep AI systems reliable and maintainable after launch.

Ask your AI guide

AI Chat· Building Production Agents — Error Handling and Graceful Degradation

🤖

Ask anything about Building Production Agents — Error Handling and Graceful Degradation, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.