Diagram showing an AI agent calling external tools and APIs

Error Handling and Retries

AGAI 201 · Production Tool Use: Reliability, Security, and Evaluation

Learn how to classify tool failures, design retry policies, return useful error messages, and help agents recover gracefully from broken workflows.

Key terms

error type → recovery strategyretry only safe operationsidempotency determines retry riskgraceful failure builds trust

Learning objectives

  • Classify common tool failure types.
  • Design structured error responses for agents.
  • Implement safe retry policies using idempotency.
  • Create tests for failure and recovery behavior.

Tool calls fail. APIs time out, databases return no records, permissions are missing, rate limits are exceeded, arguments are malformed, and external services change behavior. A production agent must expect failure and handle it gracefully.

Error handling is not just a backend concern. Tool errors become part of the agent’s reasoning context. If the error is vague, the model may retry blindly or produce a bad answer. If the error is structured and informative, the model and application can choose a sensible next step.

A robust tool-using system distinguishes between different types of failure.

Categories of tool errors

Common error categories include:

Validation errors occur when the model supplies invalid arguments. Example: a missing order_id, an invalid enum value, or a malformed date.

Not found errors occur when the requested resource does not exist. Example: no customer exists with that ID.

Permission errors occur when the user or agent is not allowed to perform the action.

Rate limit errors occur when a service receives too many requests.

Timeout errors occur when a tool does not respond quickly enough.

Dependency errors occur when an external API, database, or service is unavailable.

Business rule errors occur when the request is valid but not allowed under policy. Example: a refund window has expired.

Each category should produce a different recovery strategy.

Structured error results

Do not return raw stack traces to the model. Return structured, safe, actionable errors.

Example:

{
  "success": false,
  "error_code": "INVALID_DATE_FORMAT",
  "message": "The date must be in YYYY-MM-DD format.",
  "retryable": true,
  "field": "start_date"
}

For a permission error:

{
  "success": false,
  "error_code": "PERMISSION_DENIED",
  "message": "The current user is not allowed to approve refunds.",
  "retryable": false,
  "allowed_alternative": "create_refund_request_draft"
}

The second result helps the agent recover by offering an allowed alternative.

Retry policies

Retries are useful for temporary failures, but dangerous for permanent failures. A retry policy should consider error type, number of attempts, delay, and whether the tool is idempotent.

An idempotent operation can be repeated safely. Reading an order status is usually idempotent. Sending an email is not, unless the system has deduplication.

A simple retry policy:

RETRYABLE_ERRORS = {"TIMEOUT", "RATE_LIMIT", "SERVICE_UNAVAILABLE"}

MAX_RETRIES = 3

def should_retry(error_code: str, attempt: int, idempotent: bool) -> bool:
    if attempt >= MAX_RETRIES:
        return False
    if error_code not in RETRYABLE_ERRORS:
        return False
    if not idempotent:
        return False
    return True

For rate limits, use exponential backoff:

Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds

In interactive applications, long retries may frustrate users. Sometimes it is better to report the temporary issue and suggest trying again.

Letting the model repair arguments

If a tool call fails due to invalid arguments, the model may be able to repair the call.

Example error:

{
  "success": false,
  "error_code": "MISSING_REQUIRED_FIELD",
  "message": "The field 'timezone' is required.",
  "retryable": true,
  "field": "timezone"
}

If the user’s location or timezone is available in context, the model can retry with the missing field. If not, it should ask the user.

A good system instruction might say:

If a tool returns a validation error, retry only when the missing or invalid value can be confidently inferred from context. Otherwise, ask the user for the missing information.

This prevents reckless guessing.

Graceful degradation

Graceful degradation means the agent provides useful partial help when it cannot fully complete the task.

Suppose a research agent can search documents but the fetch tool is temporarily unavailable. A poor response would be:

I can't do that.

A better response:

I found three likely relevant documents, but I could not fetch their full text because the document service timed out. Based on the search snippets, the most relevant result appears to be the API Authentication Guide. I can try again, or you can provide the document text directly.

The user receives partial progress and a clear explanation.

Error handling in agent prompts

System prompts should define error behavior. Example:

When a tool fails:
- Read the error_code and retryable fields.
- Retry only if the error is retryable and the operation is safe to repeat.
- Do not invent missing tool results.
- If the task cannot be completed, explain what failed and what partial progress was made.
- If a safer alternative is provided, offer it to the user.

This gives the model a policy for failure.

Avoiding infinite loops

Agents can get stuck. They may retry the same failing tool call, alternate between tools, or keep searching without reaching an answer.

Prevent this with:

  • Maximum tool-call count
  • Maximum retry count per tool
  • Duplicate-call detection
  • Time limits
  • Required escalation after repeated failure
  • State tracking of previous attempts

Example:

{
  "max_total_tool_calls": 10,
  "max_retries_per_tool_call": 2,
  "stop_if_same_error_repeats": true,
  "require_user_confirmation_after_failure": true
}

Testing failure modes

Do not test only successful tool calls. Create failure tests:

[
  {
    "case": "missing order id",
    "expected": "ask user for order ID"
  },
  {
    "case": "order not found",
    "expected": "explain not found and ask user to verify ID"
  },
  {
    "case": "shipping API timeout",
    "expected": "retry once, then report temporary issue"
  },
  {
    "case": "refund permission denied",
    "expected": "offer to create refund request draft"
  }
]

Failure testing reveals whether your agent is robust or merely impressive in happy-path demos.

Practical takeaway

Production agents must treat errors as normal events, not exceptional surprises. Good error handling requires structured error results, retry policies, idempotency awareness, graceful degradation, and explicit stopping conditions.

A reliable agent does not always succeed. It knows how to fail safely, explain clearly, and preserve useful progress.

Sign in to track your progress.

Ask your AI guide

AI Chat· Tool Use & Function Calling — Error Handling and Retries
🤖

Ask anything about Tool Use & Function Calling — Error Handling and Retries, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.