Dashboard showing AI agent monitoring and tracing in a production environment

Cost and Latency Optimization

AGAI 401 · Deployment, Operations, and Optimization

Learn how to measure, budget, and optimize token cost, tool cost, model-call count, and latency for production agents.

Key terms

cost = input tokens + output tokens + tool usagecost per successful task > cost per requestp50 latency ≠ p99 latencybudget limits control runaway agents

Learning objectives

  • Identify major cost and latency drivers in production agents.
  • Calculate approximate token-based LLM cost.
  • Use latency percentiles to understand user experience.
  • Apply routing, caching, prompt reduction, and budgets to optimize systems.

Production agents can become expensive and slow if cost and latency are not designed into the architecture. A prototype may work well for ten internal users, then become unusable when traffic grows or when agent loops call models and tools repeatedly.

Cost and latency are product quality attributes. Users notice slow responses. Teams notice unpredictable bills.

Understand cost drivers

Common cost drivers include:

Input tokens
Output tokens
Number of model calls
Model choice
Embedding calls
Vector database queries
Reranking calls
Tool/API usage
Retries
Reflection or critique loops
Multi-agent parallel calls
Long conversation history

Agentic systems often cost more than simple chat because they may perform multiple model calls per user request.

Basic token cost calculation

A simple cost estimator:

def estimate_llm_cost(input_tokens, output_tokens, input_price_per_million, output_price_per_million):
    input_cost = (input_tokens / 1_000_000) * input_price_per_million
    output_cost = (output_tokens / 1_000_000) * output_price_per_million
    return input_cost + output_cost

cost = estimate_llm_cost(
    input_tokens=12000,
    output_tokens=900,
    input_price_per_million=0.50,
    output_price_per_million=1.50
)
print(f"Estimated cost: ${cost:.4f}")

Use current provider pricing in your actual implementation. Prices change, so keep pricing data configurable.

Cost per successful task

Cost per request is useful, but cost per successful task is better.

cost per successful task = total cost / successful completed tasks

If a cheaper model fails often and requires retries or human correction, it may be more expensive in practice than a stronger model.

Track:

Average cost per request
Average cost per successful task
Cost by route or feature
Cost by customer or tenant
Cost by model version
Cost caused by retries

Latency percentiles

Average latency is not enough. Track percentiles.

p50 latency: typical user experience
p95 latency: slow users
p99 latency: worst tail behavior

A system with good average latency may still have terrible p99 latency due to slow tools, retries, or large contexts.

Important phrase:

p50 latency ≠ p99 latency

Reducing latency

Strategies include:

  • Use smaller models for simple tasks.
  • Route by task complexity.
  • Reduce prompt size.
  • Retrieve fewer but better chunks.
  • Cache frequent retrieval results.
  • Run independent tool calls in parallel.
  • Stream responses when appropriate.
  • Avoid unnecessary reflection loops.
  • Set timeouts on slow tools.
  • Precompute embeddings and summaries.

Example task router:

def choose_model(task):
    if task.type == "classification" and task.risk == "low":
        return "small-fast-model"
    if task.type == "high_risk_policy" or task.requires_reasoning:
        return "stronger-model"
    return "default-model"

Routing can reduce cost and latency without sacrificing quality.

Caching

Caching can help when inputs repeat or retrieval results are stable.

Cache candidates:

Embeddings for documents
Retrieval results for common queries
Policy document summaries
Tool results with safe TTLs
Model outputs for deterministic low-risk tasks

Example TTL cache:

import time

cache = {}

def get_cached(key, ttl_seconds):
    item = cache.get(key)
    if not item:
        return None
    value, created_at = item
    if time.time() - created_at > ttl_seconds:
        del cache[key]
        return None
    return value

def set_cached(key, value):
    cache[key] = (value, time.time())

Do not cache sensitive or user-specific outputs without careful controls.

Budget limits

Agents should have budgets:

{
  "max_model_calls": 6,
  "max_tool_calls": 10,
  "max_input_tokens": 50000,
  "max_cost_usd": 0.25,
  "max_runtime_seconds": 30
}

If a task exceeds budget, the agent should summarize partial progress and ask whether to continue, or escalate based on product design.

Practical takeaway

Cost and latency are architectural concerns. Measure them per trace, optimize the biggest drivers, and set explicit budgets. The goal is not always to use the cheapest model. The goal is to deliver reliable task completion within acceptable cost and latency.

Sign in to track your progress.

Ask your AI guide

AI Chat· Building Production Agents — Cost and Latency Optimization
🤖

Ask anything about Building Production Agents — Cost and Latency Optimization, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.