Cost and Latency Optimization

Production agents can become expensive and slow if cost and latency are not designed into the architecture. A prototype may work well for ten internal users, then become unusable when traffic grows or when agent loops call models and tools repeatedly.

Cost and latency are product quality attributes. Users notice slow responses. Teams notice unpredictable bills.

Understand cost drivers

Common cost drivers include:

Input tokens
Output tokens
Number of model calls
Model choice
Embedding calls
Vector database queries
Reranking calls
Tool/API usage
Retries
Reflection or critique loops
Multi-agent parallel calls
Long conversation history

Agentic systems often cost more than simple chat because they may perform multiple model calls per user request.

Basic token cost calculation

A simple cost estimator:

def estimate_llm_cost(input_tokens, output_tokens, input_price_per_million, output_price_per_million):
    input_cost = (input_tokens / 1_000_000) * input_price_per_million
    output_cost = (output_tokens / 1_000_000) * output_price_per_million
    return input_cost + output_cost

cost = estimate_llm_cost(
    input_tokens=12000,
    output_tokens=900,
    input_price_per_million=0.50,
    output_price_per_million=1.50
)
print(f"Estimated cost: ${cost:.4f}")

Use current provider pricing in your actual implementation. Prices change, so keep pricing data configurable.

Cost per successful task

Cost per request is useful, but cost per successful task is better.

cost per successful task = total cost / successful completed tasks

If a cheaper model fails often and requires retries or human correction, it may be more expensive in practice than a stronger model.

Track:

Average cost per request
Average cost per successful task
Cost by route or feature
Cost by customer or tenant
Cost by model version
Cost caused by retries

Latency percentiles

Average latency is not enough. Track percentiles.

p50 latency: typical user experience
p95 latency: slow users
p99 latency: worst tail behavior

A system with good average latency may still have terrible p99 latency due to slow tools, retries, or large contexts.

Important phrase:

p50 latency ≠ p99 latency

Reducing latency

Strategies include:

Use smaller models for simple tasks.
Route by task complexity.
Reduce prompt size.
Retrieve fewer but better chunks.
Cache frequent retrieval results.
Run independent tool calls in parallel.
Stream responses when appropriate.
Avoid unnecessary reflection loops.
Set timeouts on slow tools.
Precompute embeddings and summaries.

Example task router:

def choose_model(task):
    if task.type == "classification" and task.risk == "low":
        return "small-fast-model"
    if task.type == "high_risk_policy" or task.requires_reasoning:
        return "stronger-model"
    return "default-model"

Routing can reduce cost and latency without sacrificing quality.

Caching

Caching can help when inputs repeat or retrieval results are stable.

Cache candidates:

Embeddings for documents
Retrieval results for common queries
Policy document summaries
Tool results with safe TTLs
Model outputs for deterministic low-risk tasks

Example TTL cache:

import time

cache = {}

def get_cached(key, ttl_seconds):
    item = cache.get(key)
    if not item:
        return None
    value, created_at = item
    if time.time() - created_at > ttl_seconds:
        del cache[key]
        return None
    return value

def set_cached(key, value):
    cache[key] = (value, time.time())

Do not cache sensitive or user-specific outputs without careful controls.

Budget limits

Agents should have budgets:

{
  "max_model_calls": 6,
  "max_tool_calls": 10,
  "max_input_tokens": 50000,
  "max_cost_usd": 0.25,
  "max_runtime_seconds": 30
}

If a task exceeds budget, the agent should summarize partial progress and ask whether to continue, or escalate based on product design.

Practical takeaway

Cost and latency are architectural concerns. Measure them per trace, optimize the biggest drivers, and set explicit budgets. The goal is not always to use the cheapest model. The goal is to deliver reliable task completion within acceptable cost and latency.

Key terms

Learning objectives