
Cost and Latency Optimization
AGAI 401 · Deployment, Operations, and Optimization
Learn how to measure, budget, and optimize token cost, tool cost, model-call count, and latency for production agents.
Key terms
cost = input tokens + output tokens + tool usagecost per successful task > cost per requestp50 latency ≠ p99 latencybudget limits control runaway agentsLearning objectives
- Identify major cost and latency drivers in production agents.
- Calculate approximate token-based LLM cost.
- Use latency percentiles to understand user experience.
- Apply routing, caching, prompt reduction, and budgets to optimize systems.
Production agents can become expensive and slow if cost and latency are not designed into the architecture. A prototype may work well for ten internal users, then become unusable when traffic grows or when agent loops call models and tools repeatedly.
Cost and latency are product quality attributes. Users notice slow responses. Teams notice unpredictable bills.
Understand cost drivers
Common cost drivers include:
Input tokens
Output tokens
Number of model calls
Model choice
Embedding calls
Vector database queries
Reranking calls
Tool/API usage
Retries
Reflection or critique loops
Multi-agent parallel calls
Long conversation history
Agentic systems often cost more than simple chat because they may perform multiple model calls per user request.
Basic token cost calculation
A simple cost estimator:
def estimate_llm_cost(input_tokens, output_tokens, input_price_per_million, output_price_per_million):
input_cost = (input_tokens / 1_000_000) * input_price_per_million
output_cost = (output_tokens / 1_000_000) * output_price_per_million
return input_cost + output_cost
cost = estimate_llm_cost(
input_tokens=12000,
output_tokens=900,
input_price_per_million=0.50,
output_price_per_million=1.50
)
print(f"Estimated cost: ${cost:.4f}")
Use current provider pricing in your actual implementation. Prices change, so keep pricing data configurable.
Cost per successful task
Cost per request is useful, but cost per successful task is better.
cost per successful task = total cost / successful completed tasks
If a cheaper model fails often and requires retries or human correction, it may be more expensive in practice than a stronger model.
Track:
Average cost per request
Average cost per successful task
Cost by route or feature
Cost by customer or tenant
Cost by model version
Cost caused by retries
Latency percentiles
Average latency is not enough. Track percentiles.
p50 latency: typical user experience
p95 latency: slow users
p99 latency: worst tail behavior
A system with good average latency may still have terrible p99 latency due to slow tools, retries, or large contexts.
Important phrase:
p50 latency ≠ p99 latency
Reducing latency
Strategies include:
- Use smaller models for simple tasks.
- Route by task complexity.
- Reduce prompt size.
- Retrieve fewer but better chunks.
- Cache frequent retrieval results.
- Run independent tool calls in parallel.
- Stream responses when appropriate.
- Avoid unnecessary reflection loops.
- Set timeouts on slow tools.
- Precompute embeddings and summaries.
Example task router:
def choose_model(task):
if task.type == "classification" and task.risk == "low":
return "small-fast-model"
if task.type == "high_risk_policy" or task.requires_reasoning:
return "stronger-model"
return "default-model"
Routing can reduce cost and latency without sacrificing quality.
Caching
Caching can help when inputs repeat or retrieval results are stable.
Cache candidates:
Embeddings for documents
Retrieval results for common queries
Policy document summaries
Tool results with safe TTLs
Model outputs for deterministic low-risk tasks
Example TTL cache:
import time
cache = {}
def get_cached(key, ttl_seconds):
item = cache.get(key)
if not item:
return None
value, created_at = item
if time.time() - created_at > ttl_seconds:
del cache[key]
return None
return value
def set_cached(key, value):
cache[key] = (value, time.time())
Do not cache sensitive or user-specific outputs without careful controls.
Budget limits
Agents should have budgets:
{
"max_model_calls": 6,
"max_tool_calls": 10,
"max_input_tokens": 50000,
"max_cost_usd": 0.25,
"max_runtime_seconds": 30
}
If a task exceeds budget, the agent should summarize partial progress and ask whether to continue, or escalate based on product design.
Practical takeaway
Cost and latency are architectural concerns. Measure them per trace, optimize the biggest drivers, and set explicit budgets. The goal is not always to use the cheapest model. The goal is to deliver reliable task completion within acceptable cost and latency.
Sign in to track your progress.
Ask your AI guide
Ask anything about Building Production Agents — Cost and Latency Optimization, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.