
CI/CD for AI Systems
AGAI 401 · Deployment, Operations, and Optimization
Learn how to adapt continuous integration and deployment practices for prompts, evals, model changes, retrieval indexes, and agent orchestration logic.
Key terms
AI release = code + prompts + models + indexeseval thresholds gate deploymentbaseline comparison reveals tradeoffscanary rollout reduces release riskLearning objectives
- Design a CI pipeline for prompt and agent changes.
- Define evaluation thresholds for release gates.
- Version retrieval indexes and model configurations.
- Use canary deployment and rollback for AI behavior changes.
CI/CD for AI systems extends normal software delivery practices to model-driven behavior. You still build, test, review, deploy, and monitor code. But now you must also manage prompts, eval datasets, model versions, retrieval indexes, tool schemas, and safety policies.
A production AI release is not just a code release. It may include:
Application code
Prompt versions
Tool schema versions
Model configuration
Retrieval index version
Evaluator version
Safety policy version
Feature flags
If any of these change, behavior can change.
AI-specific CI checks
A CI pipeline for an agent might include:
1. Run unit tests for deterministic code.
2. Validate prompt templates render correctly.
3. Validate tool schemas.
4. Run offline eval suite.
5. Run safety eval suite.
6. Compare against baseline metrics.
7. Check cost and latency budgets.
8. Produce evaluation report for review.
Example GitHub Actions-style outline:
name: agent-ci
on:
pull_request:
jobs:
test-and-evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/unit
- name: Validate prompts
run: python scripts/validate_prompts.py
- name: Validate tool schemas
run: python scripts/validate_tools.py
- name: Run eval suite
run: python evals/run_evals.py --suite support_core --output eval_report.json
- name: Check eval thresholds
run: python evals/check_thresholds.py eval_report.json
Threshold checks prevent obvious regressions from merging.
Evaluation thresholds
Define release gates:
{
"min_task_success_rate": 0.90,
"max_policy_violation_rate": 0.00,
"max_avg_cost_usd": 0.05,
"max_p95_latency_ms": 8000,
"min_tool_selection_accuracy": 0.95
}
Not every metric must block release, but safety-critical metrics should.
Comparing against baseline
Absolute thresholds are useful, but baseline comparison is often better. If a prompt change improves task success from 90% to 93% but doubles cost, you need to decide whether the tradeoff is acceptable.
Example comparison output:
{
"baseline": "support_agent_v1.4.1",
"candidate": "support_agent_v1.4.2",
"task_success_rate": { "baseline": 0.91, "candidate": 0.94 },
"policy_violations": { "baseline": 0, "candidate": 0 },
"avg_cost_usd": { "baseline": 0.032, "candidate": 0.041 },
"p95_latency_ms": { "baseline": 6100, "candidate": 6900 }
}
CI should surface tradeoffs, not hide them.
Retrieval index deployment
RAG systems have another deployable artifact: the index. Changing chunking, embedding model, document set, or metadata can change answers.
Version retrieval indexes:
{
"index_name": "policy_docs",
"version": "2025-11-18",
"embedding_model": "text-embedding-3-small",
"chunking_strategy": "markdown_heading_800_overlap_100",
"document_count": 482,
"chunk_count": 3910
}
Before promoting a new index, run retrieval evals and answer-level evals.
Canary deployments
For production rollout, use canaries or feature flags.
1% traffic → monitor
5% traffic → monitor
25% traffic → monitor
100% traffic → full release
Monitor quality, cost, latency, errors, user feedback, and safety events during rollout.
A bad prompt release should be rollbackable without redeploying the entire application.
Environment separation
Use separate environments:
local development
staging with synthetic data
internal dogfood
limited beta
production
Do not test agent tool permissions for the first time in production. Staging should include realistic tool mocks or sandboxed integrations.
Practical takeaway
CI/CD for AI systems is ordinary software delivery plus behavior evaluation. Prompts, models, tools, indexes, and policies are all versioned artifacts. Evals become release gates. Canary rollouts and rollback plans protect production users.
The more autonomy your agent has, the more disciplined your release process must be.
Sign in to track your progress.
Ask your AI guide
Ask anything about Building Production Agents — CI/CD for AI Systems, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.