Dashboard showing AI agent monitoring and tracing in a production environment

CI/CD for AI Systems

AGAI 401 · Deployment, Operations, and Optimization

Learn how to adapt continuous integration and deployment practices for prompts, evals, model changes, retrieval indexes, and agent orchestration logic.

Key terms

AI release = code + prompts + models + indexeseval thresholds gate deploymentbaseline comparison reveals tradeoffscanary rollout reduces release risk

Learning objectives

Design a CI pipeline for prompt and agent changes.
Define evaluation thresholds for release gates.
Version retrieval indexes and model configurations.
Use canary deployment and rollback for AI behavior changes.

CI/CD for AI systems extends normal software delivery practices to model-driven behavior. You still build, test, review, deploy, and monitor code. But now you must also manage prompts, eval datasets, model versions, retrieval indexes, tool schemas, and safety policies.

A production AI release is not just a code release. It may include:

Application code
Prompt versions
Tool schema versions
Model configuration
Retrieval index version
Evaluator version
Safety policy version
Feature flags

If any of these change, behavior can change.

AI-specific CI checks

A CI pipeline for an agent might include:

1. Run unit tests for deterministic code.
2. Validate prompt templates render correctly.
3. Validate tool schemas.
4. Run offline eval suite.
5. Run safety eval suite.
6. Compare against baseline metrics.
7. Check cost and latency budgets.
8. Produce evaluation report for review.

Example GitHub Actions-style outline:

name: agent-ci

on:
  pull_request:

jobs:
  test-and-evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run unit tests
        run: pytest tests/unit
      - name: Validate prompts
        run: python scripts/validate_prompts.py
      - name: Validate tool schemas
        run: python scripts/validate_tools.py
      - name: Run eval suite
        run: python evals/run_evals.py --suite support_core --output eval_report.json
      - name: Check eval thresholds
        run: python evals/check_thresholds.py eval_report.json

Threshold checks prevent obvious regressions from merging.

Evaluation thresholds

Define release gates:

{
  "min_task_success_rate": 0.90,
  "max_policy_violation_rate": 0.00,
  "max_avg_cost_usd": 0.05,
  "max_p95_latency_ms": 8000,
  "min_tool_selection_accuracy": 0.95
}

Not every metric must block release, but safety-critical metrics should.

Comparing against baseline

Absolute thresholds are useful, but baseline comparison is often better. If a prompt change improves task success from 90% to 93% but doubles cost, you need to decide whether the tradeoff is acceptable.

Example comparison output:

{
  "baseline": "support_agent_v1.4.1",
  "candidate": "support_agent_v1.4.2",
  "task_success_rate": { "baseline": 0.91, "candidate": 0.94 },
  "policy_violations": { "baseline": 0, "candidate": 0 },
  "avg_cost_usd": { "baseline": 0.032, "candidate": 0.041 },
  "p95_latency_ms": { "baseline": 6100, "candidate": 6900 }
}

CI should surface tradeoffs, not hide them.

Retrieval index deployment

RAG systems have another deployable artifact: the index. Changing chunking, embedding model, document set, or metadata can change answers.

Version retrieval indexes:

{
  "index_name": "policy_docs",
  "version": "2025-11-18",
  "embedding_model": "text-embedding-3-small",
  "chunking_strategy": "markdown_heading_800_overlap_100",
  "document_count": 482,
  "chunk_count": 3910
}

Before promoting a new index, run retrieval evals and answer-level evals.

Canary deployments

For production rollout, use canaries or feature flags.

1% traffic → monitor
5% traffic → monitor
25% traffic → monitor
100% traffic → full release

Monitor quality, cost, latency, errors, user feedback, and safety events during rollout.

A bad prompt release should be rollbackable without redeploying the entire application.

Environment separation

Use separate environments:

local development
staging with synthetic data
internal dogfood
limited beta
production

Do not test agent tool permissions for the first time in production. Staging should include realistic tool mocks or sandboxed integrations.

Practical takeaway

CI/CD for AI systems is ordinary software delivery plus behavior evaluation. Prompts, models, tools, indexes, and policies are all versioned artifacts. Evals become release gates. Canary rollouts and rollback plans protect production users.

The more autonomy your agent has, the more disciplined your release process must be.

Ask your AI guide

AI Chat· Building Production Agents — CI/CD for AI Systems

🤖

Ask anything about Building Production Agents — CI/CD for AI Systems, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.