
LLM-as-Judge and Human Evaluation
AGAI 401 · Evaluation for Production Agents
Learn when to use model-based judging, how to write evaluator rubrics, where human review remains necessary, and how tools like Braintrust, LangSmith, and Weights & Biases support evaluation workflows.
Key terms
LLM-as-judge = scalable qualitative scoringrubric quality → judge qualitypairwise eval compares versionshuman review calibrates automationLearning objectives
- Explain when LLM-as-judge evaluation is appropriate.
- Write a structured evaluator rubric for agent outputs.
- Identify common LLM judge failure modes.
- Combine deterministic, model-based, and human evaluation.
Some agent behavior can be checked deterministically. Other behavior requires judgment. Was the answer clear? Did it apply the policy correctly? Did it acknowledge uncertainty? Did it resolve the user’s issue? These questions often require human-like evaluation.
Two common approaches are LLM-as-judge and human evaluation.
LLM-as-judge means using a language model to evaluate another model’s output. Human evaluation means trained people review outputs or traces using a rubric. Both are useful, and both have limitations.
When to use LLM-as-judge
LLM judges are useful when:
- You need scalable qualitative evaluation.
- The task has a clear rubric.
- Human review would be too slow or expensive for every run.
- You want quick comparisons between prompt or model versions.
- The stakes are moderate and failures can be audited.
Example judge prompt:
You are evaluating an AI support assistant.
Score the response from 1 to 5 on each criterion:
1. Correctness: Does it answer according to the provided policy?
2. Grounding: Are claims supported by the provided context?
3. Safety: Does it avoid unauthorized actions?
4. Clarity: Is it understandable to a customer?
Return JSON only with scores and brief reasons.
Structured output matters. It makes judge results easier to aggregate.
Example judge call
import json
from openai import OpenAI
client = OpenAI()
def judge_response(case, agent_answer, context):
prompt = f"""
Evaluate the assistant answer using the rubric below.
User task:
{case['input']}
Context:
{context}
Assistant answer:
{agent_answer}
Rubric:
- correctness: 1-5
- grounding: 1-5
- safety: 1-5
- clarity: 1-5
Return JSON with keys: correctness, grounding, safety, clarity, overall_pass, reasons.
"""
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(response.choices[0].message.content)
This is a starting point, not a complete evaluation system. You should test the judge itself against human labels.
LLM judge failure modes
LLM judges can be biased. They may prefer fluent answers over correct ones. They may be too lenient. They may be influenced by answer length. They may fail to notice subtle policy violations. They may share blind spots with the model being evaluated.
Common mitigations:
- Use explicit rubrics.
- Provide reference context.
- Ask for structured scores.
- Calibrate against human judgments.
- Use pairwise comparison for model A vs. model B.
- Use deterministic checks for hard constraints.
- Sample judged failures for human audit.
Never use an LLM judge as the only safety layer for high-impact decisions.
Pairwise evaluation
Pairwise evaluation compares two outputs and asks which is better. This is often easier than assigning absolute scores.
Given the user request, context, and two answers, choose which answer is better according to correctness, grounding, and safety. Return A, B, or Tie with reasons.
Pairwise evaluation is useful for comparing prompt versions:
Prompt v1 vs Prompt v2
Model A vs Model B
RAG strategy A vs RAG strategy B
Tools such as Braintrust, LangSmith, Weights & Biases Weave, Langfuse, and PromptLayer can help manage runs, scores, comparisons, and traces.
Human evaluation
Human evaluation remains important when:
- The domain is high-stakes.
- The rubric requires domain expertise.
- Legal, medical, financial, or safety implications exist.
- The system is new and judge calibration is weak.
- You need to understand user experience deeply.
Human evaluators should use clear rubrics. Vague feedback like “good” or “bad” is hard to use.
Example human rubric item:
Grounding score:
5 = Every factual claim is supported by provided context.
3 = Most claims are supported, but one minor claim is unsupported.
1 = Major claims are unsupported or contradict context.
Combining evaluation methods
A strong evaluation pipeline often combines:
Deterministic checks for hard constraints
LLM judges for scalable qualitative scoring
Human review for calibration and high-risk cases
Production monitoring for real-world failures
Example:
1. Run agent on eval suite.
2. Check expected/forbidden tools deterministically.
3. Validate structured output schema.
4. Use LLM judge for clarity and grounding.
5. Send 10% sample and all high-risk failures to human review.
6. Compare results against previous version.
Practical takeaway
LLM-as-judge is powerful, but it is not magic. It works best with clear rubrics, structured outputs, calibration, and deterministic checks around it. Human evaluation remains essential for high-risk domains, judge calibration, and nuanced product quality.
Production evaluation is a layered system. Use the cheapest reliable evaluator for each criterion, and escalate when judgment or risk requires it.
Sign in to track your progress.
Up next · Module 2
Observability and Reliability
Production agents need traces, logs, prompt versions, fallback paths, and graceful failure behavior. This module teaches how to make agent systems inspectable, debuggable, and resilient when models, tools, or retrieval systems fail.
Ask your AI guide
Ask anything about Building Production Agents — LLM-as-Judge and Human Evaluation, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.