LLM
Evals
Academy
Build evaluation systems that catch regressions before your users do. From simple assertions to LLM-as-judge, from eval-driven development to CI/CD integration — with interactive code examples.
Why Evals Matter
The foundation of reliable AI products
LLM evaluations are systematic methods for measuring the quality, reliability, and safety of LLM outputs. They are the difference between shipping a demo and shipping a product.
Traditional software has type systems, unit tests, and integration tests. LLM outputs are non-deterministic, subjective, and context- dependent. Evals are the testing discipline designed for this reality — they give you confidence that your system works, a safety net when you make changes, and data to drive decisions.
LLMs are stochastic systems. The same input can produce different outputs. A prompt that works today may fail tomorrow after a model update. Without evals, every change is a leap of faith — you have no way to know if your system is improving, degrading, or breaking. Users become your QA team, and user complaints become your monitoring system.
Key Insight
Evals are not a quality-assurance step you add at the end. They are the foundation of the entire development workflow. The best AI engineering teams write evals before they write prompts — the same way the best software teams write tests before they write code.
The Cost of Shipping Without Evals
Prompt Regression
A prompt change that improves one use case silently breaks three others.
Impact
Users see degraded quality for days before anyone notices.
Model Provider Update
Your LLM provider ships a new model version. Your app behavior changes overnight.
Impact
No way to quantify the impact or decide whether to pin the old version.
Temperature Tweak
Someone changes temperature from 0.7 to 0.3 to reduce hallucinations.
Impact
Hallucinations drop but creative tasks become robotic. No data to balance the tradeoff.
RAG Index Update
New documents are indexed into the retrieval system with different formatting.
Impact
Retrieval quality shifts but nothing in the pipeline detects the change.
The Eval Mindset
Every prompt change is a hypothesis
Evals turn subjective 'I think this is better' into measurable 'this scores 4.2 vs 3.8 on our rubric.'
Evals compound over time
Each production failure you add to your eval suite makes the next deployment safer. Your eval suite is your institutional knowledge of what can go wrong.
Measure what matters to users
Don't eval for perplexity or BLEU score if your users care about helpfulness and accuracy. Eval metrics should map to user satisfaction.
Automate relentlessly
If a human has to remember to run evals, they won't. Put evals in CI, block merges on regressions, and post results on every PR.
“Evals are the most underinvested part of the AI stack. If you don't have evals, you don't have a product — you have a demo.”
Hamel Husain
AI Engineering Consultant
“The best teams I've seen treat evals as a first-class artifact. They write the eval before writing the prompt, the same way TDD works for software.”
Eugene Yan
Senior Applied Scientist, Amazon
“Without evals, every prompt change is a leap of faith. With evals, it's an experiment with measurable outcomes.”
Jason Wei
Research Scientist, OpenAI
“If you can't measure it, you can't improve it. The teams that ship the most reliable LLM products are the ones with the strongest eval suites.”
Simon Willison
Creator of Datasette, AI Blogger
Types of Evaluations
Functional, quality, safety, and regression evals
Not all evals are created equal. Each type serves a different purpose, runs at a different cadence, and catches a different class of failure. A robust eval strategy layers multiple types to cover the full spectrum of quality dimensions.
The Eval Pyramid — Layer Your Evaluations
Functional evals form the base — they are fast, cheap, and catch the most common failures. Each layer above adds more coverage at higher cost and latency.
Verify that the output meets hard requirements — correct format, contains required fields, follows instructions. These are the LLM equivalent of unit tests.
Examples
- -Output is valid JSON matching a schema
- -Response contains all required sections
- -Generated SQL query is syntactically valid
- -Classification output is one of the allowed labels
When to Use
Always. Functional evals are the minimum baseline for any LLM application. They're fast, deterministic, and catch the most obvious failures.
Code Pattern
// Functional eval: check format and content
expect(output).toMatchSchema(responseSchema);
expect(output.sections).toContain("summary");
expect(output.length).toBeLessThan(maxTokens);Assess subjective quality dimensions — helpfulness, clarity, accuracy, tone, completeness. These typically use LLM-as-judge or human reviewers with rubrics.
Examples
- -Is the summary accurate and complete?
- -Does the response match the requested tone?
- -Are the generated instructions easy to follow?
- -Does the answer cite relevant sources?
When to Use
For any user-facing output where quality matters. Use LLM-as-judge for fast iteration, human eval for calibration and high-stakes decisions.
Code Pattern
// Quality eval: LLM-as-judge with rubric
const score = await judge.evaluate({
criteria: ["accuracy", "helpfulness", "clarity"],
rubric: qualityRubric,
response: output,
reference: goldenAnswer,
});Test for harmful, biased, or inappropriate outputs. Includes adversarial testing (jailbreaks, prompt injection), bias detection, PII leakage, and content policy compliance.
Examples
- -Response doesn't contain PII from training data
- -Output doesn't generate harmful instructions
- -System resists prompt injection attempts
- -Responses don't exhibit demographic bias
When to Use
Before any public deployment. Run adversarial evals during development and continuously in production. Required for regulated industries.
Code Pattern
// Safety eval: adversarial test suite
const adversarialInputs = loadAdversarialSuite();
for (const input of adversarialInputs) {
const output = await model.generate(input);
expect(output).not.toMatch(harmfulPatterns);
expect(output).toPassContentPolicy();
}Compare current output quality against a known baseline. Detect when changes to prompts, models, or retrieval systems cause quality to drop on previously-passing cases.
Examples
- -Score on existing test suite after prompt change
- -Before/after comparison on model version upgrade
- -Quality delta after RAG index rebuild
- -Performance check after temperature adjustment
When to Use
On every change to any component of the LLM pipeline — prompts, models, retrieval, post-processing. Gate deployments on regression results.
Code Pattern
// Regression eval: compare against baseline
const baseline = await loadBaseline("v2.3");
const current = await runEvalSuite(newPrompt);
const regressions = findRegressions(baseline, current);
assert(regressions.length < threshold);Evaluate two or more variants side-by-side. Used for A/B testing prompts, comparing models, or evaluating system architecture changes. Requires careful methodology to avoid position bias.
Examples
- -Prompt A vs Prompt B on the same test suite
- -GPT-4o vs Claude Sonnet for a specific use case
- -RAG pipeline v1 vs v2 on retrieval quality
- -Single-shot vs chain-of-thought on reasoning tasks
When to Use
When choosing between alternatives. Run before committing to a prompt change, model switch, or architecture decision. Randomize ordering to avoid position bias.
Code Pattern
// Comparative eval: head-to-head comparison
const results = await compareVariants({
variants: [promptA, promptB],
dataset: evalCases,
judge: llmJudge,
randomizeOrder: true, // Avoid position bias
});
console.log(results.winner, results.pValue);Choosing Your Eval Mix
Start with functional evals — they are the fastest to build and catch the most common failures. Add regression evals as soon as you have a baseline. Layer in quality evals for user-facing outputs. Add safety evals before public deployment. Use comparative evals when making architectural decisions.
Building Eval Datasets
Golden datasets, synthetic data, production sampling
Your evals are only as good as your dataset. A well-constructed eval dataset is the most valuable artifact in your LLM engineering workflow — it encodes your team's knowledge of what “good” looks like and what can go wrong.
Three Sources of Eval Data
Hand-crafted test cases written by domain experts who understand what good output looks like. The highest quality source but the least scalable.
Pros
- + Highest quality and precision
- + Tests specific requirements
- + Captures domain nuance
Cons
- - Expensive and slow to create
- - Limited scale
- - Author bias
Tip
Start here. 20 well-crafted manual cases beat 500 auto-generated ones for establishing your eval baseline.
Use an LLM to generate test cases from seed examples, templates, or category descriptions. Good for coverage and edge cases, but requires human review.
Pros
- + Scales to hundreds of cases quickly
- + Good for edge case generation
- + Covers categories systematically
Cons
- - Quality varies — needs human review
- - Can miss real-world patterns
- - May have blind spots matching the generator model
Tip
Generate in batches by category, then have a human review and filter. A 50% acceptance rate is typical and acceptable.
Sample real user inputs and outputs from production logs. The most representative source of how your system is actually used.
Pros
- + Reflects real usage patterns
- + Captures edge cases you'd never imagine
- + Automatically tracks distribution shifts
Cons
- - Requires production traffic first
- - May contain PII (needs scrubbing)
- - Needs labeling after collection
Tip
Stratified sampling is key — sample across user types, input categories, and outcome types (success/failure). Over-sample failures.
Generating Synthetic Test Cases
// Generate synthetic eval cases from seed examples
import { z } from "zod";
const EvalCase = z.object({
input: z.string(),
expectedBehavior: z.string(),
category: z.string(),
difficulty: z.enum(["easy", "medium", "hard", "adversarial"]),
});
async function generateSyntheticCases(
seedCases: z.infer<typeof EvalCase>[],
categories: string[],
countPerCategory: number,
): Promise<z.infer<typeof EvalCase>[]> {
const allCases: z.infer<typeof EvalCase>[] = [];
for (const category of categories) {
const seeds = seedCases.filter((c) => c.category === category);
const generated = await llm.generate({
system: `You are an eval dataset generator. Given seed examples,
generate ${countPerCategory} new test cases for the "${category}" category.
Requirements:
- Each case must be meaningfully different from seeds
- Include a mix of difficulties: easy, medium, hard, adversarial
- Adversarial cases should test edge cases and failure modes
- Expected behavior should describe what a GOOD response does
Respond as a JSON array matching the schema.`,
messages: [
{
role: "user",
content: `Seed examples:\n${JSON.stringify(seeds, null, 2)}`,
},
],
});
const parsed = z.array(EvalCase).parse(JSON.parse(generated));
allCases.push(...parsed);
}
return allCases;
}Use an LLM to generate test cases from seed examples. Specify categories and difficulty levels. Always have a human review generated cases before adding them to the eval suite — a 50% acceptance rate is typical.
Sampling from Production
// Sample production logs for eval dataset
interface ProductionLog {
id: string;
input: string;
output: string;
timestamp: string;
userFeedback?: "positive" | "negative";
latencyMs: number;
model: string;
}
async function sampleForEvals(
logs: ProductionLog[],
config: {
totalSamples: number;
failureOversampling: number; // e.g., 3x
stratifyBy: "category" | "date" | "model";
},
): Promise<ProductionLog[]> {
// Separate successes and failures
const failures = logs.filter((l) => l.userFeedback === "negative");
const successes = logs.filter((l) => l.userFeedback !== "negative");
// Over-sample failures (they're more valuable as eval cases)
const failureSamples = stratifiedSample(
failures,
Math.min(
failures.length,
Math.floor(config.totalSamples * 0.4 * config.failureOversampling),
),
config.stratifyBy,
);
const successSamples = stratifiedSample(
successes,
config.totalSamples - failureSamples.length,
config.stratifyBy,
);
return [...failureSamples, ...successSamples];
}Over-sample failures — they are far more valuable as eval cases than successes. Use stratified sampling to ensure coverage across user types, input categories, and time periods.
Versioning Your Datasets
// Version and track eval datasets
interface DatasetVersion {
version: string;
createdAt: string;
cases: EvalCase[];
metadata: {
totalCases: number;
bySource: Record<string, number>;
byCategory: Record<string, number>;
byDifficulty: Record<string, number>;
};
changelog: string;
}
async function saveDatasetVersion(
cases: EvalCase[],
changelog: string,
): Promise<DatasetVersion> {
const version: DatasetVersion = {
version: `v${Date.now()}`,
createdAt: new Date().toISOString(),
cases,
metadata: {
totalCases: cases.length,
bySource: countBy(cases, "source"),
byCategory: countBy(cases, "category"),
byDifficulty: countBy(cases, "difficulty"),
},
changelog,
};
// Store in version control alongside code
await writeFile(
`evals/datasets/${version.version}.json`,
JSON.stringify(version, null, 2),
);
// Update the "latest" symlink
await writeFile(
"evals/datasets/latest.json",
JSON.stringify(version, null, 2),
);
return version;
}Edge Case Checklist
The Golden Rule
Your eval dataset should be a living document, not a static artifact. Feed production failures back in weekly. Regenerate synthetic cases quarterly. Audit for coverage gaps monthly. The teams with the best AI products are the ones that invest the most in their eval datasets.
Automated Evaluations
LLM-as-judge, assertions, semantic similarity
Automated evaluations are the backbone of a scalable eval system. They run on every change, catch regressions instantly, and provide quantitative signals for prompt iteration. The best eval systems layer multiple methods — assertions for hard requirements, semantic similarity for meaning, and LLM-as-judge for nuanced quality.
Each method has different cost, speed, and reliability tradeoffs. The key is knowing when to use each one and how to combine them into a pipeline that balances thoroughness with speed.
Three Layers of Automated Evaluation
Assertion-Based Evals
Hard checks against LLM output: format validation, keyword presence, length constraints, schema compliance. The fastest and cheapest eval method — runs in milliseconds with zero LLM calls.
Strengths
- +Deterministic — same input always gives same result
- +Zero cost — no LLM calls needed
- +Fast — runs in milliseconds
- +Easy to debug when tests fail
Limitations
- -Can't assess subjective quality
- -Brittle for free-form text
- -Misses semantically correct but differently worded responses
Semantic Similarity
Compare output embeddings against reference answer embeddings using cosine similarity. Catches semantically equivalent responses regardless of exact wording. Cheap and fast — one embedding call per evaluation.
Strengths
- +Tolerant of rephrasing and word choice variation
- +Cheap — only requires embedding calls, not LLM generation
- +Good middle ground between exact match and LLM-as-judge
- +Quantitative and reproducible
Limitations
- -Doesn't understand nuance or factual correctness
- -High similarity doesn't guarantee correctness
- -Depends on embedding model quality
LLM-as-Judge
Use a strong LLM to evaluate another LLM's output against a rubric. The most flexible eval method — can assess nuanced quality dimensions like helpfulness, accuracy, and tone. Requires calibration to be reliable.
Strengths
- +Evaluates subjective quality dimensions
- +Flexible — adapts to any rubric or criteria
- +Correlates well with human judgment when calibrated
- +Scales to thousands of evaluations
Limitations
- -Has systematic biases (verbosity, position, self-preference)
- -Non-deterministic — scores vary between runs
- -Expensive — requires LLM call per evaluation
- -Requires careful calibration against human labels
Implementing LLM-as-Judge
The key to reliable LLM-as-judge is calibration. Provide concrete examples of what each score level looks like. Set temperature to 0 for reproducibility. Require step-by-step reasoning before scoring — this forces the judge to justify its evaluation and improves consistency.
// LLM-as-Judge with calibration and bias mitigation
import { z } from "zod";
const JudgmentSchema = z.object({
reasoning: z.string(),
scores: z.object({
accuracy: z.number().min(1).max(5),
helpfulness: z.number().min(1).max(5),
safety: z.number().min(1).max(5),
}),
overall: z.number().min(1).max(5),
});
type Judgment = z.infer<typeof JudgmentSchema>;
async function llmJudge(
response: string,
reference: string,
rubric: string,
calibrationExamples: { response: string; score: number }[],
): Promise<Judgment> {
// Include calibration examples to anchor scoring
const calibration = calibrationExamples
.map((ex) => `Response: "${ex.response}" → Score: ${ex.score}/5`)
.join("\n");
const result = await llm.generate({
model: "gpt-4o",
temperature: 0, // Reduce variance between runs
system: `You are an expert evaluator. Score responses 1-5.
## Rubric
${rubric}
## Calibration Examples (use these to anchor your scale)
${calibration}
## Reference Answer
${reference}
## Instructions
1. Write step-by-step reasoning BEFORE scoring
2. Score each dimension independently
3. Overall = weighted average (accuracy 40%, helpfulness 40%, safety 20%)
4. Output valid JSON matching the schema.`,
messages: [
{ role: "user", content: `Evaluate this response:\n"${response}"` },
],
});
return JudgmentSchema.parse(JSON.parse(result));
}Combined Eval Pipeline
The most effective eval systems layer all three methods. Assertions run first as a fast gate — if they fail, skip the expensive LLM calls. Semantic similarity provides a cheap quality signal. LLM-as-judge adds nuanced evaluation for cases that pass the first two layers.
// Combined eval pipeline: assertions + similarity + judge
interface EvalResult {
assertions: { passed: boolean; failures: string[] };
similarity: { score: number; threshold: number };
judge: { overall: number; reasoning: string };
finalScore: number;
passed: boolean;
}
async function evaluate(
response: string,
testCase: EvalCase,
): Promise<EvalResult> {
// Layer 1: Fast assertions (deterministic, free)
const assertions = runAssertions(response, testCase.assertions);
// Short-circuit: if assertions fail, skip expensive evals
if (!assertions.passed) {
return {
assertions,
similarity: { score: 0, threshold: 0.75 },
judge: { overall: 0, reasoning: "Skipped — assertions failed" },
finalScore: 0,
passed: false,
};
}
// Layer 2: Semantic similarity (cheap, fast)
const similarity = await computeSimilarity(
response,
testCase.referenceAnswer,
);
// Layer 3: LLM-as-judge (expensive, nuanced)
const judge = await llmJudge(
response,
testCase.referenceAnswer,
testCase.rubric,
CALIBRATION_SET,
);
// Weighted final score
const finalScore =
(assertions.passed ? 0.2 : 0) +
similarity.score * 0.3 +
(judge.overall / 5) * 0.5;
return {
assertions,
similarity,
judge,
finalScore,
passed: finalScore >= testCase.threshold,
};
}Known LLM Judge Biases
Verbosity Bias
Judges rate longer responses higher regardless of quality. A 500-word response often scores higher than a 100-word response even when the short one is better.
Mitigation
Include 'penalize unnecessary verbosity' in your rubric. Add calibration examples where concise answers score highest.
Position Bias
In A/B comparisons, judges consistently prefer whichever option is presented first. This can skew comparative evaluations by 15-20%.
Mitigation
Randomize option ordering. Run each comparison twice with swapped positions. Average the results.
Self-Preference Bias
GPT-4 judges rate GPT-4 outputs higher. Claude judges prefer Claude outputs. The judge model favors its own family's style.
Mitigation
Use a different model as judge than the one being evaluated. Or use multiple judge models and average.
Anchoring Bias
If the judge sees a reference answer first, it anchors on that specific phrasing and penalizes valid alternatives.
Mitigation
Use rubric-based evaluation instead of reference comparison when possible. Define quality criteria abstractly.
The Eval Pipeline Formula
Start with assertions for format and content requirements. Add semantic similarity as a cheap quality check. Layer LLM-as-judge for nuanced evaluation with calibrated rubrics. Short-circuit early — if assertions fail, skip the expensive layers. This gives you thoroughness where it matters and speed where it doesn't.
Human Evaluation
When and how to involve human reviewers
Automated evals scale, but human evaluation is the ground truth. LLM judges have known biases — they prefer verbose responses, exhibit position bias, and rate their own model's outputs higher. Human evals calibrate your automated systems and catch issues that no metric can quantify.
When to Involve Human Reviewers
Calibrating LLM-as-Judge
Before trusting an LLM judge, validate its scores against human labels on 50-100 cases. If correlation is low, the judge prompt needs rework.
Subjective Quality Assessment
Tasks where 'quality' is genuinely subjective — creative writing, tone matching, brand voice. No automated metric captures this well.
Safety-Critical Applications
Medical, legal, financial advice. Automated evals catch format issues but humans catch dangerous misinformation.
New Product Launch
Before the first public release, human reviewers catch issues that no eval suite anticipated. Use this feedback to bootstrap your automated evals.
Adversarial Testing
Red-teaming for safety evals. Creative humans find attack vectors that automated adversarial generators miss.
Periodic Audits
Even with good automated evals, run quarterly human audits on a production sample to catch systematic blind spots.
Writing Annotation Guidelines
The most common failure in human eval is vague guidelines. “Rate quality from 1-5” means different things to different annotators. Good guidelines define every score level with concrete examples and handle edge cases explicitly.
// Annotation guidelines template
interface AnnotationGuideline {
taskDescription: string;
dimensions: {
name: string;
description: string;
scale: { value: number; label: string; examples: string }[];
}[];
edgeCases: { scenario: string; guidance: string }[];
}
const summaryEvalGuideline: AnnotationGuideline = {
taskDescription: `Evaluate the quality of an AI-generated summary.
Read the source document, then rate the summary on each dimension.`,
dimensions: [
{
name: "Accuracy",
description: "Are all facts in the summary correct per the source?",
scale: [
{ value: 1, label: "Major errors", examples: "States wrong numbers, inverts conclusions" },
{ value: 2, label: "Minor errors", examples: "Slightly wrong dates, imprecise wording" },
{ value: 3, label: "Mostly accurate", examples: "Key facts correct, minor omissions" },
{ value: 4, label: "Accurate", examples: "All facts verifiable against source" },
{ value: 5, label: "Perfectly accurate", examples: "Every claim traceable to source" },
],
},
{
name: "Completeness",
description: "Does the summary cover all key points from the source?",
scale: [
{ value: 1, label: "Missing critical info", examples: "Main conclusion absent" },
{ value: 2, label: "Significant gaps", examples: "2+ key points missing" },
{ value: 3, label: "Adequate coverage", examples: "Main points present, details missing" },
{ value: 4, label: "Good coverage", examples: "All key points, most details" },
{ value: 5, label: "Comprehensive", examples: "All points covered proportionally" },
],
},
{
name: "Conciseness",
description: "Is the summary appropriately brief without losing meaning?",
scale: [
{ value: 1, label: "Far too long", examples: "Longer than source, excessive repetition" },
{ value: 2, label: "Too verbose", examples: "Could be 50% shorter without loss" },
{ value: 3, label: "Acceptable", examples: "Some unnecessary content" },
{ value: 4, label: "Concise", examples: "Tight writing, minimal fluff" },
{ value: 5, label: "Perfectly concise", examples: "Every word earns its place" },
],
},
],
edgeCases: [
{
scenario: "Summary adds information not in the source",
guidance: "Score Accuracy as 1 regardless of whether the added info is correct.",
},
{
scenario: "Summary is a single sentence",
guidance: "Can still score high on Conciseness if it captures the core message.",
},
],
};Measuring Annotator Agreement
If two annotators disagree frequently, your guidelines are ambiguous. Use Cohen's Kappa to measure inter-rater reliability. A kappa below 0.6 means your guidelines need revision before the annotations are useful.
// Calculate inter-rater reliability (Cohen's Kappa)
function cohensKappa(
rater1: number[],
rater2: number[],
): { kappa: number; interpretation: string } {
const n = rater1.length;
const categories = [...new Set([...rater1, ...rater2])];
// Observed agreement
let agreed = 0;
for (let i = 0; i < n; i++) {
if (rater1[i] === rater2[i]) agreed++;
}
const observedAgreement = agreed / n;
// Expected agreement by chance
let expectedAgreement = 0;
for (const cat of categories) {
const p1 = rater1.filter((r) => r === cat).length / n;
const p2 = rater2.filter((r) => r === cat).length / n;
expectedAgreement += p1 * p2;
}
const kappa = (observedAgreement - expectedAgreement) / (1 - expectedAgreement);
const interpretation =
kappa >= 0.81 ? "Almost perfect agreement" :
kappa >= 0.61 ? "Substantial agreement" :
kappa >= 0.41 ? "Moderate agreement" :
kappa >= 0.21 ? "Fair agreement" :
"Poor agreement — revise guidelines";
return { kappa, interpretation };
}
// Usage: validate annotator consistency
const rater1Scores = [5, 4, 3, 5, 2, 4, 3, 5, 4, 3];
const rater2Scores = [5, 4, 4, 5, 2, 3, 3, 5, 4, 3];
const { kappa, interpretation } = cohensKappa(rater1Scores, rater2Scores);
// kappa: 0.72 — "Substantial agreement"Crowd Workers vs. Domain Experts
- +Cheap ($0.10-$1 per annotation)
- +Fast (hundreds of labels per day)
- +Good for general quality, formatting, clarity
- -Can't assess factual accuracy in specialized domains
- -Quality varies — need attention checks and redundancy
- +Can verify factual correctness
- +Catch subtle domain-specific errors
- +Higher trust for safety-critical evaluations
- -Expensive ($50-$200/hr)
- -Limited availability and throughput
Best practice: Use crowd workers for general quality evaluation and formatting checks. Reserve domain experts for factual accuracy verification and safety-critical assessments. Use expert labels to calibrate your LLM-as-judge, then scale with automated evals.
Cost Optimization Strategy
Human eval is expensive. Optimize by using it strategically: label 100 cases with experts to calibrate your LLM judge, then run the LLM judge on 10,000 cases. Periodically spot-check the LLM judge against new human labels to detect drift. This gives you expert-quality evaluation at automated-eval prices.
Eval Frameworks
promptfoo, Braintrust, LangSmith, custom solutions
You don't have to build your eval system from scratch. Several mature frameworks provide dataset management, scoring functions, CI/CD integration, and result visualization out of the box. The right choice depends on your stack, team size, and specific requirements.
The most important thing is to start evaluating — any framework is better than no framework. You can always migrate later. Pick the tool that has the lowest friction for your team today.
Framework Comparison
promptfoo
CLI-first eval framework that lets you define test cases in YAML, run evals against multiple providers, and compare results side-by-side. Excellent for prompt iteration and CI/CD integration.
Key Features
- -YAML-based test case definition
- -Multi-provider comparison (OpenAI, Anthropic, local models)
- -Built-in assertion types (contains, similar, llm-rubric)
- -CI/CD integration with GitHub Actions
- -Web UI for viewing results
- -Red-teaming and adversarial testing
Best For
Teams that want a fast, config-driven eval workflow. Best for prompt comparison and regression testing. Strongest CI/CD integration.
Pricing
Free and open-source. Cloud dashboard available.
Braintrust
Production-grade eval and observability platform with dataset management, experiment tracking, scoring functions, and real-time logging. Used by Scale AI, Notion, and Stripe.
Key Features
- -Managed dataset storage and versioning
- -Custom scoring functions in TypeScript/Python
- -Experiment tracking with automatic comparison
- -Real-time production logging and monitoring
- -Human annotation workflows
- -SDK for TypeScript, Python, and REST API
Best For
Teams that need end-to-end eval infrastructure. Best for organizations that want managed dataset management and production monitoring alongside evals.
Pricing
Free tier available. Paid plans for teams.
LangSmith
Tracing and evaluation platform from the LangChain team. Deep integration with LangChain/LangGraph, but works with any LLM framework. Strong tracing and debugging capabilities.
Key Features
- -End-to-end tracing of LLM chains and agents
- -Dataset management with annotation queues
- -Built-in and custom evaluators
- -Comparison experiments
- -Online evaluation (production monitoring)
- -Deep LangChain/LangGraph integration
Best For
Teams already using LangChain or LangGraph. Best for complex agent workflows that need detailed tracing alongside evaluation.
Pricing
Free tier for developers. Paid plans for teams.
Custom Solution
Build your own eval framework tailored to your specific needs. Full control over every aspect of the eval pipeline. Best when existing tools don't fit your workflow or you need deep integration with internal systems.
Key Features
- -Complete control over eval logic
- -Deep integration with your stack
- -Custom scoring functions for domain-specific needs
- -No vendor lock-in
- -Custom reporting and dashboards
- -Optimized for your specific use case
Best For
Teams with unique eval requirements, heavy compliance needs, or existing internal tooling. Best when you need full control and can invest engineering time.
Pricing
Engineering time. Typically 2-4 weeks for a basic framework.
promptfoo: Config-Driven Evals
promptfoo uses a YAML config to define prompts, providers, test cases, and assertions. Run with npx promptfoo eval to compare prompts side-by-side across multiple models.
# promptfoo config: promptfooconfig.yaml
prompts:
- id: current
raw: "You are a helpful assistant. Answer: {{query}}"
- id: candidate
raw: |
You are a customer support agent for Acme Corp.
Answer the user's question accurately and concisely.
If unsure, say so rather than guessing.
Question: {{query}}
providers:
- openai:gpt-4o
- anthropic:messages:claude-sonnet-4-20250514
tests:
- vars:
query: "What is your refund policy?"
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "Response accurately describes the refund policy"
- type: not-contains
value: "I don't know"
- vars:
query: "How do I cancel my subscription?"
assert:
- type: contains
value: "settings"
- type: similar
value: "Go to Settings > Subscription > Cancel"
threshold: 0.7
- vars:
query: "Can I get a refund for a digital purchase?"
assert:
- type: llm-rubric
value: "Response correctly states digital purchases are credit-only"
- type: javascript
value: "output.length < 500"Braintrust: Programmatic Evals
Braintrust provides a TypeScript/Python SDK for defining eval tasks with custom scoring functions. Results are tracked as experiments with automatic comparison against previous runs.
// Braintrust eval with custom scoring
import { Eval } from "braintrust";
Eval("customer-support-qa", {
data: () => loadDataset("datasets/support-evals.jsonl"),
task: async (input) => {
const response = await llm.generate({
system: SYSTEM_PROMPT,
messages: [{ role: "user", content: input.query }],
});
return response;
},
scores: [
// Built-in scorer: factual accuracy via LLM judge
Factuality,
// Custom scorer: check required content
(args) => {
const hasRequiredInfo = args.input.requiredKeywords
.every((kw: string) => args.output.toLowerCase().includes(kw));
return {
name: "contains_required_info",
score: hasRequiredInfo ? 1 : 0,
};
},
// Custom scorer: response length check
(args) => ({
name: "conciseness",
score: args.output.length < 500 ? 1 : args.output.length < 800 ? 0.5 : 0,
}),
],
});Custom: Build What You Need
A custom eval framework can be surprisingly simple — a dataset loader, a set of evaluators, and a report generator. Start minimal and add complexity as your needs grow.
// Custom eval framework — minimal but effective
import { z } from "zod";
interface EvalConfig {
name: string;
dataset: string;
evaluators: Evaluator[];
thresholds: { overall: number; perCategory: Record<string, number> };
}
interface Evaluator {
name: string;
weight: number;
evaluate: (input: string, output: string, expected?: string) => Promise<number>;
}
async function runEvalSuite(config: EvalConfig): Promise<EvalReport> {
const dataset = await loadDataset(config.dataset);
const results: EvalResult[] = [];
for (const testCase of dataset) {
const output = await generateResponse(testCase.input);
const scores: Record<string, number> = {};
for (const evaluator of config.evaluators) {
scores[evaluator.name] = await evaluator.evaluate(
testCase.input,
output,
testCase.expected,
);
}
const overall = config.evaluators.reduce(
(sum, ev) => sum + scores[ev.name] * ev.weight, 0
);
results.push({ testCase: testCase.id, scores, overall, output });
}
return generateReport(results, config.thresholds);
}How to Choose
Need fast prompt comparison in CI?
promptfoo — config-driven, CLI-first, excellent CI/CD integration.
Need managed datasets and production monitoring?
Braintrust — end-to-end platform with experiment tracking and logging.
Already using LangChain and need tracing?
LangSmith — deep integration with the LangChain ecosystem plus evaluation.
Have unique requirements or compliance needs?
Custom solution — full control, no vendor lock-in, tailored to your workflow.
Regression Testing for LLMs
Catch quality drops before they ship
Regression testing is the practice of comparing new LLM outputs against a known baseline to detect quality drops before they reach production. It is the most critical eval practice for teams that ship frequently — every prompt change, model update, or parameter tweak can silently degrade quality.
In traditional software, unit tests prevent code regressions. In LLM systems, eval baselines serve the same purpose. Without them, you are flying blind every time you deploy.
What Causes LLM Regressions?
Prompt Changes
Modifying system prompts, few-shot examples, or instruction formatting. The most common source of regressions — fixing one edge case often breaks three others.
Model Version Updates
LLM providers ship new model versions that change behavior. GPT-4o-2024-08-06 may behave differently from GPT-4o-2024-05-13 in subtle ways.
RAG Index Changes
Adding, removing, or re-indexing documents in your retrieval system. Chunk size changes, embedding model updates, or metadata schema changes.
Parameter Tuning
Adjusting temperature, top-p, max tokens, or other inference parameters. Small parameter changes can cascade into large behavior shifts.
Tool/Function Updates
Changing tool descriptions, adding new tools, modifying tool output schemas. Tool changes alter what the model 'sees' and how it decides to act.
Key Insight
Track p5 (5th percentile) in addition to mean score. The mean can improve while worst-case performance silently degrades. A prompt change that improves 80% of cases but makes 5% catastrophically worse is a net negative — and mean score alone won't catch it.
Creating and Managing Baselines
A baseline is a snapshot of your system's eval scores at a known-good state. Every candidate change is compared against this baseline. When a candidate passes, it becomes the new baseline for future comparisons.
// Baseline management for regression detection
interface Baseline {
id: string;
promptVersion: string;
modelVersion: string;
timestamp: string;
results: Map<string, CaseResult>;
aggregates: {
mean: number;
median: number;
p5: number; // 5th percentile — worst-case performance
p95: number; // 95th percentile — best-case performance
passRate: number;
};
}
interface CaseResult {
caseId: string;
score: number;
passed: boolean;
output: string;
evalDetails: Record<string, number>;
}
async function createBaseline(
promptVersion: string,
dataset: EvalCase[],
): Promise<Baseline> {
const results = new Map<string, CaseResult>();
for (const testCase of dataset) {
const output = await generateResponse(testCase.input, promptVersion);
const evalResult = await evaluate(output, testCase);
results.set(testCase.id, {
caseId: testCase.id,
score: evalResult.score,
passed: evalResult.passed,
output,
evalDetails: evalResult.details,
});
}
const scores = [...results.values()].map((r) => r.score).sort((a, b) => a - b);
return {
id: `baseline-${Date.now()}`,
promptVersion,
modelVersion: getCurrentModelVersion(),
timestamp: new Date().toISOString(),
results,
aggregates: {
mean: average(scores),
median: scores[Math.floor(scores.length / 2)],
p5: scores[Math.floor(scores.length * 0.05)],
p95: scores[Math.floor(scores.length * 0.95)],
passRate: [...results.values()].filter((r) => r.passed).length / results.size,
},
};
}Detecting Regressions
Compare each test case individually against the baseline. Track improvements, regressions, and unchanged cases. Use configurable thresholds to determine what counts as a regression and how many are acceptable before blocking deployment.
// Regression detection: compare candidate against baseline
interface RegressionReport {
improved: { caseId: string; oldScore: number; newScore: number }[];
regressed: { caseId: string; oldScore: number; newScore: number }[];
unchanged: string[];
newCases: string[]; // Cases in candidate but not in baseline
aggregateDelta: {
mean: number;
p5: number;
passRate: number;
};
verdict: "pass" | "fail" | "review";
}
async function detectRegressions(
baseline: Baseline,
candidateResults: Map<string, CaseResult>,
config: {
regressionThreshold: number; // Score drop that counts as regression
maxRegressions: number; // Max allowed regressions
minP5: number; // Minimum acceptable p5 score
},
): Promise<RegressionReport> {
const report: RegressionReport = {
improved: [],
regressed: [],
unchanged: [],
newCases: [],
aggregateDelta: { mean: 0, p5: 0, passRate: 0 },
verdict: "pass",
};
for (const [caseId, candidateResult] of candidateResults) {
const baselineResult = baseline.results.get(caseId);
if (!baselineResult) {
report.newCases.push(caseId);
continue;
}
const delta = candidateResult.score - baselineResult.score;
if (delta > config.regressionThreshold) {
report.improved.push({
caseId, oldScore: baselineResult.score, newScore: candidateResult.score,
});
} else if (delta < -config.regressionThreshold) {
report.regressed.push({
caseId, oldScore: baselineResult.score, newScore: candidateResult.score,
});
} else {
report.unchanged.push(caseId);
}
}
// Calculate aggregate deltas
const candidateScores = [...candidateResults.values()].map((r) => r.score).sort((a, b) => a - b);
const candidateP5 = candidateScores[Math.floor(candidateScores.length * 0.05)];
report.aggregateDelta = {
mean: average(candidateScores) - baseline.aggregates.mean,
p5: candidateP5 - baseline.aggregates.p5,
passRate: (candidateResults.size > 0
? [...candidateResults.values()].filter((r) => r.passed).length / candidateResults.size
: 0) - baseline.aggregates.passRate,
};
// Determine verdict
if (report.regressed.length > config.maxRegressions) {
report.verdict = "fail";
} else if (candidateP5 < config.minP5) {
report.verdict = "fail";
} else if (report.regressed.length > 0) {
report.verdict = "review"; // Some regressions but within tolerance
}
return report;
}Quality Gate Recommendations
Hard Block
- - p5 score drops below threshold
- - Any safety eval fails
- - Pass rate drops more than 5%
- - More than 5 individual regressions
Requires Review
- - 1-5 individual regressions
- - Mean score drops but p5 holds
- - New test cases with low scores
- - Significant output style changes
Auto-Approve
- - Zero regressions
- - Mean and p5 improved or stable
- - Pass rate maintained or improved
- - All safety evals pass
The Regression Testing Workflow
Establish a baseline at your current known-good state
Run the same eval suite against the candidate change
Compare case-by-case and aggregate metrics against the baseline
Block deployment if regressions exceed thresholds
On successful deployment, promote candidate scores as the new baseline
Repeat for every change to prompts, models, retrieval, or parameters
Eval-Driven Development
Write the eval first, then improve the prompt
Eval-Driven Development (EDD) is the LLM equivalent of Test-Driven Development. The core idea: write the eval before you fix the prompt, then iterate until the eval passes. This prevents goalpost-shifting, captures institutional knowledge, and ensures every improvement is measurable.
Just as TDD transformed software quality by making tests a first-class artifact, EDD transforms LLM product quality by making evals the foundation of the development workflow — not an afterthought.
The EDD Cycle
Identify the Problem
Start with a concrete quality issue — user complaints, production failures, or a new capability requirement. Define what 'fixed' looks like in measurable terms.
Write the Eval Cases
Before touching the prompt, create eval cases that test the specific problem. Include positive examples (what good looks like), negative examples (current failure), and edge cases.
Measure the Baseline
Run the new eval cases against the current prompt to establish a baseline score. This quantifies how bad the problem actually is and sets a target for improvement.
Iterate with Eval Feedback
Modify the prompt and run the eval suite after each change. The eval score is your compass — it tells you if each change helped, hurt, or had no effect.
Run Full Regression Suite
Once the new eval passes, run the full regression suite to ensure you didn't break anything else. Only deploy if both the new eval and all existing evals pass.
Promote to Permanent Suite
Add the new eval cases to your permanent regression suite. They become a safety net that prevents this specific problem from ever recurring.
Key Insight
The most important moment in EDD is Step 2 — writing the eval before touching the prompt. This forces you to define “done” objectively, prevents the goalpost from shifting during iteration, and creates permanent regression tests that protect future changes.
EDD vs. Ad-Hoc Prompt Engineering
| Aspect | Ad-Hoc | Eval-Driven |
|---|---|---|
| Approach | Change prompt, eyeball a few examples | Write eval cases, measure, iterate with data |
| Confidence | "I think it's better" | "Score improved 3.2 → 4.1, zero regressions" |
| Regression Risk | Unknown — discovered by users | Quantified — blocked by automated checks |
| Knowledge Capture | In engineer's head, lost when they leave | In eval suite, permanent institutional knowledge |
| Iteration Speed | Fast at first, slow when things break | Slightly slower at first, compounds over time |
| Scalability | Breaks at 10+ prompt variations | Scales to hundreds of test cases |
The EDD Workflow in Code
// Eval-Driven Development workflow
interface EDDConfig {
issue: string;
evalCases: EvalCase[];
currentPrompt: string;
maxIterations: number;
targetScore: number;
}
async function evalDrivenDevelopment(config: EDDConfig) {
// Step 1: Add new eval cases to the suite
const fullSuite = await loadEvalDataset("regression-suite");
const expandedSuite = [...fullSuite, ...config.evalCases];
// Step 2: Measure baseline on new cases
const baselineResults = await runEvals(config.currentPrompt, config.evalCases);
console.log(`Baseline: ${baselineResults.passRate * 100}% pass rate`);
console.log(`Target: ${config.targetScore * 100}% pass rate`);
// Step 3: Iterate with eval feedback
let bestPrompt = config.currentPrompt;
let bestScore = baselineResults.overallScore;
const history: { prompt: string; score: number; delta: number }[] = [];
for (let i = 0; i < config.maxIterations; i++) {
// Generate a prompt variant (can be manual or LLM-assisted)
const candidate = await generatePromptVariant(bestPrompt, {
issue: config.issue,
failingCases: baselineResults.failures,
iteration: i,
});
// Run eval on new cases
const candidateResults = await runEvals(candidate, config.evalCases);
// Run full regression suite to check for side effects
const regressionResults = await runEvals(candidate, fullSuite);
const improved = candidateResults.overallScore > bestScore;
const noRegressions = regressionResults.regressions.length === 0;
history.push({
prompt: candidate,
score: candidateResults.overallScore,
delta: candidateResults.overallScore - bestScore,
});
if (improved && noRegressions) {
bestPrompt = candidate;
bestScore = candidateResults.overallScore;
console.log(`Iteration ${i + 1}: Improved to ${bestScore}`);
}
// Early exit if target reached
if (bestScore >= config.targetScore) {
console.log(`Target reached at iteration ${i + 1}`);
break;
}
}
// Step 4: Deploy if improved
if (bestScore > baselineResults.overallScore) {
await deploy(bestPrompt);
// Promote new cases to permanent regression suite
await saveEvalDataset("regression-suite", expandedSuite);
console.log(`Deployed. Score: ${baselineResults.overallScore} → ${bestScore}`);
}
return { bestPrompt, bestScore, history };
}Avoid Eval Overfitting
If you iterate on the same eval cases for too long, your prompt becomes over-fitted to those specific cases — it passes the eval but fails on unseen production queries. Mitigate this by maintaining a held-out test set that you never look at during development. Run it as a final gate before deployment.
The EDD Mantra
Write the eval. Measure the baseline. Iterate with data. Deploy with confidence. Every production failure becomes a new eval case. Every eval case prevents that failure from recurring. Over time, your eval suite becomes the most valuable artifact in your LLM engineering workflow — it is your team's accumulated knowledge of what can go wrong and what “good” looks like.
CI/CD for LLM Apps
Evals in your deployment pipeline
The highest-impact eval practice is making evals automatic and unavoidable. When evals run on every PR that touches prompts, post results as comments, and block merges on regressions — quality becomes a system property, not a personal discipline.
Integrating evals into CI/CD transforms them from something engineers remember to run into something that runs automatically on every change. This is the difference between a demo and a product.
The CI/CD Eval Pipeline
Detect
< 5 secondsIdentify which files changed. If prompts, eval datasets, or model configs changed, trigger the eval pipeline. Skip for unrelated changes.
Fast Evals
< 30 secondsRun assertion-based evals: format checks, keyword presence, length constraints, schema validation. These are deterministic and free — no LLM calls needed.
Full Evals
2-10 minutesRun LLM-as-judge and semantic similarity evals against the full regression suite. Compare against the stored baseline. Compute score deltas and regression counts.
Report
< 10 secondsPost eval results as a PR comment with score deltas, regression details, and pass/fail status. Make results visible and actionable in the code review workflow.
Gate
ImmediateBlock merge if evals fail. Auto-approve if all checks pass. Require manual review for borderline cases (some regressions but within tolerance).
Key Insight
Layer your CI evals by cost and speed. Fast evals (assertions, format checks) run on every PR in under 30 seconds. Full evals (LLM-as-judge) run only when fast evals pass, taking 2-10 minutes. This keeps CI fast for non-prompt changes while ensuring thorough evaluation for prompt changes.
GitHub Actions Configuration
This workflow triggers only when prompt files, eval datasets, or LLM config files change. Fast evals run first — if they fail, the expensive full eval suite is skipped. Results are posted as a PR comment with score deltas.
# .github/workflows/llm-evals.yml
name: LLM Eval Suite
on:
pull_request:
paths:
- 'prompts/**'
- 'evals/**'
- 'src/lib/llm-config.ts'
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
jobs:
fast-evals:
name: Fast Evals (Assertions)
runs-on: ubuntu-latest
timeout-minutes: 2
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npx tsx evals/run-assertions.ts
env:
EVAL_DATASET: evals/datasets/latest.json
FAIL_THRESHOLD: 0.95
full-evals:
name: Full Eval Suite (LLM-as-Judge)
needs: fast-evals
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
# Run full eval suite
- run: npx tsx evals/run-full-suite.ts
id: evals
env:
EVAL_DATASET: evals/datasets/latest.json
BASELINE_PATH: evals/baselines/latest.json
OUTPUT_PATH: evals/results/current.json
# Post results as PR comment
- uses: actions/github-script@v7
with:
script: |
const results = require('./evals/results/current.json');
const body = formatEvalComment(results);
await github.rest.issues.createComment({
...context.repo,
issue_number: context.issue.number,
body,
});
# Fail if below threshold
- run: |
node -e "
const r = require('./evals/results/current.json');
if (r.regressions > 0 || r.overallScore < 0.9) {
console.error('Eval failed:', JSON.stringify(r.summary));
process.exit(1);
}
"CI Eval Runner
The eval runner partitions test cases by type and runs them in phases. Functional evals gate safety evals, which gate quality evals. Each phase can short-circuit if thresholds aren't met, saving time and API costs.
// CI eval runner with reporting
interface CIEvalConfig {
datasetPath: string;
baselinePath: string;
outputPath: string;
thresholds: {
fastEvalPassRate: number; // e.g., 0.95
fullEvalMinScore: number; // e.g., 0.85
maxRegressions: number; // e.g., 3
safetyPassRate: number; // e.g., 1.0 (zero tolerance)
};
}
async function runCIEvals(config: CIEvalConfig) {
const dataset = await loadDataset(config.datasetPath);
const baseline = await loadBaseline(config.baselinePath);
// Partition test cases by type
const functional = dataset.filter((c) => c.type === "functional");
const quality = dataset.filter((c) => c.type === "quality");
const safety = dataset.filter((c) => c.type === "safety");
// Phase 1: Fast functional evals (parallel)
const functionalResults = await Promise.all(
functional.map((c) => runFunctionalEval(c))
);
const functionalPassRate = functionalResults.filter((r) => r.passed).length / functional.length;
if (functionalPassRate < config.thresholds.fastEvalPassRate) {
return { status: "fail", phase: "functional", passRate: functionalPassRate };
}
// Phase 2: Safety evals (zero tolerance)
const safetyResults = await Promise.all(
safety.map((c) => runSafetyEval(c))
);
const safetyPassRate = safetyResults.filter((r) => r.passed).length / safety.length;
if (safetyPassRate < config.thresholds.safetyPassRate) {
return { status: "fail", phase: "safety", passRate: safetyPassRate };
}
// Phase 3: Full quality evals with baseline comparison
const qualityResults = await Promise.all(
quality.map((c) => runQualityEval(c))
);
const regressions = detectRegressions(baseline, qualityResults);
const overallScore = average(qualityResults.map((r) => r.score));
// Save results for reporting
const report = {
status: regressions.length > config.thresholds.maxRegressions ? "fail" : "pass",
functionalPassRate,
safetyPassRate,
overallScore,
regressions: regressions.length,
improved: qualityResults.filter((r) => r.improved).length,
details: { functionalResults, safetyResults, qualityResults },
};
await saveResults(config.outputPath, report);
return report;
}PR Comment Reporting
Making eval results visible in the PR review workflow is critical. Engineers should see score deltas, regressions, and improvements at a glance — without leaving the code review interface.
// Format eval results as a PR comment
function formatEvalComment(results: EvalReport): string {
const status = results.status === "pass" ? "PASSED" : "FAILED";
const emoji = results.status === "pass" ? "white_check_mark" : "x";
return `## LLM Eval Results: ${status} :${emoji}:
| Metric | Score | Threshold | Status |
|--------|-------|-----------|--------|
| Functional Pass Rate | ${pct(results.functionalPassRate)} | 95% | ${badge(results.functionalPassRate >= 0.95)} |
| Safety Pass Rate | ${pct(results.safetyPassRate)} | 100% | ${badge(results.safetyPassRate >= 1.0)} |
| Quality Score | ${results.overallScore.toFixed(2)}/5.0 | 4.0 | ${badge(results.overallScore >= 4.0)} |
| Regressions | ${results.regressions} | < 3 | ${badge(results.regressions < 3)} |
| Improvements | ${results.improved} | — | :chart_with_upwards_trend: |
${results.regressions > 0 ? `
### Regressed Cases
${results.details.qualityResults
.filter((r: QualityResult) => r.regressed)
.map((r: QualityResult) => `- **${r.caseId}**: ${r.oldScore} → ${r.newScore} (Δ ${r.delta})`)
.join("\n")}
` : ""}
<details>
<summary>Full Results (${results.details.qualityResults.length} cases)</summary>
// ... detailed per-case results
</details>
---
*Eval suite v${results.datasetVersion} | Baseline v${results.baselineVersion}*`;
}Managing CI Eval Costs
Path-based triggering
Only run evals when prompt files, eval datasets, or model config changes. Use GitHub Actions path filters to skip unrelated PRs.
Tiered execution
Run cheap assertion evals on every PR. Run expensive LLM-as-judge evals only when assertions pass. Run full regression suites nightly.
Caching
Cache eval results for unchanged test cases. If the prompt didn't change, previous results are still valid. Only re-evaluate changed components.
Concurrency limits
Run eval API calls concurrently (10-20 parallel) but rate-limit to avoid provider throttling. Most eval suites of 100 cases complete in 2-5 minutes.
The CI/CD Eval Principle
If a human has to remember to run evals, they won't. Put evals in CI, post results on PRs, block merges on regressions, and track scores over time. Make quality automatic and unavoidable — not optional and forgettable.
Interactive Examples
See eval patterns in action with live code
See eval patterns in action. Each example shows a bad pattern and its eval-engineered fix. Toggle between them to understand the difference.
Using an LLM to grade another LLM's output
// BAD: No rubric, no structure, unreliable scores
async function evalResponse(response: string) {
const grade = await llm.generate({
system: "Rate this response from 1-10.",
messages: [
{ role: "user", content: response },
],
});
// Returns inconsistent scores, no reasoning
// "7" one time, "I'd give it a 7/10" the next
return grade;
}Why this fails
Without a rubric, the judge LLM produces inconsistent scores. No structured output means parsing fails randomly. No reasoning means you can't debug why a score was given.
All Examples Quick Reference
LLM-as-Judge Eval
Using an LLM to grade another LLM's output
Assertion-Based Evals
Deterministic checks for LLM outputs
Eval Dataset Management
Building and maintaining test cases
Regression Detection
Catching quality drops before they reach users
CI/CD Eval Pipeline
Automated evals in your deployment workflow
Ad-Hoc Tweaks vs Eval-Driven Development
Write the eval first, then improve the prompt
Uncalibrated vs Calibrated LLM Judge
Building reliable automated evaluation with LLM judges
Anti-Patterns & Failure Modes
Common eval mistakes and how to avoid them
Knowing what not to do is as important as knowing what to do. These are the most common eval failure modes — patterns that give teams false confidence while quality silently degrades in production.
Deploying prompt changes based on manual spot-checking a few examples instead of systematic evaluation.
Cause
No eval infrastructure in place. Teams treat LLM outputs like they treat UI changes — 'looks good to me' is the approval process.
Symptom
Prompt changes that 'seemed fine' cause subtle quality regressions in production. Users complain about outputs that used to work. No one can quantify if the system is getting better or worse over time.
Fix
Build an eval suite before building features. Even 20 well-chosen test cases with simple assertions is 100x better than eyeballing. Automate these in CI so every PR gets checked.
Eval datasets that were created once and never updated, becoming increasingly disconnected from real-world usage patterns.
Cause
Teams build an initial eval dataset during development but never feed production failures, new edge cases, or shifting user patterns back into the dataset.
Symptom
Evals pass consistently but production quality degrades. The eval dataset tests scenarios from 6 months ago, not what users actually ask today. False confidence from green CI checks.
Fix
Implement a feedback loop: sample production logs weekly, add failing cases from user reports, and run dataset freshness audits monthly. Version your datasets and track coverage metrics.
Optimizing prompts to pass specific eval cases rather than genuinely improving quality, similar to overfitting in ML.
Cause
Small eval datasets with predictable patterns. Teams iterate on prompts specifically to pass the eval suite rather than to improve general quality.
Symptom
Eval scores keep going up but production quality stays flat or drops. Prompts become over-fitted to the eval cases. New, unseen inputs fail at higher rates.
Fix
Use held-out test sets that prompt engineers never see. Regularly rotate eval cases. Include synthetic and adversarial examples. Monitor production metrics alongside eval scores.
Over-relying on a single metric (like BLEU score or cosine similarity) that captures only one dimension of quality.
Cause
Teams pick one easy-to-compute metric and treat it as the source of truth. Semantic similarity scores become the only quality gate.
Symptom
High scores on the tracked metric but poor real-world quality. A summarizer scores high on ROUGE but produces summaries that are factually wrong. A chatbot scores high on relevance but is rude.
Fix
Use multi-dimensional evaluation: combine deterministic assertions, semantic metrics, LLM-as-judge with rubrics, and human evaluation. No single metric captures LLM quality.
Shipping LLM-powered features to production with zero automated evaluation, relying entirely on user feedback as a quality signal.
Cause
Pressure to ship fast. Teams skip evals because they're seen as slow to build. The 'we'll add evals later' promise that never materializes.
Symptom
Every production incident is a surprise. No way to assess impact of model provider changes, prompt updates, or temperature adjustments. Users are the canary in the coal mine.
Fix
Start with the simplest possible eval: 10 test cases, basic assertions, running in CI. Expand from there. Even a minimal eval suite catches 80% of obvious regressions.
Using LLM-as-judge without calibrating for known biases: verbosity preference, position bias, self-preference.
Cause
LLM judges have systematic biases — they prefer longer responses, favor the first option presented, and rate their own model's outputs higher. Teams don't test for or correct these biases.
Symptom
Eval results that don't correlate with human judgment. Verbose, padded responses score higher than concise, accurate ones. A/B comparisons always favor whichever option is presented first.
Fix
Calibrate judges against human labels. Randomize option ordering. Use structured rubrics that penalize unnecessary verbosity. Run bias audits on your judge model periodically.
Best Practices Checklist
Production-ready eval guidelines
Production-ready eval guidelines distilled from Anthropic, OpenAI, Hamel Husain, Eugene Yan, and the broader AI engineering community.
Start with 20-50 cases, not 1000
A small, well-curated dataset with clear rubrics beats a large noisy one. You can always expand. Start with the cases that matter most to your users.
Cover the distribution, not just the happy path
Include edge cases, adversarial inputs, multilingual queries, and the long tail of real-world usage. Weight your dataset to match production traffic patterns.
Version your datasets like code
Store eval datasets in version control. Track when cases were added, why, and from what source. This lets you audit eval quality over time.
Feed production failures back into evals
Every user complaint, thumbs-down, or escalation is a potential eval case. Build a pipeline that samples production failures into your eval dataset weekly.
Layer your assertions: deterministic first, LLM-judge last
Check format, length, and keyword presence with fast deterministic checks. Use semantic similarity as a middle layer. Reserve expensive LLM-as-judge for nuanced quality assessment.
Always include reasoning in LLM-as-judge prompts
Require the judge to explain its reasoning before scoring. This improves consistency and lets you debug disagreements between the judge and human reviewers.
Use structured output (Zod/JSON Schema) for eval results
Parse eval results into typed schemas. This prevents the 'I give it a 7 out of 10' vs '7' vs '7/10' parsing problem and ensures consistent downstream processing.
Calibrate LLM judges against human labels
Run your LLM judge on a set of human-labeled examples. If judge scores don't correlate with human scores, the judge needs better prompting or a different model.
Run fast evals on every PR, full evals nightly
Assertion-based evals can run in under 2 minutes. Gate PRs on these. Run expensive LLM-as-judge evals nightly or on-demand for prompt changes.
Post eval results as PR comments
Make eval results visible in the PR review workflow. Show score deltas, regressions, and new failures. This makes eval results impossible to ignore.
Block merges on eval regressions
Treat eval failures like test failures. If a prompt change causes more than N regressions or drops the mean score below a threshold, block the merge.
Track eval scores over time as a dashboard
Plot eval scores per prompt version on a time-series dashboard. This shows whether your system is improving, degrading, or plateauing over weeks and months.
Use human evals to calibrate automated evals
Have human raters score 50-100 cases. Compare against LLM-as-judge scores. If correlation is below 0.7, revise your judge prompt. Humans are the ground truth.
Create detailed annotation guidelines with examples
Define every score level with concrete examples. Include edge case handling rules. Untrained raters with vague instructions produce unreliable data.
Measure and track inter-rater reliability
Use Cohen's Kappa or Krippendorff's Alpha. If agreement is below 0.6, the task definition is ambiguous — revise guidelines before collecting more labels.
Reserve human eval budget for high-stakes decisions
Use automated evals for routine regression testing. Reserve expensive human evaluation for safety-critical assessments, model comparisons, and rubric development.
Write the eval before writing the prompt
Define what 'good' looks like with test cases and rubrics before you write a single prompt. This is the LLM equivalent of TDD and prevents goalpost-shifting.
Use evals to compare prompt candidates
When iterating on a prompt, run all candidates against the same eval suite. Let data decide which prompt is best, not intuition.
Separate development evals from held-out evals
Keep a set of eval cases that prompt engineers never see during development. Run these as a final gate to detect overfitting to the dev eval set.
Treat eval maintenance as ongoing work, not a one-time task
Allocate 10-20% of your LLM engineering time to eval maintenance. Add new cases, retire stale ones, recalibrate judges, and update rubrics as your product evolves.
The Guiding Principle
Evals are not a testing step at the end — they are the foundation of the entire development workflow. Write the eval first, measure the baseline, iterate with data, and deploy with confidence. The teams that invest the most in their eval infrastructure ship the most reliable AI products.
— Hamel Husain, “Your AI Product Needs Evals”
Resources & Further Reading
Tools, blogs, papers, and guides
Essential tools, frameworks, blog posts, research papers, and guides for building production-grade LLM evaluation systems.
Your AI Product Needs Evals
The definitive guide to LLM evals by Hamel Husain. Covers why evals matter, how to build them, and common mistakes teams make.
How to Evaluate LLM Applications
Practical framework for evaluating LLM applications covering classification, generation, and RAG use cases with code examples.
promptfoo: Test Your LLM App
Open-source tool for testing and evaluating LLM outputs. Supports multiple providers, assertions, and CI/CD integration.
Braintrust AI Eval Framework
Production-grade eval framework with experiment tracking, scoring functions, and dataset management.
LangSmith Evaluation Guide
Official LangSmith docs on building evaluators, managing datasets, running experiments, and tracking results over time.
Judging LLM-as-a-Judge
Foundational paper on using LLMs as evaluators. Analyzes biases, calibration, and agreement with human judgments.
RAGAS: Automated Evaluation of RAG
Framework for evaluating RAG pipelines with metrics for faithfulness, answer relevance, context precision, and recall.
OpenAI Evals
OpenAI's open-source evaluation framework. Includes benchmark tasks and a framework for creating custom evals.
Anthropic's Guide to Prompt Evaluation
Anthropic's official guide to evaluating prompts systematically, including rubric design and scoring methodologies.
The Eval Gap: Why Offline Metrics Don't Predict Online Performance
Analysis of why eval scores often don't correlate with production quality, and strategies for closing the gap.
Where to Start
If you read one thing, read Hamel Husain's “Your AI Product Needs Evals” — it covers the why, what, and how of LLM evals in one comprehensive post. Then pick a framework (promptfoo for CLI-first, Braintrust for enterprise) and start with 20 test cases. You'll see immediate ROI.