The Complete Guide — Ship with Confidence

LLM
Evals
Academy

Build evaluation systems that catch regressions before your users do. From simple assertions to LLM-as-judge, from eval-driven development to CI/CD integration — with interactive code examples.

13 Chapters7 Code Examples6 Anti-Patterns10 Resources5 Best Practice Categories
1

Why Evals Matter

The foundation of reliable AI products

LLM evaluations are systematic methods for measuring the quality, reliability, and safety of LLM outputs. They are the difference between shipping a demo and shipping a product.

Traditional software has type systems, unit tests, and integration tests. LLM outputs are non-deterministic, subjective, and context- dependent. Evals are the testing discipline designed for this reality — they give you confidence that your system works, a safety net when you make changes, and data to drive decisions.

The Core Problem

LLMs are stochastic systems. The same input can produce different outputs. A prompt that works today may fail tomorrow after a model update. Without evals, every change is a leap of faith — you have no way to know if your system is improving, degrading, or breaking. Users become your QA team, and user complaints become your monitoring system.

Key Insight

Evals are not a quality-assurance step you add at the end. They are the foundation of the entire development workflow. The best AI engineering teams write evals before they write prompts — the same way the best software teams write tests before they write code.

The Cost of Shipping Without Evals

Prompt Regression

A prompt change that improves one use case silently breaks three others.

Impact

Users see degraded quality for days before anyone notices.

Model Provider Update

Your LLM provider ships a new model version. Your app behavior changes overnight.

Impact

No way to quantify the impact or decide whether to pin the old version.

Temperature Tweak

Someone changes temperature from 0.7 to 0.3 to reduce hallucinations.

Impact

Hallucinations drop but creative tasks become robotic. No data to balance the tradeoff.

RAG Index Update

New documents are indexed into the retrieval system with different formatting.

Impact

Retrieval quality shifts but nothing in the pipeline detects the change.

The Eval Mindset

Every prompt change is a hypothesis

Evals turn subjective 'I think this is better' into measurable 'this scores 4.2 vs 3.8 on our rubric.'

Evals compound over time

Each production failure you add to your eval suite makes the next deployment safer. Your eval suite is your institutional knowledge of what can go wrong.

Measure what matters to users

Don't eval for perplexity or BLEU score if your users care about helpfulness and accuracy. Eval metrics should map to user satisfaction.

Automate relentlessly

If a human has to remember to run evals, they won't. Put evals in CI, block merges on regressions, and post results on every PR.

Evals are the most underinvested part of the AI stack. If you don't have evals, you don't have a product — you have a demo.

Hamel Husain

AI Engineering Consultant

The best teams I've seen treat evals as a first-class artifact. They write the eval before writing the prompt, the same way TDD works for software.

Eugene Yan

Senior Applied Scientist, Amazon

Without evals, every prompt change is a leap of faith. With evals, it's an experiment with measurable outcomes.

Jason Wei

Research Scientist, OpenAI

If you can't measure it, you can't improve it. The teams that ship the most reliable LLM products are the ones with the strongest eval suites.

Simon Willison

Creator of Datasette, AI Blogger

2

Types of Evaluations

Functional, quality, safety, and regression evals

Not all evals are created equal. Each type serves a different purpose, runs at a different cadence, and catches a different class of failure. A robust eval strategy layers multiple types to cover the full spectrum of quality dimensions.

The Eval Pyramid — Layer Your Evaluations

Safety Evals
Run on every deployment
Quality Evals (LLM-as-Judge)
Run nightly or per-PR
Regression Evals
Run on every change
Functional Evals (Assertions)
Run on every PR

Functional evals form the base — they are fast, cheap, and catch the most common failures. Each layer above adds more coverage at higher cost and latency.

Functional Evals

Verify that the output meets hard requirements — correct format, contains required fields, follows instructions. These are the LLM equivalent of unit tests.

Examples

  • -Output is valid JSON matching a schema
  • -Response contains all required sections
  • -Generated SQL query is syntactically valid
  • -Classification output is one of the allowed labels

When to Use

Always. Functional evals are the minimum baseline for any LLM application. They're fast, deterministic, and catch the most obvious failures.

Code Pattern

// Functional eval: check format and content
expect(output).toMatchSchema(responseSchema);
expect(output.sections).toContain("summary");
expect(output.length).toBeLessThan(maxTokens);
Quality Evals

Assess subjective quality dimensions — helpfulness, clarity, accuracy, tone, completeness. These typically use LLM-as-judge or human reviewers with rubrics.

Examples

  • -Is the summary accurate and complete?
  • -Does the response match the requested tone?
  • -Are the generated instructions easy to follow?
  • -Does the answer cite relevant sources?

When to Use

For any user-facing output where quality matters. Use LLM-as-judge for fast iteration, human eval for calibration and high-stakes decisions.

Code Pattern

// Quality eval: LLM-as-judge with rubric
const score = await judge.evaluate({
  criteria: ["accuracy", "helpfulness", "clarity"],
  rubric: qualityRubric,
  response: output,
  reference: goldenAnswer,
});
Safety Evals

Test for harmful, biased, or inappropriate outputs. Includes adversarial testing (jailbreaks, prompt injection), bias detection, PII leakage, and content policy compliance.

Examples

  • -Response doesn't contain PII from training data
  • -Output doesn't generate harmful instructions
  • -System resists prompt injection attempts
  • -Responses don't exhibit demographic bias

When to Use

Before any public deployment. Run adversarial evals during development and continuously in production. Required for regulated industries.

Code Pattern

// Safety eval: adversarial test suite
const adversarialInputs = loadAdversarialSuite();
for (const input of adversarialInputs) {
  const output = await model.generate(input);
  expect(output).not.toMatch(harmfulPatterns);
  expect(output).toPassContentPolicy();
}
Regression Evals

Compare current output quality against a known baseline. Detect when changes to prompts, models, or retrieval systems cause quality to drop on previously-passing cases.

Examples

  • -Score on existing test suite after prompt change
  • -Before/after comparison on model version upgrade
  • -Quality delta after RAG index rebuild
  • -Performance check after temperature adjustment

When to Use

On every change to any component of the LLM pipeline — prompts, models, retrieval, post-processing. Gate deployments on regression results.

Code Pattern

// Regression eval: compare against baseline
const baseline = await loadBaseline("v2.3");
const current = await runEvalSuite(newPrompt);
const regressions = findRegressions(baseline, current);
assert(regressions.length < threshold);
Comparative Evals

Evaluate two or more variants side-by-side. Used for A/B testing prompts, comparing models, or evaluating system architecture changes. Requires careful methodology to avoid position bias.

Examples

  • -Prompt A vs Prompt B on the same test suite
  • -GPT-4o vs Claude Sonnet for a specific use case
  • -RAG pipeline v1 vs v2 on retrieval quality
  • -Single-shot vs chain-of-thought on reasoning tasks

When to Use

When choosing between alternatives. Run before committing to a prompt change, model switch, or architecture decision. Randomize ordering to avoid position bias.

Code Pattern

// Comparative eval: head-to-head comparison
const results = await compareVariants({
  variants: [promptA, promptB],
  dataset: evalCases,
  judge: llmJudge,
  randomizeOrder: true, // Avoid position bias
});
console.log(results.winner, results.pValue);

Choosing Your Eval Mix

Start with functional evals — they are the fastest to build and catch the most common failures. Add regression evals as soon as you have a baseline. Layer in quality evals for user-facing outputs. Add safety evals before public deployment. Use comparative evals when making architectural decisions.

3

Building Eval Datasets

Golden datasets, synthetic data, production sampling

Your evals are only as good as your dataset. A well-constructed eval dataset is the most valuable artifact in your LLM engineering workflow — it encodes your team's knowledge of what “good” looks like and what can go wrong.

Three Sources of Eval Data

Manual / Curated

Hand-crafted test cases written by domain experts who understand what good output looks like. The highest quality source but the least scalable.

Pros

  • + Highest quality and precision
  • + Tests specific requirements
  • + Captures domain nuance

Cons

  • - Expensive and slow to create
  • - Limited scale
  • - Author bias

Tip

Start here. 20 well-crafted manual cases beat 500 auto-generated ones for establishing your eval baseline.

Synthetic / LLM-Generated

Use an LLM to generate test cases from seed examples, templates, or category descriptions. Good for coverage and edge cases, but requires human review.

Pros

  • + Scales to hundreds of cases quickly
  • + Good for edge case generation
  • + Covers categories systematically

Cons

  • - Quality varies — needs human review
  • - Can miss real-world patterns
  • - May have blind spots matching the generator model

Tip

Generate in batches by category, then have a human review and filter. A 50% acceptance rate is typical and acceptable.

Production Sampling

Sample real user inputs and outputs from production logs. The most representative source of how your system is actually used.

Pros

  • + Reflects real usage patterns
  • + Captures edge cases you'd never imagine
  • + Automatically tracks distribution shifts

Cons

  • - Requires production traffic first
  • - May contain PII (needs scrubbing)
  • - Needs labeling after collection

Tip

Stratified sampling is key — sample across user types, input categories, and outcome types (success/failure). Over-sample failures.

Generating Synthetic Test Cases

synthetic-eval-generation.ts
// Generate synthetic eval cases from seed examples
import { z } from "zod";

const EvalCase = z.object({
  input: z.string(),
  expectedBehavior: z.string(),
  category: z.string(),
  difficulty: z.enum(["easy", "medium", "hard", "adversarial"]),
});

async function generateSyntheticCases(
  seedCases: z.infer<typeof EvalCase>[],
  categories: string[],
  countPerCategory: number,
): Promise<z.infer<typeof EvalCase>[]> {
  const allCases: z.infer<typeof EvalCase>[] = [];

  for (const category of categories) {
    const seeds = seedCases.filter((c) => c.category === category);

    const generated = await llm.generate({
      system: `You are an eval dataset generator. Given seed examples,
generate ${countPerCategory} new test cases for the "${category}" category.

Requirements:
- Each case must be meaningfully different from seeds
- Include a mix of difficulties: easy, medium, hard, adversarial
- Adversarial cases should test edge cases and failure modes
- Expected behavior should describe what a GOOD response does

Respond as a JSON array matching the schema.`,
      messages: [
        {
          role: "user",
          content: `Seed examples:\n${JSON.stringify(seeds, null, 2)}`,
        },
      ],
    });

    const parsed = z.array(EvalCase).parse(JSON.parse(generated));
    allCases.push(...parsed);
  }

  return allCases;
}

Use an LLM to generate test cases from seed examples. Specify categories and difficulty levels. Always have a human review generated cases before adding them to the eval suite — a 50% acceptance rate is typical.

Sampling from Production

production-sampling.ts
// Sample production logs for eval dataset
interface ProductionLog {
  id: string;
  input: string;
  output: string;
  timestamp: string;
  userFeedback?: "positive" | "negative";
  latencyMs: number;
  model: string;
}

async function sampleForEvals(
  logs: ProductionLog[],
  config: {
    totalSamples: number;
    failureOversampling: number; // e.g., 3x
    stratifyBy: "category" | "date" | "model";
  },
): Promise<ProductionLog[]> {
  // Separate successes and failures
  const failures = logs.filter((l) => l.userFeedback === "negative");
  const successes = logs.filter((l) => l.userFeedback !== "negative");

  // Over-sample failures (they're more valuable as eval cases)
  const failureSamples = stratifiedSample(
    failures,
    Math.min(
      failures.length,
      Math.floor(config.totalSamples * 0.4 * config.failureOversampling),
    ),
    config.stratifyBy,
  );

  const successSamples = stratifiedSample(
    successes,
    config.totalSamples - failureSamples.length,
    config.stratifyBy,
  );

  return [...failureSamples, ...successSamples];
}

Over-sample failures — they are far more valuable as eval cases than successes. Use stratified sampling to ensure coverage across user types, input categories, and time periods.

Versioning Your Datasets

dataset-versioning.ts
// Version and track eval datasets
interface DatasetVersion {
  version: string;
  createdAt: string;
  cases: EvalCase[];
  metadata: {
    totalCases: number;
    bySource: Record<string, number>;
    byCategory: Record<string, number>;
    byDifficulty: Record<string, number>;
  };
  changelog: string;
}

async function saveDatasetVersion(
  cases: EvalCase[],
  changelog: string,
): Promise<DatasetVersion> {
  const version: DatasetVersion = {
    version: `v${Date.now()}`,
    createdAt: new Date().toISOString(),
    cases,
    metadata: {
      totalCases: cases.length,
      bySource: countBy(cases, "source"),
      byCategory: countBy(cases, "category"),
      byDifficulty: countBy(cases, "difficulty"),
    },
    changelog,
  };

  // Store in version control alongside code
  await writeFile(
    `evals/datasets/${version.version}.json`,
    JSON.stringify(version, null, 2),
  );

  // Update the "latest" symlink
  await writeFile(
    "evals/datasets/latest.json",
    JSON.stringify(version, null, 2),
  );

  return version;
}

Edge Case Checklist

Empty or whitespace-only inputs
Very long inputs (near token limits)
Inputs in unexpected languages
Adversarial / prompt injection attempts
Ambiguous queries with multiple valid answers
Inputs requiring refused responses
Multi-part questions
Inputs with typos or informal language
Domain-specific jargon and abbreviations
Time-sensitive queries with stale data

The Golden Rule

Your eval dataset should be a living document, not a static artifact. Feed production failures back in weekly. Regenerate synthetic cases quarterly. Audit for coverage gaps monthly. The teams with the best AI products are the ones that invest the most in their eval datasets.

4

Automated Evaluations

LLM-as-judge, assertions, semantic similarity

Automated evaluations are the backbone of a scalable eval system. They run on every change, catch regressions instantly, and provide quantitative signals for prompt iteration. The best eval systems layer multiple methods — assertions for hard requirements, semantic similarity for meaning, and LLM-as-judge for nuanced quality.

Each method has different cost, speed, and reliability tradeoffs. The key is knowing when to use each one and how to combine them into a pipeline that balances thoroughness with speed.

Three Layers of Automated Evaluation

Deterministic

Assertion-Based Evals

Hard checks against LLM output: format validation, keyword presence, length constraints, schema compliance. The fastest and cheapest eval method — runs in milliseconds with zero LLM calls.

Strengths

  • +Deterministic — same input always gives same result
  • +Zero cost — no LLM calls needed
  • +Fast — runs in milliseconds
  • +Easy to debug when tests fail

Limitations

  • -Can't assess subjective quality
  • -Brittle for free-form text
  • -Misses semantically correct but differently worded responses
Embedding-Based

Semantic Similarity

Compare output embeddings against reference answer embeddings using cosine similarity. Catches semantically equivalent responses regardless of exact wording. Cheap and fast — one embedding call per evaluation.

Strengths

  • +Tolerant of rephrasing and word choice variation
  • +Cheap — only requires embedding calls, not LLM generation
  • +Good middle ground between exact match and LLM-as-judge
  • +Quantitative and reproducible

Limitations

  • -Doesn't understand nuance or factual correctness
  • -High similarity doesn't guarantee correctness
  • -Depends on embedding model quality
Model-Graded

LLM-as-Judge

Use a strong LLM to evaluate another LLM's output against a rubric. The most flexible eval method — can assess nuanced quality dimensions like helpfulness, accuracy, and tone. Requires calibration to be reliable.

Strengths

  • +Evaluates subjective quality dimensions
  • +Flexible — adapts to any rubric or criteria
  • +Correlates well with human judgment when calibrated
  • +Scales to thousands of evaluations

Limitations

  • -Has systematic biases (verbosity, position, self-preference)
  • -Non-deterministic — scores vary between runs
  • -Expensive — requires LLM call per evaluation
  • -Requires careful calibration against human labels

Implementing LLM-as-Judge

The key to reliable LLM-as-judge is calibration. Provide concrete examples of what each score level looks like. Set temperature to 0 for reproducibility. Require step-by-step reasoning before scoring — this forces the judge to justify its evaluation and improves consistency.

llm-as-judge.ts
// LLM-as-Judge with calibration and bias mitigation
import { z } from "zod";

const JudgmentSchema = z.object({
  reasoning: z.string(),
  scores: z.object({
    accuracy: z.number().min(1).max(5),
    helpfulness: z.number().min(1).max(5),
    safety: z.number().min(1).max(5),
  }),
  overall: z.number().min(1).max(5),
});

type Judgment = z.infer<typeof JudgmentSchema>;

async function llmJudge(
  response: string,
  reference: string,
  rubric: string,
  calibrationExamples: { response: string; score: number }[],
): Promise<Judgment> {
  // Include calibration examples to anchor scoring
  const calibration = calibrationExamples
    .map((ex) => `Response: "${ex.response}" → Score: ${ex.score}/5`)
    .join("\n");

  const result = await llm.generate({
    model: "gpt-4o",
    temperature: 0, // Reduce variance between runs
    system: `You are an expert evaluator. Score responses 1-5.

## Rubric
${rubric}

## Calibration Examples (use these to anchor your scale)
${calibration}

## Reference Answer
${reference}

## Instructions
1. Write step-by-step reasoning BEFORE scoring
2. Score each dimension independently
3. Overall = weighted average (accuracy 40%, helpfulness 40%, safety 20%)
4. Output valid JSON matching the schema.`,
    messages: [
      { role: "user", content: `Evaluate this response:\n"${response}"` },
    ],
  });

  return JudgmentSchema.parse(JSON.parse(result));
}

Combined Eval Pipeline

The most effective eval systems layer all three methods. Assertions run first as a fast gate — if they fail, skip the expensive LLM calls. Semantic similarity provides a cheap quality signal. LLM-as-judge adds nuanced evaluation for cases that pass the first two layers.

combined-eval-pipeline.ts
// Combined eval pipeline: assertions + similarity + judge
interface EvalResult {
  assertions: { passed: boolean; failures: string[] };
  similarity: { score: number; threshold: number };
  judge: { overall: number; reasoning: string };
  finalScore: number;
  passed: boolean;
}

async function evaluate(
  response: string,
  testCase: EvalCase,
): Promise<EvalResult> {
  // Layer 1: Fast assertions (deterministic, free)
  const assertions = runAssertions(response, testCase.assertions);

  // Short-circuit: if assertions fail, skip expensive evals
  if (!assertions.passed) {
    return {
      assertions,
      similarity: { score: 0, threshold: 0.75 },
      judge: { overall: 0, reasoning: "Skipped — assertions failed" },
      finalScore: 0,
      passed: false,
    };
  }

  // Layer 2: Semantic similarity (cheap, fast)
  const similarity = await computeSimilarity(
    response,
    testCase.referenceAnswer,
  );

  // Layer 3: LLM-as-judge (expensive, nuanced)
  const judge = await llmJudge(
    response,
    testCase.referenceAnswer,
    testCase.rubric,
    CALIBRATION_SET,
  );

  // Weighted final score
  const finalScore =
    (assertions.passed ? 0.2 : 0) +
    similarity.score * 0.3 +
    (judge.overall / 5) * 0.5;

  return {
    assertions,
    similarity,
    judge,
    finalScore,
    passed: finalScore >= testCase.threshold,
  };
}

Known LLM Judge Biases

Verbosity Bias

Judges rate longer responses higher regardless of quality. A 500-word response often scores higher than a 100-word response even when the short one is better.

Mitigation

Include 'penalize unnecessary verbosity' in your rubric. Add calibration examples where concise answers score highest.

Position Bias

In A/B comparisons, judges consistently prefer whichever option is presented first. This can skew comparative evaluations by 15-20%.

Mitigation

Randomize option ordering. Run each comparison twice with swapped positions. Average the results.

Self-Preference Bias

GPT-4 judges rate GPT-4 outputs higher. Claude judges prefer Claude outputs. The judge model favors its own family's style.

Mitigation

Use a different model as judge than the one being evaluated. Or use multiple judge models and average.

Anchoring Bias

If the judge sees a reference answer first, it anchors on that specific phrasing and penalizes valid alternatives.

Mitigation

Use rubric-based evaluation instead of reference comparison when possible. Define quality criteria abstractly.

The Eval Pipeline Formula

Start with assertions for format and content requirements. Add semantic similarity as a cheap quality check. Layer LLM-as-judge for nuanced evaluation with calibrated rubrics. Short-circuit early — if assertions fail, skip the expensive layers. This gives you thoroughness where it matters and speed where it doesn't.

5

Human Evaluation

When and how to involve human reviewers

Automated evals scale, but human evaluation is the ground truth. LLM judges have known biases — they prefer verbose responses, exhibit position bias, and rate their own model's outputs higher. Human evals calibrate your automated systems and catch issues that no metric can quantify.

When to Involve Human Reviewers

critical

Calibrating LLM-as-Judge

Before trusting an LLM judge, validate its scores against human labels on 50-100 cases. If correlation is low, the judge prompt needs rework.

high

Subjective Quality Assessment

Tasks where 'quality' is genuinely subjective — creative writing, tone matching, brand voice. No automated metric captures this well.

critical

Safety-Critical Applications

Medical, legal, financial advice. Automated evals catch format issues but humans catch dangerous misinformation.

high

New Product Launch

Before the first public release, human reviewers catch issues that no eval suite anticipated. Use this feedback to bootstrap your automated evals.

high

Adversarial Testing

Red-teaming for safety evals. Creative humans find attack vectors that automated adversarial generators miss.

medium

Periodic Audits

Even with good automated evals, run quarterly human audits on a production sample to catch systematic blind spots.

Writing Annotation Guidelines

The most common failure in human eval is vague guidelines. “Rate quality from 1-5” means different things to different annotators. Good guidelines define every score level with concrete examples and handle edge cases explicitly.

annotation-guidelines.ts
// Annotation guidelines template
interface AnnotationGuideline {
  taskDescription: string;
  dimensions: {
    name: string;
    description: string;
    scale: { value: number; label: string; examples: string }[];
  }[];
  edgeCases: { scenario: string; guidance: string }[];
}

const summaryEvalGuideline: AnnotationGuideline = {
  taskDescription: `Evaluate the quality of an AI-generated summary.
Read the source document, then rate the summary on each dimension.`,
  dimensions: [
    {
      name: "Accuracy",
      description: "Are all facts in the summary correct per the source?",
      scale: [
        { value: 1, label: "Major errors", examples: "States wrong numbers, inverts conclusions" },
        { value: 2, label: "Minor errors", examples: "Slightly wrong dates, imprecise wording" },
        { value: 3, label: "Mostly accurate", examples: "Key facts correct, minor omissions" },
        { value: 4, label: "Accurate", examples: "All facts verifiable against source" },
        { value: 5, label: "Perfectly accurate", examples: "Every claim traceable to source" },
      ],
    },
    {
      name: "Completeness",
      description: "Does the summary cover all key points from the source?",
      scale: [
        { value: 1, label: "Missing critical info", examples: "Main conclusion absent" },
        { value: 2, label: "Significant gaps", examples: "2+ key points missing" },
        { value: 3, label: "Adequate coverage", examples: "Main points present, details missing" },
        { value: 4, label: "Good coverage", examples: "All key points, most details" },
        { value: 5, label: "Comprehensive", examples: "All points covered proportionally" },
      ],
    },
    {
      name: "Conciseness",
      description: "Is the summary appropriately brief without losing meaning?",
      scale: [
        { value: 1, label: "Far too long", examples: "Longer than source, excessive repetition" },
        { value: 2, label: "Too verbose", examples: "Could be 50% shorter without loss" },
        { value: 3, label: "Acceptable", examples: "Some unnecessary content" },
        { value: 4, label: "Concise", examples: "Tight writing, minimal fluff" },
        { value: 5, label: "Perfectly concise", examples: "Every word earns its place" },
      ],
    },
  ],
  edgeCases: [
    {
      scenario: "Summary adds information not in the source",
      guidance: "Score Accuracy as 1 regardless of whether the added info is correct.",
    },
    {
      scenario: "Summary is a single sentence",
      guidance: "Can still score high on Conciseness if it captures the core message.",
    },
  ],
};

Measuring Annotator Agreement

If two annotators disagree frequently, your guidelines are ambiguous. Use Cohen's Kappa to measure inter-rater reliability. A kappa below 0.6 means your guidelines need revision before the annotations are useful.

inter-rater-reliability.ts
// Calculate inter-rater reliability (Cohen's Kappa)
function cohensKappa(
  rater1: number[],
  rater2: number[],
): { kappa: number; interpretation: string } {
  const n = rater1.length;
  const categories = [...new Set([...rater1, ...rater2])];

  // Observed agreement
  let agreed = 0;
  for (let i = 0; i < n; i++) {
    if (rater1[i] === rater2[i]) agreed++;
  }
  const observedAgreement = agreed / n;

  // Expected agreement by chance
  let expectedAgreement = 0;
  for (const cat of categories) {
    const p1 = rater1.filter((r) => r === cat).length / n;
    const p2 = rater2.filter((r) => r === cat).length / n;
    expectedAgreement += p1 * p2;
  }

  const kappa = (observedAgreement - expectedAgreement) / (1 - expectedAgreement);

  const interpretation =
    kappa >= 0.81 ? "Almost perfect agreement" :
    kappa >= 0.61 ? "Substantial agreement" :
    kappa >= 0.41 ? "Moderate agreement" :
    kappa >= 0.21 ? "Fair agreement" :
    "Poor agreement — revise guidelines";

  return { kappa, interpretation };
}

// Usage: validate annotator consistency
const rater1Scores = [5, 4, 3, 5, 2, 4, 3, 5, 4, 3];
const rater2Scores = [5, 4, 4, 5, 2, 3, 3, 5, 4, 3];
const { kappa, interpretation } = cohensKappa(rater1Scores, rater2Scores);
// kappa: 0.72 — "Substantial agreement"

Crowd Workers vs. Domain Experts

Crowd Workers
  • +Cheap ($0.10-$1 per annotation)
  • +Fast (hundreds of labels per day)
  • +Good for general quality, formatting, clarity
  • -Can't assess factual accuracy in specialized domains
  • -Quality varies — need attention checks and redundancy
Domain Experts
  • +Can verify factual correctness
  • +Catch subtle domain-specific errors
  • +Higher trust for safety-critical evaluations
  • -Expensive ($50-$200/hr)
  • -Limited availability and throughput

Best practice: Use crowd workers for general quality evaluation and formatting checks. Reserve domain experts for factual accuracy verification and safety-critical assessments. Use expert labels to calibrate your LLM-as-judge, then scale with automated evals.

Cost Optimization Strategy

Human eval is expensive. Optimize by using it strategically: label 100 cases with experts to calibrate your LLM judge, then run the LLM judge on 10,000 cases. Periodically spot-check the LLM judge against new human labels to detect drift. This gives you expert-quality evaluation at automated-eval prices.

6

Eval Frameworks

promptfoo, Braintrust, LangSmith, custom solutions

You don't have to build your eval system from scratch. Several mature frameworks provide dataset management, scoring functions, CI/CD integration, and result visualization out of the box. The right choice depends on your stack, team size, and specific requirements.

The most important thing is to start evaluating — any framework is better than no framework. You can always migrate later. Pick the tool that has the lowest friction for your team today.

Framework Comparison

Open Source

promptfoo

CLI-first eval framework that lets you define test cases in YAML, run evals against multiple providers, and compare results side-by-side. Excellent for prompt iteration and CI/CD integration.

Key Features

  • -YAML-based test case definition
  • -Multi-provider comparison (OpenAI, Anthropic, local models)
  • -Built-in assertion types (contains, similar, llm-rubric)
  • -CI/CD integration with GitHub Actions
  • -Web UI for viewing results
  • -Red-teaming and adversarial testing

Best For

Teams that want a fast, config-driven eval workflow. Best for prompt comparison and regression testing. Strongest CI/CD integration.

Pricing

Free and open-source. Cloud dashboard available.

Enterprise

Braintrust

Production-grade eval and observability platform with dataset management, experiment tracking, scoring functions, and real-time logging. Used by Scale AI, Notion, and Stripe.

Key Features

  • -Managed dataset storage and versioning
  • -Custom scoring functions in TypeScript/Python
  • -Experiment tracking with automatic comparison
  • -Real-time production logging and monitoring
  • -Human annotation workflows
  • -SDK for TypeScript, Python, and REST API

Best For

Teams that need end-to-end eval infrastructure. Best for organizations that want managed dataset management and production monitoring alongside evals.

Pricing

Free tier available. Paid plans for teams.

LangChain Ecosystem

LangSmith

Tracing and evaluation platform from the LangChain team. Deep integration with LangChain/LangGraph, but works with any LLM framework. Strong tracing and debugging capabilities.

Key Features

  • -End-to-end tracing of LLM chains and agents
  • -Dataset management with annotation queues
  • -Built-in and custom evaluators
  • -Comparison experiments
  • -Online evaluation (production monitoring)
  • -Deep LangChain/LangGraph integration

Best For

Teams already using LangChain or LangGraph. Best for complex agent workflows that need detailed tracing alongside evaluation.

Pricing

Free tier for developers. Paid plans for teams.

Build Your Own

Custom Solution

Build your own eval framework tailored to your specific needs. Full control over every aspect of the eval pipeline. Best when existing tools don't fit your workflow or you need deep integration with internal systems.

Key Features

  • -Complete control over eval logic
  • -Deep integration with your stack
  • -Custom scoring functions for domain-specific needs
  • -No vendor lock-in
  • -Custom reporting and dashboards
  • -Optimized for your specific use case

Best For

Teams with unique eval requirements, heavy compliance needs, or existing internal tooling. Best when you need full control and can invest engineering time.

Pricing

Engineering time. Typically 2-4 weeks for a basic framework.

promptfoo: Config-Driven Evals

promptfoo uses a YAML config to define prompts, providers, test cases, and assertions. Run with npx promptfoo eval to compare prompts side-by-side across multiple models.

promptfooconfig.yaml
# promptfoo config: promptfooconfig.yaml
prompts:
  - id: current
    raw: "You are a helpful assistant. Answer: {{query}}"
  - id: candidate
    raw: |
      You are a customer support agent for Acme Corp.
      Answer the user's question accurately and concisely.
      If unsure, say so rather than guessing.
      Question: {{query}}

providers:
  - openai:gpt-4o
  - anthropic:messages:claude-sonnet-4-20250514

tests:
  - vars:
      query: "What is your refund policy?"
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "Response accurately describes the refund policy"
      - type: not-contains
        value: "I don't know"

  - vars:
      query: "How do I cancel my subscription?"
    assert:
      - type: contains
        value: "settings"
      - type: similar
        value: "Go to Settings > Subscription > Cancel"
        threshold: 0.7

  - vars:
      query: "Can I get a refund for a digital purchase?"
    assert:
      - type: llm-rubric
        value: "Response correctly states digital purchases are credit-only"
      - type: javascript
        value: "output.length < 500"

Braintrust: Programmatic Evals

Braintrust provides a TypeScript/Python SDK for defining eval tasks with custom scoring functions. Results are tracked as experiments with automatic comparison against previous runs.

braintrust-eval.ts
// Braintrust eval with custom scoring
import { Eval } from "braintrust";

Eval("customer-support-qa", {
  data: () => loadDataset("datasets/support-evals.jsonl"),

  task: async (input) => {
    const response = await llm.generate({
      system: SYSTEM_PROMPT,
      messages: [{ role: "user", content: input.query }],
    });
    return response;
  },

  scores: [
    // Built-in scorer: factual accuracy via LLM judge
    Factuality,

    // Custom scorer: check required content
    (args) => {
      const hasRequiredInfo = args.input.requiredKeywords
        .every((kw: string) => args.output.toLowerCase().includes(kw));
      return {
        name: "contains_required_info",
        score: hasRequiredInfo ? 1 : 0,
      };
    },

    // Custom scorer: response length check
    (args) => ({
      name: "conciseness",
      score: args.output.length < 500 ? 1 : args.output.length < 800 ? 0.5 : 0,
    }),
  ],
});

Custom: Build What You Need

A custom eval framework can be surprisingly simple — a dataset loader, a set of evaluators, and a report generator. Start minimal and add complexity as your needs grow.

custom-eval-framework.ts
// Custom eval framework — minimal but effective
import { z } from "zod";

interface EvalConfig {
  name: string;
  dataset: string;
  evaluators: Evaluator[];
  thresholds: { overall: number; perCategory: Record<string, number> };
}

interface Evaluator {
  name: string;
  weight: number;
  evaluate: (input: string, output: string, expected?: string) => Promise<number>;
}

async function runEvalSuite(config: EvalConfig): Promise<EvalReport> {
  const dataset = await loadDataset(config.dataset);
  const results: EvalResult[] = [];

  for (const testCase of dataset) {
    const output = await generateResponse(testCase.input);
    const scores: Record<string, number> = {};

    for (const evaluator of config.evaluators) {
      scores[evaluator.name] = await evaluator.evaluate(
        testCase.input,
        output,
        testCase.expected,
      );
    }

    const overall = config.evaluators.reduce(
      (sum, ev) => sum + scores[ev.name] * ev.weight, 0
    );

    results.push({ testCase: testCase.id, scores, overall, output });
  }

  return generateReport(results, config.thresholds);
}

How to Choose

?

Need fast prompt comparison in CI?

promptfoo — config-driven, CLI-first, excellent CI/CD integration.

?

Need managed datasets and production monitoring?

Braintrust — end-to-end platform with experiment tracking and logging.

?

Already using LangChain and need tracing?

LangSmith — deep integration with the LangChain ecosystem plus evaluation.

?

Have unique requirements or compliance needs?

Custom solution — full control, no vendor lock-in, tailored to your workflow.

7

Regression Testing for LLMs

Catch quality drops before they ship

Regression testing is the practice of comparing new LLM outputs against a known baseline to detect quality drops before they reach production. It is the most critical eval practice for teams that ship frequently — every prompt change, model update, or parameter tweak can silently degrade quality.

In traditional software, unit tests prevent code regressions. In LLM systems, eval baselines serve the same purpose. Without them, you are flying blind every time you deploy.

What Causes LLM Regressions?

Prompt Changes

Modifying system prompts, few-shot examples, or instruction formatting. The most common source of regressions — fixing one edge case often breaks three others.

Every PR that touches prompts

Model Version Updates

LLM providers ship new model versions that change behavior. GPT-4o-2024-08-06 may behave differently from GPT-4o-2024-05-13 in subtle ways.

On every model version change

RAG Index Changes

Adding, removing, or re-indexing documents in your retrieval system. Chunk size changes, embedding model updates, or metadata schema changes.

On every index rebuild

Parameter Tuning

Adjusting temperature, top-p, max tokens, or other inference parameters. Small parameter changes can cascade into large behavior shifts.

On any parameter change

Tool/Function Updates

Changing tool descriptions, adding new tools, modifying tool output schemas. Tool changes alter what the model 'sees' and how it decides to act.

On any tool schema change

Key Insight

Track p5 (5th percentile) in addition to mean score. The mean can improve while worst-case performance silently degrades. A prompt change that improves 80% of cases but makes 5% catastrophically worse is a net negative — and mean score alone won't catch it.

Creating and Managing Baselines

A baseline is a snapshot of your system's eval scores at a known-good state. Every candidate change is compared against this baseline. When a candidate passes, it becomes the new baseline for future comparisons.

baseline-management.ts
// Baseline management for regression detection
interface Baseline {
  id: string;
  promptVersion: string;
  modelVersion: string;
  timestamp: string;
  results: Map<string, CaseResult>;
  aggregates: {
    mean: number;
    median: number;
    p5: number;   // 5th percentile — worst-case performance
    p95: number;  // 95th percentile — best-case performance
    passRate: number;
  };
}

interface CaseResult {
  caseId: string;
  score: number;
  passed: boolean;
  output: string;
  evalDetails: Record<string, number>;
}

async function createBaseline(
  promptVersion: string,
  dataset: EvalCase[],
): Promise<Baseline> {
  const results = new Map<string, CaseResult>();

  for (const testCase of dataset) {
    const output = await generateResponse(testCase.input, promptVersion);
    const evalResult = await evaluate(output, testCase);

    results.set(testCase.id, {
      caseId: testCase.id,
      score: evalResult.score,
      passed: evalResult.passed,
      output,
      evalDetails: evalResult.details,
    });
  }

  const scores = [...results.values()].map((r) => r.score).sort((a, b) => a - b);

  return {
    id: `baseline-${Date.now()}`,
    promptVersion,
    modelVersion: getCurrentModelVersion(),
    timestamp: new Date().toISOString(),
    results,
    aggregates: {
      mean: average(scores),
      median: scores[Math.floor(scores.length / 2)],
      p5: scores[Math.floor(scores.length * 0.05)],
      p95: scores[Math.floor(scores.length * 0.95)],
      passRate: [...results.values()].filter((r) => r.passed).length / results.size,
    },
  };
}

Detecting Regressions

Compare each test case individually against the baseline. Track improvements, regressions, and unchanged cases. Use configurable thresholds to determine what counts as a regression and how many are acceptable before blocking deployment.

regression-detection.ts
// Regression detection: compare candidate against baseline
interface RegressionReport {
  improved: { caseId: string; oldScore: number; newScore: number }[];
  regressed: { caseId: string; oldScore: number; newScore: number }[];
  unchanged: string[];
  newCases: string[];  // Cases in candidate but not in baseline
  aggregateDelta: {
    mean: number;
    p5: number;
    passRate: number;
  };
  verdict: "pass" | "fail" | "review";
}

async function detectRegressions(
  baseline: Baseline,
  candidateResults: Map<string, CaseResult>,
  config: {
    regressionThreshold: number; // Score drop that counts as regression
    maxRegressions: number;      // Max allowed regressions
    minP5: number;               // Minimum acceptable p5 score
  },
): Promise<RegressionReport> {
  const report: RegressionReport = {
    improved: [],
    regressed: [],
    unchanged: [],
    newCases: [],
    aggregateDelta: { mean: 0, p5: 0, passRate: 0 },
    verdict: "pass",
  };

  for (const [caseId, candidateResult] of candidateResults) {
    const baselineResult = baseline.results.get(caseId);

    if (!baselineResult) {
      report.newCases.push(caseId);
      continue;
    }

    const delta = candidateResult.score - baselineResult.score;

    if (delta > config.regressionThreshold) {
      report.improved.push({
        caseId, oldScore: baselineResult.score, newScore: candidateResult.score,
      });
    } else if (delta < -config.regressionThreshold) {
      report.regressed.push({
        caseId, oldScore: baselineResult.score, newScore: candidateResult.score,
      });
    } else {
      report.unchanged.push(caseId);
    }
  }

  // Calculate aggregate deltas
  const candidateScores = [...candidateResults.values()].map((r) => r.score).sort((a, b) => a - b);
  const candidateP5 = candidateScores[Math.floor(candidateScores.length * 0.05)];

  report.aggregateDelta = {
    mean: average(candidateScores) - baseline.aggregates.mean,
    p5: candidateP5 - baseline.aggregates.p5,
    passRate: (candidateResults.size > 0
      ? [...candidateResults.values()].filter((r) => r.passed).length / candidateResults.size
      : 0) - baseline.aggregates.passRate,
  };

  // Determine verdict
  if (report.regressed.length > config.maxRegressions) {
    report.verdict = "fail";
  } else if (candidateP5 < config.minP5) {
    report.verdict = "fail";
  } else if (report.regressed.length > 0) {
    report.verdict = "review"; // Some regressions but within tolerance
  }

  return report;
}

Quality Gate Recommendations

Hard Block

  • - p5 score drops below threshold
  • - Any safety eval fails
  • - Pass rate drops more than 5%
  • - More than 5 individual regressions

Requires Review

  • - 1-5 individual regressions
  • - Mean score drops but p5 holds
  • - New test cases with low scores
  • - Significant output style changes

Auto-Approve

  • - Zero regressions
  • - Mean and p5 improved or stable
  • - Pass rate maintained or improved
  • - All safety evals pass

The Regression Testing Workflow

1

Establish a baseline at your current known-good state

2

Run the same eval suite against the candidate change

3

Compare case-by-case and aggregate metrics against the baseline

4

Block deployment if regressions exceed thresholds

5

On successful deployment, promote candidate scores as the new baseline

6

Repeat for every change to prompts, models, retrieval, or parameters

8

Eval-Driven Development

Write the eval first, then improve the prompt

Eval-Driven Development (EDD) is the LLM equivalent of Test-Driven Development. The core idea: write the eval before you fix the prompt, then iterate until the eval passes. This prevents goalpost-shifting, captures institutional knowledge, and ensures every improvement is measurable.

Just as TDD transformed software quality by making tests a first-class artifact, EDD transforms LLM product quality by making evals the foundation of the development workflow — not an afterthought.

The EDD Cycle

1

Identify the Problem

Start with a concrete quality issue — user complaints, production failures, or a new capability requirement. Define what 'fixed' looks like in measurable terms.

Example: Users report that the chatbot gives wrong refund amounts for multi-item orders.
2

Write the Eval Cases

Before touching the prompt, create eval cases that test the specific problem. Include positive examples (what good looks like), negative examples (current failure), and edge cases.

Example: Create 10 test cases: 3 multi-item orders, 2 single-item, 2 partial refunds, 3 edge cases (free items, discounts, bundles).
3

Measure the Baseline

Run the new eval cases against the current prompt to establish a baseline score. This quantifies how bad the problem actually is and sets a target for improvement.

Example: Baseline: 3/10 cases pass (30%). Multi-item orders fail consistently. Target: 9/10 (90%).
4

Iterate with Eval Feedback

Modify the prompt and run the eval suite after each change. The eval score is your compass — it tells you if each change helped, hurt, or had no effect.

Example: v1: Add multi-item instructions → 5/10. v2: Add calculation examples → 7/10. v3: Add edge case handling → 9/10.
5

Run Full Regression Suite

Once the new eval passes, run the full regression suite to ensure you didn't break anything else. Only deploy if both the new eval and all existing evals pass.

Example: New eval: 9/10 pass. Full regression suite: 0 regressions detected. Safe to deploy.
6

Promote to Permanent Suite

Add the new eval cases to your permanent regression suite. They become a safety net that prevents this specific problem from ever recurring.

Example: 10 multi-item refund cases added to regression suite. Future prompt changes will be tested against these.

Key Insight

The most important moment in EDD is Step 2 — writing the eval before touching the prompt. This forces you to define “done” objectively, prevents the goalpost from shifting during iteration, and creates permanent regression tests that protect future changes.

EDD vs. Ad-Hoc Prompt Engineering

AspectAd-HocEval-Driven
ApproachChange prompt, eyeball a few examplesWrite eval cases, measure, iterate with data
Confidence"I think it's better""Score improved 3.2 → 4.1, zero regressions"
Regression RiskUnknown — discovered by usersQuantified — blocked by automated checks
Knowledge CaptureIn engineer's head, lost when they leaveIn eval suite, permanent institutional knowledge
Iteration SpeedFast at first, slow when things breakSlightly slower at first, compounds over time
ScalabilityBreaks at 10+ prompt variationsScales to hundreds of test cases

The EDD Workflow in Code

eval-driven-development.ts
// Eval-Driven Development workflow
interface EDDConfig {
  issue: string;
  evalCases: EvalCase[];
  currentPrompt: string;
  maxIterations: number;
  targetScore: number;
}

async function evalDrivenDevelopment(config: EDDConfig) {
  // Step 1: Add new eval cases to the suite
  const fullSuite = await loadEvalDataset("regression-suite");
  const expandedSuite = [...fullSuite, ...config.evalCases];

  // Step 2: Measure baseline on new cases
  const baselineResults = await runEvals(config.currentPrompt, config.evalCases);
  console.log(`Baseline: ${baselineResults.passRate * 100}% pass rate`);
  console.log(`Target: ${config.targetScore * 100}% pass rate`);

  // Step 3: Iterate with eval feedback
  let bestPrompt = config.currentPrompt;
  let bestScore = baselineResults.overallScore;
  const history: { prompt: string; score: number; delta: number }[] = [];

  for (let i = 0; i < config.maxIterations; i++) {
    // Generate a prompt variant (can be manual or LLM-assisted)
    const candidate = await generatePromptVariant(bestPrompt, {
      issue: config.issue,
      failingCases: baselineResults.failures,
      iteration: i,
    });

    // Run eval on new cases
    const candidateResults = await runEvals(candidate, config.evalCases);

    // Run full regression suite to check for side effects
    const regressionResults = await runEvals(candidate, fullSuite);

    const improved = candidateResults.overallScore > bestScore;
    const noRegressions = regressionResults.regressions.length === 0;

    history.push({
      prompt: candidate,
      score: candidateResults.overallScore,
      delta: candidateResults.overallScore - bestScore,
    });

    if (improved && noRegressions) {
      bestPrompt = candidate;
      bestScore = candidateResults.overallScore;
      console.log(`Iteration ${i + 1}: Improved to ${bestScore}`);
    }

    // Early exit if target reached
    if (bestScore >= config.targetScore) {
      console.log(`Target reached at iteration ${i + 1}`);
      break;
    }
  }

  // Step 4: Deploy if improved
  if (bestScore > baselineResults.overallScore) {
    await deploy(bestPrompt);
    // Promote new cases to permanent regression suite
    await saveEvalDataset("regression-suite", expandedSuite);
    console.log(`Deployed. Score: ${baselineResults.overallScore} → ${bestScore}`);
  }

  return { bestPrompt, bestScore, history };
}
Warning

Avoid Eval Overfitting

If you iterate on the same eval cases for too long, your prompt becomes over-fitted to those specific cases — it passes the eval but fails on unseen production queries. Mitigate this by maintaining a held-out test set that you never look at during development. Run it as a final gate before deployment.

The EDD Mantra

Write the eval. Measure the baseline. Iterate with data. Deploy with confidence. Every production failure becomes a new eval case. Every eval case prevents that failure from recurring. Over time, your eval suite becomes the most valuable artifact in your LLM engineering workflow — it is your team's accumulated knowledge of what can go wrong and what “good” looks like.

9

CI/CD for LLM Apps

Evals in your deployment pipeline

The highest-impact eval practice is making evals automatic and unavoidable. When evals run on every PR that touches prompts, post results as comments, and block merges on regressions — quality becomes a system property, not a personal discipline.

Integrating evals into CI/CD transforms them from something engineers remember to run into something that runs automatically on every change. This is the difference between a demo and a product.

The CI/CD Eval Pipeline

1

Detect

< 5 seconds

Identify which files changed. If prompts, eval datasets, or model configs changed, trigger the eval pipeline. Skip for unrelated changes.

2

Fast Evals

< 30 seconds

Run assertion-based evals: format checks, keyword presence, length constraints, schema validation. These are deterministic and free — no LLM calls needed.

3

Full Evals

2-10 minutes

Run LLM-as-judge and semantic similarity evals against the full regression suite. Compare against the stored baseline. Compute score deltas and regression counts.

4

Report

< 10 seconds

Post eval results as a PR comment with score deltas, regression details, and pass/fail status. Make results visible and actionable in the code review workflow.

5

Gate

Immediate

Block merge if evals fail. Auto-approve if all checks pass. Require manual review for borderline cases (some regressions but within tolerance).

Key Insight

Layer your CI evals by cost and speed. Fast evals (assertions, format checks) run on every PR in under 30 seconds. Full evals (LLM-as-judge) run only when fast evals pass, taking 2-10 minutes. This keeps CI fast for non-prompt changes while ensuring thorough evaluation for prompt changes.

GitHub Actions Configuration

This workflow triggers only when prompt files, eval datasets, or LLM config files change. Fast evals run first — if they fail, the expensive full eval suite is skipped. Results are posted as a PR comment with score deltas.

.github/workflows/llm-evals.yml
# .github/workflows/llm-evals.yml
name: LLM Eval Suite

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'evals/**'
      - 'src/lib/llm-config.ts'

env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

jobs:
  fast-evals:
    name: Fast Evals (Assertions)
    runs-on: ubuntu-latest
    timeout-minutes: 2
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20

      - run: npm ci
      - run: npx tsx evals/run-assertions.ts
        env:
          EVAL_DATASET: evals/datasets/latest.json
          FAIL_THRESHOLD: 0.95

  full-evals:
    name: Full Eval Suite (LLM-as-Judge)
    needs: fast-evals
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20

      - run: npm ci

      # Run full eval suite
      - run: npx tsx evals/run-full-suite.ts
        id: evals
        env:
          EVAL_DATASET: evals/datasets/latest.json
          BASELINE_PATH: evals/baselines/latest.json
          OUTPUT_PATH: evals/results/current.json

      # Post results as PR comment
      - uses: actions/github-script@v7
        with:
          script: |
            const results = require('./evals/results/current.json');
            const body = formatEvalComment(results);
            await github.rest.issues.createComment({
              ...context.repo,
              issue_number: context.issue.number,
              body,
            });

      # Fail if below threshold
      - run: |
          node -e "
            const r = require('./evals/results/current.json');
            if (r.regressions > 0 || r.overallScore < 0.9) {
              console.error('Eval failed:', JSON.stringify(r.summary));
              process.exit(1);
            }
          "

CI Eval Runner

The eval runner partitions test cases by type and runs them in phases. Functional evals gate safety evals, which gate quality evals. Each phase can short-circuit if thresholds aren't met, saving time and API costs.

ci-eval-runner.ts
// CI eval runner with reporting
interface CIEvalConfig {
  datasetPath: string;
  baselinePath: string;
  outputPath: string;
  thresholds: {
    fastEvalPassRate: number;    // e.g., 0.95
    fullEvalMinScore: number;    // e.g., 0.85
    maxRegressions: number;      // e.g., 3
    safetyPassRate: number;      // e.g., 1.0 (zero tolerance)
  };
}

async function runCIEvals(config: CIEvalConfig) {
  const dataset = await loadDataset(config.datasetPath);
  const baseline = await loadBaseline(config.baselinePath);

  // Partition test cases by type
  const functional = dataset.filter((c) => c.type === "functional");
  const quality = dataset.filter((c) => c.type === "quality");
  const safety = dataset.filter((c) => c.type === "safety");

  // Phase 1: Fast functional evals (parallel)
  const functionalResults = await Promise.all(
    functional.map((c) => runFunctionalEval(c))
  );
  const functionalPassRate = functionalResults.filter((r) => r.passed).length / functional.length;

  if (functionalPassRate < config.thresholds.fastEvalPassRate) {
    return { status: "fail", phase: "functional", passRate: functionalPassRate };
  }

  // Phase 2: Safety evals (zero tolerance)
  const safetyResults = await Promise.all(
    safety.map((c) => runSafetyEval(c))
  );
  const safetyPassRate = safetyResults.filter((r) => r.passed).length / safety.length;

  if (safetyPassRate < config.thresholds.safetyPassRate) {
    return { status: "fail", phase: "safety", passRate: safetyPassRate };
  }

  // Phase 3: Full quality evals with baseline comparison
  const qualityResults = await Promise.all(
    quality.map((c) => runQualityEval(c))
  );

  const regressions = detectRegressions(baseline, qualityResults);
  const overallScore = average(qualityResults.map((r) => r.score));

  // Save results for reporting
  const report = {
    status: regressions.length > config.thresholds.maxRegressions ? "fail" : "pass",
    functionalPassRate,
    safetyPassRate,
    overallScore,
    regressions: regressions.length,
    improved: qualityResults.filter((r) => r.improved).length,
    details: { functionalResults, safetyResults, qualityResults },
  };

  await saveResults(config.outputPath, report);
  return report;
}

PR Comment Reporting

Making eval results visible in the PR review workflow is critical. Engineers should see score deltas, regressions, and improvements at a glance — without leaving the code review interface.

pr-comment-format.ts
// Format eval results as a PR comment
function formatEvalComment(results: EvalReport): string {
  const status = results.status === "pass" ? "PASSED" : "FAILED";
  const emoji = results.status === "pass" ? "white_check_mark" : "x";

  return `## LLM Eval Results: ${status} :${emoji}:

| Metric | Score | Threshold | Status |
|--------|-------|-----------|--------|
| Functional Pass Rate | ${pct(results.functionalPassRate)} | 95% | ${badge(results.functionalPassRate >= 0.95)} |
| Safety Pass Rate | ${pct(results.safetyPassRate)} | 100% | ${badge(results.safetyPassRate >= 1.0)} |
| Quality Score | ${results.overallScore.toFixed(2)}/5.0 | 4.0 | ${badge(results.overallScore >= 4.0)} |
| Regressions | ${results.regressions} | < 3 | ${badge(results.regressions < 3)} |
| Improvements | ${results.improved} | — | :chart_with_upwards_trend: |

${results.regressions > 0 ? `
### Regressed Cases
${results.details.qualityResults
  .filter((r: QualityResult) => r.regressed)
  .map((r: QualityResult) => `- **${r.caseId}**: ${r.oldScore} → ${r.newScore} (Δ ${r.delta})`)
  .join("\n")}
` : ""}

<details>
<summary>Full Results (${results.details.qualityResults.length} cases)</summary>
// ... detailed per-case results
</details>

---
*Eval suite v${results.datasetVersion} | Baseline v${results.baselineVersion}*`;
}

Managing CI Eval Costs

Path-based triggering

Only run evals when prompt files, eval datasets, or model config changes. Use GitHub Actions path filters to skip unrelated PRs.

Tiered execution

Run cheap assertion evals on every PR. Run expensive LLM-as-judge evals only when assertions pass. Run full regression suites nightly.

Caching

Cache eval results for unchanged test cases. If the prompt didn't change, previous results are still valid. Only re-evaluate changed components.

Concurrency limits

Run eval API calls concurrently (10-20 parallel) but rate-limit to avoid provider throttling. Most eval suites of 100 cases complete in 2-5 minutes.

The CI/CD Eval Principle

If a human has to remember to run evals, they won't. Put evals in CI, post results on PRs, block merges on regressions, and track scores over time. Make quality automatic and unavoidable — not optional and forgettable.

10

Interactive Examples

See eval patterns in action with live code

See eval patterns in action. Each example shows a bad pattern and its eval-engineered fix. Toggle between them to understand the difference.

Automated EvalLLM-as-Judge Eval

Using an LLM to grade another LLM's output

Vague grading with no rubric
// BAD: No rubric, no structure, unreliable scores
async function evalResponse(response: string) {
  const grade = await llm.generate({
    system: "Rate this response from 1-10.",
    messages: [
      { role: "user", content: response },
    ],
  });

  // Returns inconsistent scores, no reasoning
  // "7" one time, "I'd give it a 7/10" the next
  return grade;
}

Why this fails

Without a rubric, the judge LLM produces inconsistent scores. No structured output means parsing fails randomly. No reasoning means you can't debug why a score was given.

All Examples Quick Reference

Automated Eval

LLM-as-Judge Eval

Using an LLM to grade another LLM's output

Functional Eval

Assertion-Based Evals

Deterministic checks for LLM outputs

Dataset Engineering

Eval Dataset Management

Building and maintaining test cases

Regression Testing

Regression Detection

Catching quality drops before they reach users

CI/CD

CI/CD Eval Pipeline

Automated evals in your deployment workflow

EDD Workflow

Ad-Hoc Tweaks vs Eval-Driven Development

Write the eval first, then improve the prompt

LLM-as-Judge

Uncalibrated vs Calibrated LLM Judge

Building reliable automated evaluation with LLM judges

11

Anti-Patterns & Failure Modes

Common eval mistakes and how to avoid them

Knowing what not to do is as important as knowing what to do. These are the most common eval failure modes — patterns that give teams false confidence while quality silently degrades in production.

CriticalHighMedium
criticalVibe Check Shipping

Deploying prompt changes based on manual spot-checking a few examples instead of systematic evaluation.

Cause

No eval infrastructure in place. Teams treat LLM outputs like they treat UI changes — 'looks good to me' is the approval process.

Symptom

Prompt changes that 'seemed fine' cause subtle quality regressions in production. Users complain about outputs that used to work. No one can quantify if the system is getting better or worse over time.

Fix

Build an eval suite before building features. Even 20 well-chosen test cases with simple assertions is 100x better than eyeballing. Automate these in CI so every PR gets checked.

criticalGolden Dataset Rot

Eval datasets that were created once and never updated, becoming increasingly disconnected from real-world usage patterns.

Cause

Teams build an initial eval dataset during development but never feed production failures, new edge cases, or shifting user patterns back into the dataset.

Symptom

Evals pass consistently but production quality degrades. The eval dataset tests scenarios from 6 months ago, not what users actually ask today. False confidence from green CI checks.

Fix

Implement a feedback loop: sample production logs weekly, add failing cases from user reports, and run dataset freshness audits monthly. Version your datasets and track coverage metrics.

highEval Gaming

Optimizing prompts to pass specific eval cases rather than genuinely improving quality, similar to overfitting in ML.

Cause

Small eval datasets with predictable patterns. Teams iterate on prompts specifically to pass the eval suite rather than to improve general quality.

Symptom

Eval scores keep going up but production quality stays flat or drops. Prompts become over-fitted to the eval cases. New, unseen inputs fail at higher rates.

Fix

Use held-out test sets that prompt engineers never see. Regularly rotate eval cases. Include synthetic and adversarial examples. Monitor production metrics alongside eval scores.

highMetric Myopia

Over-relying on a single metric (like BLEU score or cosine similarity) that captures only one dimension of quality.

Cause

Teams pick one easy-to-compute metric and treat it as the source of truth. Semantic similarity scores become the only quality gate.

Symptom

High scores on the tracked metric but poor real-world quality. A summarizer scores high on ROUGE but produces summaries that are factually wrong. A chatbot scores high on relevance but is rude.

Fix

Use multi-dimensional evaluation: combine deterministic assertions, semantic metrics, LLM-as-judge with rubrics, and human evaluation. No single metric captures LLM quality.

criticalEval-less Deployment

Shipping LLM-powered features to production with zero automated evaluation, relying entirely on user feedback as a quality signal.

Cause

Pressure to ship fast. Teams skip evals because they're seen as slow to build. The 'we'll add evals later' promise that never materializes.

Symptom

Every production incident is a surprise. No way to assess impact of model provider changes, prompt updates, or temperature adjustments. Users are the canary in the coal mine.

Fix

Start with the simplest possible eval: 10 test cases, basic assertions, running in CI. Expand from there. Even a minimal eval suite catches 80% of obvious regressions.

mediumJudge Bias Blindspot

Using LLM-as-judge without calibrating for known biases: verbosity preference, position bias, self-preference.

Cause

LLM judges have systematic biases — they prefer longer responses, favor the first option presented, and rate their own model's outputs higher. Teams don't test for or correct these biases.

Symptom

Eval results that don't correlate with human judgment. Verbose, padded responses score higher than concise, accurate ones. A/B comparisons always favor whichever option is presented first.

Fix

Calibrate judges against human labels. Randomize option ordering. Use structured rubrics that penalize unnecessary verbosity. Run bias audits on your judge model periodically.

12

Best Practices Checklist

Production-ready eval guidelines

Production-ready eval guidelines distilled from Anthropic, OpenAI, Hamel Husain, Eugene Yan, and the broader AI engineering community.

Eval Dataset Design

Start with 20-50 cases, not 1000

A small, well-curated dataset with clear rubrics beats a large noisy one. You can always expand. Start with the cases that matter most to your users.

Cover the distribution, not just the happy path

Include edge cases, adversarial inputs, multilingual queries, and the long tail of real-world usage. Weight your dataset to match production traffic patterns.

Version your datasets like code

Store eval datasets in version control. Track when cases were added, why, and from what source. This lets you audit eval quality over time.

Feed production failures back into evals

Every user complaint, thumbs-down, or escalation is a potential eval case. Build a pipeline that samples production failures into your eval dataset weekly.

Automated Evaluation

Layer your assertions: deterministic first, LLM-judge last

Check format, length, and keyword presence with fast deterministic checks. Use semantic similarity as a middle layer. Reserve expensive LLM-as-judge for nuanced quality assessment.

Always include reasoning in LLM-as-judge prompts

Require the judge to explain its reasoning before scoring. This improves consistency and lets you debug disagreements between the judge and human reviewers.

Use structured output (Zod/JSON Schema) for eval results

Parse eval results into typed schemas. This prevents the 'I give it a 7 out of 10' vs '7' vs '7/10' parsing problem and ensures consistent downstream processing.

Calibrate LLM judges against human labels

Run your LLM judge on a set of human-labeled examples. If judge scores don't correlate with human scores, the judge needs better prompting or a different model.

CI/CD Integration

Run fast evals on every PR, full evals nightly

Assertion-based evals can run in under 2 minutes. Gate PRs on these. Run expensive LLM-as-judge evals nightly or on-demand for prompt changes.

Post eval results as PR comments

Make eval results visible in the PR review workflow. Show score deltas, regressions, and new failures. This makes eval results impossible to ignore.

Block merges on eval regressions

Treat eval failures like test failures. If a prompt change causes more than N regressions or drops the mean score below a threshold, block the merge.

Track eval scores over time as a dashboard

Plot eval scores per prompt version on a time-series dashboard. This shows whether your system is improving, degrading, or plateauing over weeks and months.

Human Evaluation

Use human evals to calibrate automated evals

Have human raters score 50-100 cases. Compare against LLM-as-judge scores. If correlation is below 0.7, revise your judge prompt. Humans are the ground truth.

Create detailed annotation guidelines with examples

Define every score level with concrete examples. Include edge case handling rules. Untrained raters with vague instructions produce unreliable data.

Measure and track inter-rater reliability

Use Cohen's Kappa or Krippendorff's Alpha. If agreement is below 0.6, the task definition is ambiguous — revise guidelines before collecting more labels.

Reserve human eval budget for high-stakes decisions

Use automated evals for routine regression testing. Reserve expensive human evaluation for safety-critical assessments, model comparisons, and rubric development.

Eval-Driven Development

Write the eval before writing the prompt

Define what 'good' looks like with test cases and rubrics before you write a single prompt. This is the LLM equivalent of TDD and prevents goalpost-shifting.

Use evals to compare prompt candidates

When iterating on a prompt, run all candidates against the same eval suite. Let data decide which prompt is best, not intuition.

Separate development evals from held-out evals

Keep a set of eval cases that prompt engineers never see during development. Run these as a final gate to detect overfitting to the dev eval set.

Treat eval maintenance as ongoing work, not a one-time task

Allocate 10-20% of your LLM engineering time to eval maintenance. Add new cases, retire stale ones, recalibrate judges, and update rubrics as your product evolves.

The Guiding Principle

Evals are not a testing step at the end — they are the foundation of the entire development workflow. Write the eval first, measure the baseline, iterate with data, and deploy with confidence. The teams that invest the most in their eval infrastructure ship the most reliable AI products.

— Hamel Husain, “Your AI Product Needs Evals”

13

Resources & Further Reading

Tools, blogs, papers, and guides

Essential tools, frameworks, blog posts, research papers, and guides for building production-grade LLM evaluation systems.

blogHamel Husain

Your AI Product Needs Evals

The definitive guide to LLM evals by Hamel Husain. Covers why evals matter, how to build them, and common mistakes teams make.

blogEugene Yan

How to Evaluate LLM Applications

Practical framework for evaluating LLM applications covering classification, generation, and RAG use cases with code examples.

repoGitHub

promptfoo: Test Your LLM App

Open-source tool for testing and evaluating LLM outputs. Supports multiple providers, assertions, and CI/CD integration.

guideBraintrust

Braintrust AI Eval Framework

Production-grade eval framework with experiment tracking, scoring functions, and dataset management.

guideLangChain

LangSmith Evaluation Guide

Official LangSmith docs on building evaluators, managing datasets, running experiments, and tracking results over time.

paperarXiv

Judging LLM-as-a-Judge

Foundational paper on using LLMs as evaluators. Analyzes biases, calibration, and agreement with human judgments.

repoGitHub

RAGAS: Automated Evaluation of RAG

Framework for evaluating RAG pipelines with metrics for faithfulness, answer relevance, context precision, and recall.

repoGitHub

OpenAI Evals

OpenAI's open-source evaluation framework. Includes benchmark tasks and a framework for creating custom evals.

guideAnthropic

Anthropic's Guide to Prompt Evaluation

Anthropic's official guide to evaluating prompts systematically, including rubric design and scoring methodologies.

blogLatent Space

The Eval Gap: Why Offline Metrics Don't Predict Online Performance

Analysis of why eval scores often don't correlate with production quality, and strategies for closing the gap.

Where to Start

If you read one thing, read Hamel Husain's “Your AI Product Needs Evals” — it covers the why, what, and how of LLM evals in one comprehensive post. Then pick a framework (promptfoo for CLI-first, Braintrust for enterprise) and start with 20 test cases. You'll see immediate ROI.