Testing AI Features: Unit Tests and Golden Sets

evalsai-testinggolden-setsproperty-testingci-cd

Testing AI features is the part of the job most teams still get wrong, and it's the part that quietly kills more integrations than any other. The reason is simple: classical testing practice assumes that the same input produces the same output. AI features don't offer that guarantee. The same prompt can produce different text from one call to the next. The vendor can update the model underneath you. A harmless prompt tweak can blow up a failure mode nobody caught.

You don't fix this by testing harder in the traditional sense. You fix it by combining three complementary testing layers - plus CI integration and a regression-tracking habit - and treating the whole thing as first-class engineering. This post walks through the stack we use on production AI features, with code snippets, and the tradeoffs between each layer.

Why "just add unit tests" doesn't work

A classical unit test is an assertion: expect(output).toEqual(expected). That assertion fails the moment the output shifts by a single token, which in an AI feature will happen constantly and for benign reasons - model reroll, temperature > 0, a minor prompt tweak, a new version of the same model family.

Two bad responses to that reality:

Turn temperature to zero and hope. Some determinism, but the model vendor can still change behavior without your knowledge. You'll also make the tests brittle in a way that encourages rubber-stamping.
Skip testing altogether. The most common response. Ship the feature, hope for the best, panic when a user reports something obviously wrong.

The correct response is to break the problem apart: test the deterministic parts with unit tests, test the behavior with golden sets and property tests, and run it all in CI. Each layer catches a different class of bug.

Layer 1 - Unit tests for the deterministic shell

Every AI feature has a surrounding shell of boring, deterministic code: input validation, prompt construction, retrieval queries, response parsing, guardrails, cost tracking. All of that is classical software and deserves classical tests.

The rule we use: any function that doesn't call the model should have a normal unit test. These are cheap, fast, and catch regressions in the parts of the system most likely to silently break.

// TypeScript example - prompt construction (NestJS service style)
export function buildClassificationPrompt(
  ticket: Ticket,
  tenantContext: string,
): string {
  if (!ticket.body) {
    throw new Error('ticket body required');
  }
  return formatTemplate(PROMPT_TEMPLATE, {
    tenant: tenantContext,
    subject: ticket.subject ?? '(no subject)',
    body: truncate(ticket.body, { maxTokens: 1500 }),
  });
}

// Jest / Vitest tests
describe('buildClassificationPrompt', () => {
  it('rejects an empty body', () => {
    expect(() =>
      buildClassificationPrompt({ body: '' } as Ticket, 'acme-corp'),
    ).toThrow('ticket body required');
  });

  it('truncates a long body', () => {
    const longBody = 'word '.repeat(5000);
    const prompt = buildClassificationPrompt(
      { body: longBody } as Ticket,
      'acme-corp',
    );
    expect(prompt.length).toBeLessThan(10_000);
  });
});

This pattern - pure functions for the deterministic pieces, unit tests around them - should cover:

Prompt template construction and variable substitution
Input sanitization and truncation
Response parsing (JSON extraction, schema validation, fallbacks)
Retrieval query construction
Cost estimation and budget checks
Guardrail rules (content filters, PII redaction, tenant isolation checks)

If you're writing a test that calls the model, stop - that's the next layer.

Layer 2 - Golden sets (the core technique)

A golden set is a curated collection of inputs paired with expected outputs or acceptable output properties. You run the whole AI feature against the set, score the results, and track the aggregate score over time.

This is the single highest-leverage testing practice for AI features. Everything else is supporting cast.

Building a golden set

Size matters less than coverage. Our defaults:

30–60 items for a narrow feature (a classifier, a specific extraction task).
80–150 for a broader feature (a chat assistant, a generation feature).
Cover three tiers: common cases, edge cases, adversarial cases (inputs that should be refused or handled specially).

Each item has at minimum:

id: classify-ticket-001
category: common
input:
  subject: "Can't log in after password reset"
  body: "I reset my password this morning and now I get an error..."
expected:
  label: "account_access"
  priority: "medium"
notes: "Straightforward auth issue; should not be escalated."

Where possible, write properties instead of exact expected outputs. "Label must be one of {account_access, billing, technical}" is more robust than "label must equal 'account_access'." Properties are the bridge to Layer 3.

Scoring

Three scoring strategies, picked per field:

1. Exact match. Works for classification, extraction into a closed schema, and structured outputs. Fast, cheap, reliable.

2. LLM-as-judge. For open-ended outputs, use a stronger model (or a different model) to score the primary model's response against a rubric. This is less reliable than exact match but works well when the rubric is concrete ("does the reply address the user's stated question," "does the reply include a citation to a source doc," "is the tone appropriate for customer support").

3. Rubric-based human scoring. For high-stakes features or ambiguous tasks, periodic human-scored runs of the golden set against a structured rubric. Slower, more expensive, most reliable.

Most real golden-set harnesses use a mix: exact match for the fields that allow it, LLM-as-judge for the rest, human scoring on a monthly or quarterly cadence to calibrate the judge.

A minimal runner

// Sketch of a small golden-set runner (TypeScript / Node)
import { promises as fs } from 'node:fs';

type GoldenItem = {
  id: string;
  input: Record<string, unknown>;
  expected: Record<string, unknown>;
};

type FeatureFn = (input: GoldenItem['input']) => Promise<Record<string, unknown>>;
type JudgeFn = (output: Record<string, unknown>, rubric: string) => Promise<number>;

const mean = (xs: number[]) =>
  xs.length === 0 ? 0 : xs.reduce((a, b) => a + b, 0) / xs.length;

export async function runGoldenSet(
  featureFn: FeatureFn,
  goldenPath: string,
  judgeFn: JudgeFn,
) {
  const raw = await fs.readFile(goldenPath, 'utf8');
  const items: GoldenItem[] = raw
    .split('\n')
    .filter((line) => line.trim())
    .map((line) => JSON.parse(line));

  const results = [];
  for (const item of items) {
    const output = await featureFn(item.input);
    const score = await scoreItem(output, item.expected, judgeFn);
    results.push({ id: item.id, score, output });
  }
  return { meanScore: mean(results.map((r) => r.score)), results };
}

async function scoreItem(
  output: Record<string, unknown>,
  expected: Record<string, unknown>,
  judgeFn: JudgeFn,
): Promise<number> {
  const scores: number[] = [];
  for (const [field, rule] of Object.entries(expected)) {
    if (typeof rule === 'string') {
      // exact match
      scores.push(output[field] === rule ? 1 : 0);
    } else if (rule && typeof rule === 'object' && 'any_of' in rule) {
      // property
      const allowed = (rule as { any_of: unknown[] }).any_of;
      scores.push(allowed.includes(output[field]) ? 1 : 0);
    } else if (rule && typeof rule === 'object' && 'judge' in rule) {
      // LLM-as-judge
      const rubric = (rule as { judge: string }).judge;
      scores.push(await judgeFn(output, rubric));
    }
  }
  return mean(scores);
}

That's the skeleton. Real production versions add retries, per-item timing, cost tracking, persistence, and diffing against the previous run - all straightforward software.

Layer 3 - Property-based testing

Property-based testing asks: instead of "for this specific input, I expect this specific output," can we say "for any input in this class, the output must have this property?"

Properties are invariants that should hold across the whole input space. Examples from real AI features we've built:

A classifier must always return a label from a fixed enum.
A JSON-output feature must always produce parseable JSON.
A redaction feature must never leak email addresses or phone numbers in its output.
A summarizer's output must be shorter than its input.
A translation feature must not leave untranslated tokens of the source language (when those tokens are not proper nouns).
A tenant-scoped retrieval feature must never return content from a different tenant, regardless of input.

The last one is especially valuable because it's a security property that you can test continuously. Property tests turn "I hope the model respects tenant boundaries" into "we ran 500 adversarial prompts and tenant isolation held every time, and the CI will catch the first time it doesn't."

// TypeScript property tests using fast-check + Vitest / Jest
import fc from 'fast-check';

test('redaction never leaks an email address', async () => {
  await fc.assert(
    fc.asyncProperty(
      fc.string({ minLength: 1, maxLength: 2000 }),
      async (inputText) => {
        const output = await redactionFeature(
          `${inputText} contact me at [email protected]`,
        );
        expect(output).not.toContain('[email protected]');
      },
    ),
    { numRuns: 25 },
  );
});

test('classifier always returns a label from the allowed enum', async () => {
  await fc.assert(
    fc.asyncProperty(
      fc.string({ minLength: 1, maxLength: 500 }),
      async (body) => {
        const result = await classifyTicket({ subject: 'test', body });
        expect(ALLOWED_LABELS).toContain(result.label);
      },
    ),
    { numRuns: 25 },
  );
});

Property tests are slower and more expensive than unit tests because each case calls the model. We typically run a smaller number of cases (20–50 per property) in CI and a larger, more adversarial run nightly.

Regression tracking

The whole point of golden sets and property tests is to catch regressions you didn't predict. That requires tracking scores over time, not just running them.

Minimum regression infrastructure:

Store every run. Each golden-set run records the timestamp, model version, prompt version, retrieval index version, per-item scores, and aggregate score.
Diff against previous. After each run, diff the per-item scores against the last baseline. Any item that went from passing to failing gets flagged.
Dashboard the aggregate. A simple chart of mean score over time, with annotations at each model/prompt change, is worth more than an entire test suite nobody looks at.
Fail CI on material regression. We typically fail the build if the aggregate score drops by more than a configured threshold, or if more than a small number of previously-passing items start failing.

The threshold question is real: AI output scores are noisy. A 2% drop run-over-run is probably noise. A 10% drop is a regression. Between those, look at which items changed, not the aggregate. Often a small drop caused by one category of inputs is the canary for a bigger problem.

CI integration

Wiring this into CI is more of a workflow question than a technical one. The pattern we use:

On every PR: unit tests (fast), a subset of the golden set (~20 items), and property tests with a small run count (~20 cases each).
Nightly: the full golden set, property tests with 100+ cases, LLM-as-judge scoring.
On model or prompt version change: full golden set + full property runs + explicit diff against the baseline, gated on reviewer approval.

Cost matters here. A full golden set run can cost real money if the feature uses a premium model. Design the CI matrix so the expensive runs happen at the moments where they pay off - not on every typo-fix PR.

What this stack actually catches

In the projects where we've built this stack, the bugs it's caught (that nothing else would have caught) include:

A prompt change that improved the common case by 3% and broke a specific edge case by 40% - caught on the nightly diff.
A silent model vendor update that changed refusal behavior - caught when a property test for "must return a label" started failing on a new class of inputs.
A retrieval index rebuild that accidentally re-indexed from a stale dump - caught because golden-set scores dropped uniformly across all categories.
A tenant-isolation break when a new filter wasn't propagated to a code path - caught by the property test on cross-tenant leakage.

None of those would have been found by manual testing. All of them would have shipped and caused real damage.

What this does not replace

Worth naming: none of this replaces user testing, production observability, or review of real user interactions. Golden sets prove the system works on inputs you anticipated. Production telemetry shows you the inputs you didn't anticipate. Both matter. A team that has great evals and no production observability has a testing dashboard that says "all green" while users hit novel failure modes.

If you're building AI features and want a hand standing up the eval and testing infrastructure - especially if you're past the demo stage and moving toward production - that's a significant piece of the work in our AI integration practice. Or get in touch and we can talk through what your specific feature needs.

Closing

AI features that last aren't the ones with the best prompts. They're the ones whose teams built the discipline to catch regressions before users do. The three-layer stack - unit tests for the shell, golden sets for the core, property tests for invariants - is not glamorous. It's the part of AI engineering that looks like normal engineering, which is exactly why it works.

Share this article

Testing AI Features: Unit Tests and Golden Sets

Why "just add unit tests" doesn't work

Layer 1 - Unit tests for the deterministic shell

Layer 2 - Golden sets (the core technique)

Building a golden set

Scoring

A minimal runner

Layer 3 - Property-based testing

Regression tracking

CI integration

What this stack actually catches

What this does not replace

Closing

2025 in Review: What We Shipped and Learned

Related Articles

SaaS Auth in 2026: Clerk vs Auth0 vs Supabase

Push Notifications Done Right in 2026

Stripe Billing for SaaS: Implementation Guide

Need Development Expertise?

How does it work?