Skip to main content
DesignKey Studio
Testing AI Features: Unit Tests, Golden Sets, Property Tests — featured article image
Development
January 12, 2026
10 min read
By Daniel Killyevo

Testing AI Features: Unit Tests, Golden Sets, Property Tests

A practical testing strategy for AI features — unit tests for the deterministic shell, golden sets for the core, property tests for invariants, and evals in CI.

evalsai-testinggolden-setsproperty-testingci-cd

Testing AI features is the part of the job most teams still get wrong, and it's the part that quietly kills more integrations than any other. The reason is simple: classical testing practice assumes that the same input produces the same output. AI features don't offer that guarantee. The same prompt can produce different text from one call to the next. The vendor can update the model underneath you. A harmless prompt tweak can blow up a failure mode nobody caught.

You don't fix this by testing harder in the traditional sense. You fix it by combining three complementary testing layers — plus CI integration and a regression-tracking habit — and treating the whole thing as first-class engineering. This post walks through the stack we use on production AI features, with code snippets, and the tradeoffs between each layer.

Why "just add unit tests" doesn't work

A classical unit test is an assertion: expect(output).toEqual(expected). That assertion fails the moment the output shifts by a single token, which in an AI feature will happen constantly and for benign reasons — model reroll, temperature > 0, a minor prompt tweak, a new version of the same model family.

Two bad responses to that reality:

  1. Turn temperature to zero and hope. Some determinism, but the model vendor can still change behavior without your knowledge. You'll also make the tests brittle in a way that encourages rubber-stamping.
  2. Skip testing altogether. The most common response. Ship the feature, hope for the best, panic when a user reports something obviously wrong.

The correct response is to break the problem apart: test the deterministic parts with unit tests, test the behavior with golden sets and property tests, and run it all in CI. Each layer catches a different class of bug.

Layer 1 — Unit tests for the deterministic shell

Every AI feature has a surrounding shell of boring, deterministic code: input validation, prompt construction, retrieval queries, response parsing, guardrails, cost tracking. All of that is classical software and deserves classical tests.

The rule we use: any function that doesn't call the model should have a normal unit test. These are cheap, fast, and catch regressions in the parts of the system most likely to silently break.

# Python example — prompt construction
def build_classification_prompt(ticket: Ticket, tenant_context: str) -> str:
    if not ticket.body:
        raise ValueError("ticket body required")
    return PROMPT_TEMPLATE.format(
        tenant=tenant_context,
        subject=ticket.subject or "(no subject)",
        body=truncate(ticket.body, max_tokens=1500),
    )

def test_build_prompt_rejects_empty_body():
    with pytest.raises(ValueError):
        build_classification_prompt(Ticket(body=""), "acme-corp")

def test_build_prompt_truncates_long_body():
    long_body = "word " * 5000
    prompt = build_classification_prompt(Ticket(body=long_body), "acme-corp")
    assert len(prompt) < 10_000

This pattern — pure functions for the deterministic pieces, unit tests around them — should cover:

  • Prompt template construction and variable substitution
  • Input sanitization and truncation
  • Response parsing (JSON extraction, schema validation, fallbacks)
  • Retrieval query construction
  • Cost estimation and budget checks
  • Guardrail rules (content filters, PII redaction, tenant isolation checks)

If you're writing a test that calls the model, stop — that's the next layer.

Layer 2 — Golden sets (the core technique)

A golden set is a curated collection of inputs paired with expected outputs or acceptable output properties. You run the whole AI feature against the set, score the results, and track the aggregate score over time.

This is the single highest-leverage testing practice for AI features. Everything else is supporting cast.

Building a golden set

Size matters less than coverage. Our defaults:

  • 30–60 items for a narrow feature (a classifier, a specific extraction task).
  • 80–150 for a broader feature (a chat assistant, a generation feature).
  • Cover three tiers: common cases, edge cases, adversarial cases (inputs that should be refused or handled specially).

Each item has at minimum:

id: classify-ticket-001
category: common
input:
  subject: "Can't log in after password reset"
  body: "I reset my password this morning and now I get an error..."
expected:
  label: "account_access"
  priority: "medium"
notes: "Straightforward auth issue; should not be escalated."

Where possible, write properties instead of exact expected outputs. "Label must be one of {account_access, billing, technical}" is more robust than "label must equal 'account_access'." Properties are the bridge to Layer 3.

Scoring

Three scoring strategies, picked per field:

1. Exact match. Works for classification, extraction into a closed schema, and structured outputs. Fast, cheap, reliable.

2. LLM-as-judge. For open-ended outputs, use a stronger model (or a different model) to score the primary model's response against a rubric. This is less reliable than exact match but works well when the rubric is concrete ("does the reply address the user's stated question," "does the reply include a citation to a source doc," "is the tone appropriate for customer support").

3. Rubric-based human scoring. For high-stakes features or ambiguous tasks, periodic human-scored runs of the golden set against a structured rubric. Slower, more expensive, most reliable.

Most real golden-set harnesses use a mix: exact match for the fields that allow it, LLM-as-judge for the rest, human scoring on a monthly or quarterly cadence to calibrate the judge.

A minimal runner

# pseudocode — a small golden-set runner
import json, statistics
from pathlib import Path

def run_golden_set(feature_fn, golden_path: Path, judge_fn):
    items = [json.loads(l) for l in golden_path.read_text().splitlines() if l.strip()]
    results = []
    for item in items:
        output = feature_fn(item["input"])
        score = score_item(output, item["expected"], judge_fn)
        results.append({"id": item["id"], "score": score, "output": output})
    mean = statistics.mean(r["score"] for r in results)
    return {"mean_score": mean, "results": results}

def score_item(output, expected, judge_fn):
    scores = []
    for field, rule in expected.items():
        if isinstance(rule, str):                     # exact match
            scores.append(1.0 if output.get(field) == rule else 0.0)
        elif isinstance(rule, dict) and "any_of" in rule:  # property
            scores.append(1.0 if output.get(field) in rule["any_of"] else 0.0)
        elif isinstance(rule, dict) and "judge" in rule:   # LLM-as-judge
            scores.append(judge_fn(output, rule["judge"]))
    return statistics.mean(scores) if scores else 0.0

That's the skeleton. Real production versions add retries, per-item timing, cost tracking, persistence, and diffing against the previous run — all straightforward software.

Layer 3 — Property-based testing

Property-based testing asks: instead of "for this specific input, I expect this specific output," can we say "for any input in this class, the output must have this property?"

Properties are invariants that should hold across the whole input space. Examples from real AI features we've built:

  • A classifier must always return a label from a fixed enum.
  • A JSON-output feature must always produce parseable JSON.
  • A redaction feature must never leak email addresses or phone numbers in its output.
  • A summarizer's output must be shorter than its input.
  • A translation feature must not leave untranslated tokens of the source language (when those tokens are not proper nouns).
  • A tenant-scoped retrieval feature must never return content from a different tenant, regardless of input.

The last one is especially valuable because it's a security property that you can test continuously. Property tests turn "I hope the model respects tenant boundaries" into "we ran 500 adversarial prompts and tenant isolation held every time, and the CI will catch the first time it doesn't."

# Python + hypothesis-style property test
from hypothesis import given, strategies as st

@given(st.text(min_size=1, max_size=2000))
def test_redact_never_leaks_email(input_text):
    output = redaction_feature(input_text + " contact me at test@example.com")
    assert "test@example.com" not in output

@given(st.text(min_size=1, max_size=500))
def test_classifier_returns_valid_label(input_text):
    result = classify_ticket({"subject": "test", "body": input_text})
    assert result["label"] in ALLOWED_LABELS

Property tests are slower and more expensive than unit tests because each case calls the model. We typically run a smaller number of cases (20–50 per property) in CI and a larger, more adversarial run nightly.

Regression tracking

The whole point of golden sets and property tests is to catch regressions you didn't predict. That requires tracking scores over time, not just running them.

Minimum regression infrastructure:

  1. Store every run. Each golden-set run records the timestamp, model version, prompt version, retrieval index version, per-item scores, and aggregate score.
  2. Diff against previous. After each run, diff the per-item scores against the last baseline. Any item that went from passing to failing gets flagged.
  3. Dashboard the aggregate. A simple chart of mean score over time, with annotations at each model/prompt change, is worth more than an entire test suite nobody looks at.
  4. Fail CI on material regression. We typically fail the build if the aggregate score drops by more than a configured threshold, or if more than a small number of previously-passing items start failing.

The threshold question is real: AI output scores are noisy. A 2% drop run-over-run is probably noise. A 10% drop is a regression. Between those, look at which items changed, not the aggregate. Often a small drop caused by one category of inputs is the canary for a bigger problem.

CI integration

Wiring this into CI is more of a workflow question than a technical one. The pattern we use:

  • On every PR: unit tests (fast), a subset of the golden set (~20 items), and property tests with a small run count (~20 cases each).
  • Nightly: the full golden set, property tests with 100+ cases, LLM-as-judge scoring.
  • On model or prompt version change: full golden set + full property runs + explicit diff against the baseline, gated on reviewer approval.

Cost matters here. A full golden set run can cost real money if the feature uses a premium model. Design the CI matrix so the expensive runs happen at the moments where they pay off — not on every typo-fix PR.

What this stack actually catches

In the projects where we've built this stack, the bugs it's caught (that nothing else would have caught) include:

  • A prompt change that improved the common case by 3% and broke a specific edge case by 40% — caught on the nightly diff.
  • A silent model vendor update that changed refusal behavior — caught when a property test for "must return a label" started failing on a new class of inputs.
  • A retrieval index rebuild that accidentally re-indexed from a stale dump — caught because golden-set scores dropped uniformly across all categories.
  • A tenant-isolation break when a new filter wasn't propagated to a code path — caught by the property test on cross-tenant leakage.

None of those would have been found by manual testing. All of them would have shipped and caused real damage.

What this does not replace

Worth naming: none of this replaces user testing, production observability, or review of real user interactions. Golden sets prove the system works on inputs you anticipated. Production telemetry shows you the inputs you didn't anticipate. Both matter. A team that has great evals and no production observability has a testing dashboard that says "all green" while users hit novel failure modes.

If you're building AI features and want a hand standing up the eval and testing infrastructure — especially if you're past the demo stage and moving toward production — that's a significant piece of the work in our AI integration practice. Or get in touch and we can talk through what your specific feature needs.

Closing

AI features that last aren't the ones with the best prompts. They're the ones whose teams built the discipline to catch regressions before users do. The three-layer stack — unit tests for the shell, golden sets for the core, property tests for invariants — is not glamorous. It's the part of AI engineering that looks like normal engineering, which is exactly why it works.

Share this article

Author
DK

Daniel Killyevo

Founder

Building cutting-edge software solutions for businesses worldwide.

Contact Us

Let's have a conversation!

Fill out the form, and tell us about your expectations.
We'll get back to you to answer all questions and help to chart the course of your project.

How does it work?

1

Our solution expert will analyze your requirements and get back to you in 3 business days.

2

If necessary, we can sign a mutual NDA and discuss the project in more detail during a call.

3

You'll receive an initial estimate and our suggestions for your project within 3-5 business days.