Episode 7·

Build a Regression Gate: Test Sets and Pass-Fail Thresholds for LLM Changes

Intro

This episode is for solo operators who ship prompt changes weekly and need a fast, cheap way to prevent regressions. You'll get a working regression gate that catches broken changes before they reach production—no team required, no complex infrastructure, just a twenty-row spreadsheet and some smart automation.

In This Episode

Jordan walks through building a minimal regression gate using Promptfoo, Make, and n8n. Starting with a real story of a two-word prompt change that broke a lead qualification system, he demonstrates how to create a frozen golden set from production logs, implement three types of automated checkers (regex, contains, and LLM rubrics), and wire the whole system into a pass-fail gate that blocks bad deploys. The episode covers cost and time guards, threshold sensitivity with small test sets, and when to graduate to more sophisticated tools like TruLens and MLflow for complex agent workflows.

Key Takeaways

  • A twenty-row frozen test set from real production inputs catches more meaningful regressions than massive synthetic benchmarks
  • Layer three checker types—deterministic regex and contains checks first, then LLM-as-a-judge for semantic validation—to balance cost and reliability
  • Use Promptfoo's exit code 100 and PROMPTFOO_PASS_RATE_THRESHOLD environment variable to create a release gate that blocks deploys when tests fail

Timestamps

Companion Resource

  • Promptfoo Docs – Command Line

    promptfoo.dev

    • - `promptfoo eval` returns exit code 100 when at least one test case fails or pass rate falls below `PROMPTFOO_PASS_RATE_THRESHOLD`; exit code 1 is used for other errors; `PROMPTFOO_FAILED_TEST_EXIT_CODE` can override the failed‑tests exit code.
  • Promptfoo Docs – Configuration Reference & Assertions/Metrics

    promptfoo.dev

    • - Promptfoo automatically loads `promptfooconfig.*` (yaml/json/js) or accepts `-c path/to/config`; tests can specify per‑test `threshold` and weighted assertions to compute pass/fail.
  • Promptfoo Docs – Assertions & Metrics

    promptfoo.dev

    • - Deterministic assertion types include `contains`, `regex`, `is-json`, `javascript`, `python`, and more; model‑assisted metrics include `similar`, `classifier`, `llm‑rubric`, `g‑eval`, `answer‑relevance`, and `context‑faithfulness`.
  • Promptfoo Docs – Looper Integration

    promptfoo.dev

    • - Promptfoo Looper integration shows a shell‑level quality gate: parsing successes/failures from `output.json` and `test "$FAIL" -eq 0` to fail builds; includes a pass‑rate example using jq/bc to enforce ≥95%.
  • Promptfoo Docs – Command Line; Assertions & Metrics; Configuration Reference

    promptfoo.dev

    • - Promptfoo documentation pages used here list 'Last updated on May 8, 2026'.
  • TruLens Blog – 2026 Archive

    trulens.org

    • - TruLens 2.7 (Feb 3, 2026) introduced a unified metric API and first‑class MLflow integration; 2.8 (Apr 14, 2026) added parallel batch evals and schema validation.
  • MLflow Blog – Agent Trace Evaluation with TruLens Scorers

    mlflow.org

    • - MLflow blog details TruLens integration adding 10 scorers (4 RAG, 6 agent trace evaluators based on Agent GPA) with thresholded yes/no outputs and rationales.
  • TruLens Docs – MLflow Scorers

    trulens.org

    • - TruLens 'Scorers with MLflow' cookbook states TruLens feedback functions are available as first‑class scorers in MLflow GenAI starting with MLflow 3.10.0; includes code to configure thresholds and batch evaluation.
  • DeepEval Docs – Prompt Optimization Introduction

    deepeval.com

    • - DeepEval provides a PromptOptimizer that optimizes prompts against a golden set using GEPA and MIPROv2 algorithms and leverages 50+ metrics.
  • Confident AI Docs – LLM Metrics

    confident-ai.com

    • - Confident AI/DeepEval docs catalog pre‑built metrics (e.g., Answer Relevancy, Faithfulness, Hallucination, Bias, Toxicity, Tool Correctness) and support custom metrics (G‑Eval and Python code).
  • OpenAI – Evaluation best practices

    platform.openai.com

    • - OpenAI evaluation best practices recommend including a pass/fail threshold in addition to numerical scores for deployment decisions.
  • Microsoft Learn – Interpret evaluation scores and assess readiness

    learn.microsoft.com

    • - Microsoft Learn advises that with fewer than 30 test cases, one test flipping pass→fail can shift the score by ≥3%, underscoring sensitivity of small frozen sets.
  • arXiv – An Empirical Investigation of Practical LLM‑as‑a‑Judge Improvement Techniques on RewardBench 2

    arxiv.org

    • - Research continues to document biases and stability challenges in LLM‑as‑a‑judge settings; recent (2026) work studies practical improvement techniques on RewardBench 2.
  • GitHub – openai/evals

    github.com

    • - OpenAI Evals remains an open‑source framework and registry, commonly cited in top 'LLM evals' results and used for CI/CD gating by some teams.
  • Promptfoo Docs: Setting up Promptfoo with Looper

    promptfoo.dev

    • - Looper CI integration using Promptfoo as a release gate
    • - Shows a concrete CI pattern that fails the build on failed evals and demonstrates a pass‑rate gate shell snippet, which Jordan can replicate in Make/n8n runners.
  • GitHub: promptfoo/promptfoo-action

    github.com

    • - Promptfoo GitHub Action for CI gating
    • - Shows maintained, first‑party CI integration that posts PR comments and fails on non‑zero exit codes; evidence that small eval suites are commonly run in CI as gates.
  • MLflow Blog: Agent Trace Evaluation with TruLens Scorers in MLflow

    mlflow.org

    • - TruLens scorers integrated into MLflow GenAI evaluation
    • - Demonstrates a growth path beyond simple substring/regex checks: trace‑aware scorers for agents and RAG, useful when Jordan’s listeners outgrow lightweight gates.
  • TruLens 2.6/2.7 Release Posts

    trulens.org

    • - Rapid evolution of TruLens with MLflow integration and unified metric API
    • - Corroborates the episode’s context that the ecosystem is maturing in 2026 with viable, automatable checks and integrations.
  • DeepEval Docs: Prompt Optimizer + Metrics

    deepeval.com

    • - DeepEval’s PromptOptimizer and broad metric coverage
    • - Shows an alternative/tooling tradeoff: in‑code evaluators and automated prompt optimization against a golden set.

Jordan: I got a DM last Tuesday that I haven't been able to stop thinking about. It said — and I'm quoting almost exactly — "I tweaked one line in my system prompt on Friday. Changed the tone instruction from 'professional and concise' to 'warm and helpful.' Shipped it. Didn't test it. By Monday my lead qualification workflow had approved fourteen spam submissions that it had been catching perfectly for three months."

Fourteen. In one weekend. Because of two words.

And the person who sent this — they're not a beginner. They're running a real automation practice. Make scenarios, n8n workflows, LLM calls in production handling actual client data. They knew the change was small. They figured small change, small risk.

That math does not work with language models. A two-word prompt edit can flip the behavior of an entire pipeline. And unless you have something that checks the output before it ships — something automated, something that runs every time — you will not catch it until a client does.

So the question this person actually asked me was simple. "What's the minimum viable way to make sure a prompt change doesn't break what's already working?"

That's what we're building today.

Jordan: When was the last time you changed a prompt in a production workflow and ran it against the exact same inputs it handled last week — before you deployed? Not eyeballed the output. Not sent one test message and said "looks fine." Actually ran the same twenty inputs through the new version and compared. If the answer is never — and for most solo operators it is never — then every prompt change you've shipped has been a guess. This is Headcount Zero. I'm Jordan. And today you're getting a regression gate — a frozen test set, three dead-simple checkers, and a pass-fail threshold that blocks your deploy when something breaks. The whole thing runs inside Promptfoo, triggered from Make or n8n, and it takes about forty-five minutes to stand up from scratch.

Jordan: So here's the trap. When you hear "LLM evals," your brain probably goes to leaderboards. MMLU scores. Chatbot Arena. Some massive benchmark where GPT-four-o beats Claude on reasoning but loses on coding. And that stuff is interesting if you're picking a base model — but it tells you absolutely nothing about whether your specific prompt, with your specific instructions, still routes a refund request to the right queue after you edited paragraph three last night.

Those benchmarks are testing the model. You need to test your system. Your prompt, your formatting rules, your routing logic, your output constraints. And the good news is that testing your system is dramatically simpler than testing a model. You don't need thousands of examples. You don't need a team. You need roughly twenty rows in a spreadsheet and about three types of checks.

Let me walk through what those twenty rows actually look like, because this is where most people overthink it.

Jordan: You open a CSV. Or a JSON file — doesn't matter, pick whichever you'll actually maintain. Each row is one real input that your workflow has handled in production. Not synthetic. Not hypothetical. Real inputs you've already seen. A customer email asking for a refund. A lead form submission that's clearly spam. An edge case where the input was in Spanish even though your system prompt says English only. You pull these from your actual logs, your actual Slack notifications, your actual error reports.

Twenty rows. That's it. And you freeze them. Meaning — you do not casually edit this file. It's your golden set. It's the contract between your current prompt and reality. When you change the prompt, you run it against these twenty rows, and if the outputs change in ways you didn't intend, the gate catches it.

Jordan: Now — the checkers. Three types, and they layer on each other.

First — format checks. Regex. Your workflow expects JSON output? Write a regex that confirms the output starts with a curly brace and ends with a curly brace. Your classifier is supposed to return one of three labels — approve, reject, escalate? Write a regex that matches only those three strings. These checks are deterministic. They cost zero API calls. They catch the most catastrophic failures — the ones where the model just stops following your format instructions entirely.

Second — content checks. Contains. Must-include. Your output is supposed to have a "next steps" section? Check that the string "next steps" appears. Your response is supposed to reference the customer's name? Check that the variable you passed in shows up in the output. Again — deterministic. Zero cost. And you'd be amazed how often a prompt tweak causes the model to silently drop a required section.

I actually caught this on my own system two months ago. I updated the tone of a client onboarding email generator — just the tone — and the model stopped including the calendar link. Every single output. The link was in the prompt template. The model just... decided it didn't fit the new tone. A contains check would have caught that in seconds. Instead I caught it when a client said "where do I book my kickoff call?"

Third type — and this is where it gets interesting — classifier checks and LLM rubrics. These use a second model call to judge the output. You define a rubric — "the response must mention at least three action items and maintain a professional tone" — and a judge model scores it. Promptfoo supports this natively. You set a threshold — say zero-point-eight — and if the judge scores below that, the test fails.

Now, I want to be honest about this third type. LLM-as-a-judge has real limitations. A 2026 paper on RewardBench documented systematic biases — judges favor longer outputs, judges favor outputs that match their own training distribution. So you don't lean on judge-based checks alone. You front-load the deterministic stuff — regex, contains — and you use the judge for the semantic layer that deterministic checks can't reach. That's the hierarchy. Cheap and reliable first. Expensive and probabilistic second.

Jordan: Okay. You've got your golden set. You've got your checkers defined. Now you need something that actually runs them and blocks a bad deploy. This is where Promptfoo earns its place in the stack.

Promptfoo is an open-source CLI tool. You define your tests in a YAML config file — promptfooconfig.yaml — and you run them with one command. `npx promptfoo eval -c path/to/config -j 4 --no-cache` so it actually hits the model fresh every time. That's the whole invocation.

And here's the part that makes it work as a gate. Promptfoo returns exit code 100 when tests fail or when your pass rate drops below a threshold you set. You set that threshold with an environment variable — `PROMPTFOOPASSRATE_THRESHOLD`. Set it to 95, and if fewer than 95 percent of your twenty tests pass, the command exits non-zero. Your runner — Make, n8n, GitHub Actions, whatever — sees that non-zero exit and stops the deploy.

That's the entire release gate. One YAML file. One command. One environment variable. One exit code.

In Make, you wire this as an SSH step that runs the eval script on a small Linux runner — a five-dollar VPS, a container, whatever you've got with Node installed. The scenario triggers on a schedule or a webhook — I trigger mine when I push a prompt change to my repo. The SSH step runs the script. If it exits zero, the scenario continues to the deploy step. If it exits non-zero, it routes to a Slack notification that says "gate failed, here are the failing test IDs" and the deploy never happens.

n8n is the same pattern. Execute Command node or SSH node. Same script. Same exit code logic. The If node checks the status, routes to success or failure.

And the cost of running this — let me do the math. Twenty test cases. Each one makes one model call for the prompt under test, plus maybe one judge call for the rubric checks. So roughly forty API calls. If you're using GPT-4o-mini at fifteen cents per million input tokens... you're looking at maybe three to five cents per eval run. Maybe ten cents if your prompts are long. That's the cost of catching a regression before your client does.

Jordan: Now — a few guardrails on the eval itself, because you don't want your safety net to become its own problem.

Concurrency. The `-j` flag in Promptfoo controls how many API calls run in parallel. I set mine to four. If you set it to twenty, you'll burn through rate limits and get throttled, which makes your eval flaky — and a flaky eval is worse than no eval, because you stop trusting it.

Time cap. `PROMPTFOOMAXEVALTIMEMS`. I set mine to 180,000 — that's three minutes. If the eval hasn't finished in three minutes, something is wrong. Maybe the API is down, maybe you accidentally pointed at a model that takes ten seconds per call. Either way, you want a hard stop so your Make scenario doesn't hang forever.

And then there's the canary pattern. Your golden set of twenty rows runs before deploy. But after deploy, you keep a smaller set — five to ten of your highest-signal cases — and you run those against production on a schedule. Every hour, every six hours, whatever fits your volume. This catches model-side drift. Because even if you didn't change your prompt, the model provider might have updated something on their end. It happens. The canary catches it.

Jordan: One thing I want to flag, because it matters more than people realize with small test sets. Microsoft's evaluation guidance points out that with fewer than thirty test cases, a single test flipping from pass to fail can swing your overall score by three percent or more. So if you've got twenty tests and your threshold is 95 percent, one failure drops you to 95 — right at the edge. Two failures drops you to 90, and you're blocked.

That's actually a feature, not a bug — if you've chosen your twenty cases well. Each one should represent a real category of input your system handles. If even one of those categories breaks, you want to know. But it means you need to be deliberate about which cases make the cut. Don't throw in twenty variations of the same happy path. Include your edge cases. Include the inputs that have broken things before. Include at least one adversarial input — the spam submission, the prompt injection attempt, the input in an unexpected language.

And if you're thinking "twenty rows feels too small to be meaningful" — Hamel Husain, who's written some of the most cited practical guides on LLM evals, has argued repeatedly that you should design your evals from error analysis on real traces. Start with the failures you've actually seen. That's your golden set. You're not trying to cover every possible input. You're trying to catch the regressions that have actually bitten you — or that would bite you hardest.

OpenAI's own evaluation best practices say the same thing — include a pass-fail threshold, not just a numerical score. A score of 87 means nothing if you don't know whether 87 is good enough to ship. The threshold forces the decision. Ship or don't ship. Green or red.

Jordan: Alright — so this system handles prompt regression testing for single-turn workflows. Input goes in, output comes out, you check the output. That covers a huge percentage of what solo operators are actually running. Lead qualification, email generation, content classification, data extraction.

But some of you are building multi-step agents. RAG pipelines. Workflows where the LLM makes a plan, executes tools, retrieves context, and then generates a response. For those, checking just the final output isn't enough. You need to evaluate the intermediate steps — did it retrieve the right documents? Did it call the right tool? Did the plan make sense?

That's where you graduate to trace-aware evaluation. TruLens shipped version 2.7 in February with first-class MLflow integration and a unified metric API. MLflow added ten TruLens scorers — four for RAG evaluation, six for agent trace evaluation. These score things like context faithfulness, answer relevance, and whether the agent's plan was coherent. And they output thresholded yes-no judgments, so you can wire them into the same kind of gate.

But — and I want to be clear about this — you do not start there. If you're not running multi-turn agents or RAG pipelines today, TruLens and MLflow are overhead you don't need. Start with the twenty-row golden set and Promptfoo. Get the gate working. Get comfortable with the rhythm of running evals before every deploy. Then, when your workflows get more complex, the upgrade path exists.

DeepEval is another option worth knowing about. It covers fifty-plus metrics out of the box — answer relevancy, faithfulness, hallucination, bias, toxicity — and it has a PromptOptimizer that will actually rewrite your prompts against your golden set using optimization algorithms called GEPA and MIPROv2. That's a different use case though. That's not gating — that's improvement. You use the gate to prevent regressions. You use the optimizer to make the prompt better. Same golden set, two different jobs.

Don't let the tool options become stack sprawl. Pick one gate tool. Ship it. Expand later.

Jordan: So — back to that DM. Two words changed in a system prompt. Fourteen spam submissions approved over a weekend. The fix wasn't a better prompt. The fix was knowing the prompt broke before the client did.

That's what a regression gate gives you. Not perfection — awareness. A twenty-row spreadsheet, three types of checkers, one pass-fail threshold, and a deploy that only ships when the gate says green. You can set this up in under an hour. The hard part isn't the tooling. The hard part is pulling those first twenty golden cases from your logs and committing to running them every single time.

If you want to skip the config-from-scratch part, the LLM Regression Gate Starter Kit is on the Resources page. It's got the CSV and JSON golden set templates, a working promptfooconfig YAML with all three checker types wired up, Make and n8n runner templates, and the shell script with the exit code logic. Drop in your own test cases, set your pass rate, and your next prompt change ships only if it clears the gate.

One thing this week. Pull up your most recent prompt change — the last one you shipped without testing. Find twenty real inputs from your logs. Put them in a spreadsheet. That's your golden set. Everything else builds from there.

I'm Jordan. This is Headcount Zero. Go build something that doesn't break on Monday.

LLM evaluationprompt testingregression testingautomationMake.comn8nPromptfooCI/CDquality gatessolo operations