LLM Regression Gate Starter Kit (Make/n8n + Promptfoo)
A copy‑ready bundle to stand up a lightweight LLM regression gate: frozen 20‑row goldens, a Promptfoo YAML with regex/contains/classifier examples, Make/n8n runner steps, and CI snippets using pass‑rate thresholds and exit codes. Built for solo operators who ship prompt changes weekly.
Use this template to stand up a lightweight, shippable regression gate for your LLM prompts or workflows. You’ll maintain a tiny frozen golden set (start with ~20 rows), run three simple checkers (regex, must‑contain, classifier/LLM‑judge), and enforce a pass/fail threshold in CI or from Make/n8n. Replace [BRACKETS] with your details, commit the files, and run the eval script. Aim to get a green gate locally before wiring it into automations.
Bundle structure (drop-in)
Copy this structure into your repo. Keep tests small and versioned.
[PROJECT_ROOT]/
├─ promptfooconfig.yaml
├─ tests/
│ ├─ goldens.csv # 20+ canonical cases you will NOT edit casually
│ ├─ goldens.json # JSON mirror of the CSV (handy for diffs/scripts)
│ └─ canary.csv # 5–10 high-signal cases you run post-deploy
├─ scripts/
│ └─ eval-gate.sh # Runs Promptfoo with pass-rate gate + exit codes
├─ .env.example # Env vars for local and runner usage
└─ README.eval.md # Why these tests, thresholds, and ops notes
.env.example (fill and keep out of git)
Copy to .env or your CI secret store.
# Gate policy
PROMPTFOO_PASS_RATE_THRESHOLD=[95] # % that must pass (0–100; default 100)
PROMPTFOO_MAX_EVAL_TIME_MS=[180000] # Hard cap on total eval time (ms)
PROMPTFOO_FAILED_TEST_EXIT_CODE=[100] # Non-zero exit on failed tests
MAX_CONCURRENCY=[4] # Passed to -j for API cost/latency
# Model providers (example)
OPENAI_API_KEY=[YOUR_OPENAI_KEY]
ANTHROPIC_API_KEY=[YOUR_ANTHROPIC_KEY]
AZURE_OPENAI_API_KEY=[YOUR_AZURE_KEY]
Notes:
- Keep provider keys in CI secrets; never commit real keys.
- With small sets (<30 rows), one failure can swing ≥3%. Choose thresholds that reflect the cost of errors in your business.
promptfooconfig.yaml (starter)
This config demonstrates three assertion styles you can automate today: regex (format), contains (must-include), and classifier/LLM‑rubric (semantics). Duplicate the test blocks until you reach 20–30 cases. Keep them stable.
# promptfooconfig.yaml
# Minimal CI-ready config with per-test assertions and optional thresholds.
# 1) Providers — pick one to start; add alternates to compare
providers:
- id: "[PROVIDER_ID]" # e.g., openai:gpt-4o-mini | anthropic:claude-3-haiku
config:
temperature: [0.2]
# 2) Prompt under test — inline or reference a file
prompts:
- id: main
prompt: |
[YOUR_PROMPT_TEMPLATE]
---
Task input:
{{input}}
# 3) Tests — keep small, frozen, and representative
# Each test includes concrete assertions. Add weight on must-not-fail checks.
tests:
- id: T001-json-format
vars:
input: "[INPUT_EXAMPLE_JSON]"
assertions:
- type: regex
pattern: "^\{[\s\S]*\}$"
description: "Output must be a single JSON object (no prose)."
weight: 2
- id: T002-contains-next-steps
vars:
input: "[INPUT_EXAMPLE_STEPS]"
assertions:
- type: contains
value: "[MUST_INCLUDE_PHRASE]" # e.g., "Next steps:"
description: "Response must include the required heading."
- id: T003-classifier-routing
vars:
input: "[INPUT_EXAMPLE_ROUTE]"
assertions:
- type: classifier
labels: ["approve", "reject", "escalate"]
value: "[EXPECTED_LABEL]" # one of the labels above
provider: "[JUDGE_PROVIDER_ID]" # e.g., openai:gpt-4o-mini
description: "Route to the correct path."
- id: T004-llm-rubric-thresholded
vars:
input: "[INPUT_EXAMPLE_RUBRIC]"
rubric: "Mentions [MUST_KEYWORD_A] and provides a numbered list with at least 3 items."
assertions:
- type: llm-rubric
value: "${vars.rubric}"
threshold: 0.8 # per-test semantic pass bar
provider: "[JUDGE_PROVIDER_ID]"
description: "Meets content/rubric bar."
# Optional: name the suite run and output file for CI parsing
writeOutput: true
outputPath: "output.json"
Golden set — CSV template (20 rows)
Maintain this by hand or export from your labeling tool. Keep IDs stable.
# tests/goldens.csv
id,input,must_include,format_regex,expected_label
1,"[CUSTOMER_EMAIL_ASKING_FOR_REFUND]","Next steps:","^\\{[\\s\\S]*\\}$","escalate"
2,"[LEAD_FORM_MESSAGE_SIMPLE_QUALIFIED]","Next steps:","^\\{[\\s\\S]*\\}$","approve"
3,"[LEAD_FORM_MESSAGE_SPAMMY]","Next steps:","^\\{[\\s\\S]*\\}$","reject"
...
20,"[EDGE_CASE_INPUT]","Next steps:","^\\{[\\s\\S]*\\}$","[EXPECTED]"
Usage note: Mirror the same cases in goldens.json (below) if you prefer JSON workflows. Your config can be expanded later to programmatically load datasets; start by pasting high-signal cases directly into promptfooconfig.yaml to ship fast.
Golden set — JSON template (mirror)
JSON mirror of the CSV, convenient for scripts and diffs.
// tests/goldens.json
[
{
"id": 1,
"input": "[CUSTOMER_EMAIL_ASKING_FOR_REFUND]",
"must_include": "Next steps:",
"format_regex": "^\\{[\\s\\S]*\\}$",
"expected_label": "escalate"
},
{
"id": 2,
"input": "[LEAD_FORM_MESSAGE_SIMPLE_QUALIFIED]",
"must_include": "Next steps:",
"format_regex": "^\\{[\\s\\S]*\\}$",
"expected_label": "approve"
},
{
"id": 3,
"input": "[LEAD_FORM_MESSAGE_SPAMMY]",
"must_include": "Next steps:",
"format_regex": "^\\{[\\s\\S]*\\}$",
"expected_label": "reject"
}
]
scripts/eval-gate.sh (CI gate script)
Runs Promptfoo with a pass-rate gate and clear exit codes you can wire into any runner, Make, n8n, or CI. Make executable: chmod +x scripts/eval-gate.sh.
#!/usr/bin/env bash
set -eo pipefail
# Load env if present
if [ -f .env ]; then
# shellcheck disable=SC2046
export $(grep -v '^#' .env | xargs -I{} echo {})
fi
: "${PROMPTFOO_PASS_RATE_THRESHOLD:=[95]}"
: "${PROMPTFOO_MAX_EVAL_TIME_MS:=[180000]}"
: "${PROMPTFOO_FAILED_TEST_EXIT_CODE:=[100]}"
: "${MAX_CONCURRENCY:=[4]}"
echo "→ Running evals with pass-rate >= ${PROMPTFOO_PASS_RATE_THRESHOLD}% and -j ${MAX_CONCURRENCY}"
# Main eval (no cache to catch regressions)
PROMPTFOO_PASS_RATE_THRESHOLD="${PROMPTFOO_PASS_RATE_THRESHOLD}" \
PROMPTFOO_MAX_EVAL_TIME_MS="${PROMPTFOO_MAX_EVAL_TIME_MS}" \
PROMPTFOO_FAILED_TEST_EXIT_CODE="${PROMPTFOO_FAILED_TEST_EXIT_CODE}" \
npx promptfoo eval -c promptfooconfig.yaml -j "${MAX_CONCURRENCY}" --no-cache -o output.json
status=$?
if [ "$status" -eq 0 ]; then
echo "✅ Gate PASSED"
exit 0
elif [ "$status" -eq "${PROMPTFOO_FAILED_TEST_EXIT_CODE}" ]; then
echo "❌ Gate FAILED: tests below threshold. Exit $status"
exit "$status"
else
echo "⚠️ Eval ERROR: promptfoo exited $status"
exit "$status"
fi
Tip: Add a second call that runs a 5–10 case canary set (tests/canary.csv) after deploy to watch for drift.
Make: "Eval Runner" via SSH (template steps)
This scenario runs the gate on a small Linux runner you control (cheap VPS or container with Node installed).
- Trigger: [Scheduler every 10m] or [Webhook → only on PR/merge]
- Step 1 — SSH: Execute command
- Connection: [SSH_CREDENTIAL]
- Command:
cd [REPO_DIR] && git pull && bash scripts/eval-gate.sh - Expected outcome: Non-zero exit halts the scenario (treat as failure).
- Step 2 — Router (optional):
- If exit = 0 → Notify success (Slack/Email) with summary from
output.json. - Else → Create issue / post comment on PR and stop downstream deploy.
- If exit = 0 → Notify success (Slack/Email) with summary from
Runner prep checklist:
- Install Node 18+ and
npxon the runner. - Set environment secrets for provider keys and pass-rate threshold.
- Ensure the repo contains
promptfooconfig.yamlandscripts/eval-gate.sh.
n8n: "Eval Runner" (execute on host)
Use the built-in Execute Command (self-hosted) or SSH node.
- Trigger: [When PR labeled "run-evals"] or [Manual]
- Node A — Execute Command (or SSH):
- Command:
cd [REPO_DIR] && git pull && bash scripts/eval-gate.sh - On Error: Stop workflow and mark run as failed.
- Command:
- Node B — If (status == 0):
- Send success message with a link to the run logs and attach
output.json.
- Send success message with a link to the run logs and attach
- Node C — If (status != 0):
- Post failure notice to Slack/Email with top failing test IDs parsed from
output.json.
- Post failure notice to Slack/Email with top failing test IDs parsed from
Workflow prep:
- Mount the repo or pull fresh on each run.
- Inject secrets as environment variables on the n8n host.
- Ensure the host can reach your model providers (or use mock providers in dev).
CI gate snippet (drop-in)
Drop this into any CI system (GitHub Actions, GitLab CI, etc.). It honors exit code 0 (pass), 100 (failed tests), and 1 (other errors).
# ci/eval-gate.sh (inline snippet)
set -eo pipefail
export PROMPTFOO_PASS_RATE_THRESHOLD=${PROMPTFOO_PASS_RATE_THRESHOLD:-[95]}
export PROMPTFOO_MAX_EVAL_TIME_MS=${PROMPTFOO_MAX_EVAL_TIME_MS:-[180000]}
export PROMPTFOO_FAILED_TEST_EXIT_CODE=${PROMPTFOO_FAILED_TEST_EXIT_CODE:-[100]}
export MAX_CONCURRENCY=${MAX_CONCURRENCY:-[4]}
npx promptfoo eval -c promptfooconfig.yaml -j "$MAX_CONCURRENCY" --no-cache -o output.json || status=$?
if [ -z "$status" ]; then
echo "✅ Gate PASSED"; exit 0
elif [ "$status" -eq "$PROMPTFOO_FAILED_TEST_EXIT_CODE" ]; then
echo "❌ Gate FAILED (below ${PROMPTFOO_PASS_RATE_THRESHOLD}% pass rate)"; exit "$status"
else
echo "⚠️ Eval ERROR (exit $status)"; exit "$status"
fi
Example GitHub Actions job step:
- name: Run LLM regression gate
run: bash ci/eval-gate.sh
env:
PROMPTFOO_PASS_RATE_THRESHOLD: [95]
PROMPTFOO_MAX_EVAL_TIME_MS: [180000]
MAX_CONCURRENCY: [4]
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
README template (policy, ownership, and ops)
Give teammates (or future you) the context to maintain the gate without guesswork.
# README.eval.md
## What this gate protects
- System: [SYSTEM/FEATURE]
- Risk we’re catching: [E.G., WRONG ROUTING / MISSING DISCLAIMERS / FORMAT BREAKS]
## Golden set ownership
- Source of truth: tests/goldens.csv (20 rows)
- Update protocol: Propose via PR with rationale + before/after runs
- Change log: [LINK_TO_CHANGELOG]
## Gate policy
- Pass rate: [95]% overall, with weighted must-not-fail checks
- Critical checks: Format regex must pass (weight=2)
- Time/cost guard: `-j [4]`, `PROMPTFOO_MAX_EVAL_TIME_MS=[180000]`
## Canary evals (post-deploy)
- File: tests/canary.csv (5–10 rows)
- Frequency: [HOURLY/DAILY]
- Alerting: [SLACK/EMAIL CHANNEL]
## Upgrade path
- When multi-turn or RAG traces matter, graduate to MLflow + TruLens scorers.
- For improving prompts (not just gating), trial a prompt optimizer against these goldens.