TemplateMay 8, 2026

LLM Regression Gate Starter Kit (Make/n8n + Promptfoo)

A copy‑ready bundle to stand up a lightweight LLM regression gate: frozen 20‑row goldens, a Promptfoo YAML with regex/contains/classifier examples, Make/n8n runner steps, and CI snippets using pass‑rate thresholds and exit codes. Built for solo operators who ship prompt changes weekly.

Contents↓ Download PDF

Bundle structure (drop-in).env.example (fill and keep out of git)promptfooconfig.yaml (starter)Golden set — CSV template (20 rows)Golden set — JSON template (mirror)scripts/eval-gate.sh (CI gate script)Make: "Eval Runner" via SSH (template steps)n8n: "Eval Runner" (execute on host)CI gate snippet (drop-in)README template (policy, ownership, and ops)

Use this template to stand up a lightweight, shippable regression gate for your LLM prompts or workflows. You’ll maintain a tiny frozen golden set (start with ~20 rows), run three simple checkers (regex, must‑contain, classifier/LLM‑judge), and enforce a pass/fail threshold in CI or from Make/n8n. Replace [BRACKETS] with your details, commit the files, and run the eval script. Aim to get a green gate locally before wiring it into automations.

Bundle structure (drop-in)

Copy this structure into your repo. Keep tests small and versioned.

[PROJECT_ROOT]/
├─ promptfooconfig.yaml
├─ tests/
│  ├─ goldens.csv              # 20+ canonical cases you will NOT edit casually
│  ├─ goldens.json             # JSON mirror of the CSV (handy for diffs/scripts)
│  └─ canary.csv               # 5–10 high-signal cases you run post-deploy
├─ scripts/
│  └─ eval-gate.sh             # Runs Promptfoo with pass-rate gate + exit codes
├─ .env.example                # Env vars for local and runner usage
└─ README.eval.md              # Why these tests, thresholds, and ops notes

↑ Back to top

.env.example (fill and keep out of git)

Copy to .env or your CI secret store.

# Gate policy
PROMPTFOO_PASS_RATE_THRESHOLD=[95]           # % that must pass (0–100; default 100)
PROMPTFOO_MAX_EVAL_TIME_MS=[180000]          # Hard cap on total eval time (ms)
PROMPTFOO_FAILED_TEST_EXIT_CODE=[100]        # Non-zero exit on failed tests
MAX_CONCURRENCY=[4]                          # Passed to -j for API cost/latency

# Model providers (example)
OPENAI_API_KEY=[YOUR_OPENAI_KEY]
ANTHROPIC_API_KEY=[YOUR_ANTHROPIC_KEY]
AZURE_OPENAI_API_KEY=[YOUR_AZURE_KEY]

Notes:

Keep provider keys in CI secrets; never commit real keys.
With small sets (<30 rows), one failure can swing ≥3%. Choose thresholds that reflect the cost of errors in your business.

↑ Back to top

promptfooconfig.yaml (starter)

This config demonstrates three assertion styles you can automate today: regex (format), contains (must-include), and classifier/LLM‑rubric (semantics). Duplicate the test blocks until you reach 20–30 cases. Keep them stable.

# promptfooconfig.yaml
# Minimal CI-ready config with per-test assertions and optional thresholds.

# 1) Providers — pick one to start; add alternates to compare
providers:
  - id: &quot;[PROVIDER_ID]&quot;            # e.g., openai:gpt-4o-mini | anthropic:claude-3-haiku
    config:
      temperature: [0.2]

# 2) Prompt under test — inline or reference a file
prompts:
  - id: main
    prompt: |
      [YOUR_PROMPT_TEMPLATE]
      ---
      Task input:
      {{input}}

# 3) Tests — keep small, frozen, and representative
#    Each test includes concrete assertions. Add weight on must-not-fail checks.
tests:
  - id: T001-json-format
    vars:
      input: &quot;[INPUT_EXAMPLE_JSON]&quot;
    assertions:
      - type: regex
        pattern: &quot;^\{[\s\S]*\}$&quot;
        description: &quot;Output must be a single JSON object (no prose).&quot;
        weight: 2

  - id: T002-contains-next-steps
    vars:
      input: &quot;[INPUT_EXAMPLE_STEPS]&quot;
    assertions:
      - type: contains
        value: &quot;[MUST_INCLUDE_PHRASE]&quot;   # e.g., &quot;Next steps:&quot;
        description: &quot;Response must include the required heading.&quot;

  - id: T003-classifier-routing
    vars:
      input: &quot;[INPUT_EXAMPLE_ROUTE]&quot;
    assertions:
      - type: classifier
        labels: [&quot;approve&quot;, &quot;reject&quot;, &quot;escalate&quot;]
        value: &quot;[EXPECTED_LABEL]&quot;        # one of the labels above
        provider: &quot;[JUDGE_PROVIDER_ID]&quot;  # e.g., openai:gpt-4o-mini
        description: &quot;Route to the correct path.&quot;

  - id: T004-llm-rubric-thresholded
    vars:
      input: &quot;[INPUT_EXAMPLE_RUBRIC]&quot;
      rubric: &quot;Mentions [MUST_KEYWORD_A] and provides a numbered list with at least 3 items.&quot;
    assertions:
      - type: llm-rubric
        value: &quot;${vars.rubric}&quot;
        threshold: 0.8                  # per-test semantic pass bar
        provider: &quot;[JUDGE_PROVIDER_ID]&quot;
        description: &quot;Meets content/rubric bar.&quot;

# Optional: name the suite run and output file for CI parsing
writeOutput: true
outputPath: &quot;output.json&quot;

↑ Back to top

Golden set — CSV template (20 rows)

Maintain this by hand or export from your labeling tool. Keep IDs stable.

# tests/goldens.csv
id,input,must_include,format_regex,expected_label
1,&quot;[CUSTOMER_EMAIL_ASKING_FOR_REFUND]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;escalate&quot;
2,&quot;[LEAD_FORM_MESSAGE_SIMPLE_QUALIFIED]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;approve&quot;
3,&quot;[LEAD_FORM_MESSAGE_SPAMMY]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;reject&quot;
...
20,&quot;[EDGE_CASE_INPUT]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;[EXPECTED]&quot;

Usage note: Mirror the same cases in goldens.json (below) if you prefer JSON workflows. Your config can be expanded later to programmatically load datasets; start by pasting high-signal cases directly into promptfooconfig.yaml to ship fast.

↑ Back to top

Golden set — JSON template (mirror)

JSON mirror of the CSV, convenient for scripts and diffs.

// tests/goldens.json
[
  {
    &quot;id&quot;: 1,
    &quot;input&quot;: &quot;[CUSTOMER_EMAIL_ASKING_FOR_REFUND]&quot;,
    &quot;must_include&quot;: &quot;Next steps:&quot;,
    &quot;format_regex&quot;: &quot;^\\{[\\s\\S]*\\}$&quot;,
    &quot;expected_label&quot;: &quot;escalate&quot;
  },
  {
    &quot;id&quot;: 2,
    &quot;input&quot;: &quot;[LEAD_FORM_MESSAGE_SIMPLE_QUALIFIED]&quot;,
    &quot;must_include&quot;: &quot;Next steps:&quot;,
    &quot;format_regex&quot;: &quot;^\\{[\\s\\S]*\\}$&quot;,
    &quot;expected_label&quot;: &quot;approve&quot;
  },
  {
    &quot;id&quot;: 3,
    &quot;input&quot;: &quot;[LEAD_FORM_MESSAGE_SPAMMY]&quot;,
    &quot;must_include&quot;: &quot;Next steps:&quot;,
    &quot;format_regex&quot;: &quot;^\\{[\\s\\S]*\\}$&quot;,
    &quot;expected_label&quot;: &quot;reject&quot;
  }
]

↑ Back to top

scripts/eval-gate.sh (CI gate script)

Runs Promptfoo with a pass-rate gate and clear exit codes you can wire into any runner, Make, n8n, or CI. Make executable: chmod +x scripts/eval-gate.sh.

#!/usr/bin/env bash
set -eo pipefail

# Load env if present
if [ -f .env ]; then
  # shellcheck disable=SC2046
  export $(grep -v &#39;^#&#39; .env | xargs -I{} echo {})
fi

: &quot;${PROMPTFOO_PASS_RATE_THRESHOLD:=[95]}&quot;
: &quot;${PROMPTFOO_MAX_EVAL_TIME_MS:=[180000]}&quot;
: &quot;${PROMPTFOO_FAILED_TEST_EXIT_CODE:=[100]}&quot;
: &quot;${MAX_CONCURRENCY:=[4]}&quot;

echo &quot;→ Running evals with pass-rate &gt;= ${PROMPTFOO_PASS_RATE_THRESHOLD}% and -j ${MAX_CONCURRENCY}&quot;

# Main eval (no cache to catch regressions)
PROMPTFOO_PASS_RATE_THRESHOLD=&quot;${PROMPTFOO_PASS_RATE_THRESHOLD}&quot; \
PROMPTFOO_MAX_EVAL_TIME_MS=&quot;${PROMPTFOO_MAX_EVAL_TIME_MS}&quot; \
PROMPTFOO_FAILED_TEST_EXIT_CODE=&quot;${PROMPTFOO_FAILED_TEST_EXIT_CODE}&quot; \
npx promptfoo eval -c promptfooconfig.yaml -j &quot;${MAX_CONCURRENCY}&quot; --no-cache -o output.json
status=$?

if [ &quot;$status&quot; -eq 0 ]; then
  echo &quot;✅ Gate PASSED&quot;
  exit 0
elif [ &quot;$status&quot; -eq &quot;${PROMPTFOO_FAILED_TEST_EXIT_CODE}&quot; ]; then
  echo &quot;❌ Gate FAILED: tests below threshold. Exit $status&quot;
  exit &quot;$status&quot;
else
  echo &quot;⚠️  Eval ERROR: promptfoo exited $status&quot;
  exit &quot;$status&quot;
fi

Tip: Add a second call that runs a 5–10 case canary set (tests/canary.csv) after deploy to watch for drift.

↑ Back to top

Make: "Eval Runner" via SSH (template steps)

This scenario runs the gate on a small Linux runner you control (cheap VPS or container with Node installed).

Trigger: [Scheduler every 10m] or [Webhook → only on PR/merge]
Step 1 — SSH: Execute command
- Connection: [SSH_CREDENTIAL]
- Command:
```
cd [REPO_DIR] &amp;&amp; git pull &amp;&amp; bash scripts/eval-gate.sh
```
- Expected outcome: Non-zero exit halts the scenario (treat as failure).
Step 2 — Router (optional):
- If exit = 0 → Notify success (Slack/Email) with summary from output.json.
- Else → Create issue / post comment on PR and stop downstream deploy.

Runner prep checklist:

Install Node 18+ and npx on the runner.
Set environment secrets for provider keys and pass-rate threshold.
Ensure the repo contains promptfooconfig.yaml and scripts/eval-gate.sh.

↑ Back to top

n8n: "Eval Runner" (execute on host)

Use the built-in Execute Command (self-hosted) or SSH node.

Trigger: [When PR labeled "run-evals"] or [Manual]
Node A — Execute Command (or SSH):
- Command:
```
cd [REPO_DIR] &amp;&amp; git pull &amp;&amp; bash scripts/eval-gate.sh
```
- On Error: Stop workflow and mark run as failed.
Node B — If (status == 0):
- Send success message with a link to the run logs and attach output.json.
Node C — If (status != 0):
- Post failure notice to Slack/Email with top failing test IDs parsed from output.json.

Workflow prep:

Mount the repo or pull fresh on each run.
Inject secrets as environment variables on the n8n host.
Ensure the host can reach your model providers (or use mock providers in dev).

↑ Back to top

CI gate snippet (drop-in)

Drop this into any CI system (GitHub Actions, GitLab CI, etc.). It honors exit code 0 (pass), 100 (failed tests), and 1 (other errors).

# ci/eval-gate.sh (inline snippet)
set -eo pipefail
export PROMPTFOO_PASS_RATE_THRESHOLD=${PROMPTFOO_PASS_RATE_THRESHOLD:-[95]}
export PROMPTFOO_MAX_EVAL_TIME_MS=${PROMPTFOO_MAX_EVAL_TIME_MS:-[180000]}
export PROMPTFOO_FAILED_TEST_EXIT_CODE=${PROMPTFOO_FAILED_TEST_EXIT_CODE:-[100]}
export MAX_CONCURRENCY=${MAX_CONCURRENCY:-[4]}

npx promptfoo eval -c promptfooconfig.yaml -j &quot;$MAX_CONCURRENCY&quot; --no-cache -o output.json || status=$?

if [ -z &quot;$status&quot; ]; then
  echo &quot;✅ Gate PASSED&quot;; exit 0
elif [ &quot;$status&quot; -eq &quot;$PROMPTFOO_FAILED_TEST_EXIT_CODE&quot; ]; then
  echo &quot;❌ Gate FAILED (below ${PROMPTFOO_PASS_RATE_THRESHOLD}% pass rate)&quot;; exit &quot;$status&quot;
else
  echo &quot;⚠️  Eval ERROR (exit $status)&quot;; exit &quot;$status&quot;
fi

Example GitHub Actions job step:

- name: Run LLM regression gate
  run: bash ci/eval-gate.sh
  env:
    PROMPTFOO_PASS_RATE_THRESHOLD: [95]
    PROMPTFOO_MAX_EVAL_TIME_MS: [180000]
    MAX_CONCURRENCY: [4]
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

↑ Back to top

README template (policy, ownership, and ops)

Give teammates (or future you) the context to maintain the gate without guesswork.

# README.eval.md

## What this gate protects
- System: [SYSTEM/FEATURE]
- Risk we’re catching: [E.G., WRONG ROUTING / MISSING DISCLAIMERS / FORMAT BREAKS]

## Golden set ownership
- Source of truth: tests/goldens.csv (20 rows)
- Update protocol: Propose via PR with rationale + before/after runs
- Change log: [LINK_TO_CHANGELOG]

## Gate policy
- Pass rate: [95]% overall, with weighted must-not-fail checks
- Critical checks: Format regex must pass (weight=2)
- Time/cost guard: `-j [4]`, `PROMPTFOO_MAX_EVAL_TIME_MS=[180000]`

## Canary evals (post-deploy)
- File: tests/canary.csv (5–10 rows)
- Frequency: [HOURLY/DAILY]
- Alerting: [SLACK/EMAIL CHANNEL]

## Upgrade path
- When multi-turn or RAG traces matter, graduate to MLflow + TruLens scorers.
- For improving prompts (not just gating), trial a prompt optimizer against these goldens.

↑ Back to top