Template

LLM Regression Gate Starter Kit (Make/n8n + Promptfoo)

A copy‑ready bundle to stand up a lightweight LLM regression gate: frozen 20‑row goldens, a Promptfoo YAML with regex/contains/classifier examples, Make/n8n runner steps, and CI snippets using pass‑rate thresholds and exit codes. Built for solo operators who ship prompt changes weekly.

Use this template to stand up a lightweight, shippable regression gate for your LLM prompts or workflows. You’ll maintain a tiny frozen golden set (start with ~20 rows), run three simple checkers (regex, must‑contain, classifier/LLM‑judge), and enforce a pass/fail threshold in CI or from Make/n8n. Replace [BRACKETS] with your details, commit the files, and run the eval script. Aim to get a green gate locally before wiring it into automations.

Bundle structure (drop-in)

Copy this structure into your repo. Keep tests small and versioned.

[PROJECT_ROOT]/
├─ promptfooconfig.yaml
├─ tests/
│  ├─ goldens.csv              # 20+ canonical cases you will NOT edit casually
│  ├─ goldens.json             # JSON mirror of the CSV (handy for diffs/scripts)
│  └─ canary.csv               # 5–10 high-signal cases you run post-deploy
├─ scripts/
│  └─ eval-gate.sh             # Runs Promptfoo with pass-rate gate + exit codes
├─ .env.example                # Env vars for local and runner usage
└─ README.eval.md              # Why these tests, thresholds, and ops notes

.env.example (fill and keep out of git)

Copy to .env or your CI secret store.

# Gate policy
PROMPTFOO_PASS_RATE_THRESHOLD=[95]           # % that must pass (0–100; default 100)
PROMPTFOO_MAX_EVAL_TIME_MS=[180000]          # Hard cap on total eval time (ms)
PROMPTFOO_FAILED_TEST_EXIT_CODE=[100]        # Non-zero exit on failed tests
MAX_CONCURRENCY=[4]                          # Passed to -j for API cost/latency

# Model providers (example)
OPENAI_API_KEY=[YOUR_OPENAI_KEY]
ANTHROPIC_API_KEY=[YOUR_ANTHROPIC_KEY]
AZURE_OPENAI_API_KEY=[YOUR_AZURE_KEY]

Notes:

  • Keep provider keys in CI secrets; never commit real keys.
  • With small sets (<30 rows), one failure can swing ≥3%. Choose thresholds that reflect the cost of errors in your business.

promptfooconfig.yaml (starter)

This config demonstrates three assertion styles you can automate today: regex (format), contains (must-include), and classifier/LLM‑rubric (semantics). Duplicate the test blocks until you reach 20–30 cases. Keep them stable.

# promptfooconfig.yaml
# Minimal CI-ready config with per-test assertions and optional thresholds.

# 1) Providers — pick one to start; add alternates to compare
providers:
  - id: &quot;[PROVIDER_ID]&quot;            # e.g., openai:gpt-4o-mini | anthropic:claude-3-haiku
    config:
      temperature: [0.2]

# 2) Prompt under test — inline or reference a file
prompts:
  - id: main
    prompt: |
      [YOUR_PROMPT_TEMPLATE]
      ---
      Task input:
      {{input}}

# 3) Tests — keep small, frozen, and representative
#    Each test includes concrete assertions. Add weight on must-not-fail checks.
tests:
  - id: T001-json-format
    vars:
      input: &quot;[INPUT_EXAMPLE_JSON]&quot;
    assertions:
      - type: regex
        pattern: &quot;^\{[\s\S]*\}$&quot;
        description: &quot;Output must be a single JSON object (no prose).&quot;
        weight: 2

  - id: T002-contains-next-steps
    vars:
      input: &quot;[INPUT_EXAMPLE_STEPS]&quot;
    assertions:
      - type: contains
        value: &quot;[MUST_INCLUDE_PHRASE]&quot;   # e.g., &quot;Next steps:&quot;
        description: &quot;Response must include the required heading.&quot;

  - id: T003-classifier-routing
    vars:
      input: &quot;[INPUT_EXAMPLE_ROUTE]&quot;
    assertions:
      - type: classifier
        labels: [&quot;approve&quot;, &quot;reject&quot;, &quot;escalate&quot;]
        value: &quot;[EXPECTED_LABEL]&quot;        # one of the labels above
        provider: &quot;[JUDGE_PROVIDER_ID]&quot;  # e.g., openai:gpt-4o-mini
        description: &quot;Route to the correct path.&quot;

  - id: T004-llm-rubric-thresholded
    vars:
      input: &quot;[INPUT_EXAMPLE_RUBRIC]&quot;
      rubric: &quot;Mentions [MUST_KEYWORD_A] and provides a numbered list with at least 3 items.&quot;
    assertions:
      - type: llm-rubric
        value: &quot;${vars.rubric}&quot;
        threshold: 0.8                  # per-test semantic pass bar
        provider: &quot;[JUDGE_PROVIDER_ID]&quot;
        description: &quot;Meets content/rubric bar.&quot;

# Optional: name the suite run and output file for CI parsing
writeOutput: true
outputPath: &quot;output.json&quot;

Golden set — CSV template (20 rows)

Maintain this by hand or export from your labeling tool. Keep IDs stable.

# tests/goldens.csv
id,input,must_include,format_regex,expected_label
1,&quot;[CUSTOMER_EMAIL_ASKING_FOR_REFUND]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;escalate&quot;
2,&quot;[LEAD_FORM_MESSAGE_SIMPLE_QUALIFIED]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;approve&quot;
3,&quot;[LEAD_FORM_MESSAGE_SPAMMY]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;reject&quot;
...
20,&quot;[EDGE_CASE_INPUT]&quot;,&quot;Next steps:&quot;,&quot;^\\{[\\s\\S]*\\}$&quot;,&quot;[EXPECTED]&quot;

Usage note: Mirror the same cases in goldens.json (below) if you prefer JSON workflows. Your config can be expanded later to programmatically load datasets; start by pasting high-signal cases directly into promptfooconfig.yaml to ship fast.

Golden set — JSON template (mirror)

JSON mirror of the CSV, convenient for scripts and diffs.

// tests/goldens.json
[
  {
    &quot;id&quot;: 1,
    &quot;input&quot;: &quot;[CUSTOMER_EMAIL_ASKING_FOR_REFUND]&quot;,
    &quot;must_include&quot;: &quot;Next steps:&quot;,
    &quot;format_regex&quot;: &quot;^\\{[\\s\\S]*\\}$&quot;,
    &quot;expected_label&quot;: &quot;escalate&quot;
  },
  {
    &quot;id&quot;: 2,
    &quot;input&quot;: &quot;[LEAD_FORM_MESSAGE_SIMPLE_QUALIFIED]&quot;,
    &quot;must_include&quot;: &quot;Next steps:&quot;,
    &quot;format_regex&quot;: &quot;^\\{[\\s\\S]*\\}$&quot;,
    &quot;expected_label&quot;: &quot;approve&quot;
  },
  {
    &quot;id&quot;: 3,
    &quot;input&quot;: &quot;[LEAD_FORM_MESSAGE_SPAMMY]&quot;,
    &quot;must_include&quot;: &quot;Next steps:&quot;,
    &quot;format_regex&quot;: &quot;^\\{[\\s\\S]*\\}$&quot;,
    &quot;expected_label&quot;: &quot;reject&quot;
  }
]

scripts/eval-gate.sh (CI gate script)

Runs Promptfoo with a pass-rate gate and clear exit codes you can wire into any runner, Make, n8n, or CI. Make executable: chmod +x scripts/eval-gate.sh.

#!/usr/bin/env bash
set -eo pipefail

# Load env if present
if [ -f .env ]; then
  # shellcheck disable=SC2046
  export $(grep -v &#39;^#&#39; .env | xargs -I{} echo {})
fi

: &quot;${PROMPTFOO_PASS_RATE_THRESHOLD:=[95]}&quot;
: &quot;${PROMPTFOO_MAX_EVAL_TIME_MS:=[180000]}&quot;
: &quot;${PROMPTFOO_FAILED_TEST_EXIT_CODE:=[100]}&quot;
: &quot;${MAX_CONCURRENCY:=[4]}&quot;

echo &quot;→ Running evals with pass-rate &gt;= ${PROMPTFOO_PASS_RATE_THRESHOLD}% and -j ${MAX_CONCURRENCY}&quot;

# Main eval (no cache to catch regressions)
PROMPTFOO_PASS_RATE_THRESHOLD=&quot;${PROMPTFOO_PASS_RATE_THRESHOLD}&quot; \
PROMPTFOO_MAX_EVAL_TIME_MS=&quot;${PROMPTFOO_MAX_EVAL_TIME_MS}&quot; \
PROMPTFOO_FAILED_TEST_EXIT_CODE=&quot;${PROMPTFOO_FAILED_TEST_EXIT_CODE}&quot; \
npx promptfoo eval -c promptfooconfig.yaml -j &quot;${MAX_CONCURRENCY}&quot; --no-cache -o output.json
status=$?

if [ &quot;$status&quot; -eq 0 ]; then
  echo &quot;✅ Gate PASSED&quot;
  exit 0
elif [ &quot;$status&quot; -eq &quot;${PROMPTFOO_FAILED_TEST_EXIT_CODE}&quot; ]; then
  echo &quot;❌ Gate FAILED: tests below threshold. Exit $status&quot;
  exit &quot;$status&quot;
else
  echo &quot;⚠️  Eval ERROR: promptfoo exited $status&quot;
  exit &quot;$status&quot;
fi

Tip: Add a second call that runs a 5–10 case canary set (tests/canary.csv) after deploy to watch for drift.

Make: "Eval Runner" via SSH (template steps)

This scenario runs the gate on a small Linux runner you control (cheap VPS or container with Node installed).

  • Trigger: [Scheduler every 10m] or [Webhook → only on PR/merge]
  • Step 1 — SSH: Execute command
    • Connection: [SSH_CREDENTIAL]
    • Command:
      cd [REPO_DIR] &amp;&amp; git pull &amp;&amp; bash scripts/eval-gate.sh
      
    • Expected outcome: Non-zero exit halts the scenario (treat as failure).
  • Step 2 — Router (optional):
    • If exit = 0 → Notify success (Slack/Email) with summary from output.json.
    • Else → Create issue / post comment on PR and stop downstream deploy.

Runner prep checklist:

  • Install Node 18+ and npx on the runner.
  • Set environment secrets for provider keys and pass-rate threshold.
  • Ensure the repo contains promptfooconfig.yaml and scripts/eval-gate.sh.

n8n: "Eval Runner" (execute on host)

Use the built-in Execute Command (self-hosted) or SSH node.

  • Trigger: [When PR labeled "run-evals"] or [Manual]
  • Node A — Execute Command (or SSH):
    • Command:
      cd [REPO_DIR] &amp;&amp; git pull &amp;&amp; bash scripts/eval-gate.sh
      
    • On Error: Stop workflow and mark run as failed.
  • Node B — If (status == 0):
    • Send success message with a link to the run logs and attach output.json.
  • Node C — If (status != 0):
    • Post failure notice to Slack/Email with top failing test IDs parsed from output.json.

Workflow prep:

  • Mount the repo or pull fresh on each run.
  • Inject secrets as environment variables on the n8n host.
  • Ensure the host can reach your model providers (or use mock providers in dev).

CI gate snippet (drop-in)

Drop this into any CI system (GitHub Actions, GitLab CI, etc.). It honors exit code 0 (pass), 100 (failed tests), and 1 (other errors).

# ci/eval-gate.sh (inline snippet)
set -eo pipefail
export PROMPTFOO_PASS_RATE_THRESHOLD=${PROMPTFOO_PASS_RATE_THRESHOLD:-[95]}
export PROMPTFOO_MAX_EVAL_TIME_MS=${PROMPTFOO_MAX_EVAL_TIME_MS:-[180000]}
export PROMPTFOO_FAILED_TEST_EXIT_CODE=${PROMPTFOO_FAILED_TEST_EXIT_CODE:-[100]}
export MAX_CONCURRENCY=${MAX_CONCURRENCY:-[4]}

npx promptfoo eval -c promptfooconfig.yaml -j &quot;$MAX_CONCURRENCY&quot; --no-cache -o output.json || status=$?

if [ -z &quot;$status&quot; ]; then
  echo &quot;✅ Gate PASSED&quot;; exit 0
elif [ &quot;$status&quot; -eq &quot;$PROMPTFOO_FAILED_TEST_EXIT_CODE&quot; ]; then
  echo &quot;❌ Gate FAILED (below ${PROMPTFOO_PASS_RATE_THRESHOLD}% pass rate)&quot;; exit &quot;$status&quot;
else
  echo &quot;⚠️  Eval ERROR (exit $status)&quot;; exit &quot;$status&quot;
fi

Example GitHub Actions job step:

- name: Run LLM regression gate
  run: bash ci/eval-gate.sh
  env:
    PROMPTFOO_PASS_RATE_THRESHOLD: [95]
    PROMPTFOO_MAX_EVAL_TIME_MS: [180000]
    MAX_CONCURRENCY: [4]
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

README template (policy, ownership, and ops)

Give teammates (or future you) the context to maintain the gate without guesswork.

# README.eval.md

## What this gate protects
- System: [SYSTEM/FEATURE]
- Risk we’re catching: [E.G., WRONG ROUTING / MISSING DISCLAIMERS / FORMAT BREAKS]

## Golden set ownership
- Source of truth: tests/goldens.csv (20 rows)
- Update protocol: Propose via PR with rationale + before/after runs
- Change log: [LINK_TO_CHANGELOG]

## Gate policy
- Pass rate: [95]% overall, with weighted must-not-fail checks
- Critical checks: Format regex must pass (weight=2)
- Time/cost guard: `-j [4]`, `PROMPTFOO_MAX_EVAL_TIME_MS=[180000]`

## Canary evals (post-deploy)
- File: tests/canary.csv (5–10 rows)
- Frequency: [HOURLY/DAILY]
- Alerting: [SLACK/EMAIL CHANNEL]

## Upgrade path
- When multi-turn or RAG traces matter, graduate to MLflow + TruLens scorers.
- For improving prompts (not just gating), trial a prompt optimizer against these goldens.