ChecklistMay 25, 2026

Boxed‑Agent Checklist (Solo Operator Edition)

A one‑page pre‑flight and runtime checklist to ship AI agents safely as a solo operator: strict tool allowlists and approvals, timeouts and run caps, budgets and rate limits, checkpoints and reversible actions, schema‑validated I/O, a baseline A/B harness, and a crisp rollback plan.

From EpisodeStop Building Agents by Default

Run this pre‑flight before any agent touches production, then re‑run on every version bump. Bias to deterministic flows first; if you ship an agent, keep it boxed with strict controls and a clean rollback.

1
Gate the decision: do you actually need an agent?
Ship a deterministic flow if inputs are structured and the path is known. Only proceed with an agent if the task needs open‑ended reasoning or tool choice. Write your win criteria now (e.g., ≤ baseline failure rate, ≤ 1.25× cost/run, ≤ 2× p95 latency).
2
Define a hard tool allowlist + approvals
List only the tools the agent may call. Mark destructive tools (email.send, db.write, file.delete, payment.charge) as approval‑required so runs pause for human sign‑off before execution (e.g., approval_mode=always_require). Default all other tools to read‑only where possible.
3
Lock identities, scopes, and surfaces
Run the agent under least‑privilege service accounts with scoped API keys. Separate staging vs. production credentials. Restrict external surfaces (allowed domains/paths, filesystem sandbox like /tmp only) and block raw shell/HTTP unless explicitly allowlisted.
4
Set timeouts per node and a global run cap
Enforce a per‑tool/node timeout (starter defaults: 8–15s) and a wall‑clock cap for the whole run (starter default: 60–120s). Fail fast on timeout with a typed error and route to a safe fallback; for long tasks, use background/continuation patterns rather than stretching timeouts.
5
Enforce budgets and rate limits
Set per‑run spend caps (e.g., $0.05/run), token caps, max tool calls (e.g., ≤ 10), and a daily budget (e.g., $10). Add concurrency limits (e.g., 2–3 runs) and exponential backoff on 429/5xx. If any cap trips, abort the run and fall back deterministically.
6
Add checkpoints and make side‑effects reversible
Persist state at each super‑step so you can resume or audit. Before any side‑effect, require an idempotency key and a dry‑run/preview. Only commit after approval; record the external system receipt/ID so you can undo or reconcile if needed.
7
Validate input/output with strict schemas
Define JSON/Pydantic schemas for agent inputs, tool arguments, and tool outputs. Parse in strict mode (no coercion); reject runs on schema violations. Normalize enumerations (status, currency, country) and enforce required fields before proceeding.
8
Wire audit logging and redaction
Log every run with a trace ID, tool calls, inputs/outputs (after PII redaction), token usage, cost, timings, approvals, and errors. Store structured logs in your datastore; add alerts on budget_exceeded, repeated tool errors (>3 in 5 min), or p95 latency spikes (>2× baseline).
9
Build an A/B harness against your deterministic baseline
Freeze a test set (e.g., 50–100 real inputs). Run both the deterministic flow and the boxed agent. Compare failure classes, p50/p95 latency, and $/run. Ship the agent only if it meets the pre‑written win criteria with guardrails enabled.
10
Handle rate limits, errors, and fallbacks deterministically
Map 429/limit errors to queued retries within caps; map timeouts to a known fallback path; map validation/tool errors to a human‑review queue with the last checkpoint attached. Never let the agent free‑roam after an error.
11
Prepare a crisp rollback and kill switch
Wrap the agent behind a feature flag. Document steps to redirect 100% of traffic to the deterministic baseline, revoke agent credentials, and drain/rehydrate in‑flight work from checkpoints. Practice the rollback once before go‑live.
12
Go live via a canary and review cycle
Start with 10–20% traffic for 24–48 hours. Watch alerts, compare A/B metrics in real time, and review 10 random traces manually. Promote to 100% only if metrics hold; otherwise roll back, fix, and re‑test.