Episode 10·June 1, 2026

Stop Building Agents by Default

Spotify Apple Podcasts RSS Feed Open Companion Resource

Intro

This episode is for solo automation consultants tempted by agent hype who need reliability in client deliverables. You'll get a practical framework for deciding when agents are justified, plus the exact guardrails to contain them when they are.

In This Episode

Jordan runs a head-to-head comparison between a deterministic Make scenario and a boxed agent handling the same lead-intake workflow. The deterministic flow failed twice out of 1,000 runs while the agent failed 43 times and cost four times as much. He walks through the five essential guardrails for any production agent: tool allowlists with approval gates, hard timeouts with deterministic fallbacks, per-run budget caps, workflow checkpoints for auditability, and reversible actions only. The episode covers the latest framework updates from Microsoft Agent Framework and LangGraph that make these constraints enforceable, plus a systematic A/B testing approach to prove whether an agent actually earns its spot over a deterministic baseline.

Key Takeaways

Use the 80/20 rule: deterministic flows for known-path workflows, boxed agents only for tasks requiring adaptive tool selection based on unpredictable input
Implement five non-negotiable guardrails before shipping any agent: tool allowlists with approval gates, hard timeouts, budget caps, checkpoints, and reversible actions only
A/B test every agent against your deterministic baseline with frozen inputs and predefined win criteria before going live, keeping a rollback plan ready

Timestamps

Companion Resource

checklist

Boxed‑Agent Checklist (Solo Operator Edition)

A one‑page pre‑flight and runtime checklist to ship AI agents safely as a solo operator: strict tool allowlists and approvals, timeouts and run caps, budgets and rate limits, checkpoints and reversible actions, schema‑validated I/O, a baseline A/B harness, and a crisp rollback plan.

OpenAI docs — Assistants migration guide; Migrate to Responses API; API reference
platform.openai.com
- - OpenAI’s Assistants API is deprecated with a sunset on August 26, 2026; OpenAI advises migrating to the Responses API.
Microsoft Techwiese GA note; Microsoft Learn migration guide; GitHub repo
microsoft.com
- - Microsoft Agent Framework (MAF) reached General Availability in April 2026 and unifies prior orchestration efforts including AutoGen and Semantic Kernel; Microsoft provides an official AutoGen→MAF migration guide.
Microsoft Learn — Workflows: Checkpoints; Python API reference
learn.microsoft.com
- - MAF supports workflow checkpoints to persist and resume full execution state across supersteps for reliability and audit.
Microsoft Learn — Using function tools with human approvals; Hyperlight CodeAct integration
learn.microsoft.com
- - MAF exposes tool-level approvals via approval_mode, enabling human-in-the-loop gates for sensitive functions.
Microsoft Learn — Agent Background Responses
learn.microsoft.com
- - MAF includes background responses to handle long‑running operations and resume after timeouts or disconnects.
Microsoft Learn — AG‑UI HTTP service; AG‑UI getting started
learn.microsoft.com
- - MAF integration endpoints set explicit request timeouts (default ~60 seconds) in AG‑UI clients; these can be tuned per deployment.
LangGraph docs; GitHub releases
docs.langchain.com
- - LangGraph provides first‑class timeout and error‑handling primitives (TimeoutPolicy, typed errors, per‑node timeouts) with May 2026 release activity reflected in the repo.
OpenAI Help Center — ChatGPT agent; API docs
help.openai.com
- - OpenAI’s docs and help center indicate agents/products enforce rate/message limits to ensure reliability (e.g., monthly message limits and concurrency caps).
Cloudflare Developers — Agents limits
developers.cloudflare.com
- - Cloudflare Agents platform documents hard platform limits and quotas relevant to agent workloads.
Stanford HAI — 2026 AI Index; Inside the AI Index article
hai.stanford.edu
- - Stanford HAI’s 2026 AI Index reports large agent capability gains but uneven reliability: OSWorld task success improved to ~66% (still ~34% failure), while other suites like Terminal‑Bench report ~77.3% success.
IEEE Spectrum — Why AI Systems Fail Quietly
spectrum.ieee.org
- - IEEE Spectrum highlights ‘behavioral reliability’ risks for agentic systems — plans that look locally reasonable can be globally unsafe, requiring stronger governance and observability.
Axios — AI’s compute wars
axios.com
- - Axios reports continued capacity constraints and outages at leading model providers in 2026, reinforcing the need for fail‑safes and deterministic fallbacks for critical paths.
arXiv — Towards a Science of AI Agent Reliability; Judge Agent paper
arxiv.org
- - Research proposes structured reliability science for agents and shows specialized ‘judge’ agents can dramatically cut silent‑failure rates on scientific simulations (e.g., from 42% to 1.5% in one study).
Microsoft Learn migration guide; GitHub — autogen README
learn.microsoft.com
- - Microsoft provides an official AutoGen→Agent Framework migration path; the AutoGen repo README also points users to MAF.
Microsoft Learn — Using function tools with human-in-the-loop approvals
learn.microsoft.com
- - Microsoft Agent Framework tool approvals
- - Demonstrates gating risky actions with per-tool approvals so agents cannot execute certain functions without human sign‑off.
LangGraph docs — Fault tolerance: timeouts and error handling
docs.langchain.com
- - Per‑node timeouts and error handling in LangGraph
- - Shows how to cap execution time and route failures deterministically at node level — a core guardrail when you do use agents.
Stanford HAI — 2026 AI Index (Inside the AI Index: 12 Takeaways)
hai.stanford.edu
- - Agent task success improvements but remaining failure rate
- - Concrete benchmark context for the baseline-vs‑agent comparison the episode recommends.

Jordan: Stop building agents.

I'm serious. That lead-intake workflow you're about to hand to an agent? The one where it dedupes the contact, enriches it with Clearbit, routes it to the right CRM pipeline? Don't. Build a Make scenario. Build an n8n flow. Wire it up deterministically — step one fires, step two fires, step three fires — and move on with your day.

I know what you're thinking. "Jordan, it's twenty twenty-six. Agents are the whole conversation. Microsoft just shipped the Agent Framework. LangGraph has first-class orchestration. OpenAI is sunsetting the Assistants API to push everyone toward agentic patterns. Why would I go backward?"

Because backward is where the reliability is. Stanford's twenty twenty-six AI Index — their big annual report — tested agents across real-world task suites. The best results? About seventy-seven percent success. The worst? One in three attempts failed. And those are benchmark conditions. Clean inputs. Controlled environments. Not your client's CRM data at two PM on a Tuesday.

I ran both versions of the same workflow last week. Deterministic flow and a boxed agent. Same frozen inputs. Same task. The flow failed twice out of a thousand runs. The agent failed forty-three times. And it cost four times as much per run.

Forty-three failures versus two. That's not a rounding error. That's a client conversation you don't want to have.

How many of your production workflows actually need an agent? Not "could theoretically benefit from one." Need one. Where the task genuinely requires the model to choose its own tools, reason through ambiguity, and decide its own path — and a deterministic flow can't do the job.

If you're honest, the number is probably one. Maybe zero. This is Headcount Zero. I'm Jordan. And today I'm making the case that about eighty percent of the work you're tempted to hand to an agent belongs in a deterministic flow — and for the other twenty percent, I'm showing you exactly how to box that agent so tight it can't hurt you. Tool allowlists, hard timeouts, per-run budgets, checkpoints, and a kill switch. Plus the A/B harness that proves whether the agent actually earns its spot.

So here's the question you should be asking before you reach for any agent framework. Does this task require the model to choose which tools to call, in what order, based on ambiguous input? Or do I already know the steps?

Because if you know the steps — and you almost always know the steps — then you're adding a reasoning layer on top of a problem that doesn't need reasoning. You're paying for the model to figure out what you already figured out six months ago.

I'll give you the exact example that made this click for me. I have a lead-intake pipeline for a client. New form submission comes in, gets deduped against their CRM, enriched through an API call, scored on three criteria, and routed to one of four pipelines based on the score. Six steps. Every step is known. Every branch is defined. I built it as a Make scenario in about forty-five minutes. It runs at roughly half a cent per execution. Failure rate over the last three months — zero point two percent. Almost all of those are upstream API timeouts from the enrichment provider, not logic errors.

Now. I rebuilt that same pipeline as an agent. Gave it access to the CRM tool, the enrichment tool, and the routing tool. Told it the goal: "Intake this lead, dedupe, enrich, score, and route." Let it figure out the order.

And it did figure it out. Most of the time. But "most of the time" is not a spec you can hand to a client.

Here's what actually happens when you let an agent run a structured workflow. The failures aren't the kind you're used to. A Make scenario fails predictably — the API times out, the webhook doesn't fire, the data doesn't match the schema. You get an error. You fix the error. Done.

Agent failures are weirder. IEEE Spectrum published a piece in April — "Why AI Systems Fail Quietly" — and they nailed the core problem. They call it behavioral reliability. The agent's plan can look locally reasonable at every single step and still be globally wrong. Each individual decision makes sense. The sequence doesn't.

In my test, the agent occasionally enriched before deduping. Which means it burned an API credit on a contact that was already in the CRM. Not a crash. Not an error. Just wasted money and a duplicate record that now has two enrichment timestamps. Try explaining that to a client's ops team.

Other times it called the routing tool before the scoring was complete — because the model decided it had "enough information" to route. It didn't. The lead went to the wrong pipeline. Again, no error. No alert. Just a silent misroute that nobody catches until the sales rep calls and says "why am I getting enterprise leads?"

And then there's the external dependency problem. Axios reported in April that paying customers at major model providers — OpenAI, Anthropic — are hitting capacity limits and outages. Not hypothetical. Happening now. If your deterministic flow hits a rate limit, it retries or fails with a typed error. If your agent hits a rate limit mid-reasoning, it might hallucinate the tool response, skip the step entirely, or loop until it burns through your token budget.

So the first principle is simple. If the path is known, use a flow. Flows fail predictably. Agents fail creatively. And creative failures are the ones that cost you clients.

Okay. So when do you actually need an agent? I said about eighty percent of work belongs in flows. That's a heuristic, not a law. But the twenty percent is real.

The test I use — does the task require the model to choose which tools to call based on input it hasn't seen before? If a customer sends a support message and the right response might require checking their order status, or pulling up their contract, or escalating to a human, or doing nothing — and you can't predetermine which — that's agent territory. The branching is too complex or too variable to hardcode.

But here's where it gets interesting. The frameworks have gotten dramatically better at letting you box that agent in. And I mean box. Not "give it guidelines." Actual hard constraints.

Let me walk through what a boxed agent looks like in practice, because the tooling that shipped this spring changes the game.

First — tool allowlists. Microsoft's Agent Framework, which hit general availability in April, has a concept called approval mode. You set it per tool. For anything destructive — sending an email, writing to a database, charging a payment — you set approval mode to always require. The agent literally pauses execution and waits for a human to approve before it can call that tool. It emits a request, you approve or deny, and only then does it proceed.

That's not a suggestion in the system prompt. That's a hard gate in the runtime. The agent cannot bypass it.

Second — timeouts. LangGraph shipped first-class timeout primitives in their May release. You set a timeout policy per node — I use eight to fifteen seconds as a starting default — and a wall-clock cap for the entire run. Sixty to a hundred twenty seconds. If any node exceeds its timeout, it throws a typed error — a NodeError — that you can catch and route deterministically. Fail fast, fall back to your known-good flow.

This is where most people get stuck, so let me be specific. You're not just setting a timeout and hoping. You're defining what happens when the timeout fires. In my setup, a timeout on any tool call triggers a fallback that routes the input back into the deterministic Make scenario. The agent tried. It ran out of time. The flow picks up. The client never knows.

Third — budgets. Per-run spend caps. I set mine at five cents per run, ten tool calls maximum, and a daily budget of ten dollars across all agent runs. Cloudflare's Agents platform documents hard platform limits too — these aren't optional. If any cap trips, the run aborts and falls back deterministically. No free-roaming. No runaway token spend.

I learned this one the hard way. Early prototype, no budget cap. Left it running overnight on a batch of two hundred test inputs. Woke up to a forty-seven dollar bill. For two hundred leads. That's twenty-three cents per lead just in inference costs, on top of the API calls. The Make scenario does the same job for half a cent.

Fourth — checkpoints. The Microsoft Agent Framework persists state at each super-step. If the agent crashes, times out, or gets interrupted, you can resume from the last checkpoint instead of restarting from scratch. More importantly, you can audit every checkpoint after the fact. You can see exactly what the agent decided at each step, what tools it called, what data it passed. When a client asks "what happened to that lead?" you have a trace, not a shrug.

And fifth — reversible actions only. Before any side-effect, require an idempotency key and a dry-run preview. The agent proposes the action. You — or your approval gate — confirm it. Only then does it commit. And you record the external system's receipt ID so you can undo it if needed.

Now — I want to be honest about something. Agents are improving fast. The Stanford AI Index shows real gains. Terminal-Bench hit about seventy-seven percent success. And there's a fascinating paper out of arXiv this year showing that specialized judge agents — basically a second model that reviews the first model's work — can cut silent-failure rates from forty-two percent down to one and a half percent in certain scientific simulation tasks.

One and a half percent. That's remarkable.

But... that's one domain. Scientific simulations with structured outputs. The judge agent knew exactly what "correct" looked like because the domain had clear ground truth. Your client's lead-routing logic doesn't have that. "Did this lead go to the right pipeline?" is a judgment call that depends on context the judge model doesn't have.

So yes — invest in agent architectures. Learn the frameworks. The Microsoft Agent Framework consolidating AutoGen and Semantic Kernel into one governed stack is genuinely important. OpenAI sunsetting the Assistants API by August twenty-sixth and pushing everyone to the Responses API — that's a signal that the platform layer is maturing.

But maturity doesn't mean reliability. Not yet. Not for your client work. Not without proof.

Which brings me to the part that actually matters. Before you ship any agent to production, you benchmark it against your deterministic baseline. Not in theory. In practice. On your data.

Here's what I do. I freeze a test set — fifty to a hundred real inputs from the last month of production data. I run both the deterministic flow and the boxed agent against that frozen set. Same inputs. Same expected outputs. And I compare three things. Failure rate by class — timeouts, tool errors, hallucinated actions, budget trips. Latency at the fiftieth and ninety-fifth percentile. And cost per run.

The agent ships only if it meets the win criteria I wrote before I started testing. Not after. Before. Because if you define success after you see the results, you will rationalize shipping something that isn't ready.

My default win criteria — the agent's failure rate must be at or below the baseline flow's failure rate. Cost per run can't exceed one-point-two-five times the flow. And ninety-fifth percentile latency can't exceed two times the flow. If it misses any of those, it doesn't ship. Period.

And I'll tell you — most of the time, it doesn't ship. Most of the time, the flow wins. And that's fine. That's the right answer. The agent stays in staging until either the task genuinely needs it or the reliability catches up.

The whole point is that you're not guessing. You're not reading a blog post about how agents are the future and assuming your workflow should be one. You're measuring. On your inputs. With your failure classes. And you're keeping a clean rollback — a feature flag that redirects one hundred percent of traffic back to the deterministic flow if anything goes sideways in production.

So — stop building agents. That's what I said at the top. And I meant it. Stop building agents by default. Stop reaching for the agentic pattern because it feels like the sophisticated choice. The sophisticated choice is the one that doesn't wake you up at two AM because a model decided to skip a step your Make scenario has never skipped in ten thousand runs.

Build the flow first. Measure it. And if — if — you have a task that genuinely needs adaptive reasoning, box that agent until it can't move without your permission. Tool allowlists. Timeouts. Budgets. Checkpoints. Reversible actions. A/B it against the flow. Write your win criteria before you run the test. And keep the kill switch warm.

If you want the full pre-flight — every guardrail, every default, every fallback pattern — grab the Boxed-Agent Checklist on the Resources page. It's the exact list I run before any agent touches client traffic.

One thing to do this week. Pick your most tempting agent candidate — the workflow you've been thinking about handing to a model — and write down the steps it takes. If you can write them down, you don't need an agent. You need a scenario. Ship that instead.

I'm Jordan. This is Headcount Zero. Go build something that works every time.

AI agentsautomation workflowsdeterministic flowsMicrosoft Agent FrameworkLangGraphreliability testingguardrailssolo consultingMake.comn8nagent reliabilityworkflow automation