Episode 10·

Stop Building Agents by Default

Intro

This episode is for solo automation consultants tempted by agent hype who need reliability in client deliverables. You'll get a practical framework for deciding when agents are justified, plus the exact guardrails to contain them when they are.

In This Episode

Jordan runs a head-to-head comparison between a deterministic Make scenario and a boxed agent handling the same lead-intake workflow. The deterministic flow failed twice out of 1,000 runs while the agent failed 43 times and cost four times as much. He walks through the five essential guardrails for any production agent: tool allowlists with approval gates, hard timeouts with deterministic fallbacks, per-run budget caps, workflow checkpoints for auditability, and reversible actions only. The episode covers the latest framework updates from Microsoft Agent Framework and LangGraph that make these constraints enforceable, plus a systematic A/B testing approach to prove whether an agent actually earns its spot over a deterministic baseline.

Key Takeaways

  • Use the 80/20 rule: deterministic flows for known-path workflows, boxed agents only for tasks requiring adaptive tool selection based on unpredictable input
  • Implement five non-negotiable guardrails before shipping any agent: tool allowlists with approval gates, hard timeouts, budget caps, checkpoints, and reversible actions only
  • A/B test every agent against your deterministic baseline with frozen inputs and predefined win criteria before going live, keeping a rollback plan ready

Timestamps

Companion Resource

Jordan: Stop building agents.

I'm serious. That lead-intake workflow you're about to hand to an agent? The one where it dedupes the contact, enriches it with Clearbit, routes it to the right CRM pipeline? Don't. Build a Make scenario. Build an n8n flow. Wire it up deterministically — step one fires, step two fires, step three fires — and move on with your day.

I know what you're thinking. "Jordan, it's twenty twenty-six. Agents are the whole conversation. Microsoft just shipped the Agent Framework. LangGraph has first-class orchestration. OpenAI is sunsetting the Assistants API to push everyone toward agentic patterns. Why would I go backward?"

Because backward is where the reliability is. Stanford's twenty twenty-six AI Index — their big annual report — tested agents across real-world task suites. The best results? About seventy-seven percent success. The worst? One in three attempts failed. And those are benchmark conditions. Clean inputs. Controlled environments. Not your client's CRM data at two PM on a Tuesday.

I ran both versions of the same workflow last week. Deterministic flow and a boxed agent. Same frozen inputs. Same task. The flow failed twice out of a thousand runs. The agent failed forty-three times. And it cost four times as much per run.

Forty-three failures versus two. That's not a rounding error. That's a client conversation you don't want to have.

How many of your production workflows actually need an agent? Not "could theoretically benefit from one." Need one. Where the task genuinely requires the model to choose its own tools, reason through ambiguity, and decide its own path — and a deterministic flow can't do the job.

If you're honest, the number is probably one. Maybe zero. This is Headcount Zero. I'm Jordan. And today I'm making the case that about eighty percent of the work you're tempted to hand to an agent belongs in a deterministic flow — and for the other twenty percent, I'm showing you exactly how to box that agent so tight it can't hurt you. Tool allowlists, hard timeouts, per-run budgets, checkpoints, and a kill switch. Plus the A/B harness that proves whether the agent actually earns its spot.

So here's the question you should be asking before you reach for any agent framework. Does this task require the model to choose which tools to call, in what order, based on ambiguous input? Or do I already know the steps?

Because if you know the steps — and you almost always know the steps — then you're adding a reasoning layer on top of a problem that doesn't need reasoning. You're paying for the model to figure out what you already figured out six months ago.

I'll give you the exact example that made this click for me. I have a lead-intake pipeline for a client. New form submission comes in, gets deduped against their CRM, enriched through an API call, scored on three criteria, and routed to one of four pipelines based on the score. Six steps. Every step is known. Every branch is defined. I built it as a Make scenario in about forty-five minutes. It runs at roughly half a cent per execution. Failure rate over the last three months — zero point two percent. Almost all of those are upstream API timeouts from the enrichment provider, not logic errors.

Now. I rebuilt that same pipeline as an agent. Gave it access to the CRM tool, the enrichment tool, and the routing tool. Told it the goal: "Intake this lead, dedupe, enrich, score, and route." Let it figure out the order.

And it did figure it out. Most of the time. But "most of the time" is not a spec you can hand to a client.

Here's what actually happens when you let an agent run a structured workflow. The failures aren't the kind you're used to. A Make scenario fails predictably — the API times out, the webhook doesn't fire, the data doesn't match the schema. You get an error. You fix the error. Done.

Agent failures are weirder. IEEE Spectrum published a piece in April — "Why AI Systems Fail Quietly" — and they nailed the core problem. They call it behavioral reliability. The agent's plan can look locally reasonable at every single step and still be globally wrong. Each individual decision makes sense. The sequence doesn't.

In my test, the agent occasionally enriched before deduping. Which means it burned an API credit on a contact that was already in the CRM. Not a crash. Not an error. Just wasted money and a duplicate record that now has two enrichment timestamps. Try explaining that to a client's ops team.

Other times it called the routing tool before the scoring was complete — because the model decided it had "enough information" to route. It didn't. The lead went to the wrong pipeline. Again, no error. No alert. Just a silent misroute that nobody catches until the sales rep calls and says "why am I getting enterprise leads?"

And then there's the external dependency problem. Axios reported in April that paying customers at major model providers — OpenAI, Anthropic — are hitting capacity limits and outages. Not hypothetical. Happening now. If your deterministic flow hits a rate limit, it retries or fails with a typed error. If your agent hits a rate limit mid-reasoning, it might hallucinate the tool response, skip the step entirely, or loop until it burns through your token budget.

So the first principle is simple. If the path is known, use a flow. Flows fail predictably. Agents fail creatively. And creative failures are the ones that cost you clients.

Okay. So when do you actually need an agent? I said about eighty percent of work belongs in flows. That's a heuristic, not a law. But the twenty percent is real.

The test I use — does the task require the model to choose which tools to call based on input it hasn't seen before? If a customer sends a support message and the right response might require checking their order status, or pulling up their contract, or escalating to a human, or doing nothing — and you can't predetermine which — that's agent territory. The branching is too complex or too variable to hardcode.

But here's where it gets interesting. The frameworks have gotten dramatically better at letting you box that agent in. And I mean box. Not "give it guidelines." Actual hard constraints.

Let me walk through what a boxed agent looks like in practice, because the tooling that shipped this spring changes the game.

First — tool allowlists. Microsoft's Agent Framework, which hit general availability in April, has a concept called approval mode. You set it per tool. For anything destructive — sending an email, writing to a database, charging a payment — you set approval mode to always require. The agent literally pauses execution and waits for a human to approve before it can call that tool. It emits a request, you approve or deny, and only then does it proceed.

That's not a suggestion in the system prompt. That's a hard gate in the runtime. The agent cannot bypass it.

Second — timeouts. LangGraph shipped first-class timeout primitives in their May release. You set a timeout policy per node — I use eight to fifteen seconds as a starting default — and a wall-clock cap for the entire run. Sixty to a hundred twenty seconds. If any node exceeds its timeout, it throws a typed error — a NodeError — that you can catch and route deterministically. Fail fast, fall back to your known-good flow.

This is where most people get stuck, so let me be specific. You're not just setting a timeout and hoping. You're defining what happens when the timeout fires. In my setup, a timeout on any tool call triggers a fallback that routes the input back into the deterministic Make scenario. The agent tried. It ran out of time. The flow picks up. The client never knows.

Third — budgets. Per-run spend caps. I set mine at five cents per run, ten tool calls maximum, and a daily budget of ten dollars across all agent runs. Cloudflare's Agents platform documents hard platform limits too — these aren't optional. If any cap trips, the run aborts and falls back deterministically. No free-roaming. No runaway token spend.

I learned this one the hard way. Early prototype, no budget cap. Left it running overnight on a batch of two hundred test inputs. Woke up to a forty-seven dollar bill. For two hundred leads. That's twenty-three cents per lead just in inference costs, on top of the API calls. The Make scenario does the same job for half a cent.

Fourth — checkpoints. The Microsoft Agent Framework persists state at each super-step. If the agent crashes, times out, or gets interrupted, you can resume from the last checkpoint instead of restarting from scratch. More importantly, you can audit every checkpoint after the fact. You can see exactly what the agent decided at each step, what tools it called, what data it passed. When a client asks "what happened to that lead?" you have a trace, not a shrug.

And fifth — reversible actions only. Before any side-effect, require an idempotency key and a dry-run preview. The agent proposes the action. You — or your approval gate — confirm it. Only then does it commit. And you record the external system's receipt ID so you can undo it if needed.

Now — I want to be honest about something. Agents are improving fast. The Stanford AI Index shows real gains. Terminal-Bench hit about seventy-seven percent success. And there's a fascinating paper out of arXiv this year showing that specialized judge agents — basically a second model that reviews the first model's work — can cut silent-failure rates from forty-two percent down to one and a half percent in certain scientific simulation tasks.

One and a half percent. That's remarkable.

But... that's one domain. Scientific simulations with structured outputs. The judge agent knew exactly what "correct" looked like because the domain had clear ground truth. Your client's lead-routing logic doesn't have that. "Did this lead go to the right pipeline?" is a judgment call that depends on context the judge model doesn't have.

So yes — invest in agent architectures. Learn the frameworks. The Microsoft Agent Framework consolidating AutoGen and Semantic Kernel into one governed stack is genuinely important. OpenAI sunsetting the Assistants API by August twenty-sixth and pushing everyone to the Responses API — that's a signal that the platform layer is maturing.

But maturity doesn't mean reliability. Not yet. Not for your client work. Not without proof.

Which brings me to the part that actually matters. Before you ship any agent to production, you benchmark it against your deterministic baseline. Not in theory. In practice. On your data.

Here's what I do. I freeze a test set — fifty to a hundred real inputs from the last month of production data. I run both the deterministic flow and the boxed agent against that frozen set. Same inputs. Same expected outputs. And I compare three things. Failure rate by class — timeouts, tool errors, hallucinated actions, budget trips. Latency at the fiftieth and ninety-fifth percentile. And cost per run.

The agent ships only if it meets the win criteria I wrote before I started testing. Not after. Before. Because if you define success after you see the results, you will rationalize shipping something that isn't ready.

My default win criteria — the agent's failure rate must be at or below the baseline flow's failure rate. Cost per run can't exceed one-point-two-five times the flow. And ninety-fifth percentile latency can't exceed two times the flow. If it misses any of those, it doesn't ship. Period.

And I'll tell you — most of the time, it doesn't ship. Most of the time, the flow wins. And that's fine. That's the right answer. The agent stays in staging until either the task genuinely needs it or the reliability catches up.

The whole point is that you're not guessing. You're not reading a blog post about how agents are the future and assuming your workflow should be one. You're measuring. On your inputs. With your failure classes. And you're keeping a clean rollback — a feature flag that redirects one hundred percent of traffic back to the deterministic flow if anything goes sideways in production.

So — stop building agents. That's what I said at the top. And I meant it. Stop building agents by default. Stop reaching for the agentic pattern because it feels like the sophisticated choice. The sophisticated choice is the one that doesn't wake you up at two AM because a model decided to skip a step your Make scenario has never skipped in ten thousand runs.

Build the flow first. Measure it. And if — if — you have a task that genuinely needs adaptive reasoning, box that agent until it can't move without your permission. Tool allowlists. Timeouts. Budgets. Checkpoints. Reversible actions. A/B it against the flow. Write your win criteria before you run the test. And keep the kill switch warm.

If you want the full pre-flight — every guardrail, every default, every fallback pattern — grab the Boxed-Agent Checklist on the Resources page. It's the exact list I run before any agent touches client traffic.

One thing to do this week. Pick your most tempting agent candidate — the workflow you've been thinking about handing to a model — and write down the steps it takes. If you can write them down, you don't need an agent. You need a scenario. Ship that instead.

I'm Jordan. This is Headcount Zero. Go build something that works every time.

AI agentsautomation workflowsdeterministic flowsMicrosoft Agent FrameworkLangGraphreliability testingguardrailssolo consultingMake.comn8nagent reliabilityworkflow automation