Jordan: Your automations are not reliable. I know you think they are. You tested them. They ran clean for a week. Your client said "this is amazing." And now you're building the next feature — maybe an AI summarizer, maybe a Slack integration, maybe a fancy dashboard. Stop.
Because here's what's actually happening inside those workflows right now. Stripe is sending the same webhook twice. Your Zap is processing both of them. Your client just got double-charged, or double-notified, or double-entered into a CRM. And you have no idea — because there's no log, no alert, no dead letter queue catching the failure. There's just... silence. Until the client emails you.
March. I had a Make scenario handling Shopify order webhooks for a client doing about two hundred orders a day. Worked perfectly for three weeks. Then Shopify had a brief API hiccup on a Friday night — maybe forty seconds of instability — and my scenario replayed eleven orders. Eleven duplicate fulfillment emails went out. Eleven customers got confused. My client's support inbox exploded over the weekend. And I didn't know until Monday morning because I had zero monitoring on that flow.
Eleven emails. That's what it took to convince me that "it works in testing" is a completely different statement than "it works in production."
Jordan: If you're running ten or more Zaps, scenarios, or n8n workflows across client accounts and you haven't implemented idempotency keys, explicit retry logic, a dead letter queue, and a circuit breaker — you are one API hiccup away from a weekend you can't get back. Not a theoretical weekend. A real one. The kind where you're apologizing to a client at eleven PM because their customers got duplicate charges and you had no way to catch it, no way to replay it cleanly, and no way to prove it won't happen again. Today I'm walking through all four controls — what they are, exactly where to toggle them in Zapier, Make, and n8n, and why you need them in place before you add a single new AI feature to any client workflow.
Jordan: So let's talk about why demo automations break. And I don't mean "break" like they throw an error and stop. I mean break in the way that's actually dangerous — they keep running, they look fine, and they're silently creating problems you won't discover for hours or days.
There are two categories of failure. Transient failures — a timeout, a rate limit, a brief connection drop. And logical failures — a mapping error, a bad field reference, a webhook that fires twice because the provider retried it. The platforms handle these very differently, and if you don't understand the difference, you're going to build on assumptions that will hurt you.
I'll start with the one that bit me. Idempotency. Because this is the gap that no platform fills for you by default.
Jordan: An idempotency key is just a unique fingerprint for each event your workflow processes. Before your automation does anything with side effects — sends an email, charges a card, creates a record — it checks: have I seen this exact event before? If yes, skip. If no, proceed and log the key.
The implementation is simpler than it sounds. You take the stable fields from the incoming payload — the provider's event ID if there is one, like a Stripe event ID or a Shopify order ID — and you hash them into a key. Then you write that key to a store before you do anything else. If the write succeeds, the event is new. If it fails because the key already exists, it's a duplicate. Abort.
In Zapier, you can use Zapier Tables or Storage by Zapier as your key store. In Make, a Data Store works. In n8n, there's actually a Remove Duplicates node that handles per-run deduplication, but for cross-execution idempotency — which is what you actually need — you want Redis, a DataStore, or a simple Postgres table with an upsert. There's a community pattern on the n8n forums from a builder who shared an idempotency gate workflow using event IDs and a persistent store. It's solid. The key insight from that post — and this matters — is that if you're hashing payloads, you need to normalize the data first. Strip timestamps, strip anything that changes between retries. Hash only the fields that identify the event.
And here's the part that surprised me — I assumed Postgres was overkill for this. It's not. A single idempotency table with an insert-on-conflict-do-nothing query is maybe six lines of SQL. You can share it across every tool in your stack. One table. Every Zap, every scenario, every n8n workflow checks the same store before writing. That's roughly a five-dollar-a-month Supabase instance replacing what would otherwise be a very expensive Monday morning.
Jordan: Okay. Control number two — explicit retries. And this is where the platforms get sneaky, because they all have built-in retry behavior, and it's tempting to think that's enough.
Zapier has Autoreplay. It's on by default at the account level, and it will automatically retry failed Zap runs. Sounds great. But — and this is documented in Zapier's own help center, updated just this month — the moment you add a custom error handler to a Zap, Autoreplay gets disabled for that Zap. Gone. And you can't manually replay individual steps on runs that have error handlers either. You can replay the whole run, but not step by step.
So if you're building production workflows — which means you should have error handlers — you need to know that Autoreplay is no longer your safety net. You have to build your own retry logic into the error handler itself. Check if the error is a four twenty-nine or a five hundred, add a delay, retry the step, and if it still fails, route the payload somewhere you can inspect it later. That "somewhere" is your dead letter queue, which we'll get to.
Now, Zapier does give you a per-Zap Autoreplay override in Advanced Settings — you can set it to Always Replay, Never Replay, or Use Account Setting. My recommendation: set it deliberately on every production Zap. Don't rely on the account default. Know what each Zap does when it fails.
Make is actually ahead here. Make automatically retries transient errors — connection timeouts, module timeouts, rate limit errors — with exponential backoff. The retry schedule depends on whether you have Incomplete Executions enabled, but the point is: transient failures are handled. Where Make does not help you is logical errors. A bad field mapping, a missing variable, a module that returns unexpected data — those don't retry. They just fail. And if you don't have an error handler attached, they fail silently unless you're checking your scenario history.
For Make, the move is the Break error handler. Right-click any critical module, add an error handler, choose Break. What Break does is shunt the failed bundle into Incomplete Executions — which is essentially Make's built-in dead letter queue. You can inspect the error, fix the mapping, and then replay the run using Make's Run Replay feature, which uses the stored trigger data from the original execution. It consumes credits, but it's safe and auditable.
One thing to watch — you have to enable Store Incomplete Executions in your scenario settings first. It's a checkbox. If it's not checked, Break handlers don't have anywhere to put the failed bundles. I've seen people add Break handlers and wonder why nothing shows up in their incomplete executions list. That checkbox. Every time.
n8n gives you the most granular control, but also the most rope to hang yourself with. Every node has a Settings tab with Retry On Fail — you set the number of attempts and the wait between them. And there's an On Error option: Stop the workflow, Continue, or Continue Using Error Output. That last one is powerful — it lets downstream nodes see the error data and decide what to do with it.
But the real production pattern in n8n is the Error Workflow. You set it at the workflow level — Workflow Settings, Error Workflow field. When any execution fails, the Error Trigger fires in your designated error handler workflow. It receives structured data — the execution ID, the failing node, the error message. You log that to your DLQ store and now you have a complete record of what broke, where, and the payload that caused it.
And if you're running n8n with webhook triggers — which most of us are — there's a critical detail. If your workflow errors before it responds to the webhook, n8n returns a five hundred status code. The provider sees that five hundred and retries the webhook. Now you've got duplicate deliveries on top of the original failure. The fix is the Respond to Webhook node — place it immediately after the Webhook trigger, return a two hundred OK right away, and then do your processing. The provider gets their acknowledgment. If your processing fails, it goes to the Error Workflow, not back to the provider's retry queue.
That one pattern — respond early, process later — probably prevents more duplicate runs than any other single change you can make in n8n.
Jordan: Which brings us to the dead letter queue. I've been mentioning it — now let me be specific about what it actually is and why it matters more than any individual retry.
A DLQ is just a place where failed payloads go to wait for you. Not to disappear. Not to retry endlessly. To wait. With enough context that you can figure out what went wrong, fix it, and replay the event safely.
The minimum viable DLQ record needs six things: the platform, the workflow name, the run ID, the failing step, the error message, and the original payload. If you also store the idempotency key — which you should — you can replay without worrying about duplicates, because the idempotency gate will catch any event that already succeeded on a previous attempt.
In Zapier, the cleanest pattern is Zapier Manager. There's a trigger called New Zap Error — it fires whenever any Zap in your account errors. You connect that to a Webhooks by Zapier action that posts the error context to your DLQ endpoint. Could be a Notion database, could be an Airtable base, could be a Postgres table. The point is: every failure across every Zap lands in one place.
In Make, you already have Incomplete Executions if you followed the Break handler setup. That is your DLQ. And Run Replay is your replay mechanism. In n8n, the Error Workflow logs to whatever store you choose, and you replay by re-triggering the workflow with the captured payload.
Now — I know what some of you are thinking. "Jordan, this is over-engineering. The platforms already handle errors. Autoreplay exists. Make has backoff. I'm adding complexity for edge cases." And honestly? I thought the same thing. For about three weeks. Until the Shopify incident I mentioned at the top.
The counterargument is real — these platforms do have built-in resilience. Make's exponential backoff for transient errors is genuinely good. Zapier's Autoreplay catches a lot of failures automatically. n8n's Retry On Fail handles flaky APIs. But each of those has a boundary. Autoreplay disappears when you add error handlers. Make's backoff doesn't touch logical errors. n8n's retry doesn't track whether the original event already succeeded somewhere downstream. And none of them — none — give you a unified view of what's failing across your entire client portfolio.
When you're running thirty or forty workflows across eight clients, "check each scenario's history individually" is not an operating model. It's a prayer.
Jordan: Last control. The circuit breaker. And I'm going to ground this in Slack because that's where most of us feel the pain first.
Slack enforces a hard limit — roughly one message per second per channel. Exceed that and you get a four twenty-nine with a Retry-After header telling you how long to wait. If your automation doesn't respect that header, it retries immediately, gets another four twenty-nine, retries again — and now you're in a throttling loop that burns operations and delivers nothing.
This spring, Slack also tightened rate limits on read methods like conversations.history and conversations.replies for non-Marketplace commercial apps. The exact numeric caps aren't published, but the direction is clear — Slack wants you to batch, cache, and throttle. Not loop.
A circuit breaker is the pattern that prevents this. You track your error rate or response latency over a rolling window — say five minutes. When the rate crosses a threshold you define — maybe ten percent errors, maybe p95 latency above eight seconds — the breaker trips. It flips a flag in a key-value store or a simple database row. Every workflow checks that flag before making downstream calls. If the breaker is open, the workflow short-circuits — skips the Slack post, skips the API call, logs the intent to the DLQ, and sends one summary alert to a designated channel.
One message. Not forty. One message that says "circuit breaker tripped, non-critical posts suppressed, error rate twelve percent over the last five minutes." That's it. When the error rate drops back below threshold, the breaker closes and normal operations resume.
The irony is that the circuit breaker is the simplest of the four controls to implement — it's a boolean flag and an if-statement — but it's the one that actually lets you sleep. Because without it, a single downstream outage can cascade through every workflow that touches that service. With it, the blast radius is contained to one alert and a queue of deferred work.
So — four controls. Idempotency keys to prevent duplicates. Explicit retries with backoff to handle transient failures deliberately. A dead letter queue to catch everything that still fails and give you a clean replay path. And a circuit breaker to contain cascading failures before they wake you up at two AM.
None of these require a new tool. None of them require a paid add-on. They require about three to four hours of setup across your stack — and then they run forever.
Jordan: I keep thinking about those eleven duplicate fulfillment emails. Eleven customers who got a message saying their order shipped — twice. And the thing is, every one of those customers probably thought it was a glitch. A minor annoyance. But my client? My client saw it as a sign that the automation couldn't be trusted. And when your client stops trusting the automation, they start checking it manually. And when they're checking it manually, they're not saving time anymore. They're spending more time than before you built anything — because now they're doing the work and auditing the machine.
That's the real cost. Not the eleven emails. The erosion of trust that makes your work invisible again.
So here's what I want you to do this week. Pick one production workflow — the one that handles the most volume or touches the most money — and add the idempotency gate. Just that. One table, one check before the first side effect. It takes roughly twenty minutes. And the next time that webhook fires twice, your workflow will catch it, skip the duplicate, and you'll never know it happened. Which is exactly the point.
If you want the full implementation across all four controls — the SQL for the idempotency table, the Zapier error handler pattern, the Make Break handler setup, the n8n Error Workflow, the circuit breaker flag, all of it — grab the Production-Safe Automations checklist on the Resources page. Every toggle location, every snippet, paste-ready.
Alright. Go make something that doesn't break at two AM. I'll see you Wednesday.