Self‑Healing Automation Pack (Make/Zapier/n8n + Notion/Stripe) — Copy‑Ready Templates
Copy‑paste templates to make your automations self‑healing: DLQ schema, Make/Zapier/n8n error handlers + reprocessors, Notion incident/RCA, Better Stack/Instatus status publishing, Stripe credit automation, and security guardrails.
Use this pack to ship a three‑layer reliability pattern in under an hour: platform error handling, a dead‑letter queue (DLQ) with a reprocessor, and a lightweight status/comms loop. Fill in the [BRACKETS], paste the snippets into your tool of choice, and toggle the options that match your client’s risk profile. Version: v1 (April 2026).
Core DLQ schema (Airtable/Notion)
- Goal: one normalized record per failed run, safe to retry later without double‑writing.
- Storage: Airtable (suggested) or Notion.
- Table name:
DLQ(aka "Parking Table").
Columns (create exactly these):
id(Primary key, formula):IF({idraw}, {idraw}, RECORD_ID())id_raw(Text, optional): Native failure/run id from platform.workflow(Single line text): [WORKFLOW_NAME]client(Single line text): [CLIENT_NAME]severity(Single select): S1 | S2 | S3 | S4status(Single select): NEW | RETRYING | PARKED | RESOLVED | ESCALATEDfirst_seen(Created time)last_seen(Last modified time)attempts(Number, integer, default 0)error_code(Text)error_message(Long text)payload_redacted(Long text): never store secrets/PII.reprocess_url(URL or text): optional pointer to re‑run endpoint.notes_internal(Long text)notified_at(DateTime, optional)healthcheck_ping(URL, optional)
Recommended view filters:
Open: status IN (NEW, PARKED) and attempts < [MAX_ATTEMPTS]Stuck: attempts ≥ [MAX_ATTEMPTS]
Retry policy defaults:
[MAX_ATTEMPTS]= 5- Backoff with jitter: 1m, 2m, 4m, 8m, 16m ± random(0–30%)
Idempotency + redaction mini‑template
Add one of these to every mutation step (API write, email send, CRM create) so retries are safe.
Pattern A — Request header supports idempotency (e.g., Stripe):
- Key:
Idempotency-Key - Value:
[CLIENTID]-[WORKFLOWNAME]-[SOURCERECORDID]-[ATTEMPT]
Pattern B — DIY dedupe (no native idempotency):
- Build a stable hash from immutable fields.
- Zapier Code step (JS):
```js
const crypto = require('crypto');
const key = ${inputData.client}|${inputData.workflow}|${inputData.sourceId};
return { idem: crypto.createHash('sha256').update(key).digest('hex') };
```
- Check Store (Storage by Zapier / Airtable / Notion) for
idem. - If exists → skip; if not → write, then store the key with TTL [IDEMTTLDAYS] days.
Redaction helper (use before logging):
```js
function redact(obj){
const S = ['password','token','secret','authorization','cookie','ssn','card'];
const walk = v => Array.isArray(v) ? v.map(walk) : (v && typeof v==='object')
? Object.fromEntries(Object.entries(v).map(([k,val])=>[k.toLowerCase(),S.some(s=>k.toLowerCase().includes(s))?'[REDACTED]':walk(val)]))
: v;
return walk(obj);
}
```
Make.com — error handler + DLQ feeder + reprocessor
Scenario outline:
- Name:
[WORKFLOW_NAME] — Self‑Healing - Modules: [TRIGGER] → [TRANSFORM] → [WRITE/API] (+ Error Handler)
Error handler branch on the [WRITE/API] module:
- Directive: choose one per module
Rollback(use for ACID‑labeled modules; aborts transactional writes)Ignore(skip failed module, continue main route)- Handler route steps:
- Tools > Function (JS) → build
payload_redactedwith the Redaction helper. - Airtable > Create record in
DLQwith fields:
workflow:[WORKFLOW_NAME]client:[CLIENT_NAME]error_code:{{error.code}}error_message:{{error.message}}payload_redacted: redacted JSONattempts:{{1}}status:NEW
- Slack > Post message to [SLACKWEBHOOKURL] with a short alert.
- Flow control > Break (optional) to halt; otherwise Continue.
Reprocessor (separate scenario, schedule every 5–10 minutes):
- Trigger: Airtable
DLQview =Open - Iterator: each record
- Tools > Sleep:
delay_ms = min(960000, base 2^attempts (1 + rand(0..0.3))) - Optional circuit breaker: If failures for
[UPSTREAMSERVICE]in last 5m ≥ [CBTHRESHOLD], set statusPARKEDand skip. - Try main [WRITE/API] with idempotency key.
- On success: set
status=RESOLVED. - On failure: increment
attempts, updatelastseen, ifattempts ≥ [MAXATTEMPTS]→ESCALATED.
Heartbeat (optional but recommended):
- HTTP > Make a request (GET) to
[HEALTHCHECKSPINGURL]on start and finish. Missing pings will alert you.
Notes:
- Use Commit/Rollback only on ACID‑capable modules (labeled in Make). Others use Ignore/Break.
- Keep Notion touches low; its API averages ~3 rps per integration.
Zapier — autoreplay vs. handler, plus DLQ + reprocessor
Choose one design per Zap (don’t combine — custom Error Handling changes replay behavior on that Zap):
Option A — Autoreplay for transient failures (simple):
- In Zap Settings, enable Autoreplay. Zapier will retry up to 5 times on a backoff schedule (5m, 30m, 1h, 3h, 6h). If it still fails, handle manually or send to DLQ with a follow‑up Zap.
Option B — Custom Error Handling → DLQ (portable):
- On the risky Step, open ••• →
On failure→ run handler path:
- Code by Zapier (JS) → build
idemandpayload_redacted. - Airtable → Create record in
DLQwith fields from the Core schema. - Slack/Email → alert.
- Note: Enabling custom Error Handling on a Zap disables that Zap’s Autoreplay.
Reprocessor Zap (for either option):
- Trigger: New record in Airtable
DLQView =OpenOR Scheduled every [N] minutes with Find records. - Action 1: Delay For —
{{backoff}}using attempts‑based lookup (1m, 2m, 4m, 8m, 16m). - Action 2: Webhooks by Zapier → perform the original write with
Idempotency-Key: {{idem}}header (or DIY dedupe via Storage by Zapier). - Action 3: Airtable → Update record: increment
attempts, setstatustoRESOLVEDorPARKED/ESCALATEDon final failure. - Optional: Webhooks → GET
[HEALTHCHECKSPINGURL]at start/finish for heartbeat.
DIY dedupe with Storage by Zapier:
- Before the write,
Get Valuefor keyidem:{{idem}}. - If exists → Paths: Skip write.
- If missing → perform write, then
Set Valuewith TTL[IDEMTTLDAYS].
n8n — Retry on Fail + Error Trigger + reprocessor
Two artifacts: (1) Node‑level retry/backoff, (2) Global Error workflow.
A) Node‑level Retry on Fail (for HTTP 5xx/429):
- Open the HTTP node → Settings →
Retry on Fail: ON. - Set
Max tries= 5,Wait= 60s. Add aWaitnode before retries if you need custom backoff.
Jitter backoff expression (for a Function or Set node):
```js
const base=60000; // 60s
const max=960000; // 16m
const a=$json.attempts ?? 0; // from DLQ or node context
const delay = Math.min(max, Math.floor(base Math.pow(2,a) (1 + Math.random()*0.3)));
return { delay };
```
B) Error workflow (fires on failure):
- Create new workflow:
[WORKFLOW_NAME] — Error Handler. - Trigger:
Error Triggernode. - Steps:
- Function → build
payload_redacted(use Redaction helper from this pack). - Airtable/Notion → Upsert into
DLQwithworkflow,client,errorcode,errormessage,attempts=1,status=NEW. - Slack/Email → send alert with link to DLQ record.
Reprocessor (scheduled):
- Trigger: Cron
/5 *. - Airtable → List
Open. - For each item:
Waitfor{{$json.delay}}using the jitter formula above → attempt write with idempotency header → updateattempts/status.
Simple circuit breaker:
- Use
Workflow Static Datato store rolling failures per[UPSTREAM_SERVICE]. - If failures ≥
[CB_THRESHOLD]in 5 minutes → set DLQ statusPARKED, send status page update, and stop retries until a half‑open probe succeeds.
Notion Incident Log DB + RCA template
Create a Notion database [INCIDENTSDBNAME] for internal truth; this is the single source for client comms and automations.
Properties:
incident_id(Title):[CLIENT]-[WORKFLOW]-[YYYYMMDD]-[SEQ]client(Select)workflow(Text)severity(Select): S1 | S2 | S3 | S4status(Select): OPEN | MONITORING | RESOLVEDimpact_summary(Text, short, client‑safe)start_time(DateTime)end_time(DateTime)downtimeminutes(Formula):dateBetween(endtime, start_time, "minutes")root_cause(Rich text, internal)fix(Rich text, internal)prevention(Rich text, internal)external_message(Rich text, client‑safe)internal_notes(Rich text)sla_breached(Checkbox)credit_amount(Number, USD)credittype(Select): CREDITNOTE | COUPONstripeinvoiceid(Text)stripecustomerid(Text)statuspageid(Text)
RCA Page template (paste into a Notion template):
```
Incident [incident_id]
- Client: [CLIENT_NAME]
- Workflow: [WORKFLOW_NAME]
- Severity: [S1–S4]
- Status: [OPEN|MONITORING|RESOLVED]
- Start: [START_TIME]
- End: [END_TIME]
- Impact: [X% of runs failed | Y customers delayed]
Timeline
- [HH:MM] Detection
- [HH:MM] Mitigation started
- [HH:MM] Resolved
Root cause
[DETAIL]
Fix
[DETAIL]
Prevention
[CHECKLIST]
```
Status page API snippets (Better Stack / Instatus)
Better Stack — create/update incident:
- Endpoint:
POST https://uptime.betterstack.com/api/v3/incidents - Headers:
Authorization: Bearer [BETTERSTACKAPIKEY],Content-Type: application/json - Body:
```json
{
"incident": {
"name": "[CLIENT] – [WORKFLOW] degraded",
"startedat": "[ISO8601START]",
"status": "investigating",
"impact": "minor",
"message": "[EXTERNAL_MESSAGE]",
"components": ["[COMPONENT_NAME]"]
}
}
```
Instatus — create/update incident:
- Endpoint:
POST https://api.instatus.com/v1/[STATUSPAGEID]/incidents - Headers:
Authorization: Bearer [INSTATUSAPITOKEN],Content-Type: application/json - Body:
```json
{
"title": "[CLIENT] – [WORKFLOW] degraded",
"message": "[EXTERNAL_MESSAGE]",
"severity": "minor",
"status": "investigating",
"components": ["[COMPONENT_ID]"]
}
```
Mapping from Notion Incident to status body:
name/title←[client] + workflow + statemessage←external_messagestartedat←starttime- Close the incident when Notion
statusbecomes RESOLVED.
Heartbeats for reprocessors (Healthchecks.io / Uptime Kuma)
Add heartbeats to detect silent failures of your reprocessor/nightly jobs.
Healthchecks.io (recommended):
- Get unique ping URL:
[HEALTHCHECKSPINGURL] - On job start:
GET [HEALTHCHECKSPINGURL]/start - On success:
GET [HEALTHCHECKSPINGURL] - On failure:
GET [HEALTHCHECKSPINGURL]/fail
Uptime Kuma (self‑hosted):
- Push URL:
[UPTIMEKUMAPUSH_URL]?status=up&msg=ok(orstatus=down&msg=error) - Ping at start/finish similar to Healthchecks.
SLA credits (Stripe Credit Note or Coupon)
Trigger: Notion Incident where slabreached = true and severity ≥ S2 and downtimeminutes > [SLATHRESHOLDMINUTES].
Branch A — Post‑invoice adjustment with Credit Note:
- HTTP Request →
POST https://api.stripe.com/v1/credit_notes - Headers:
Authorization: Bearer [STRIPESECRETKEY],Idempotency-Key: [INCIDENT_ID] - Body (form‑encoded or JSON):
```
invoice=[INVOICE_ID]
amount=[AMOUNTINCENTS]
reason=servicenotas_described
memo=SLA credit for incident [INCIDENT_ID]
```
- Log
creditamount,credittype=CREDIT_NOTEon the Notion record.
Branch B — Proactive discount via Coupon/Promotion Code:
- Create coupon:
```
POST https://api.stripe.com/v1/coupons
percent_off=[PERCENT]
duration=once|repeating|forever
```
- Optionally create a promotion code and attach to customer.
- Log
credit_type=COUPONand link code in Notion.
Safety:
- Always include
Idempotency-Keyon Stripe writes. - Rotate keys after critical incidents per your security SOP.
Security guardrails for logs and status
- Never log secrets or PII. Redact common fields (
password,token,authorization,cookie,ssn,card). - Keep
payload_redactedonly. Do not store raw payloads in DLQ or Notion. - Restrict status page messages to impact and current state. No stack traces.
- Store provider keys in a secrets manager; rotate after S1 incidents.
- Notion API: target ≤ 3 requests/second average; batch updates in the reprocessor.
Packaging defaults (offer/SLO/escalation)
- Reliability Add‑On SKU:
[SETUPFEEUSD]one‑time +[MONTHLYFEEUSD]/mofor monitoring/credits. - Sample SLO:
99.5%monthly (≈ 3.65 hours downtime). No guarantees on upstream SaaS availability. - Escalation rules:
S1page you immediately;S2email + auto status post;S3/S4business‑hours.
Plan limits and v1 defaults
- Zapier Autoreplay tries ~5× at 5m, 30m, 1h, 3h, 6h intervals.
- Zapier custom Error Handling disables Autoreplay on that Zap.
- Make Rollback/Commit apply to ACID‑capable modules only; others use Ignore/Break.
- Notion API averages ~3 rps per integration; add backoff.
- Use versioned comments or a
vproperty in Notion/Airtable to track this pack’s updates.
v1 assumptions you can change:
[MAX_ATTEMPTS]=5, backoff base 60s, jitter 0–30%.- Circuit breaker threshold
[CB_THRESHOLD]=10 failures/5mper upstream. SLATHRESHOLDMINUTES]=30for S2+.