Template

Self‑Healing Automation Pack (Make/Zapier/n8n + Notion/Stripe) — Copy‑Ready Templates

Copy‑paste templates to make your automations self‑healing: DLQ schema, Make/Zapier/n8n error handlers + reprocessors, Notion incident/RCA, Better Stack/Instatus status publishing, Stripe credit automation, and security guardrails.

Use this pack to ship a three‑layer reliability pattern in under an hour: platform error handling, a dead‑letter queue (DLQ) with a reprocessor, and a lightweight status/comms loop. Fill in the [BRACKETS], paste the snippets into your tool of choice, and toggle the options that match your client’s risk profile. Version: v1 (April 2026).

Core DLQ schema (Airtable/Notion)

  • Goal: one normalized record per failed run, safe to retry later without double‑writing.
  • Storage: Airtable (suggested) or Notion.
  • Table name: DLQ (aka "Parking Table").

Columns (create exactly these):

  • id (Primary key, formula): IF({idraw}, {idraw}, RECORD_ID())
  • id_raw (Text, optional): Native failure/run id from platform.
  • workflow (Single line text): [WORKFLOW_NAME]
  • client (Single line text): [CLIENT_NAME]
  • severity (Single select): S1 | S2 | S3 | S4
  • status (Single select): NEW | RETRYING | PARKED | RESOLVED | ESCALATED
  • first_seen (Created time)
  • last_seen (Last modified time)
  • attempts (Number, integer, default 0)
  • error_code (Text)
  • error_message (Long text)
  • payload_redacted (Long text): never store secrets/PII.
  • reprocess_url (URL or text): optional pointer to re‑run endpoint.
  • notes_internal (Long text)
  • notified_at (DateTime, optional)
  • healthcheck_ping (URL, optional)

Recommended view filters:

  • Open: status IN (NEW, PARKED) and attempts < [MAX_ATTEMPTS]
  • Stuck: attempts ≥ [MAX_ATTEMPTS]

Retry policy defaults:

  • [MAX_ATTEMPTS] = 5
  • Backoff with jitter: 1m, 2m, 4m, 8m, 16m ± random(0–30%)

Idempotency + redaction mini‑template

Add one of these to every mutation step (API write, email send, CRM create) so retries are safe.

Pattern A — Request header supports idempotency (e.g., Stripe):

  • Key: Idempotency-Key
  • Value: [CLIENTID]-[WORKFLOWNAME]-[SOURCERECORDID]-[ATTEMPT]

Pattern B — DIY dedupe (no native idempotency):

  1. Build a stable hash from immutable fields.
  • Zapier Code step (JS):

```js
const crypto = require('crypto');
const key = ${inputData.client}|${inputData.workflow}|${inputData.sourceId};
return { idem: crypto.createHash('sha256').update(key).digest('hex') };
```

  1. Check Store (Storage by Zapier / Airtable / Notion) for idem.
  2. If exists → skip; if not → write, then store the key with TTL [IDEMTTLDAYS] days.

Redaction helper (use before logging):
```js
function redact(obj){
const S = ['password','token','secret','authorization','cookie','ssn','card'];
const walk = v => Array.isArray(v) ? v.map(walk) : (v && typeof v==='object')
? Object.fromEntries(Object.entries(v).map(([k,val])=>[k.toLowerCase(),S.some(s=>k.toLowerCase().includes(s))?'[REDACTED]':walk(val)]))
: v;
return walk(obj);
}
```

Make.com — error handler + DLQ feeder + reprocessor

Scenario outline:

  • Name: [WORKFLOW_NAME] — Self‑Healing
  • Modules: [TRIGGER] → [TRANSFORM] → [WRITE/API] (+ Error Handler)

Error handler branch on the [WRITE/API] module:

  • Directive: choose one per module
  • Rollback (use for ACID‑labeled modules; aborts transactional writes)
  • Ignore (skip failed module, continue main route)
  • Handler route steps:
  1. Tools > Function (JS) → build payload_redacted with the Redaction helper.
  2. Airtable > Create record in DLQ with fields:
  • workflow: [WORKFLOW_NAME]
  • client: [CLIENT_NAME]
  • error_code: {{error.code}}
  • error_message: {{error.message}}
  • payload_redacted: redacted JSON
  • attempts: {{1}}
  • status: NEW
  1. Slack > Post message to [SLACKWEBHOOKURL] with a short alert.
  2. Flow control > Break (optional) to halt; otherwise Continue.

Reprocessor (separate scenario, schedule every 5–10 minutes):

  • Trigger: Airtable DLQ view = Open
  • Iterator: each record
  • Tools > Sleep: delay_ms = min(960000, base 2^attempts (1 + rand(0..0.3)))
  • Optional circuit breaker: If failures for [UPSTREAMSERVICE] in last 5m ≥ [CBTHRESHOLD], set status PARKED and skip.
  • Try main [WRITE/API] with idempotency key.
  • On success: set status=RESOLVED.
  • On failure: increment attempts, update lastseen, if attempts ≥ [MAXATTEMPTS]ESCALATED.

Heartbeat (optional but recommended):

  • HTTP > Make a request (GET) to [HEALTHCHECKSPINGURL] on start and finish. Missing pings will alert you.

Notes:

  • Use Commit/Rollback only on ACID‑capable modules (labeled in Make). Others use Ignore/Break.
  • Keep Notion touches low; its API averages ~3 rps per integration.

Zapier — autoreplay vs. handler, plus DLQ + reprocessor

Choose one design per Zap (don’t combine — custom Error Handling changes replay behavior on that Zap):

Option A — Autoreplay for transient failures (simple):

  • In Zap Settings, enable Autoreplay. Zapier will retry up to 5 times on a backoff schedule (5m, 30m, 1h, 3h, 6h). If it still fails, handle manually or send to DLQ with a follow‑up Zap.

Option B — Custom Error Handling → DLQ (portable):

  • On the risky Step, open ••• → On failure → run handler path:
  1. Code by Zapier (JS) → build idem and payload_redacted.
  2. Airtable → Create record in DLQ with fields from the Core schema.
  3. Slack/Email → alert.
  • Note: Enabling custom Error Handling on a Zap disables that Zap’s Autoreplay.

Reprocessor Zap (for either option):

  • Trigger: New record in Airtable DLQ View = Open OR Scheduled every [N] minutes with Find records.
  • Action 1: Delay For — {{backoff}} using attempts‑based lookup (1m, 2m, 4m, 8m, 16m).
  • Action 2: Webhooks by Zapier → perform the original write with Idempotency-Key: {{idem}} header (or DIY dedupe via Storage by Zapier).
  • Action 3: Airtable → Update record: increment attempts, set status to RESOLVED or PARKED/ESCALATED on final failure.
  • Optional: Webhooks → GET [HEALTHCHECKSPINGURL] at start/finish for heartbeat.

DIY dedupe with Storage by Zapier:

  • Before the write, Get Value for key idem:{{idem}}.
  • If exists → Paths: Skip write.
  • If missing → perform write, then Set Value with TTL [IDEMTTLDAYS].

n8n — Retry on Fail + Error Trigger + reprocessor

Two artifacts: (1) Node‑level retry/backoff, (2) Global Error workflow.

A) Node‑level Retry on Fail (for HTTP 5xx/429):

  • Open the HTTP node → Settings → Retry on Fail: ON.
  • Set Max tries = 5, Wait = 60s. Add a Wait node before retries if you need custom backoff.

Jitter backoff expression (for a Function or Set node):
```js
const base=60000; // 60s
const max=960000; // 16m
const a=$json.attempts ?? 0; // from DLQ or node context
const delay = Math.min(max, Math.floor(base Math.pow(2,a) (1 + Math.random()*0.3)));
return { delay };
```

B) Error workflow (fires on failure):

  • Create new workflow: [WORKFLOW_NAME] — Error Handler.
  • Trigger: Error Trigger node.
  • Steps:
  1. Function → build payload_redacted (use Redaction helper from this pack).
  2. Airtable/Notion → Upsert into DLQ with workflow, client, errorcode, errormessage, attempts=1, status=NEW.
  3. Slack/Email → send alert with link to DLQ record.

Reprocessor (scheduled):

  • Trigger: Cron /5 *.
  • Airtable → List Open.
  • For each item: Wait for {{$json.delay}} using the jitter formula above → attempt write with idempotency header → update attempts/status.

Simple circuit breaker:

  • Use Workflow Static Data to store rolling failures per [UPSTREAM_SERVICE].
  • If failures ≥ [CB_THRESHOLD] in 5 minutes → set DLQ status PARKED, send status page update, and stop retries until a half‑open probe succeeds.

Notion Incident Log DB + RCA template

Create a Notion database [INCIDENTSDBNAME] for internal truth; this is the single source for client comms and automations.

Properties:

  • incident_id (Title): [CLIENT]-[WORKFLOW]-[YYYYMMDD]-[SEQ]
  • client (Select)
  • workflow (Text)
  • severity (Select): S1 | S2 | S3 | S4
  • status (Select): OPEN | MONITORING | RESOLVED
  • impact_summary (Text, short, client‑safe)
  • start_time (DateTime)
  • end_time (DateTime)
  • downtimeminutes (Formula): dateBetween(endtime, start_time, "minutes")
  • root_cause (Rich text, internal)
  • fix (Rich text, internal)
  • prevention (Rich text, internal)
  • external_message (Rich text, client‑safe)
  • internal_notes (Rich text)
  • sla_breached (Checkbox)
  • credit_amount (Number, USD)
  • credittype (Select): CREDITNOTE | COUPON
  • stripeinvoiceid (Text)
  • stripecustomerid (Text)
  • statuspageid (Text)

RCA Page template (paste into a Notion template):
```

Incident [incident_id]

  • Client: [CLIENT_NAME]
  • Workflow: [WORKFLOW_NAME]
  • Severity: [S1–S4]
  • Status: [OPEN|MONITORING|RESOLVED]
  • Start: [START_TIME]
  • End: [END_TIME]
  • Impact: [X% of runs failed | Y customers delayed]

Timeline

  • [HH:MM] Detection
  • [HH:MM] Mitigation started
  • [HH:MM] Resolved

Root cause

[DETAIL]

Fix

[DETAIL]

Prevention

[CHECKLIST]
```

Status page API snippets (Better Stack / Instatus)

Better Stack — create/update incident:

  • Endpoint: POST https://uptime.betterstack.com/api/v3/incidents
  • Headers: Authorization: Bearer [BETTERSTACKAPIKEY], Content-Type: application/json
  • Body:

```json
{
"incident": {
"name": "[CLIENT] – [WORKFLOW] degraded",
"startedat": "[ISO8601START]",
"status": "investigating",
"impact": "minor",
"message": "[EXTERNAL_MESSAGE]",
"components": ["[COMPONENT_NAME]"]
}
}
```

Instatus — create/update incident:

  • Endpoint: POST https://api.instatus.com/v1/[STATUSPAGEID]/incidents
  • Headers: Authorization: Bearer [INSTATUSAPITOKEN], Content-Type: application/json
  • Body:

```json
{
"title": "[CLIENT] – [WORKFLOW] degraded",
"message": "[EXTERNAL_MESSAGE]",
"severity": "minor",
"status": "investigating",
"components": ["[COMPONENT_ID]"]
}
```

Mapping from Notion Incident to status body:

  • name/title[client] + workflow + state
  • messageexternal_message
  • startedatstarttime
  • Close the incident when Notion status becomes RESOLVED.

Heartbeats for reprocessors (Healthchecks.io / Uptime Kuma)

Add heartbeats to detect silent failures of your reprocessor/nightly jobs.

Healthchecks.io (recommended):

  • Get unique ping URL: [HEALTHCHECKSPINGURL]
  • On job start: GET [HEALTHCHECKSPINGURL]/start
  • On success: GET [HEALTHCHECKSPINGURL]
  • On failure: GET [HEALTHCHECKSPINGURL]/fail

Uptime Kuma (self‑hosted):

  • Push URL: [UPTIMEKUMAPUSH_URL]?status=up&msg=ok (or status=down&msg=error)
  • Ping at start/finish similar to Healthchecks.

SLA credits (Stripe Credit Note or Coupon)

Trigger: Notion Incident where slabreached = true and severity ≥ S2 and downtimeminutes > [SLATHRESHOLDMINUTES].

Branch A — Post‑invoice adjustment with Credit Note:

  • HTTP Request → POST https://api.stripe.com/v1/credit_notes
  • Headers: Authorization: Bearer [STRIPESECRETKEY], Idempotency-Key: [INCIDENT_ID]
  • Body (form‑encoded or JSON):

```
invoice=[INVOICE_ID]
amount=[AMOUNTINCENTS]
reason=servicenotas_described
memo=SLA credit for incident [INCIDENT_ID]
```

  • Log creditamount, credittype=CREDIT_NOTE on the Notion record.

Branch B — Proactive discount via Coupon/Promotion Code:

  • Create coupon:

```
POST https://api.stripe.com/v1/coupons
percent_off=[PERCENT]
duration=once|repeating|forever
```

  • Optionally create a promotion code and attach to customer.
  • Log credit_type=COUPON and link code in Notion.

Safety:

  • Always include Idempotency-Key on Stripe writes.
  • Rotate keys after critical incidents per your security SOP.

Security guardrails for logs and status

  • Never log secrets or PII. Redact common fields (password, token, authorization, cookie, ssn, card).
  • Keep payload_redacted only. Do not store raw payloads in DLQ or Notion.
  • Restrict status page messages to impact and current state. No stack traces.
  • Store provider keys in a secrets manager; rotate after S1 incidents.
  • Notion API: target ≤ 3 requests/second average; batch updates in the reprocessor.

Packaging defaults (offer/SLO/escalation)

  • Reliability Add‑On SKU: [SETUPFEEUSD] one‑time + [MONTHLYFEEUSD]/mo for monitoring/credits.
  • Sample SLO: 99.5% monthly (≈ 3.65 hours downtime). No guarantees on upstream SaaS availability.
  • Escalation rules: S1 page you immediately; S2 email + auto status post; S3/S4 business‑hours.

Plan limits and v1 defaults

  • Zapier Autoreplay tries ~5× at 5m, 30m, 1h, 3h, 6h intervals.
  • Zapier custom Error Handling disables Autoreplay on that Zap.
  • Make Rollback/Commit apply to ACID‑capable modules only; others use Ignore/Break.
  • Notion API averages ~3 rps per integration; add backoff.
  • Use versioned comments or a v property in Notion/Airtable to track this pack’s updates.

v1 assumptions you can change:

  • [MAX_ATTEMPTS]=5, backoff base 60s, jitter 0–30%.
  • Circuit breaker threshold [CB_THRESHOLD]=10 failures/5m per upstream.
  • SLATHRESHOLDMINUTES]=30 for S2+.