Starter Redaction Pack: Presidio + Secrets Regex + Redaction Log Schema (Make/n8n‑Ready)
Copy‑ready configs and subflows to detect→mask→log PII and secrets before any external call. Includes Presidio docker + recognizers, a secrets regex library with tests, Make/n8n subflows, and an auditable Redaction Log schema with dashboard tiles.
Drop this pack in front of any external API/LLM call to detect and mask PII/secrets, then write a single annotated Redaction Log per run. Pick a detector (local Presidio or OpenAI Privacy Filter), wire the subflow, and ship with audit‑ready logs. Replace anything in [BRACKETS] for your stack.
Quick‑start map [copy/paste + fill]
- Choose your detector: [presidio|openai_privacy_filter|cloud_service]. 2) Set your log sink: [BigQuery|Postgres|S3|Logflare]. 3) Wire the subflow before every external call and before traces/metrics. 4) Run the tests, then flip to production with fail‑closed routing.
Config map you’ll reuse across sections:
- [WORKSPACE_NAME]: Project/workspace label for logs.
- [RUN_ID]: Unique id per request/task.
- [PRESIDIO_URL]: e.g., http://presidio:3000.
- [PRIVACY_FILTER_URL]: If hosting OpenAI Privacy Filter locally.
- [ANON_SALT]: Secret salt for hashing/pseudonymization.
- [LOG_DESTINATION]: e.g., BigQuery dataset.table or S3 bucket/key prefix.
- [ALERT_CHANNEL]: Email/Slack/Webhook for fail‑closed alerts.
Presidio: docker + tuned recognizers + operator map
Use the local Presidio stack for zero data egress and predictable cost. This sample stands up Analyzer + Anonymizer and adds tuned recognizers for common secrets.
# docker-compose.yaml (minimal)
version: '3.9'
services:
presidio-analyzer:
image: mcr.microsoft.com/presidio-analyzer:latest
environment:
- PYTHONIOENCODING=utf-8
ports: ['3000:3000']
presidio-anonymizer:
image: mcr.microsoft.com/presidio-anonymizer:latest
environment:
- ANON_SALT=[ANON_SALT]
ports: ['3001:3001']
Custom recognizers (augment built-ins like EMAIL_ADDRESS, PHONE_NUMBER, US_SSN):
# recognizers.yaml (Presidio pattern-based)
- name: 'OPENAI_API_KEY'
supported_language: 'en'
patterns:
- name: 'sk_prefix'
regex: 'sk-[A-Za-z0-9]{32,48}'
score: 0.75
context: ['openai', 'api', 'key', 'secret']
- name: 'GITHUB_TOKEN'
supported_language: 'en'
patterns:
- name: 'ghp_token'
regex: 'ghp_[A-Za-z0-9]{36}'
score: 0.75
context: ['github', 'token']
- name: 'AWS_ACCESS_KEY_ID'
supported_language: 'en'
patterns:
- name: 'akia_prefix'
regex: '(AKIA|ASIA)[0-9A-Z]{16}'
score: 0.7
context: ['aws', 'access', 'key', 'id']
- name: 'AWS_SECRET_ACCESS_KEY'
supported_language: 'en'
patterns:
- name: '40char_secret'
regex: '(?i)(aws_?secret_?access_?key)\s*[:=]\s*[A-Za-z0-9/+=]{40}'
score: 0.8
context: ['aws', 'secret']
- name: 'IBAN_CODE'
supported_language: 'en'
patterns:
- name: 'iban_core'
regex: '\b[A-Z]{2}\d{2}[A-Z0-9]{11,30}\b'
score: 0.65
context: ['iban', 'bank', 'account']
- name: 'JWT_TOKEN'
supported_language: 'en'
patterns:
- name: 'jwt_like'
regex: '\beyJ[\w-]*\.[\w-]*\.[\w-]*\b'
score: 0.7
context: ['jwt', 'bearer', 'authorization']
Anonymizer operator map (entity → action). Choose mask/hash/replace to balance utility vs privacy.
# anonymizer_map.yaml
operators:
EMAIL_ADDRESS: { type: 'mask', masking_char: '*', chars_to_mask: 'all' }
PHONE_NUMBER: { type: 'mask', masking_char: '•', unmasked_end_chars: 2 }
US_SSN: { type: 'mask', masking_char: 'X', unmasked_end_chars: 4 }
IBAN_CODE: { type: 'hash', hash_type: 'sha256', key: '[ANON_SALT]' }
OPENAI_API_KEY: { type: 'replace', new_value: '[API_KEY_MASKED]' }
GITHUB_TOKEN: { type: 'replace', new_value: '[GITHUB_TOKEN_MASKED]' }
AWS_ACCESS_KEY_ID: { type: 'replace', new_value: '[AWS_ACCESS_KEY_ID_MASKED]' }
AWS_SECRET_ACCESS_KEY: { type: 'replace', new_value: '[AWS_SECRET_ACCESS_KEY_MASKED]' }
JWT_TOKEN: { type: 'replace', new_value: '[JWT_MASKED]' }
DEFAULT: { type: 'mask', masking_char: '*', chars_to_mask: 'all' }
Minimal call contract (HTTP):
POST [PRESIDIO_URL]/analyze
body: { 'text': '[RAW_TEXT]', 'language': 'en', 'entities': ['EMAIL_ADDRESS','PHONE_NUMBER','US_SSN','IBAN_CODE','OPENAI_API_KEY','GITHUB_TOKEN','AWS_ACCESS_KEY_ID','AWS_SECRET_ACCESS_KEY','JWT_TOKEN'] }
POST [PRESIDIO_URL]/anonymize
body: { 'text': '[RAW_TEXT]', 'analyzer_results': [..from analyze..], 'anonymizer_config': { ..from anonymizer_map.. } }
Notes:
- Keep [ANON_SALT] secret; rotate it per environment.
- Add domain dictionaries (e.g., known client IDs) via deny/allow lists to cut false positives.
- Image/PDF pipelines: Presidio has image/DICOM support—run those at ingest, then text redaction here.
Regex fallbacks + tests [drop‑in]
Use these when the detector is offline or for simple pre‑filters. Each pattern includes a test string. Keep them conservative to avoid masking too much.
# secrets_piiregex.yaml
EMAIL: '(?i)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b'
PHONE_E164: '\+?[1-9]\d{8,14}'
US_SSN: '\b(?!000|666|9\d\d)\d{3}-(?!00)\d{2}-(?!0000)\d{4}\b'
IBAN_SIMPLE: '\b[A-Z]{2}\d{2}[A-Z0-9]{11,30}\b'
OPENAI_KEY: 'sk-[A-Za-z0-9]{32,48}'
GITHUB_TOKEN: 'ghp_[A-Za-z0-9]{36}'
AWS_AKID: '(AKIA|ASIA)[0-9A-Z]{16}'
AWS_SAK: '(?i)(aws_?secret_?access_?key)\s*[:=]\s*[A-Za-z0-9/+=]{40}'
JWT: '\beyJ[\w-]*\.[\w-]*\.[\w-]*\b'
Tiny test harness (Python + pytest):
# test_redaction_regex.py
import re, yaml
rx = yaml.safe_load(open('secrets_piiregex.yaml'))
CASES = {
'EMAIL': 'Contact me at a.ops+dev@example.io today',
'OPENAI_KEY': 'token sk-1234567890ABCDEFGHIJKLMNOPQRSTUV',
'AWS_AKID': 'env AKIAABCDEFGHIJKLMNOP',
'US_SSN': 'holder 123-45-6789',
}
for k,v in CASES.items():
assert re.search(rx[k], v, flags=re.I)
Fallback masking (Python):
# naive_mask.py
import re, yaml
rx = yaml.safe_load(open('secrets_piiregex.yaml'))
text = open('[INPUT_FILE]').read()
for label, pat in rx.items():
text = re.sub(pat, f'[{label}_MASKED]', text, flags=re.I)
open('[OUTPUT_FILE]', 'w').write(text)
Tip: run regex pre‑filters before model detection to trim obvious secrets and reduce false positives downstream.
Make subflow [detect→mask→annotate]
This template routes: Detect → Anonymize → Emit Redaction Log → Route. Duplicate the subflow before every external call and before trace/log export.
Node list and settings:
- HTTP: Detect
- Name: 'Detect PII (Presidio)'
- URL: [PRESIDIO_URL]/analyze
- Method: POST
- Body (JSON):
{ "text": "{{1.input_text}}", "language": "en", "entities": ["EMAIL_ADDRESS","PHONE_NUMBER","US_SSN","IBAN_CODE","OPENAI_API_KEY","GITHUB_TOKEN","AWS_ACCESS_KEY_ID","AWS_SECRET_ACCESS_KEY","JWT_TOKEN"] }
- HTTP: Anonymize
- Name: 'Anonymize (Presidio)'
- URL: [PRESIDIO_URL]/anonymize
- Body:
{ "text": "{{1.input_text}}", "analyzer_results": {{1.body}}, "anonymizer_config": {{2.anonymizer_map}} }
- JSON: Build Redaction Log
- Name: 'Redaction Log'
- Template object: see "Redaction Log schema" section; fill [RUN_ID],[ENV],[WORKSPACE_NAME].
- Data store/HTTP: Write Log
- Destination: [LOG_DESTINATION]
- Router: Decisions
- If detection/anonymize error → 'Fail‑Closed' path: Queue payload + POST to [ALERT_CHANNEL].
- Else if
entity_counts.total > 0→ Proceed with{{masked_text}}. - Else → Pass original input.
Export/Import tip:
- Save this as a Make subflow and call it as the first module inside any scenario making external calls. Keep [ANON_SALT] and [PRESIDIO_URL] in Make variables per environment.
n8n subflow [import JSON + set env]
Import this minimal workflow, connect credentials, and set environment variables. Place the Subworkflow node before any external request or trace export.
{
"name": "Redaction Subflow",
"nodes": [
{
"id": "DetectPII",
"name": "Detect PII (Presidio)",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "[PRESIDIO_URL]/analyze",
"method": "POST",
"jsonParameters": true,
"options": {},
"bodyParametersJson": "{\n \"text\": {{ $json.input_text }},\n \"language\": \"en\",\n \"entities\": [\"EMAIL_ADDRESS\",\"PHONE_NUMBER\",\"US_SSN\",\"IBAN_CODE\",\"OPENAI_API_KEY\",\"GITHUB_TOKEN\",\"AWS_ACCESS_KEY_ID\",\"AWS_SECRET_ACCESS_KEY\",\"JWT_TOKEN\"]\n}"
}
},
{
"id": "Anonymize",
"name": "Anonymize (Presidio)",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "[PRESIDIO_URL]/anonymize",
"method": "POST",
"jsonParameters": true,
"bodyParametersJson": "{\n \"text\": {{ $json.input_text }},\n \"analyzer_results\": {{ $('DetectPII').item.json.body }},\n \"anonymizer_config\": {{ $json.anonymizer_map }}\n}"
}
},
{
"id": "BuildLog",
"name": "Build Redaction Log",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": "const det = $items('DetectPII')[0].json.body || [];\nconst masked = $items('Anonymize')[0].json.text || '';\nconst counts = det.reduce((m,e)=>{m[e.entity_type]=(m[e.entity_type]||0)+1; m.total=(m.total||0)+1; return m;},{});\nreturn [{ json: {\n run_id: $json.RUN_ID, env: '[ENV]', workspace: '[WORKSPACE_NAME]',\n detector: { name: 'presidio', version: '[DET_VERSION]' },\n entity_counts: counts,\n sample_masked_spans: det.slice(0,5).map(e=>({ entity_type: e.entity_type, start: e.start, end: e.end })),\n decision: counts.total>0 ? 'input_masked' : 'pass_through',\n input_bytes: Buffer.from($json.input_text||'').length,\n output_bytes: Buffer.from(masked||'').length,\n latency_ms: { detect: 0, anonymize: 0 },\n ts: new Date().toISOString()\n}, paired: { masked_text: masked } }];"
}
},
{
"id": "WriteLog",
"name": "Write Log",
"type": "n8n-nodes-base.httpRequest",
"parameters": { "url": "[LOG_ENDPOINT]", "method": "POST", "jsonParameters": true, "bodyParametersJson": "={{$json}}" }
},
{
"id": "Route",
"name": "Router",
"type": "n8n-nodes-base.switch",
"parameters": { "property": "={{$json.entity_counts.total || 0}}", "rules": { "rules": [ { "operation": "larger", "value": 0 } ] } }
}
],
"connections": { "DetectPII": { "main": [[{"node":"Anonymize","type":"main","index":0}]] }, "Anonymize": { "main": [[{"node":"BuildLog","type":"main","index":0}]] }, "BuildLog": { "main": [[{"node":"WriteLog","type":"main","index":0}], [{"node":"Route","type":"main","index":0}]] } }
}
Notes:
- Replace [LOG_ENDPOINT] with your sink (HTTP collector, webhook, etc.).
- Add an Error Trigger node to route exceptions to [ALERT_CHANNEL] and store the raw payload in a quarantine queue.
Redaction Log schema + dashboard tiles
Emit one JSON object per run. Keep it small, consistent, and query‑friendly.
Schema (conceptual):
{
"run_id": "[RUN_ID]",
"ts": "2026-05-01T12:00:00Z",
"env": "[ENV]",
"workspace": "[WORKSPACE_NAME]",
"detector": { "name": "presidio|privacy_filter|cloud", "version": "[DET_VERSION]" },
"entity_counts": { "EMAIL_ADDRESS": 2, "OPENAI_API_KEY": 1, "total": 3 },
"sample_masked_spans": [
{ "entity_type": "EMAIL_ADDRESS", "start": 92, "end": 107 },
{ "entity_type": "OPENAI_API_KEY", "start": 144, "end": 185 }
],
"decision": "input_masked|pass_through|fail_closed",
"input_bytes": 1234,
"output_bytes": 1210,
"latency_ms": { "detect": 38, "anonymize": 12 },
"cost_estimate_usd": 0.0000,
"source": { "service": "[SERVICE_NAME]", "operation": "[OP_NAME]" }
}
Example tiles (adapt to your BI tool):
- Entity counts by type (7d): group by
entity_type(explode entity_counts) and sum values. - Mask vs pass decisions: count by
decisionper day. - Detector/version: group by
detector.name, detector.versionto track rollouts.
BigQuery helper views (pseudo‑SQL):
-- explode entity_counts
SELECT run_id, ts, env, detector.name AS detector, detector.version AS version,
key AS entity_type, value AS cnt
FROM `[LOG_DATASET].[TABLE]`, UNNEST(JSON_EXTRACT_KEYS(entity_counts)) AS key,
UNNEST([STRUCT(CAST(JSON_VALUE(JSON_EXTRACT(entity_counts, CONCAT('$.', key)) ) AS INT64) AS value)]);
Tip: store 2–5 masked span examples max; never store raw values.
Detector swap layer [Privacy Filter / cloud DLP]
Switch detectors without changing the rest of the subflow.
Option A — OpenAI Privacy Filter (self‑hosted):
- Endpoint: [PRIVACY_FILTER_URL]/filter
- Request:
{ "text": "[RAW_TEXT]" } - Response:
{ "masked_text": "...", "entities": [{"type":"email","start":..,"end":..}], "version": "[DET_VERSION]" } - Map
entitiesto the Redaction Log; usemasked_textdownstream.
Option B — Cloud DLP/PII service (managed):
- Wrap a thin adapter that normalizes results to
{ masked_text, entities[], version }. - Expect higher latency + per‑request cost; set
cost_estimate_usdin the log from headers/bytes.
Keep the operator map consistent so your downstream behavior doesn’t change as you swap detectors.
Fail‑closed branch [deterministic + alert]
Use this when redaction fails or the detector times out.
Pseudocode:
try {
detect(); anonymize(); write_log('input_masked'|'pass_through'); forward(masked_or_raw);
} catch (e) {
queue(original_payload, reason=e.message); write_log('fail_closed'); alert([ALERT_CHANNEL]);
}
Checklist:
- Queue: [S3 bucket|KV|DB table] named [WORKSPACE_NAME]-quarantine-[ENV].
- Alert: POST to [ALERT_CHANNEL] with [RUN_ID],
reason, and a link to the quarantined item. - Auto‑retry: backoff with jitter; max [N] attempts.
- Circuit‑breaker: if fail‑closed rate > [THRESHOLD]% over [WINDOW] mins, disable non‑essential external calls and surface a status banner to ops.
Defaults, validation, and rollout [copy this]
Fast defaults for a solo operator:
- Language: 'en' (extend later per client).
- Entities: email, phone, SSN, IBAN, API keys/tokens (OpenAI, GitHub, AWS), JWT.
- Operators: mask secrets fully; pseudonymize IBAN with SHA‑256 + [ANON_SALT]; keep last 2–4 digits for phone/SSN.
- Placement: pre‑call redaction subflow + post‑call output scan (reuse same subflow).
- Logs: one entry per run, no raw values, include detector name/version, entity_counts, decision, latency, and cost.
Validation sample:
"Email jordan.ops@example.io and use sk-1234567890ABCDEFGHIJKLMNOPQRSTUV. AWS key AKIAABCDEFGHIJKLMNOP. SSN 123-45-6789."
→ "Email ********************* and use [API_KEY_MASKED]. AWS key [AWS_ACCESS_KEY_ID_MASKED]. SSN XXX-XX-6789."
Rollout plan:
- Dev: run regex‑only prefilter + Presidio; verify logs.
- Stage: dual‑scan inputs/outputs; measure mask rate and latency budget.
- Prod: enable fail‑closed; alert on any detector errors; review dashboards weekly.