ChecklistApril 13, 2026

Solo Operator Voice Agent Playbook: Twilio + OpenAI Realtime + Calendly (14‑Step Build Checklist)

A field-tested 14‑step checklist to build a low‑latency voice receptionist with Twilio Media Streams, OpenAI Realtime, and Calendly—complete with barge‑in wiring, DTMF/handoff fallbacks, legal consent flow, and post‑call automation hooks. Built for solo consultants who want booked meetings without hiring.

From EpisodeTurn Missed Calls into Booked Consults: Voice Agent with Twilio + Realtime API + Calendly

Use this checklist to ship a working, compliant voice receptionist that answers every call, qualifies naturally, and books meetings for you—without hiring. Follow the sequence; instrument latency and fail gracefully with DTMF and timeouts. Where the checklist says “starter repo,” grab the linked Node/Express example in the show notes.

1
Buy a Twilio number and point Voice to your webhook
In Twilio Console, purchase a local/toll‑free number and set the Voice webhook to POST https://[YOUR_DOMAIN]/voice. Disable default recording for now—you’ll start it after consent.
2
Return <Connect><Stream> TwiML for bidirectional audio
Your /voice handler should return TwiML that opens a bidirectional Media Stream to your WebSocket: <Response><Connect><Stream url="wss://[YOUR_DOMAIN]/media" /></Connect></Response>. Use <Connect><Stream> (not <Start><Stream>) for live back‑and‑forth.
3
Create .env with all required secrets and config
Add: TWILIO_AUTH_TOKEN, OPENAI_API_KEY, REALTIME_MODEL, STREAM_WSS=wss://[YOUR_DOMAIN]/media, CALENDLY_TOKEN, CALENDLY_OWNER_URI, EVENT_TYPE_URI (e.g., /event_types/[ID]), FORWARD_NUMBER, PUBLIC_BASE_URL.
4
Stand up the Node/Express server with a Media Streams WS endpoint
Expose POST /voice (returns TwiML) and a WS at /media for Twilio Streams. On WS upgrade, validate the X‑Twilio‑Signature to verify the stream is authentic before accepting frames.
5
Handle Twilio Media Streams messages correctly
Parse start/media/mark/stop/dtmf events. Expect audio/x‑mulaw, 8 kHz, mono frames. Buffer in 20–40 ms chunks; send keep‑alives; on stop, flush and close downstream connections cleanly.
6
Open a low‑latency Realtime connection to the model
From the WS handler, open a server‑side WebSocket to OpenAI Realtime. Initialize the session (voice, instructions, tool schema if any). Forward caller audio to input_audio_buffer.append and commit via VAD or short timers.
7
Wire true barge‑in (two clears + a commit)
When the caller interrupts or presses a key, immediately (1) send Twilio a Clear to flush its playback buffer, and (2) call output_audio_buffer.clear on the Realtime session; then (3) commit the latest input buffer so the model responds to the new utterance.
8
Stream model speech back to Twilio in the right format
As the model emits audio, base64‑encode as audio/x‑mulaw at 8 kHz and send as Twilio media messages. Keep chunks small and sequential; on interruption, send Clear before new audio to prevent talk‑over.
9
Add Calendly booking flow (no redirect)
Fetch the host’s event types (GET /event_types), pull availability for the chosen EVENT_TYPE_URI and date range (GET /event_type_available_times), then create the booking with POST /invitees using caller name, email, and selected slot.
10
Confirm booking and send artifacts
Read back the date/time, then SMS or email the invitee confirmation URL from the Calendly response. As a fallback path, you can generate a single‑use scheduling link if API booking fails.
11
Implement guardrails: DTMF + human handoff + voicemail
Handle inbound DTMF from Twilio Streams: 0 = connect to a human at [FORWARD_NUMBER]; 1 = send a scheduling link via SMS; 9 = go to voicemail. Note: Streams support inbound DTMF only—you can’t send DTMF back to Twilio from your server.
12
Record legally: play disclosure, capture consent, then start
Before recording, play a clear disclosure (e.g., “This call may be recorded for quality and training. Do I have your permission?”). If yes, start recording and store a consent flag + transcript snippet. Check your state rules before enabling cross‑state recording.
13
Handle timeouts and errors without dead‑ends
No speech 6–8s → reprompt once; 2nd timeout → voicemail and transcript to your CRM. If Realtime or Calendly errors, apologize and text a single‑use booking link automatically.
14
Ship ops glue: summaries and follow‑ups
On call end, post a JSON payload to your Make/Zapier webhook: caller_id, disposition (booked/voicemail/transfer), consent, transcript summary, and Calendly invitee data. Create tasks/notes in your CRM and send a recap email/SMS to the prospect.
15
Instrument latency and quality; set budgets
Log end‑to‑end round‑trip (caller speech → first synthesized byte), barge‑in success rate, booking conversion, and cost per call. Tune chunk sizes, server region, and TTS speed—optimize for natural turn‑taking over raw word rate.
16
Security and reliability checklist before go‑live
Verify X‑Twilio‑Signature on webhooks and WS upgrades, rotate API keys, enforce HTTPS/WSS, backoff/retry Calendly calls, and add health checks. Run a 10‑call script with intentional barge‑ins and DTMF to validate all branches.