Solo Operator Voice Agent Playbook: Twilio + OpenAI Realtime + Calendly (14‑Step Build Checklist)
A field-tested 14‑step checklist to build a low‑latency voice receptionist with Twilio Media Streams, OpenAI Realtime, and Calendly—complete with barge‑in wiring, DTMF/handoff fallbacks, legal consent flow, and post‑call automation hooks. Built for solo consultants who want booked meetings without hiring.
Use this checklist to ship a working, compliant voice receptionist that answers every call, qualifies naturally, and books meetings for you—without hiring. Follow the sequence; instrument latency and fail gracefully with DTMF and timeouts. Where the checklist says “starter repo,” grab the linked Node/Express example in the show notes.
- 1
Buy a Twilio number and point Voice to your webhook
In Twilio Console, purchase a local/toll‑free number and set the Voice webhook to POST https://[YOUR_DOMAIN]/voice. Disable default recording for now—you’ll start it after consent.
- 2
Return <Connect><Stream> TwiML for bidirectional audio
Your /voice handler should return TwiML that opens a bidirectional Media Stream to your WebSocket:
<Response><Connect><Stream url="wss://[YOUR_DOMAIN]/media" /></Connect></Response>. Use <Connect><Stream> (not <Start><Stream>) for live back‑and‑forth. - 3
Create .env with all required secrets and config
Add: TWILIO_AUTH_TOKEN, OPENAI_API_KEY, REALTIME_MODEL, STREAM_WSS=wss://[YOUR_DOMAIN]/media, CALENDLY_TOKEN, CALENDLY_OWNER_URI, EVENT_TYPE_URI (e.g., /event_types/[ID]), FORWARD_NUMBER, PUBLIC_BASE_URL.
- 4
Stand up the Node/Express server with a Media Streams WS endpoint
Expose POST /voice (returns TwiML) and a WS at /media for Twilio Streams. On WS upgrade, validate the X‑Twilio‑Signature to verify the stream is authentic before accepting frames.
- 5
Handle Twilio Media Streams messages correctly
Parse start/media/mark/stop/dtmf events. Expect audio/x‑mulaw, 8 kHz, mono frames. Buffer in 20–40 ms chunks; send keep‑alives; on stop, flush and close downstream connections cleanly.
- 6
Open a low‑latency Realtime connection to the model
From the WS handler, open a server‑side WebSocket to OpenAI Realtime. Initialize the session (voice, instructions, tool schema if any). Forward caller audio to input_audio_buffer.append and commit via VAD or short timers.
- 7
Wire true barge‑in (two clears + a commit)
When the caller interrupts or presses a key, immediately (1) send Twilio a Clear to flush its playback buffer, and (2) call output_audio_buffer.clear on the Realtime session; then (3) commit the latest input buffer so the model responds to the new utterance.
- 8
Stream model speech back to Twilio in the right format
As the model emits audio, base64‑encode as audio/x‑mulaw at 8 kHz and send as Twilio media messages. Keep chunks small and sequential; on interruption, send Clear before new audio to prevent talk‑over.
- 9
Add Calendly booking flow (no redirect)
Fetch the host’s event types (
GET /event_types), pull availability for the chosen EVENT_TYPE_URI and date range (GET /event_type_available_times), then create the booking withPOST /inviteesusing caller name, email, and selected slot. - 10
Confirm booking and send artifacts
Read back the date/time, then SMS or email the invitee confirmation URL from the Calendly response. As a fallback path, you can generate a single‑use scheduling link if API booking fails.
- 11
Implement guardrails: DTMF + human handoff + voicemail
Handle inbound DTMF from Twilio Streams: 0 = connect to a human at [FORWARD_NUMBER]; 1 = send a scheduling link via SMS; 9 = go to voicemail. Note: Streams support inbound DTMF only—you can’t send DTMF back to Twilio from your server.
- 12
Record legally: play disclosure, capture consent, then start
Before recording, play a clear disclosure (e.g., “This call may be recorded for quality and training. Do I have your permission?”). If yes, start recording and store a consent flag + transcript snippet. Check your state rules before enabling cross‑state recording.
- 13
Handle timeouts and errors without dead‑ends
No speech 6–8s → reprompt once; 2nd timeout → voicemail and transcript to your CRM. If Realtime or Calendly errors, apologize and text a single‑use booking link automatically.
- 14
Ship ops glue: summaries and follow‑ups
On call end, post a JSON payload to your Make/Zapier webhook: caller_id, disposition (booked/voicemail/transfer), consent, transcript summary, and Calendly invitee data. Create tasks/notes in your CRM and send a recap email/SMS to the prospect.
- 15
Instrument latency and quality; set budgets
Log end‑to‑end round‑trip (caller speech → first synthesized byte), barge‑in success rate, booking conversion, and cost per call. Tune chunk sizes, server region, and TTS speed—optimize for natural turn‑taking over raw word rate.
- 16
Security and reliability checklist before go‑live
Verify X‑Twilio‑Signature on webhooks and WS upgrades, rotate API keys, enforce HTTPS/WSS, backoff/retry Calendly calls, and add health checks. Run a 10‑call script with intentional barge‑ins and DTMF to validate all branches.