Checklist

Solo Operator Voice Agent Playbook: Twilio + OpenAI Realtime + Calendly (14‑Step Build Checklist)

A field-tested 14‑step checklist to build a low‑latency voice receptionist with Twilio Media Streams, OpenAI Realtime, and Calendly—complete with barge‑in wiring, DTMF/handoff fallbacks, legal consent flow, and post‑call automation hooks. Built for solo consultants who want booked meetings without hiring.

Use this checklist to ship a working, compliant voice receptionist that answers every call, qualifies naturally, and books meetings for you—without hiring. Follow the sequence; instrument latency and fail gracefully with DTMF and timeouts. Where the checklist says “starter repo,” grab the linked Node/Express example in the show notes.

  1. 1

    Buy a Twilio number and point Voice to your webhook

    In Twilio Console, purchase a local/toll‑free number and set the Voice webhook to POST https://[YOUR_DOMAIN]/voice. Disable default recording for now—you’ll start it after consent.

  2. 2

    Return <Connect><Stream> TwiML for bidirectional audio

    Your /voice handler should return TwiML that opens a bidirectional Media Stream to your WebSocket: &lt;Response&gt;&lt;Connect&gt;&lt;Stream url=&quot;wss://[YOUR_DOMAIN]/media&quot; /&gt;&lt;/Connect&gt;&lt;/Response&gt;. Use <Connect><Stream> (not <Start><Stream>) for live back‑and‑forth.

  3. 3

    Create .env with all required secrets and config

    Add: TWILIO_AUTH_TOKEN, OPENAI_API_KEY, REALTIME_MODEL, STREAM_WSS=wss://[YOUR_DOMAIN]/media, CALENDLY_TOKEN, CALENDLY_OWNER_URI, EVENT_TYPE_URI (e.g., /event_types/[ID]), FORWARD_NUMBER, PUBLIC_BASE_URL.

  4. 4

    Stand up the Node/Express server with a Media Streams WS endpoint

    Expose POST /voice (returns TwiML) and a WS at /media for Twilio Streams. On WS upgrade, validate the X‑Twilio‑Signature to verify the stream is authentic before accepting frames.

  5. 5

    Handle Twilio Media Streams messages correctly

    Parse start/media/mark/stop/dtmf events. Expect audio/x‑mulaw, 8 kHz, mono frames. Buffer in 20–40 ms chunks; send keep‑alives; on stop, flush and close downstream connections cleanly.

  6. 6

    Open a low‑latency Realtime connection to the model

    From the WS handler, open a server‑side WebSocket to OpenAI Realtime. Initialize the session (voice, instructions, tool schema if any). Forward caller audio to input_audio_buffer.append and commit via VAD or short timers.

  7. 7

    Wire true barge‑in (two clears + a commit)

    When the caller interrupts or presses a key, immediately (1) send Twilio a Clear to flush its playback buffer, and (2) call output_audio_buffer.clear on the Realtime session; then (3) commit the latest input buffer so the model responds to the new utterance.

  8. 8

    Stream model speech back to Twilio in the right format

    As the model emits audio, base64‑encode as audio/x‑mulaw at 8 kHz and send as Twilio media messages. Keep chunks small and sequential; on interruption, send Clear before new audio to prevent talk‑over.

  9. 9

    Add Calendly booking flow (no redirect)

    Fetch the host’s event types (GET /event_types), pull availability for the chosen EVENT_TYPE_URI and date range (GET /event_type_available_times), then create the booking with POST /invitees using caller name, email, and selected slot.

  10. 10

    Confirm booking and send artifacts

    Read back the date/time, then SMS or email the invitee confirmation URL from the Calendly response. As a fallback path, you can generate a single‑use scheduling link if API booking fails.

  11. 11

    Implement guardrails: DTMF + human handoff + voicemail

    Handle inbound DTMF from Twilio Streams: 0 = connect to a human at [FORWARD_NUMBER]; 1 = send a scheduling link via SMS; 9 = go to voicemail. Note: Streams support inbound DTMF only—you can’t send DTMF back to Twilio from your server.

  12. 12

    Record legally: play disclosure, capture consent, then start

    Before recording, play a clear disclosure (e.g., “This call may be recorded for quality and training. Do I have your permission?”). If yes, start recording and store a consent flag + transcript snippet. Check your state rules before enabling cross‑state recording.

  13. 13

    Handle timeouts and errors without dead‑ends

    No speech 6–8s → reprompt once; 2nd timeout → voicemail and transcript to your CRM. If Realtime or Calendly errors, apologize and text a single‑use booking link automatically.

  14. 14

    Ship ops glue: summaries and follow‑ups

    On call end, post a JSON payload to your Make/Zapier webhook: caller_id, disposition (booked/voicemail/transfer), consent, transcript summary, and Calendly invitee data. Create tasks/notes in your CRM and send a recap email/SMS to the prospect.

  15. 15

    Instrument latency and quality; set budgets

    Log end‑to‑end round‑trip (caller speech → first synthesized byte), barge‑in success rate, booking conversion, and cost per call. Tune chunk sizes, server region, and TTS speed—optimize for natural turn‑taking over raw word rate.

  16. 16

    Security and reliability checklist before go‑live

    Verify X‑Twilio‑Signature on webhooks and WS upgrades, rotate API keys, enforce HTTPS/WSS, backoff/retry Calendly calls, and add health checks. Run a 10‑call script with intentional barge‑ins and DTMF to validate all branches.