Episode 9·

Turn Missed Calls into Booked Consults: Voice Agent with Twilio + Realtime API + Calendly

Intro

This episode is for solo consultants and service providers who lose deals to missed calls while doing deep client work. You'll get a complete technical walkthrough of building a voice AI receptionist that qualifies leads and books meetings without human intervention.

In This Episode

Jordan breaks down his complete Twilio voice AI agent build after losing an $8,000 project to a missed call. He covers the technical stack from Twilio Media Streams and OpenAI Realtime API integration to Calendly booking automation, walking through the critical barge-in implementation that handles caller interruptions, compliance requirements for call recording consent, and essential guardrails like DTMF fallbacks and timeout handling. The episode includes specific code patterns, latency optimization strategies, and a Make scenario for post-call automation that creates CRM notes and follow-up workflows.

Key Takeaways

  • Implement proper barge-in by clearing both Twilio's playback buffer and OpenAI's output buffer simultaneously, then committing new input audio to handle natural conversation interruptions
  • Use Calendly's three-step API flow (event types → availability → POST /invitees) to book meetings directly without redirecting callers to a web interface
  • Always play a consent disclosure before recording calls and check state-specific requirements using the RCFP guide, as multiple states require all-party consent beyond federal one-party rules

Timestamps

Companion Resource

Jordan: Tuesday afternoon. Two-fifteen. I'm three hours deep into a Make scenario for a client — the kind of flow state where you forget to eat. Phone buzzes. Unknown number. I let it ring. Voicemail notification pops up about forty seconds later. I ignore it. Back to work.

Five-thirty, I finally check. It's a referral from my best client. Guy runs a commercial cleaning company, needs his entire dispatch automated, budget's north of eight thousand dollars. And he says — and I remember this exactly — "I called two other people first. They didn't pick up either, but one of them had this... voice thing that answered. Asked me a few questions, booked me a call for Thursday. So I'm going with them."

Eight thousand dollar project. Gone. Not because my work was worse. Not because my price was higher. Because someone else's phone answered and mine didn't.

That was November. By December, I had a Twilio voice AI agent running on my line — picks up every call, qualifies the lead with a natural conversation, and books them straight into my Calendly. No receptionist. No VA. Just a Node server, OpenAI's Realtime API, and about six hours of build time.

Today I'm walking you through exactly how I built it.

Jordan: Imagine this. You're on a client call — the kind that actually matters, the kind where you're scoping a ten thousand dollar engagement — and your phone lights up with an unknown number. You can't answer. You shouldn't answer. But that unknown number? That's a warm referral with budget and urgency, and they're about to call the next person on their list. Now imagine your line picks up anyway. A voice — natural, conversational — greets them, asks what they need, confirms their budget range, and books a thirty-minute consult on your calendar for tomorrow at two. You finish your client call, check your phone, and there's a Notion note waiting: caller name, company, project scope, booked slot, transcript summary. That's what we're building today. Twilio, OpenAI Realtime, Calendly, and a Make scenario to tie the loose ends. This is Headcount Zero.

Jordan: So let me set the stage on why this matters more than you think. When I lost that cleaning company deal, I went back and audited my call log for the previous quarter. Seventeen missed calls from numbers I didn't recognize. I returned nine of them. Of those nine, four had already hired someone else. The other five went to voicemail and never called back. That's potentially forty, fifty thousand dollars in pipeline that evaporated because I was doing the thing I'm supposed to be doing — deep work for existing clients.

And the usual answer to this is hire a VA, hire a receptionist, use an answering service. But those options are either expensive — a decent virtual receptionist runs eight hundred to fifteen hundred a month — or they can't actually qualify leads. They take a message. They don't ask about budget. They don't check your calendar availability. They definitely don't book the meeting.

So here's the stack. A Twilio phone number — that's your entry point, costs about a dollar a month plus per-minute usage. That number points to a webhook on your server. When a call comes in, your server returns a piece of TwiML — that's Twilio's markup language — that opens a bidirectional Media Stream over WebSocket. Caller audio flows in, your server pipes it to OpenAI's Realtime API, the model responds, and that audio streams back to Twilio and into the caller's ear. The whole round trip happens in real time. Feels like talking to a person.

And then Calendly's API handles the booking piece. The model collects the caller's name, email, what they need — and when they're ready to schedule, it hits Calendly's availability endpoint, offers a couple of slots, and creates the booking via their invitees API. No redirect to a web page. No "I'll have someone call you back." The meeting is confirmed before they hang up.

Okay. Let me get specific about the Twilio side because this is where the first gotcha lives. There are two ways to start Media Streams. You can use Start Stream, which gives you unidirectional audio — good for transcription, not what we want. Or you can use Connect Stream, which opens a bidirectional channel. That's the one. Your TwiML response literally just wraps a Connect tag around a Stream tag pointing to your WebSocket URL. That's it. Four lines of markup.

The audio format is locked — mu-law encoding, eight thousand hertz, mono. You don't get to choose this. Twilio sends it, Twilio expects it back. So when you're encoding model audio to send back to the caller, it has to be mu-law at eight K, base sixty-four encoded. Get that wrong and you'll hear static or nothing at all. I spent roughly forty-five minutes debugging silence on my first attempt before I realized I was sending sixteen K audio back to an eight K pipe.

Classic.

Now — the Realtime API connection. From your WebSocket handler, you open a server-side WebSocket to OpenAI's Realtime endpoint. You initialize the session with your system prompt — mine says something like "You are a scheduling assistant for an automation consultancy. Your job is to understand what the caller needs, confirm their budget range and timeline, and book a consultation." You define the voice, the tools the model can call, and you start forwarding audio frames.

Caller speaks, you append those frames to the input audio buffer. The model's Server VAD — voice activity detection — handles turn-taking automatically. When the model decides the caller is done talking, it commits the buffer and generates a response. That response streams back as audio chunks, you encode them to mu-law, and send them to Twilio.

Now. This is where most people get stuck, so let's slow down. Barge-in. If the caller interrupts the agent mid-sentence — and they will, because that's how people talk on the phone — you need to handle it at both layers simultaneously. You send a Clear message to Twilio, which flushes its playback buffer so the caller stops hearing the old audio. And you call output audio buffer clear on the Realtime session, which tells the model to stop generating that response. Then you commit the new input so the model responds to what the caller actually just said.

If you only clear one side, you get talk-over. The caller hears the agent keep talking for another half second while also hearing the new response start. It sounds broken. It sounds like a bad IVR from two thousand and eight. Two clears and a commit. That's the pattern. Twilio Clear, Realtime output buffer clear, then input buffer commit.

The Calendly integration is cleaner than you'd expect. Their developer docs have a specific guide for AI agents — which tells you this use case is not fringe anymore. Three API calls. First, you grab your event types — that's a GET to slash event types. Second, you fetch available times for the event type and date range the caller wants. Third, you create the booking with a POST to slash invitees, passing the caller's name, email, and selected slot.

The booking triggers all the normal Calendly workflows — confirmation email, calendar invite, reminders. The caller gets the same experience as if they'd booked through your website. And if the API call fails for whatever reason — rate limit, network blip — you fall back to generating a single-use scheduling link and texting it to the caller. They can book themselves. No dead end.

Recording. You're going to want call recordings for quality, for training the system, for your own CRM notes. But you cannot just start recording. Federal law in the US is one-party consent — meaning you, as one party, can consent to recording your own calls. But multiple states — California, Illinois, Florida, a bunch of others — require all-party consent. Every person on the call has to know and agree.

So before your agent does anything else, it plays a disclosure. Something like: "This call may be recorded for quality and training purposes. Do I have your permission to continue?" If they say yes, you flag consent in your session data, start the recording, and proceed. If they say no, you skip the recording and still handle the call. The Reporters Committee for Freedom of the Press maintains a state-by-state guide — I'll link it in the show notes — and I'd strongly recommend checking it before you go live, especially if you serve clients across state lines.

Phone calls are not chat windows. People get frustrated faster. They expect escape hatches. So you build them in. DTMF — that's the keypad tones. Twilio's bidirectional streams support inbound DTMF, meaning your server can detect when the caller presses a key. I map zero to "connect me to a human" — which triggers a transfer to my actual phone. One sends a scheduling link via SMS. Nine goes to voicemail.

Quick note — bidirectional streams only support inbound DTMF. You can't send tones back to Twilio from your server. So your agent needs to verbally tell the caller their options: "Press zero at any time to reach a person directly."

Timeouts matter too. If the caller goes silent for six to eight seconds, the agent reprompts once. Second timeout — it apologizes, offers to text a booking link, and drops to voicemail. That voicemail gets transcribed and posted to your CRM automatically through a Make scenario. No call ends in a dead silence void.

Now — I want to be honest about the biggest weakness of this setup. Latency. Voice AI on the phone network is not the same as a chatbot. You've got eight kilohertz telephony audio, network jitter, buffering at multiple layers. Twilio doesn't publish specific latency targets for Media Streams, and in my experience, the round-trip time — from when the caller finishes speaking to when they hear the first word of the response — varies. Sometimes it's fast enough to feel natural. Sometimes there's a beat that feels just slightly too long.

The mitigations are real though. Keep your server geographically close to Twilio's edge. Stream small audio chunks. Let Server VAD handle turn detection instead of building your own silence timer. And the two-clear-plus-commit pattern for barge-in makes interruptions feel responsive even when the initial response has a slight delay. Phonely — a startup Twilio highlighted in their AI awards — has built production voice agents on this exact stack and specifically cites ultra-low latency as their differentiator. So the ceiling is high. But you'll need to instrument your round-trip times and tune.

The honest assessment? For qualification calls where the conversation is structured — "what do you need, what's your budget, when do you need it, let's book a time" — the latency is fine. For free-flowing, unstructured conversation? You'll notice it. Build for the structured use case first.

Last piece. When the call ends, your server fires a JSON payload to a Make webhook. Caller ID, disposition — booked, voicemail, transferred to human — consent flag, transcript summary, and the Calendly invitee data if a meeting was booked. Make creates a Notion page with the call summary, adds a task if follow-up is needed, and sends a recap SMS to the prospect: "Thanks for calling. Your consultation is confirmed for Thursday at two. Here's what we'll cover." That takes roughly twenty minutes to set up in Make and it closes the loop completely.

Jordan: So — remember that voicemail? The cleaning company guy who went with someone else because their phone answered? I actually called him back two months later. Just to check in. He told me the agent he hired delivered late and over budget. He asked if I was still available.

I was. My voice agent booked him a consult while I was at lunch.

That's the thing about this build. It's not about replacing human connection. It's about making sure the human connection gets a chance to happen. Every call that goes to voicemail is a conversation that never starts. Every lead that bounces to the next name on the list is revenue you earned through reputation and lost through availability.

If you want to build this yourself, the Solo Operator Voice Agent Playbook is on the Resources page. It's the fourteen-step checklist we just walked through — Twilio setup, Realtime wiring, barge-in pattern, Calendly booking flow, consent script, DTMF fallbacks, the Make scenario for post-call summaries. Everything in order, with the env vars and code pointers you need to ship it.

One thing to do this week. Just one. Go buy a Twilio number. It's a dollar. Point the voice webhook to a simple Express endpoint that returns a Connect Stream TwiML. Get audio flowing to a WebSocket. You don't need the Realtime API wired up yet. You don't need Calendly. Just prove to yourself that you can receive a phone call on a server you control. That's step one. Everything else builds from there.

I'm Jordan. This is Headcount Zero. Go build something.

Twilio Voice AIOpenAI Realtime APIVoice AgentCalendly IntegrationSolo OperatorCall AutomationLead QualificationPhone ReceptionistMedia StreamsBarge-in Implementation