EngineeringDecember 31, 2025·3 min read

Real conversations aren't strictly turn-based. Ours still mostly are.

Humans backchannel, overlap, and listen while they talk. Full-duplex turn-taking is where voice AI is headed — here's what the realtime providers hand us today and what stays genuinely hard.

Listen to a nurse talk with the resident in 214B and you'll hear something no strict turn-detector models: they overlap. The nurse says mm-hm while the resident is still talking, the resident keeps going, the nurse finishes the resident's sentence and the resident accepts it. There's no clean point where one stops and the other starts. Real conversation is full-duplex — both parties listening and producing at once — and everything we've described so far in this series is, by contrast, half-duplex: one speaker, then a hard endpoint, then the other. Closing that gap is the frontier of turn-taking, and it's worth being honest about how far we've actually gotten.

What strict turn-taking can't do

A half-duplex system makes three things impossible, and all three are things a kind human listener does without thinking:

Backchannels. The little mm-hm, I see, go on that tell a speaker you're still with them. They're how a listener grants patience without taking the floor — exactly the signal an elder mid-word-search needs to hear. A strict endpointer can't emit one, because to it, making a sound is taking a turn.
Graceful overlap. Two people briefly talking at once and neither treating it as an error. A half-duplex system treats any incoming speech while it's talking as either an interruption to obey or noise to ignore — never as the normal texture of conversation.
Continuous listening while speaking. Hearing no, that's not what I meant in the first half-second of its own reply and adjusting, rather than finishing the wrong answer because it had stopped listening the moment it started talking.

What the realtime providers give us

Our Go adapter speaks the OpenAI Realtime event schema and fans out to OpenAI, ElevenLabs ConvAI, and Grok behind it. The realtime providers do hand us real pieces of the full-duplex picture today:

Server-side VAD and turn detection that runs continuously on the inbound stream, so the model is, technically, always listening — even mid-reply.
Barge-in / interruption handling — when the resident starts speaking over Companion, the provider emits a speech-started event, and we cancel the in-flight response and clear the audio already queued on the CoreS3 speaker. That gives us reliable interrupt, which is the most important slice of full-duplex for safety: the resident can always cut in.
Streaming partial transcripts, which let our semantic endpointer and our overlap logic react before a turn is formally closed.

Barge-in is the part we'd call solved. If the resident speaks, Companion stops — fast, every time. That alone covers most of what residents actually need from duplex behavior.

What's still hard

The genuinely unsolved problems are the cooperative overlaps, not the interruptions:

Telling an interruption apart from a backchannel is the central one. No, stop and mm-hm are both inbound speech during our reply, but they mean opposite things — take the floor versus keep going, I'm with you. Today we treat almost all overlap as interruption and cancel, which is safe but socially blunt: it means Companion can't say mm-hm to the resident either, because emitting a backchannel while listening reopens every echo-suppression and turn-ownership problem we work hard to keep simple. Generating well-timed backchannels is its own hard problem — a late mm-hm is worse than none — and doing it on a constrained ESP32-S3 with I2S duplex audio, without the device's own backchannel tripping its own VAD, is not free.

So we've made a deliberate call: ship rock-solid interruption and patient half-duplex endpointing now, and treat true full-duplex as a horizon, not a checkbox. The resident in 214B can always cut in and be heard instantly, and she's never cut off when she pauses. She doesn't get a machine that murmurs mm-hm back to her yet. We'd rather she have a device that always yields the floor than one that talks over her trying to sound human.

turn-detectionfull-duplexrealtime

What strict turn-taking can't do

What the realtime providers give us

What's still hard

30 days. One wing. Your numbers.