EngineeringJune 19, 2025·3 min read

Letting a resident talk over Companion is the hard part.

Barge-in sounds simple — stop talking when the resident starts. In practice it means cancelling a reply mid-word, flushing buffers across three machines, and not mistaking your own speaker for the resident.

Companion is reading back tomorrow's schedule and the resident cuts in: no, not the cardiologist, the eye doctor. A real conversation absorbs that instantly. The other person stops talking, drops what they were saying, and listens. For a voice device, that one human reflex is a small distributed-systems problem, and getting it wrong is the difference between a companion and a kiosk that won't shut up.

Cancel is not one action

When our VAD detects the resident's voice during playback, we have to unwind a pipeline that already ran ahead of itself. The model has generated text the resident will never hear. The synthesizer has produced audio chunks already queued on the wire. The CoreS3 has PCM frames sitting in its I2S buffer about to hit the speaker. Stopping cleanly means acting on all three at once.

Firmware: stop the I2S write loop and zero the playback ring buffer so no already-buffered frames sneak out after we 'stopped'.
Go server: send the provider a cancel/truncate event and stop forwarding any further audio deltas downstream.
Conversation state: tell the model how much of its reply was actually heard — truncated at the audio that reached the speaker, not the text it generated — so its memory of the turn matches the resident's.

That last point is the subtle one. If the model thinks it said a sentence the resident never heard, its next turn references something that, from the bedside, never happened. We truncate the assistant turn at the playback cursor, not the generation cursor.

Barge-in versus our own echo

Here is the trap. To allow barge-in, the mic has to be open while the speaker is playing — which is exactly the situation our half-duplex gate normally forbids, because the mic hears Companion's own voice an inch away. Open the mic naively and Companion's TTS triggers its own VAD, the system thinks it's being interrupted, it cancels itself, and the reply stutters into silence. We have watched it happen.

So barge-in only exists when we can tell the resident's voice apart from our own. Our approach is conservative: the playback state is known, so we raise the VAD energy threshold during playback and require a short sustained burst of voice — not a single loud frame — before we treat it as a real interruption. A cough or a TV swell shouldn't kill a reply; a person leaning in to correct us should. It is the same double-talk problem that makes on-device echo cancellation hard, scoped down to one yes-or-no question: is this the resident, or is this us?

When it works, you stop noticing it. The resident in 214B says no, the eye doctor, Companion stops on the word cardio-, and picks up from her correction. No talkover, no robotic I'm sorry, I didn't catch that. Just two voices taking turns the way two people do.

conversational-aiinterruptionrealtime

Cancel is not one action

Barge-in versus our own echo

30 days. One wing. Your numbers.