EngineeringApril 12, 2025·3 min read

A spoken reply has an 800-millisecond budget, and most of it is already spent.

Where the milliseconds actually go between a resident finishing a sentence and Companion starting to talk — and why crossing roughly 800ms makes the whole thing feel broken.

A resident asks Companion a question and then waits. The gap before Companion answers is the single most load-bearing number in the whole system. In normal human turn-taking the gap between speakers is around 200ms — fast enough that we don't notice it. Push it past roughly 800ms and a different feeling kicks in: the resident assumes they weren't heard, repeats themselves, and now two utterances are colliding. The reply hasn't gotten worse. The conversation has.

Where the milliseconds go

The clock that matters is mouth-to-ear: from the resident's last syllable to the first sound of the reply leaving Companion's speaker. It is the sum of a chain, and every link spends real time.

End-of-turn detection (~200–400ms): we have to be sure the resident is done. Our on-device VAD waits out a silence window before closing the utterance. Shrink it and we cut people off mid-thought; grow it and the whole reply slides later.
STT / transcription (~100–200ms): with a streaming Realtime provider this overlaps the resident still talking, so the marginal cost at end-of-turn is small — but it isn't zero.
LLM time-to-first-token (~300–500ms): not the full answer, just the first token. This is usually the largest single line item and the one we control least.
TTS first byte (~100–250ms): the model has to produce enough text for the synthesizer to start a natural-sounding first word.
Network round-trips (~50–150ms): Companion to the Go server to the provider and back, twice. On a facility's cellular fallback link this is the line that quietly blows the budget.
Playback start (~20–60ms): I2S buffering on the CoreS3 before the first PCM frame reaches the speaker.

Add the floors and you are already near 800ms before anything has gone wrong. Add one slow LLM response and a degraded link and you are at 1.5 seconds, which at the bedside reads as a malfunction.

What we actually do about it

The first move is to stop treating the chain as serial. A streaming Realtime session lets STT, generation, and synthesis overlap so we pay for the slowest link, not the sum. The second move is to start playback on the first audio chunk rather than waiting for the full reply — first-byte latency is what the resident feels, total length is not.

The harder move is end-of-turn detection, because it trades directly against being interrupted. We bias toward a slightly faster silence window and recover the difference with a short conversational acknowledgment when the model needs more thinking time. A 250ms mm-hm costs almost nothing and resets the resident's clock — it signals heard you, which is most of what the 800ms threshold is really measuring.

For a resident in room 214B, none of this is visible. She asks whether her daughter is coming today, and Companion answers on the beat she expects. The work is making the seam between her sentence and ours disappear — because the moment she can feel the seam, she stops trusting the voice on the other side of it.

conversational-ailatencyrealtime

Where the milliseconds go

What we actually do about it

30 days. One wing. Your numbers.