EngineeringJuly 31, 2025·3 min read

The first 300 milliseconds of speech are the whole conversation.

Total synthesis time barely matters. Time-to-first-audio dominates whether Companion feels alive or broken. Why we chunk by sentence and start speaking before the reply is done.

Ask Companion a question and there is a gap before it answers. If that gap is short, the resident feels heard. If it stretches past about half a second, something else happens: she assumes it didn't catch her, repeats the question louder, and now the assistant is answering the first ask while she's mid-second-ask. The turn is desynced, and it started with one number — time to first audio.

Why first-byte dominates perceived speed

Total synthesis time is the wrong metric. A reply that takes two seconds to fully synthesize but starts playing in 250ms feels instant, because the resident hears speech right away and the rest streams in behind it under cover of her own listening. A reply that synthesizes in one second but only after the whole thing is ready feels like a stall. Perceived responsiveness is almost entirely a function of time-to-first-audio (TTFA), not total duration. Humans forgive a long sentence; they do not forgive a long silence before it.

This is why we never wait for the full model response before synthesizing. The language model is still generating tokens when we start turning the front of the sentence into sound. The bottleneck we optimize is the path from first usable text to first PCM sample out of the speaker.

Sentence chunking, and where to cut

Streaming TTS means deciding when you have enough text to synthesize a chunk without it sounding wrong. Synthesize too eagerly — say, the first three words — and the vocoder guesses the wrong intonation because it can't see the end of the clause, so the prosody lands flat or rises like a question that wasn't one. Wait for the full paragraph and you've thrown away the whole point of streaming.

So we chunk at clause and sentence boundaries. The rules in practice:

Cut on terminal punctuation first — periods, question marks — because a full sentence gives the synthesizer correct prosodic shape.
Fall back to clause boundaries (commas, conjunctions) when the first sentence is long, so first-audio doesn't wait on a 30-word opener.
Enforce a minimum chunk length so we never synthesize a two-word fragment that sounds clipped.
Never split mid-name or mid-number — Mrs. Okafor and 2:30 have to stay whole or the pronunciation breaks.

The first chunk is the one that matters most. We keep it short on purpose: a brief opening clause synthesizes fastest and gets sound into the room soonest, and the longer back half of the reply streams in while she's already listening to the front.

The cost of the seam

Chunking buys latency but introduces a seam: each chunk is synthesized without full knowledge of the next, so intonation can jump slightly at the boundary. On a natural-voice consumer assistant that seam is audible and annoying. At the bedside, with deliberately slower, evener delivery, the seams mostly disappear into the pacing — our calm prosody and our hard preference for first-audio happen to want the same thing.

The resident in 214B asks if it's almost dinnertime. Before she's finished settling back into the pillow, Companion has already started to answer. She doesn't notice the chunk boundary, doesn't notice the model still writing the second half. She notices that the room answered her, right away, like it was paying attention. Three hundred milliseconds is the difference between that and a device she stops trusting.

ttsstreaminglatency

Why first-byte dominates perceived speed

Sentence chunking, and where to cut

The cost of the seam

30 days. One wing. Your numbers.