He hadn't finished his word, and the device had already answered.
Stuttering and disfluency don't just garble the transcript — the silent blocks and repeated sounds wreck turn timing, so a hands-free voice device interrupts the one resident who most needs to be let finish.
One of our residents stutters, and has since childhood. He blocks hardest on the words he cares about most — his daughter's name, the word help. When he tries to say I want to c-c-call Carol, a naive voice pipeline does two wrong things at once: it transcribes the repeated onset as garble, and worse, during his silent block it decides he has stopped talking and starts replying over him. The device interrupts a man mid-stutter. There is almost no faster way to make a person stop using something.
Disfluency is two problems wearing one coat
Disfluency hurts the transcript and the conversation, and engineers tend to only think about the first. Stuttering presents in three classic forms, each with a distinct failure mode:
- Repetitions. c-c-call, I-I-I want. The acoustic model dutifully transcribes every fragment, so the language model has to recover the intended word from a smear of false starts.
- Prolongations. ssssseven — a sound held far longer than the model's duration priors expect, which can be split into multiple tokens or mistaken for a different phoneme entirely.
- Blocks. A complete, silent stoppage mid-word while airflow is held. Acoustically this is indistinguishable from finished talking, and that is the dangerous one.
The transcript problem is annoying. The timing problem is what actually breaks the product, and it breaks it in the cruelest possible direction.
Why turn-taking is the real failure
A hands-free, no-button device has no signal for I'm done except silence. Our voice-activity detection waits for a pause, then yields the turn. For typical speech, an end-of-turn timeout around 700ms feels natural. For a person who blocks, 700ms of silence is the middle of a word, not the end of a thought. Tighten the timeout and the device feels snappy for everyone except the resident who stutters — for him it becomes an interrupting machine. Loosen it globally and every conversation feels sluggish.
The default end-of-turn timeout is a fluency tax. We refuse to charge it to the residents least able to pay it.
What we do about it
- Per-resident end-of-turn patience. We raise the silence threshold for residents who stutter — closer to 1.5–2s — so blocks read as blocks, not stops. It is a config value in Firestore, scoped to the device.
- Acoustic, not just silence-based, endpointing. A held block often has telltale articulatory tension and residual airflow. Treating silence plus prior incomplete word as not-yet-done keeps the device from leaping in.
- Disfluency-tolerant normalization. Before the transcript reaches the language model, we collapse obvious repetition runs and prolongations, so c-c-call Carol resolves to call Carol without losing the intent.
- Never punish the pause. Companion will wait out a long block in silence rather than fill it. Letting a person finish is a feature.
We measure this with an interruption rate, not just WER, because a perfect transcript of a sentence the device talked over is still a failure. For the man who blocks on his daughter's name, the win is not a cleaner string. It is that he gets to finish saying Carol, and the device waits, and then it calls her.