The pause is not the end of the turn.
Post-stroke aphasia breaks the two assumptions every voice stack is built on: that words come out cleanly, and that silence means done talking. Tuning STT and turn detection for the resident who is still searching for the word.
A resident recovering from a stroke wants to ask for his glasses. He gets out I need my… and stops. The word is there — he knows the object, he can picture it — but the path from concept to spoken word is damaged. Five seconds pass. To Companion running stock settings, five seconds of silence means the turn is over, so it answers the half-sentence. He never finishes. He gives up.
The post-stroke aphasia cohort is the one that breaks the voice stack in two places at once: speech-to-text and turn detection. Both fail, and they fail for the same underlying reason — the speech doesn't arrive the way the models assume it does.
Why STT struggles
Aphasic speech carries features that off-the-shelf transcription is not trained for. Paraphasias — saying spoon for fork, or a non-word that sounds close — get transcribed literally, so the text says something the resident didn't mean. Word-finding pauses fracture a single thought across long gaps. Output can be slow, halting, and low-energy. The model returns a confident transcript of the wrong thing, and confidence is exactly what we can't trust here.
Why turn detection struggles harder
Turn detection is the sharper failure. Every voice system decides 'the user is done' from a silence window — a stretch of quiet that means hand the turn back. For a fluent speaker that window can be short. For an aphasic speaker, a long silence is the most normal thing in the world. It is the search for the word, not the end of the sentence. A short window guarantees we interrupt; the model talks over the one person who most needs to not be talked over.
So this cohort gets its own turn policy:
- A much longer end-of-turn silence window — in our pilot around 3.5 seconds, against a sub-second default — so mid-sentence searching never reads as 'finished'.
- No barge-in. Companion will not start speaking the instant it detects a gap; the resident keeps the floor.
- Patience cues instead of answers — we hold a warm silence rather than filling it, because filling it ends his turn for him.
- Confirm before acting. On a paraphasia-prone transcript we read the request back — 'your glasses?' — instead of executing on a guess.
- Lower the STT confidence bar for asking again, and never paraphrase a low-confidence turn into a clinical note.
There's a real tradeoff we accept here: a 3.5-second window makes the device feel slow to anyone outside this cohort, which is exactly why it lives at the cohort level and not in the global config. Speed for the fluent, patience for the aphasic.
The bedside result is the whole point. The resident finishes his own sentence, in his own time, and gets his glasses. He isn't interrupted, isn't second-guessed, isn't handed an answer to a question he hadn't finished asking. For someone relearning how to speak, being waited for is the feature.