The fastest transcript is wrong, and the right transcript is late.
Streaming partial transcripts let Companion react before the resident finishes a sentence — but early partials are guesses the model will revise. How we read them for turn and intent signals without ever acting on text that isn't settled yet.
When a resident speaks, the STT layer does not wait for silence to send us text. Over the same WebSocket the firmware uses for the realtime session, partial transcripts arrive while the words are still being spoken — I, then I want, then I want to call, then I want to call my daughter. These partials are the difference between a device that feels alive and one that waits a beat too long after every sentence. They are also, early in an utterance, frequently wrong. Treating them as if they were finals is the most common way to make a voice product feel erratic.
Partials are predictions, and predictions get revised
A streaming recognizer emits its best guess given the audio so far. Each new frame can change the most likely decoding of everything before it, because acoustic context resolves backward — I feel can become I fell once the next syllable lands, and those two transcripts route to completely different responses. We measure this directly: in our pilot logs, the first partial of an utterance disagrees with the eventual final transcript roughly 30–40% of the time, while a partial captured 300ms before end-of-speech matches the final around 90% of the time. Accuracy is a function of how much audio the partial has seen.
That curve is the whole design problem. Wait for the final and you have thrown away the entire reason to stream. Act on the first partial and you are acting on a coin flip. The accuracy of a partial is not a fixed property — it is a function of how far into the utterance you read it, and the engineering is about reading it at the right moment for the right purpose.
Read partials for shape before you read them for words
The unlock is that not every decision needs the exact words. Long before a partial is accurate enough to act on, it carries signals that are robust even while the specific tokens churn:
- Turn-taking. The arrival of any partial tells us the resident is still talking. Partials drying up — no update for a few hundred milliseconds while the audio also goes quiet — is a far better end-of-turn signal than silence alone, and it lets us close the turn earlier without clipping a thoughtful pause.
- Intent shape. Call / help / hurt / pain surface in partials early and stay stable even as proper nouns around them get rewritten. We can prime the response path on the shape of the request while the details are still settling.
- Confidence trajectory. A partial whose tokens stop changing as audio continues is converging; one that rewrites itself every update is not. The rate of revision is itself a signal about whether the text is safe to trust yet.
Acting only on what has earned it
So we run two clocks. One uses the volatile early partials purely for internal preparation — warming the model, biasing the decoder toward this resident's vocabulary, deciding whether the turn is ending — none of which is visible or irreversible if the text turns out wrong. The other waits for the part of the utterance that has stabilized before anything the resident can perceive happens: a spoken response, a logged event, a confirmation. We never speak a reply built on a partial that is still rewriting itself, because the cost of revising a guess we already acted on is a device that interrupts and contradicts itself.
Because Companion produces nurse-reviewed events rather than raw recordings, the bar for what we commit is high: a logged event has to reflect what the resident actually said, not the recognizer's first draft. Partials make the conversation feel immediate; finals — and the stable prefixes on the way to them — are what we are willing to stand behind.
The resident in 214B experiences a device that starts to respond on the rhythm of her sentence instead of after an awkward gap, and that never once acts on a word she didn't say. The early read makes it feel fast. The discipline about when to trust it makes it feel right.