EngineeringJune 1, 2026·3 min read

Streaming transcription is faster, wrong, and confidently so.

Whisper-family models were built to transcribe whole utterances. Force them to stream and you trade away the future context they need to be accurate. At the bedside, where the last word often changes the first, that tradeoff is sharper than it looks.

Everyone wants the transcript to appear as the resident speaks. It feels responsive, and a hands-free voice device lives or dies on feeling responsive. But the accuracy we depend on at the bedside comes from a model seeing the whole utterance before it commits — and streaming, by definition, asks it to commit before the utterance is over. That is the central, unavoidable tension of real-time ASR, and it is worth being precise about why it exists.

Why batch beats streaming on accuracy

Whisper and its relatives are sequence-to-sequence models trained on complete 30-second windows. Their accuracy comes partly from attending to future audio: the end of a sentence disambiguates the beginning. A resident says I want to take my… — the next word decides whether take was take or make, and whether the object is a pill or a walk. A batch model waits and gets it right. A streaming model has to emit take before it has heard pill.

Lost future context. Streaming sacrifices the right-hand context that resolves ambiguity. The earlier in a word the disambiguating cue arrives, the more it costs you.
Unstable partial hypotheses. Streaming output flickers — the model revises make to take to bake as more audio arrives. Anything downstream consuming partials has to tolerate rewrites.
Worse on exactly our hard cases. Dysarthric, quiet, and accented speech already lean harder on context to decode. Take that context away and the residents who need accuracy most lose the most.

On our internal sets, the same model run in streaming mode lands roughly 2–5 WER points worse than batch, and the gap widens on the impaired-speech slices where the points matter most.

Chunking and look-ahead: buying accuracy with milliseconds

The practical middle ground is to stream in chunks with a look-ahead window — let the model peek a little into the future before committing each segment. Look-ahead is a direct dial between latency and accuracy: every extra millisecond of future audio you wait for buys back some of the context batch would have had, at the cost of a slower reply.

Look-ahead is just latency with a better name. You are paying for accuracy in milliseconds the resident spends waiting.

Get the chunk size wrong in either direction and you lose. Too small and you slice words mid-phoneme and starve the model of context. Too large and the conversation feels laggy and dead. The sweet spot is not universal — it shifts with the speaker and the room.

What we do about it

Latency budget, mouth-to-ear. We hold conversational p95 under roughly 800ms. Past that the turn-taking feels broken regardless of how accurate the transcript is, so streaming earns its place by protecting that budget.
Stream for liveness, finalize for accuracy. Partial hypotheses keep the interaction feeling alive, but the version that reaches the SOAP-note pipeline is the finalized, full-context transcript — never a partial.
Speaker-tuned look-ahead. A resident with dysarthric or breathy speech gets a longer look-ahead window; a clear, brisk speaker gets a shorter one. The latency/accuracy dial is set per resident, in config, not globally.
Never act on a flickering partial. Companion does not commit to a high-stakes action — calling family, logging a symptom — off an unstable hypothesis. The commit waits for the finalized text.

The honest summary is that there is no setting that is both instant and perfectly accurate; there is only the tradeoff, managed deliberately. For the resident in 214B, that means a device that feels alive while she talks and is careful about what it believes she said — fast where speed is free, patient where accuracy is worth the wait.

sttstreaminglatency

Why batch beats streaming on accuracy

Chunking and look-ahead: buying accuracy with milliseconds

What we do about it

30 days. One wing. Your numbers.