Word Error Rate went down and the device got worse.
WER is the metric everyone quotes and the one that predicts bedside quality the least. The numbers that actually correlate with a resident being helped are entity error rate, latency p95, turn accuracy, and escalation recall.
We once shipped an ASR change that dropped Word Error Rate from 11% to 9% and made Companion worse at its job. The new model was better at common words — 'the,' 'and,' 'today' — which are most of WER by volume, and slightly worse at proper nouns and the word 'help.' WER improved because it counts every word equally. The bedside got worse because not every word is equal. A resident does not care that the conjunctions were transcribed perfectly.
Why WER lies at the bedside
WER is a flat edit distance over a transcript. It treats 'help' and 'the' as one error each. But the cost of those two errors to a person in a bed differs by orders of magnitude. WER is useful as a coarse health check and genuinely misleading as a quality target, because optimizing it pushes effort toward the easy, frequent words and away from the rare, load-bearing ones. It's the metric that's easiest to report and worst to chase.
The metrics that actually predict help
We grade on a small panel of metrics chosen because each one maps to a way a resident is or isn't served. None of them is WER.
- Entity Error Rate (EER). Accuracy on the words that carry meaning: names, medications, times, family members, body parts. A 2% EER matters more than a 9% WER, because these are the words a wrong answer hinges on.
- Latency p95, mouth-to-ear. Not the average — the tail. The resident remembers the one slow reply, not the fifty fast ones. We hold p95 under roughly 800ms; past that, the conversation feels broken regardless of content.
- Turn accuracy. Did Companion respond to what the resident actually meant, end to end? This is the only metric that grades the whole pipeline as one thing instead of grading the stages in isolation.
- Escalation precision and recall. Of the turns that should have reached a nurse, how many did (recall); of the ones we escalated, how many were real (precision). These are the two numbers that decide whether the device is safe.
Recall and precision are not symmetric here
Most systems trade precision against recall and pick a balanced point. Escalation is not balanced. A missed escalation — recall failure — means a resident asked for help and nobody came. A false escalation — precision failure — means a nurse walked to a room and found someone fine. Both are costs, but they are not the same cost.
We tune escalation recall toward 1.0 and treat precision as the budget we spend to get there — within reason, because a nurse who's been burned by ten false alarms starts ignoring the eleventh.
In our pilot we target escalation recall above 0.98 and accept precision around 0.85, then work to claw precision back up without ever trading recall for it. That asymmetry is a clinical decision encoded as a metric weight — and it's invisible in WER, which is exactly why WER can't be the thing we steer by.
The discipline is refusing the metric that's easy to report in favor of the ones that predict the moment that matters: a resident in 214 saying something quiet and ambiguous at 3am, and a nurse arriving because the number we actually optimized — recall, not WER — did its job.