Our average accuracy was great. That was the problem.
A single aggregate WER can look healthy while the device quietly fails an entire cohort. The only honest way to find bias is to stop averaging and start slicing — by accent, age, gender, and clinical condition — and to read the worst slice, not the mean.
For a while our headline number looked fine. Aggregate word error rate across the pilot was sitting around 9%, which for bedside voice is respectable. Then we sliced it, and the average dissolved into a population that was being served well and a population that was effectively locked out. The aggregate was not wrong. It was just measuring the wrong thing — the mean instead of the floor.
Why the average lies
Aggregate accuracy is a weighted vote, and the majority cohort wins it. If 80% of your eval utterances come from speakers the model handles easily, you can run a catastrophic 30% WER on the remaining 20% and still report a single-digit average. The number is real and the failure is real and they coexist. For a consumer gadget that might be acceptable. For a device whose minority cohorts are frail immigrant elders with dysarthria — precisely the people it exists to help — averaging is a way of not seeing them.
How we slice
Every utterance in our eval set carries metadata, and we compute WER (plus intent accuracy and false end-of-turn rate) within each slice, then again at the intersections, because bias lives at the intersections.
- Accent / first language — general American versus West African, South Asian, Caribbean, East Asian, and more.
- Age band — under 60, 60–79, and 80-plus, the last being our actual center of mass.
- Voice pitch cohort — low, mid, high, as a proxy that survives the gender blurring of older voices.
- Clinical condition — dysarthria, post-stroke speech, Parkinsonian tremor, hard-of-hearing (which changes how people respond, not just how they sound).
- Acoustic environment — quiet room, roommate TV, overhead paging, HVAC.
We report the worst-performing slice as a first-class release metric, alongside the gap between best and worst. A change that lifts the average but widens the gap is a regression in our scorecard, even when the headline number improves. That single rule reorganizes what we optimize.
What slicing actually catches
Slicing turned vague unease into specific bugs we could fix. It is how we found that our end-of-turn detector was firing early on slow speakers, that one STT provider was strong on accents but weak on tremor while another was the reverse, and that a TTS voice change improved younger-listener preference scores while lowering intelligibility for hard-of-hearing residents. None of those were visible in the aggregate. All of them were obvious in the slice.
The discipline is simple to state and uncomfortable to follow: never ship on the mean, always ship on the worst-served cohort. A bedside device does not get to be good on average. The resident it fails does not experience the average — she experiences her own slice, and ours is the only number that is honest about her.