The transcriber hears a Lagos accent worse than a Los Angeles one, and our residents pay for it.
Speech-to-text word error rates climb sharply for non-mainstream accents — and in a skilled-nursing wing full of immigrant elders, that gap compounds into a device that quietly stops working for the people who need it most.
A resident on one of our pilot units grew up speaking Yoruba and learned English in his thirties. When he asks Companion to call his daughter, the transcript that comes back from the speech-to-text layer often reads call my doctor or simply drops the proper noun entirely. The model is not broken. It is doing exactly what it was trained to do: recognize the accents it saw most. His was not one of them.
The gap is measurable, and it is large
Accent bias in STT is not a vibe — it is a number you can read off an eval. On our internal accent slices, word error rate (WER) for general American English speakers sits around 7–9%. For the same utterances spoken with a strong West African, South Asian, or Caribbean accent, we measure WER in the 18–26% range, and the worst-performing slice we have logged crossed 30%. Roughly one word in three wrong is not a degraded experience. It is a different product.
The cause is upstream of us. Public training corpora skew heavily toward North American and British English recorded by younger, fluent speakers. An accent that is rare in the training distribution is, by construction, harder for the acoustic model to map to the right phonemes. The model's confidence stays high while its accuracy collapses, which is the dangerous combination: it does not know it is failing.
Why it compounds at the bedside
For a healthy adult on a phone, a 20% WER is annoying. For a frail immigrant elder, the same WER stacks on top of every other thing already working against intelligibility:
- Accent plus age. A non-mainstream accent layered on a weaker, breathier aged voice pushes two underrepresented features at once. The errors do not add — they multiply.
- Accent plus environment. Bedside audio is full of HVAC hum, a roommate's TV, and an overhead page. Noise robustness is also worst on the accents the model knows least.
- Accent plus stakes. The utterances that matter most — I'm dizzy, call my son, a medication name — are exactly the ones with proper nouns and clinical terms the model is least likely to recover.
And there is a feedback loop. After two or three misrecognitions, the resident stops trusting the device and stops talking to it. The accent bias becomes a usage gap, and the usage gap erases the very person from our data, so the next model trains on even less of his speech. Left alone, bias is self-reinforcing.
What we do about it
We treat accent as a first-class cohort, not an afterthought. Our adapter layer lets us route a given device to whichever STT backend — OpenAI Realtime, ElevenLabs, Grok — scores best on that resident's accent slice, and we re-measure per provider rather than trusting one vendor's aggregate number. We bias decoding toward a small per-resident vocabulary of the names and terms that actually matter to them, so Adeyemi and furosemide are candidates the recognizer is primed to find. And when confidence drops on a high-stakes utterance, Companion confirms out loud instead of acting on a guess.
None of this fully closes the gap; honesty requires saying so. But the man in that room should not have to flatten his own voice to be heard by a machine at his bedside. The accent he carried across an ocean is not noise to be cleaned up. It is the signal, and our job is to get better at receiving it.