There is almost no public recording of the voices we build for.
Open speech corpora are full of fluent younger adults reading clean sentences. They contain almost nothing that sounds like a 91-year-old with post-stroke speech in a noisy nursing wing. You cannot model a distribution you never sampled — so we go and sample it.
Every bias we have written about — accent, age, gender — traces back to the same root cause. A model can only learn the voices it was shown, and the voices it was shown came from a handful of large public corpora. Those corpora are extraordinary engineering achievements and they contain almost nobody like our residents. The frail aged voice is not underrepresented in the training data. For most practical purposes it is absent.
The hole in the public data
Walk through what the open corpora actually contain and the gap is stark:
- Read, not spontaneous. Most data is people reading prepared sentences in a quiet room. Our residents speak haltingly, off-script, mid-cough, with a roommate's TV behind them.
- Young, not old. Speaker age skews to working adults. Recordings of the 85-plus are rare, and recordings of the 85-plus with a clinical speech condition are nearly nonexistent.
- Healthy, not impaired. Dysarthric and post-stroke speech corpora exist but are tiny, often a few dozen speakers, and rarely cleared for commercial training.
- The consent wall. The people whose voices we most need to model are a vulnerable, often cognitively impaired population. Collecting their speech responsibly is genuinely hard — which is part of why nobody has, and part of why we will not cut corners doing it.
You cannot evaluate fairness on a cohort you have no data for, and you certainly cannot fix it. The dataset gap is upstream of every metric. It is the thing that makes the bias structural rather than incidental.
What we actually do about it
We treat data collection as core engineering, not a side project, and we do it carefully.
- Consented, in-facility capture. With explicit, revocable consent from residents or their proxies, and with facility and IRB-style review, we build a small but real corpus of the voices we serve — in the actual acoustic conditions they live in.
- Events, not archives. Consistent with how the whole product works, we keep what is needed to improve recognition and discard raw audio aggressively. The goal is a better model, not a recording library.
- Targeted augmentation. We take the limited real aged speech we have and expand it synthetically — adding tremor, breathiness, lower f0 stability, slower rate, and realistic bedside noise — so the model sees the failure modes far more often than the raw hours alone would allow.
- Eval first. Before any of it touches training, it goes into the sliced eval set, so we can prove a change helped the worst-served cohort instead of just assuming it did.
Augmentation is a bridge, not a destination — a synthetic tremor is an approximation of a real one, and we are honest with ourselves about that. But it lets us start closing the gap today instead of waiting for a corpus that the industry has shown no sign of building.
The principle underneath is plain. If the people we serve are missing from the data, they will be missing from the product, and the most underserved voices in the building stay underserved. So we go room by room, with consent, and put them back into the distribution — because a model that has never heard a voice like yours was never going to listen to you well.