EngineeringJuly 14, 2025·3 min read

The voices our models understand least are the only voices we serve.

Speech models are trained overwhelmingly on younger speakers. Aged voices — tremor, weaker pitch control, slower rate, more breathiness — fall outside that distribution, which means the demographic Companion exists for is the one the underlying tech serves worst.

Here is the uncomfortable shape of our problem. Companion sits at the bedside of people in their eighties and nineties. The speech recognition and synthesis models we build on were trained on a corpus whose median speaker is decades younger. The single most important demographic feature of our users is the one the underlying technology handles worst. We did not choose that tradeoff. We inherited it, and most of our engineering is spent paying it down.

What aging does to a voice

Aging is not just a quieter version of a younger voice. It changes the signal in ways that map directly onto where models are brittle:

Reduced pitch control. Fundamental frequency (f0) becomes less stable; pitch wanders mid-word. Models that learned crisp, steady contours read the instability as a different phoneme.
Vocal tremor. A 4–8 Hz amplitude and pitch tremor is common after stroke or with Parkinsonism. It smears the acoustic features the recognizer keys on.
Breathiness and lower energy. Weaker glottal closure means more air, less tone, and a lower signal-to-noise ratio before a single bit of room noise is added.
Slower rate and longer pauses. A pause mid-sentence trips end-of-turn detection, so the device replies before the resident has finished their thought.

On our age-stratified eval, recognition for speakers under 60 sits near our baseline WER. For the 80-plus cohort — the actual center of mass of our users — WER roughly doubles, and for residents with diagnosed dysarthria or post-stroke speech it can triple. The aggregate metric a vendor advertises is computed on a population that, for us, barely exists.

It is a synthesis problem too

Bias runs in both directions. TTS voices are tuned to sound natural and pleasant to younger listeners auditioning them, which often means fast, bright, and high. Played to a resident with age-related high-frequency hearing loss and slower auditory processing, that same voice is crisp but unintelligible. A model optimized to win a preference test with a 30-year-old can fail the only listener who matters.

What we do about it

We stopped treating older speech as a corrupted version of the standard and started treating it as its own distribution to fit. Concretely: we lengthen end-of-turn silence windows for residents whose baseline rate is slow, so a thinking pause is not mistaken for a finished sentence. We tune our synthesized voice down in pitch and rate and shift energy out of the high frequencies that hearing-impaired listeners lose first. We hold per-resident acoustic baselines so tremor and breathiness are modeled as that person's normal, not as error. And our adapter layer lets us pick, per device, whichever backend scores best on that specific resident's voice rather than on a population they are not in.

The deeper fix is data, which we cover separately — there is almost no public corpus of frail aged speech to train on. But the framing is the point. A voice that has gotten quieter and less steady over ninety years is not a degraded input. It is the input. Building for the median speaker is the one thing we are not allowed to do.

biaselderlyfairness

What aging does to a voice

It is a synthesis problem too

What we do about it

30 days. One wing. Your numbers.