EngineeringSeptember 20, 2025·3 min read

Whose voice is the default, and whose gets misheard?

Gender bias in speech tech is rarely loud. It shows up as a higher error rate on one pitch range, a synthesized voice everyone assumed should be female, and assumptions about who is speaking that the acoustics quietly encode.

Two residents on the same hall, same age, same accent. One is a woman with a high, light voice; the other a man with a low, resonant one. Companion transcribes them at different accuracies, and neither of them can tell you why. Gender bias in speech systems is mostly invisible because it hides inside a number nobody at the bedside ever sees: which pitch range the model was optimized for.

The acoustics underneath

Voice differences associated with gender are not abstract — they are concrete acoustic features the model has to handle, and it handles them unevenly.

Fundamental frequency. Lower-pitched voices and higher-pitched voices land in different parts of the spectrum. If one is overrepresented in training, the recognizer's error rate splits along that axis.
Formant spacing. Vocal-tract length shifts where the formants sit. Models tuned on one distribution slightly mis-map vowels for the other.
Aging interacts with all of it. After menopause many women's voices drop in pitch; many men's rise. The neat training-time clusters blur exactly in our age range, so a model that leaned on gendered priors does worse on the very people we serve.

On our gender-sliced eval the gap is smaller than our accent or age gaps but it is real and it is persistent: a few points of WER separating the cohorts, and crucially it does not always favor the same group across providers. One backend is better on lower voices, another on higher. An aggregate metric averages that away and hides it.

The default-voice problem

Synthesis carries a different kind of bias. The reflexive choice for an assistant voice — including the assistant on this device — is a warm, female, slightly subservient tone. That is a cultural default, not a clinical one, and it deserves scrutiny in a setting where residents have lifetimes of their own associations with who speaks to them and how. We do not think there is one correct answer, and we are wary of any product that ships one by accident.

What we do about it

We measure recognition separately for low-, mid-, and high-pitch cohorts and treat any provider whose gap exceeds a threshold as failing that slice, regardless of how good its average looks. Our adapter layer then routes per device to the backend that serves that resident's voice best. On the synthesis side, Companion's voice is a configurable preference, not a baked-in default, because the right voice for the woman in 214 who taught grade school for forty years may be nothing like the right voice for the man next door.

The aim is narrow and concrete: the device should not understand you better or worse because of the pitch you were born with or grew into. When the seam between your voice and the machine's is invisible, what is left is just two voices in a quiet room — which is the entire point of putting one there.

biasfairnesstts

The acoustics underneath

The default-voice problem

What we do about it

30 days. One wing. Your numbers.