The most natural voice was the one she couldn't understand.
TTS leaderboards optimize for naturalness — how human it sounds. For a hard-of-hearing 88-year-old, the most natural voice is often the least intelligible. We tune for comprehension instead.
We ran a side-by-side at one of our pilot sites: the same sentence, two voices. One was a state-of-the-art natural voice that sounds indistinguishable from a person — breathy, casual, fast. The other was plainer, slower, more deliberate. The natural one wins every consumer benchmark on the market. The resident asked us to repeat it. The plain one, she got the first time. Naturalness and intelligibility are not the same axis, and at the bedside they often pull in opposite directions.
Why natural can mean unclear
The tricks that make a synthetic voice sound human are exactly the tricks that hurt a hard-of-hearing listener. Natural speech coarticulates — sounds blur into their neighbors. It reduces unstressed syllables to almost nothing. It speeds up on familiar phrases and uses casual, breathy phonation with weak consonants. A younger ear reconstructs the missing pieces effortlessly. An older ear often can't, for reasons that are physiological, not attentional:
- Presbycusis — age-related high-frequency loss — eats the consonants (s, f, t, th) that carry most of the meaning, while vowels stay audible. Natural voices underarticulate exactly those consonants.
- Slower auditory processing means fast, coarticulated speech arrives faster than it can be decoded.
- A small bedside speaker plus room noise strips the subtle cues a natural voice relies on, leaving only the strong, clear ones — which the natural voice deliberately softened.
What we tune for instead
We optimize for the listener getting it the first time, which is a different target than a low naturalness gap. Concretely, our voice config and prosody layer push toward clear-speech characteristics — the way a thoughtful person naturally talks to someone who's straining to hear, without sliding into the slow, loud, patronizing register that elders rightly hate.
- A measurably reduced speaking rate, with the time spent on pauses between clauses rather than on dragging out vowels.
- Crisper consonants — we favor voices and settings that don't swallow word endings, because that's where the meaning lives.
- Real pauses at clause boundaries, giving slower processing time to catch up before the next idea.
- Even, steady prosody rather than the dramatic pitch swings that read as natural but smear intelligibility on a small speaker.
- One idea per turn, so comprehension never has to hold three things at once.
There's a real cost. Tuned this way, Companion would lose a naturalness shoot-out against a flagship consumer voice — it sounds a touch more deliberate, a touch less like an offhand human. We accept that trade every time, because the metric we actually care about isn't does this sound human. It's did she understand it without asking us to say it again. We measure that the only way that counts: with real elder listeners, on the real hardware, in real rooms.
The resident in 214B has her hearing aids out for the night. Companion tells her the nurse will be by after dinner, clearly enough that she settles back down without pressing for a repeat. The flashier voice would have impressed an engineer and lost her on the consonants. The plainer one, tuned for her ear, did the job. Being understood is the product. Sounding human is just a means, and only when it doesn't get in the way.