EngineeringMay 24, 2025·3 min read

There is no best TTS. There is only the right one for 214B.

Cloud realtime, neural vocoders, on-device synthesis — the TTS landscape is wide and none of it was built for a hospital room. How we evaluate and choose per facility.

Every TTS demo you have ever heard was recorded in a quiet room with a good speaker and a listener who hears fine. None of those things are true at the bedside. The voice has to come out of a one-inch driver in a plastic enclosure, compete with a TV two beds over, and be understood by someone who is 88 and takes their hearing aids out at night. That is the problem the framework choice actually has to solve, and it is why we don't have a single answer.

Three families, three sets of compromises

The market sorts into three rough families, and each one is a different bet about where you spend latency, money, and control.

Cloud realtime/streaming (OpenAI Realtime, ElevenLabs ConvAI, Grok). Synthesis is fused into the conversation loop and audio streams back as the model generates. Best naturalness and lowest engineering cost, but you pay per minute, you depend on the network, and you get only the prosody controls the vendor exposes.
Neural vocoder pipelines (a separate acoustic model plus a vocoder you host). Maximum control — you own the lexicon, the prosody, the voice — at the cost of running GPUs, owning the latency budget, and maintaining a model nobody else patches for you.
On-device synthesis (a small model or formant engine on the MCU). Zero network dependence and zero marginal cost, but the CoreS3 simply cannot run a good neural vocoder, so quality drops to something robotic.

The axes that actually matter for eldercare

When we score a candidate framework, the consumer-product axes (does it sound like a podcast host?) are near the bottom of the list. The ones at the top are the ones a 30-something engineer never feels:

Time to first audio. A half-second of dead air after a question reads as the thing is broken to an elder, who will then repeat themselves and desync the turn.
Intelligibility under loss. How does it sound on a small speaker, in a noisy room, to a presbycusic ear — not how does it sound in headphones.
Prosody control. Can we slow it down, add real pauses, and override a name's pronunciation, or are we stuck with the default voice's cadence.
Failure behavior. When the network drops mid-sentence, does it degrade gracefully or cut to silence.
Cost at fleet scale. Per-minute pricing that is fine in a demo becomes a line item across a wing of always-on devices.

Why the choice is per-facility, not global

A facility with rock-solid wired networking and a hard-of-hearing-heavy population wants a cloud framework tuned hard for slow, clear delivery. A rural site on cellular with frequent dropouts wants something that fails soft and leans on shorter, cacheable lines. Our adapter layer already routes each device to a provider from Firestore config; the TTS choice rides the same rail. We pick per device, measure on the real hardware in the real room, and re-pick when a facility's network or population changes.

So there is no leaderboard entry that ends this. The resident in 214B does not care which vendor synthesized the sentence. She cares that when she asked whether her daughter had called, the answer came back quickly, clearly, and warm enough that she believed it. Choosing the framework is just the least visible part of making that true.

ttsframeworksarchitecture

Three families, three sets of compromises

The axes that actually matter for eldercare

Why the choice is per-facility, not global

30 days. One wing. Your numbers.