The CoreS3 can't say her name nicely. So it doesn't try.
On-device TTS would mean zero latency and zero network dependence. It would also mean a robot voice in a hospital room. Why Companion synthesizes in the cloud, and what we keep on the chip anyway.
The cleanest possible architecture for a bedside device is one that needs nothing else: capture audio, think, speak, all on the chip in the room. No network round trip, no per-minute bill, no dependence on a Wi-Fi access point a maintenance contractor will unplug someday. We wanted that. Then we tried to make the CoreS3 actually speak, and the gap between clean architecture and a voice an 88-year-old will accept turned out to be enormous.
Why a good neural vocoder won't fit
Modern natural TTS is two heavy stages: an acoustic model that turns text into a spectrogram, and a neural vocoder that turns the spectrogram into waveform samples. The vocoder is the expensive part — it has to emit 16,000 samples a second, faster than real time, or the speech stutters. That is hundreds of millions of multiply-accumulates per second of audio, against model weights measured in tens of megabytes.
The ESP32-S3 has a few hundred KB of usable RAM, single-digit megabytes of PSRAM, and no GPU. The weights don't fit in flash, the activations don't fit in RAM, and even if you quantized hard enough to squeeze something in, it would not run faster than real time. What does fit on a microcontroller is the old generation — formant synthesis or tiny concatenative engines. Those work. They also sound like 1998. In a room with someone who is frail and possibly frightened, a robotic voice isn't a quality nit; it reads as cold, and cold is the one thing we cannot ship.
The tradeoff, named honestly
So the choice is real and it has teeth on both sides.
- On-device: no network dependence, no marginal cost, lowest possible latency — but a voice quality floor far below what eldercare needs, and no room for good prosody control.
- Cloud realtime: natural voice, rich prosody, the provider flexibility our adapter already gives us — but it lives and dies on the network, costs per minute, and pays a round-trip latency tax on every turn.
We chose cloud, with eyes open. The voice quality is non-negotiable, and the prosody controls that make Companion kind to elders only exist on the cloud path. The price is that we now own a latency budget and a failure mode we'd love to not have.
What the chip earns anyway
Cloud synthesis doesn't mean the device is dumb. The CoreS3 still owns the parts of the experience that must be instant and must survive a dropout. It does its own voice activity detection and the half-duplex mic gate locally, so listening never waits on the network. It buffers and plays the streamed audio smoothly through a jitter cushion. And for a small, fixed set of lines — an acknowledgment chirp, a one moment, a connection-lost message — we keep pre-synthesized cloud-quality audio in flash, so even when the network is gone the room is never silent in a way that frightens someone. The synthesis is in the cloud; the responsiveness is on the chip.
The resident in 214B never hears the seam. She asks a question, the answer comes back in a warm human-sounding voice, and if the Wi-Fi has hiccupped she hears a calm give me just a moment instead of dead air. The robot voice that would have fit on the chip never enters the room. That absence is the whole point.