A skilled-nursing ward at 4pm is one of the loudest places we've ever tried to listen in.
TVs, bed alarms, overhead pages, roommates, and hallway carts make a real ward a punishing acoustic environment. Clean-room ASR benchmarks don't survive contact with it. The fix lives on-device and on the server, not in either alone.
The first time we put a measurement mic on a real ward, the numbers were sobering. The demo videos and the benchmark sets are recorded in quiet rooms by cooperative speakers. An actual skilled-nursing wing at four in the afternoon is a wall of sound: a TV at volume in the next bed, a bed alarm somewhere down the hall, an overhead page, a med cart rattling past, two aides talking in the doorway. Our bedside device has to pull one frail resident's voice out of all of it, in real time, from a meter away.
The noise is structured, persistent, and speech-shaped
What makes a ward hard is not loudness alone. It is the kind of noise, and almost all of it is adversarial to ASR in a specific way:
- TVs and roommates produce competing speech. This is the worst case. Noise suppression can attenuate a fan; it cannot easily tell which voice is the resident and which is the anchor on the evening news.
- Alarms and pages are loud, transient, and tonal. A bed alarm sits right in the speech band and can clip the front end, blowing out the very utterance it overlaps.
- HVAC and equipment hum is constant. Low-level, ever-present, and it drags the noise floor up all day, eroding SNR for every quiet resident on the floor.
- Carts, doors, and footsteps are impulsive. Sudden broadband transients that confuse voice-activity detection into starting or ending turns at the wrong moment.
Across our pilot wards, bedside SNR during the busy afternoon block routinely sits in the 3–10dB range — and the resident's own voice is frequently the quietest thing in it. A model that posts 5% WER on a benchmark can post north of 25% on the same words spoken into that room.
Why neither the device nor the server can fix it alone
We deliberately split the work. The device cannot run a large ASR model — it is an ESP32-S3 with kilobytes to spare, not a GPU. The server cannot fix what the device throws away or fails to capture. So the pipeline is a collaboration: the CoreS3 does cheap, conservative gating and clean capture; the server does the heavy transcription on audio that is already as good as the front end can make it.
The device's job is not to understand the resident. It is to make sure the server gets a fair chance to.
What we do about it
- On-device VAD and gating. Conservative voice-activity detection keeps the device from streaming a roommate's TV all evening, and it keeps the channel quiet so the server isn't transcribing the hallway. We tuned it to favor catching the resident over rejecting noise — a missed help costs more than a wasted second of audio.
- Adaptive noise-floor tracking. The device continuously estimates the room's floor and adjusts gating and gain so a quiet afternoon and a chaotic one are handled differently, not with one fixed threshold.
- Server-side robustness and context. The heavy model gets the cleanest stream we can hand it, plus per-resident vocabulary bias and multi-turn context so it can resolve a noisy word from what came before.
- Confidence-gated confirmation. When the room wins a round and confidence drops on a high-stakes turn, Companion asks again instead of guessing into the noise.
We don't pretend we beat the room. We measure WER on real ward audio, not benchmarks, precisely so we never fool ourselves. But the resident in 214B, trying to say something quiet while the TV blares next door, deserves a device that fights to hear her — and that fight is split, on purpose, between the chip on her nightstand and the model in the cloud.