The most important things our residents say are the ones they say quietest.
Weak, breathy, hypophonic voices arrive at the bedside mic as low-SNR whispers. Crank the gain and you amplify the room; leave it low and you lose the resident. The tradeoff has no free lunch.
There is a pattern we noticed early in the pilot: the utterances that matter most arrive quietest. I'm dizzy. I think I fell. It hurts. A resident who is in distress, frail, or frightened does not project. Pair that with Parkinsonian hypophonia or post-stroke vocal weakness, and the single most important sentence of the day reaches the microphone as a breathy near-whisper, well below the level our speech-to-text was trained on.
Quiet is a signal-to-noise problem, not a volume problem
It is tempting to think the fix is turn up the gain. It isn't, because the bedside mic does not hear the resident in isolation — it hears the resident plus the room. What matters to the decoder is the ratio between the two. Our CoreS3 sits on the nightstand, roughly a meter from the resident's head, and at that distance:
- Speech falls off with distance, the room doesn't. HVAC hum, a roommate's TV, and hallway chatter are roughly uniform across the room. The resident's voice drops with the inverse-square law; the noise floor doesn't.
- A breathy voice has little energy where it counts. Breathiness means air leaking past the vocal folds — turbulent noise that masks the periodic, formant-rich part of speech the model actually reads.
- Hypophonia is inconsistent. Volume sags across a sentence, so the front of an utterance may be intelligible and the end falls below the floor — and the end is often the verb that carries the meaning.
On our hypophonic-speaker slice we routinely measure input SNR in the 5–12dB range at bedside distance, against the 20dB-plus that off-the-shelf models implicitly assume. Below roughly 10dB, WER climbs steeply and, worse, the model stays confident while it does.
The gain tradeoff has no free lunch
Raising the analog and digital gain lifts the quiet voice — and lifts the noise floor and the self-noise of the mic and preamp by the same amount. Past a point you are amplifying hiss, not speech, and you risk clipping the loud moments (a roommate's laugh, an overhead page) into distortion that is far worse for ASR than the original quiet. There is no gain setting that is right for both a whisper and a slammed door.
You cannot amplify your way out of a low SNR. You can only amplify the ratio you already have — and at the bedside that ratio is the whole game.
What we do about it
- Adaptive gain, not fixed. We track the running noise floor on-device and adjust input gain to keep the quiet voice well above it without clipping the loud transients. The target is a stable working range, not maximum loudness.
- Endpoint on speech, not just energy. A breathy utterance can be quieter than a noisy room. Energy-only voice-activity detection misses it; we use spectral cues so a low-level voice still trips the turn.
- Confidence-gated confirmation on the words that matter. When a soft utterance lands on dizzy, fell, or hurts with low confidence, Companion asks once rather than discarding it. A missed I fell is the failure we refuse to accept.
- Per-resident microphone calibration. A device on a hypophonic resident's nightstand is provisioned with a different gain and VAD profile than one across the hall. One size does not fit the floor.
The resident in 214B says I'm dizzy the way she says everything — barely above a breath. Our job is to make sure that the quietest sentence she utters all day is the one we are most certain to catch.