Our RMS gate worked until the room changed, and rooms always change.
A fixed energy threshold is the right place to start and the wrong place to stay. Why a single RMS number fails across rooms and residents, and what spectral features and an adaptive noise floor buy us in the next iteration of on-device VAD.
The v1 voice activity detector on Companion is an RMS energy gate: every 20ms frame of 16kHz PCM gets an energy estimate, and anything above an RMS threshold of ~300 counts as speech. It is fast, it fits in almost no RAM, and it shipped. It also has one fatal assumption baked into that single number — that the difference between silence and speech is loudness. In a quiet bedroom that holds. Move the device to a room with a window unit cycling on, a roommate's game show, and an overhead page system, and the floor under that 300 moves with it.
One threshold cannot fit every room
We logged frame energy across our pilot units and the spread is the whole story. A quiet single room idles around an RMS of 80–120. A double with an active HVAC unit idles at 250–400 — already at or above the speech threshold before anyone says a word. Meanwhile a soft-spoken resident two feet from the mic can produce speech frames that peak around 350. The honest reading of that data is that there is no fixed number that admits the quiet talker and rejects the loud room, because the two distributions overlap. A threshold tuned for room 211 silently breaks in 214B.
The two failure modes are the ones you can hear. Set the gate low and the air conditioner opens a realtime session every few minutes — Companion talks into an empty room and burns audio minutes. Set it high and you miss the resident who asks, barely above a whisper, whether someone is coming. At bedside the false negative is the one that hurts: it reads as being ignored.
Loudness is the wrong axis; structure is the right one
Speech is not just energy, it is structured energy. The next iteration stops asking only how loud a frame is and starts asking what it is shaped like. A few cheap features separate the distributions that raw RMS collapses:
- Zero-crossing rate. Voiced speech crosses zero at characteristic rates; broadband HVAC hiss and the sharp click of a bed rail look very different on this axis, even at identical energy.
- Spectral flux and a coarse band ratio. Speech energy concentrates in the 300 Hz–3 kHz band and changes frame to frame as phonemes move. A fan is loud, flat, and stationary — high energy, near-zero flux.
- Spectral centroid. Where the energy sits in frequency tells a vowel apart from a TV's compressed wash, which RMS treats as the same loud thing.
The floor has to move on its own
The bigger change is that the threshold stops being a constant. We track a slow-moving estimate of the room's noise floor — a running minimum of frame energy over a few seconds, updated only during frames that the spectral features say are not speech — and the decision becomes a margin above that floor rather than an absolute value. The window unit clicks on, the floor rises, and the gate rises with it instead of jamming open. A resident who has been silent for ten minutes is measured against the room as it is right now, not as it was at boot.
All of this still has to run inside a 20ms budget on the CoreS3 next to a live duplex audio path, which rules out anything heavy — no full FFT per frame if a handful of Goertzel bins will do. That constraint is the subject of the next post. The point of this one is narrower: the v1 gate was not wrong, it was incomplete. It answered a yes/no question with a single ruler, in a building where every room measures differently.
The resident in 214B does not know any of this changed. She knows that the device used to miss her when the air conditioner ran, and now it doesn't. That is the only metric that ships.