EngineeringAugust 9, 2025·3 min read

We wanted Silero on the CoreS3. Here is the budget that decided otherwise.

A small neural VAD is dramatically better than an energy gate at telling a whisper from a window unit. The question is not whether it is better — it is whether it fits in flash, RAM, and a 20ms frame on an ESP32-S3, and what we give up by running it server-side instead.

Once you accept that loudness is the wrong axis for voice activity detection, the obvious next move is a learned model. A small neural VAD — the Silero-style RNN that everyone reaches for — was trained on exactly the problem we keep losing: separating speech from noise when the noise is louder than the speech. On our held-out bedside clips it cut false triggers from HVAC and television by more than half versus our spectral-feature gate. The accuracy question was never in doubt. The deployment question was the entire fight.

The budget on the CoreS3 is unforgiving

The CoreS3 is an ESP32-S3 with 512 KB of internal SRAM and 8 MB of PSRAM, and almost none of it is free. The firmware is already running a full-duplex I2S audio path, a WebSocket session, the echo-suppression logic, and the conversation state machine. The audio ring buffers and TLS alone are the heavy tenants. So a neural VAD does not get to assume a clean machine — it has to fit in the slack left over after real-time audio has taken its cut.

Flash. A quantized int8 Silero-class model is roughly 1–2 MB of weights. Tolerable against our app partition, but not free.
RAM. The model's tensor arena plus its recurrent hidden state wants a few hundred KB of fast SRAM — and fast SRAM is the resource we have the least of. Spilling the arena to PSRAM is possible but the per-frame access pattern punishes you for it.
Compute. The decision has to land inside the 20ms between frames, sustained, while echo suppression and the socket also want the CPU. An int8 RNN inference per frame is feasible on the S3's vector instructions, but it eats into a budget that is already accounted for.

Edge versus server is a latency argument, not an accuracy one

The tempting alternative is to skip the squeeze entirely and run the neural VAD server-side in the Go adapter, where flash and RAM are not the constraint. Accuracy would be identical or better. But moving the VAD off the device defeats the reason VAD exists on the device. The whole point of the on-board gate is that we do not stream every frame upstream — every snore and every game show would otherwise open a session, burn audio minutes, and add a network round trip to the start of speech. A server-side VAD means the device is always streaming so the server can decide, which is the cost we built the gate to avoid.

So the real architecture is not edge or server. It is a cheap, conservative gate on the device whose only job is to decide when to open the stream, and a heavier model upstream that refines turn boundaries once audio is already flowing. The edge gate can be biased toward over-triggering slightly, because a false open that the server closes in 200ms is cheap; a false miss on the device is unrecoverable, since the audio that was never streamed cannot be reconsidered.

What we can actually fit

Our current answer is a tiny model on the edge — a two-layer GRU operating on the same band features from the spectral gate rather than on raw PCM, which shrinks the input and the arena enough to live comfortably in SRAM and run well under 20ms. It is not full Silero. It does not have to be. It only has to be good enough to keep the stream closed when there is no resident speaking, and to open fast when there is. The full-strength neural VAD lives server-side and does the precise endpointing once the door is open.

For the resident in 214B the device just feels quicker to notice her now, and quieter when the room is loud and she has said nothing. The two-tier split is invisible to her, which is the goal. The engineering luxury of running the best model everywhere is one we cannot afford, so we spend our scarce SRAM on the one decision that cannot be undone.

vadedgeml

The budget on the CoreS3 is unforgiving

Edge versus server is a latency argument, not an accuracy one

What we can actually fit

30 days. One wing. Your numbers.