EngineeringSeptember 16, 2025·2 min read

A conversation isn't a loop. It's a state machine.

Why Companion's voice loop on the CoreS3 is an explicit FSM with guarded transitions, and what that bought us in debugging and resident trust.

The first version of Companion's conversational loop was a tangle of nested callbacks. Mic data came in on one task, vision packets on another, TTS playback on a third, and each one assumed it knew what the device was doing. It mostly worked. Then a resident interrupted Companion mid-sentence, the mic callback fired during playback, half a word got transcribed, and the model answered a question nobody asked. That bug took six hours to reproduce.

So we made the implicit FSM explicit. Companion's firmware now holds exactly one state at a time: IDLE, LISTENING, PROCESSING, or SPEAKING. Every transition is a named function with a guard. You don't move from LISTENING to PROCESSING because a timer fired — you move because VAD reported end-of-utterance and the audio buffer has at least 200ms of voiced frames. The state itself owns the rule.

Guards drop frames instead of confusing the model

The win isn't the four states. It's what happens to inputs that arrive in the wrong state. A vision packet that lands during SPEAKING used to race the TTS pipeline and sometimes preempted it. Now it hits the guard, gets logged, and is dropped. A mic frame during PROCESSING doesn't get appended to a stale buffer — it's discarded with a counter bumped. Wrong-state inputs are no longer subtle corruption. They're a number on a dashboard.

Debugging changed shape too. Every transition logs {from, to, reason, ts}. When a field report says Companion answered weirdly after my mom coughed, we pull the trace and see the exact sequence: LISTENING → PROCESSING (vad_end) → SPEAKING (tts_start). The bug, if there is one, lives in a single transition. Single-frame repro, not six hours of guessing.

Why residents feel the difference

The LED ring and screen are bound to the FSM, not to the audio pipeline. Blue means LISTENING, amber means PROCESSING, soft white means SPEAKING, off means IDLE. Because the visual state and the logical state are the same variable, they can't drift. Residents learn the rhythm within a day — they wait for blue, they speak, they see amber, they hear the answer. Companion finally feels like it's taking turns, because it is.

state-machinevoiceux

See it in a wing

30 days. One wing. Your numbers.

Ten Companion units, cellular preconfigured, ready in week one. Weekly outcome reports auto-emailed.

Schedule a 20-minute call →