EngineeringMay 20, 2025·2 min read

Exponential backoff for a fleet that has to come back.

Why every Companion waits a little longer between reconnect attempts — and why none of them ever give up.

Facility wifi is a hostile environment. Access points reboot at 3am on a vendor schedule nobody told us about. Basement rooms hold one bar on a good day. Captive portals expire mid-shift. The CoreS3 firmware in every Companion keeps a persistent WebSocket open to the Go API so voice and commands move in realtime — which means we spend a lot of engineering time on what happens when that socket drops.

The naive answer is to reconnect immediately. The naive answer is wrong. If an AP reboots and forty devices on the same subnet all retry at the same instant, you have built a small DDoS against your own infrastructure: the AP, the upstream API, and the realtime provider all get hit at once, and the recovery takes longer than the original outage.

Backoff with a ceiling

So firmware waits. The reconnect schedule is 5s, 10s, 20s, 40s, 60s, with jitter added so two devices that booted together do not march in lockstep. The ceiling matters as much as the curve: we cap at sixty seconds and stay there. A device that has been offline for six hours should not be waiting an hour between attempts — it should be knocking on the door every minute, ready the moment the AP comes back.

Draining, not queueing

While the socket is down, the audio pipeline drops frames rather than buffering them. Stale voice is worse than no voice; a resident does not want to hear a question from twenty seconds ago answered now. Commands and state syncs are small and worth retrying. Audio is a realtime stream, and realtime means now or never.

The product shape of this is simple. The AP reboots at 3am. The resident sleeps through it. At 7am they say good morning to Companion, and Companion says good morning back.

websocketreliabilityfirmware

Backoff with a ceiling

Draining, not queueing

30 days. One wing. Your numbers.