We run the new model in production for weeks before a single resident hears it.
Shadow-mode evaluation: a candidate provider or model processes real production conversations in parallel with the live one, and we compare outputs at scale — with zero resident impact, because only the incumbent ever reaches the speaker.
No offline eval set, however carefully built, contains tomorrow's conversations. The one we'd most like to test against — the resident we haven't met, the accent we haven't seen, the 3am phrasing nobody anticipated — is by definition not in our frozen set. So before we switch Companion from one model or provider to another, we run the candidate against exactly that: live production traffic, in real time, where the only thing it isn't allowed to do is reach a resident.
How shadow mode works in our pipeline
Our Go API already routes per-device to a provider — OpenAI Realtime, ElevenLabs, or Grok — via Firestore config with a Redis cache. Shadow mode adds a second route. The incoming audio stream is fanned out to both the live provider and a shadow provider. The live provider's response goes to the device as always. The shadow provider's response goes nowhere near the speaker — it's captured, scored, and discarded.
- The resident only ever hears the incumbent. The shadow output is write-only to our eval store; there is no code path from shadow to the I2S speaker.
- Both see identical input — the same 16kHz PCM, the same conversational context — so any difference in output is the model, not the conditions.
- We log the pair: incumbent response, shadow response, latency for each, and the metadata to find the conversation later if a human needs to look.
The cost is real — we pay for two inferences on every shadowed turn — so we shadow a sampled fraction of devices, weighted toward the wings and times of day where the hard conversations happen. We are buying a preview of the future, and we buy it where the future is most uncertain.
Comparing without a human in every loop
Thousands of shadow pairs a day is far more than anyone can read, so we lean on the same panel of metrics and the same calibrated LLM judge we use everywhere else — applied as a comparison rather than an absolute grade.
- Auto-agree pairs pass silently. When incumbent and shadow land on the same intent and escalation decision within latency tolerance, we just count it.
- Disagreements get surfaced. Where the two models diverge — different intent, different escalation call, a big latency gap — the pair is queued for human review.
- Safety divergences jump the queue. Any pair where one model would have escalated and the other wouldn't goes to a nurse-led review first, because that's the difference that can hurt someone.
This turns the upgrade question from a leap of faith into evidence. After a couple of weeks of shadowing, we can say concretely: on real traffic, the candidate matched the incumbent's escalation decisions 99.1% of the time, was 90ms faster at p95, and the divergences a nurse reviewed favored the candidate in most cases. That's a decision we can defend, not a hope.
Why this is the only honest way to switch
Every other method asks a resident to be the first to find out whether the new model is worse. Shadow mode refuses that. The person in room 214 keeps talking to the model we already trust, while the candidate quietly proves itself on her actual conversation — and only earns the speaker once the numbers say it should. The upgrade reaches her after it's been earned, not before. That ordering is the whole point of building evals first: the resident is the beneficiary of the test, never the test itself.