EngineeringSeptember 11, 2025·3 min read

The replay says it passed. The bedside says it didn't.

Offline replay evals and online live metrics catch different bugs and miss different bugs. The trap is trusting either one alone — especially when realtime audio refuses to replay deterministically.

We had a change pass every offline eval and then make the device feel worse in the room. The replay graded the words Companion said and they were correct. What the replay couldn't see was that those correct words arrived 400ms late on a real cellular link, after the resident had already started repeating herself. Offline, the answer was right. Online, the conversation was broken. Both statements were true, which is the whole problem.

What offline replay is good at

Offline evals replay our frozen conversation set through a candidate pipeline and grade the output. They're cheap, fast, and run on every commit, and they're unbeatable for one thing: catching content regressions before any resident is exposed. If a prompt change makes Companion stop recognizing a request for help, we want that to fail in CI, not in room 214.

Catches: wrong intents, dropped escalations, unsafe responses, transcription regressions on known-hard audio.
Misses: real-network latency, provider load under concurrency, microphone variation across rooms, anything timing-dependent.
Gives you: a stable, comparable number across model versions — the same inputs every time.

What online metrics are good at

Online metrics measure the live system on real conversations: p95 mouth-to-ear latency, barge-in rate, reconnly counts, the fraction of turns where a resident repeats themselves within five seconds, escalation rates by wing. These catch exactly what replay can't — the system in its actual operating conditions — but they're noisy, confounded by who happened to be talking that day, and they only tell you about a regression after it's been live. You learn the truth at the cost of having shipped it.

So the division of labor is: offline gates the merge, online watches the deploy. Offline answers 'is the content correct?' Online answers 'is the experience correct under load, on real links, with real voices?' Neither question subsumes the other, and a change has to clear both.

The replay determinism problem

Here's the part that makes realtime audio nastier than text evals: you can't actually replay it deterministically. Stream 16kHz PCM into a Realtime provider twice and you get two different transcripts, because end-of-turn detection depends on chunk arrival timing, the provider's VAD has internal state, and the model samples. Feed the audio faster than real time to speed up the suite and the VAD's silence windows fire differently than they ever would live. The act of replaying changes the thing you're measuring.

We don't pretend this away with a single golden string. Instead we pin what we can and tolerance the rest.

Pace audio at 1x into the pipeline so VAD and end-of-turn behave like production, even though it makes the suite slower.
Grade with tolerance, not exact match — entity-level correctness and intent, scored against thresholds rather than a single expected transcript.
Run flaky-prone cases N times and gate on the distribution (e.g. intent correct in ≥9 of 10 runs), so one unlucky sample doesn't block a merge and one lucky sample doesn't hide a regression.

The honest summary we tell ourselves: offline tells us a change is probably safe, online tells us it is working. We need both because the gap between them is exactly where a resident in 118 asks for help and Companion, having passed every test, answers a half-second too late.

evalsmethodologymetrics

What offline replay is good at

What online metrics are good at

The replay determinism problem

30 days. One wing. Your numbers.