We write the eval harness before we write the feature.
You can't tune what you can't measure, and at the bedside the thing you're tuning is a conversation with someone who may not be able to repeat themselves. So the harness comes first.
The first time we changed Companion's end-of-turn timing to feel snappier, three things happened at once: the device got faster, it started talking over residents with slower speech, and we had no way to say whether the trade was worth it. We shipped a feeling and got an argument. That was the day we stopped building features first. Now the harness comes first — the feature is what makes a number we already defined go up.
Why voice punishes vibes-based tuning
Most software changes are visible. You can stare at the screen and see whether the button moved. A voice change is invisible and irreversible in the moment: by the time you've heard that Companion interrupted a resident, the interruption already happened to a person in a bed who can't always rephrase. There's no scroll-back. The feedback loop is slow, emotional, and runs through people we are explicitly trying not to burden. That's the worst possible environment for tuning by ear.
Voice also has a wide, correlated parameter space. Silence-window length, VAD aggressiveness, the LLM's verbosity, TTS speaking rate, the barge-in threshold — they all push on the same felt qualities. Nudge one and two others shift. With a dozen interacting knobs and a human-cost feedback loop, intuition is not a control system. A scoreboard is.
What the harness has to exist before
Our rule is simple and slightly annoying to follow: a change to the conversation pipeline doesn't start until there's a metric it's allowed to move and a way to compute that metric on cases we've already seen. Concretely, before we touch code we need three things.
- A target metric with a direction and a threshold. Not 'feels faster' — p95 mouth-to-ear under 800ms on the replay set, with no regression in barge-in false positives.
- A frozen eval set the change is judged against. Real pilot conversations, de-identified, including the awkward ones: long pauses, hearing-aid feedback, two voices in the room.
- A guardrail metric that must NOT move. The whole point is to catch the silent trade — the thing that got better at the expense of the thing nobody was looking at.
This inverts the usual order. The harness is not a thing we add after a feature works to prove it works. It is the thing that defines what 'works' means, and writing it first forces the honest argument up front — what are we actually optimizing, and what are we willing to spend to get it? — instead of after we've fallen in love with a demo.
The bedside payoff
Eval-first is slower for the first week and faster forever after. Every later change runs against the same scoreboard, so we stop relitigating taste and start moving numbers. More importantly, it changes who absorbs the risk of a bad change. When the harness comes first, the regressions land in a CI report on a Tuesday afternoon. When the feature comes first, they land on a resident in room 118 who asked a question and got talked over. We would rather lose the argument to a spreadsheet than win it at her expense.