A good eval set is mostly the conversations you wish hadn't happened.
Building a labeled conversation eval set from pilot data — consent, de-identification, and the uncomfortable truth that the cases worth grading are the rare, hard ones you have to go hunting for.
Our first conversation eval set was useless, and it was useless because it was representative. We sampled pilot interactions uniformly, which meant 80% of it was Companion handling easy, happy-path turns perfectly. A model that got worse at the 5% of conversations that actually matter — the confused resident, the medical word, the call for help buried in small talk — could still score 96%. The average drowned out the cases we built the device to handle.
Consent and de-identification come first, not last
Before a single conversation enters the set, it has to be legitimate to use. Companion stores events, not recordings, but a conversation transcript is still about a real person in a clinical setting. Our pipeline strips it down to what an eval actually needs and nothing more.
- Consent scope is checked per facility. Eval use is a distinct purpose from operating the device, and we don't assume one covers the other.
- De-identification runs before storage, not before viewing. Names, room numbers, dates, family names, and named conditions are replaced with stable tokens like
[RESIDENT],[ROOM],[FAMILY_1]so coreference still works for graders. - The clinical content survives, the identity doesn't. 'Has [FAMILY_1] called about [DATE]?' is gradeable. 'Has Diane called about Thursday?' is a privacy incident.
This is harder than redaction in a document because turn-taking carries identity sideways — a resident answers a question that named them, and now the answer is identifying. We de-identify the conversation as a unit, with the same token map across all turns, or the references break and the transcript stops making sense to a grader.
Sampling for the hard cases on purpose
Once the data is safe to use, we deliberately over-sample the rare and the broken. We don't want a mirror of production traffic; we want a stress test. We pull conversations flagged by low ASR confidence, long silences, barge-ins, repeated utterances (a sign the resident felt unheard), any turn that triggered or should have triggered an escalation, and anything a nurse later corrected. In our pilot set, those categories are under 8% of raw volume but make up over half the eval set by design. The easy turns are present only as a regression floor.
A label schema you can actually agree on
Labels are where eval sets quietly die. 'Was this a good response?' produces inter-rater agreement barely better than a coin flip. We split quality into narrow, separately-labeled dimensions, each with a concrete definition and examples.
{
"turn_id": "c4f1-07",
"asr_correct": true,
"intent": "ask_about_family_visit",
"response_appropriate": true,
"should_escalate": false,
"did_escalate": false,
"interrupted_resident": false,
"notes": "slow speech, 2.1s pause mid-sentence"
}Each field is a separate, near-binary judgment, which is what makes two nurses and two engineers agree. Aggregate scores come later, computed from these fields, so we can always ask which dimension moved instead of watching a single blended number wobble.
The work is unglamorous: consent paperwork, token maps, fights over what 'appropriate' means. But this set is the bedrock every other eval stands on. When it's built from the conversations we wish hadn't happened, a passing score finally means Companion can handle the night a resident in 118 is frightened and not quite making sense — which is the only night that was ever the point.