EngineeringNovember 18, 2025·3 min read

We let an LLM grade conversations. Then we graded the grader.

LLM-as-judge scales conversation evals to thousands of turns no human could read. It also brings length bias, self-preference, and confident wrongness — so the judge gets calibrated against nurses before we trust a single score.

We can't pay a nurse to read ten thousand conversation turns every time we change a prompt. So we do what everyone does: we ask a strong model to grade them. An LLM judge reads a turn and the surrounding context and rates whether Companion's response was appropriate, whether it should have escalated, whether it respected the resident. It works, it scales, and the first version was quietly lying to us.

The biases that make a judge dangerous

An LLM judge is not a neutral measuring instrument. It has preferences, and those preferences leak straight into your scoreboard if you let them.

Length bias. Judges reliably rate longer, more thorough-sounding answers higher — which is exactly the wrong incentive for a bedside device where a tired resident at 2am needs eight words, not a paragraph.
Self-preference. A judge tends to favor responses written in its own style. If the judge and the responder are the same model family, you get a rigged contest the judge always wins.
Position and verbosity halo. In pairwise comparisons it favors whichever response it read first, and it conflates confident tone with correctness — a fluent wrong answer can outscore a terse right one.
Clinical blind spots. It will happily approve a warm, fluent response that missed a soft request for help, because it's grading conversational quality, not clinical safety.

That last one is the one that scares us. A judge that rewards warmth over vigilance is optimizing the device toward sounding caring while getting less safe — the precise failure that doesn't show up until someone gets hurt.

Calibrating against human raters

So we don't trust the judge's raw score. We treat the judge as a classifier and measure it the way we'd measure any classifier — against ground truth from people who actually know. Two nurses and an engineer label a calibration set of a few hundred turns by hand. Then we run the judge over the same set and compute agreement.

Build a human-labeled calibration set drawn from the hard cases, with disagreements resolved by discussion, not majority vote.
Score the judge against it — we track Cohen's kappa against the human consensus, not raw accuracy, because the classes are imbalanced.
Inspect every false negative. A turn the judge passed that the nurses failed is a potential safety miss, and one of those is worth more attention than a hundred agreements.
Re-calibrate on every judge-prompt or model change, because the judge is software too and silently regresses like any other component.

In our pilot the judge reaches kappa around 0.7 on the appropriateness label, which is good enough to trend on — but only about 0.55 on should-escalate, the highest-stakes label. So we made a rule: escalation judgments from the LLM are advisory, used to surface candidates for human review, never to auto-pass a release. The judge narrows the haystack; a nurse still finds the needle.

We also de-bias the prompt directly: the judge grades against an explicit rubric, scores safety and brevity as separate dimensions from warmth, and never sees which model produced a response. None of this makes the judge a nurse. It makes the judge a fast, honest-enough first pass — so that the conversations that reach a human are the ones where a human truly matters, like the night someone in 214 said 'I'm fine' in a voice that wasn't.

evalsllm-judgemethodology

The biases that make a judge dangerous

Calibrating against human raters

30 days. One wing. Your numbers.