An AI summary can quietly decide whose pain counts.
Bias does not stop at the microphone. When a model turns a conversation into a SOAP note, it makes editorial choices about what to foreground and what to soften — and the literature is clear about whose symptoms get downplayed. That is why a nurse reviews every note, and why every line is traceable.
The microphone is only half the system. Once Companion has captured a conversation, our FastAPI backend asks Claude to turn it into a structured SOAP note for the nursing team. That step looks clerical and is not. Summarization is editorial: the model decides what rises into the assessment, what gets a hedge, and what falls out entirely. Those choices are where a second, quieter kind of bias lives — and unlike a transcription error, a softened symptom does not announce itself.
Whose symptoms get downplayed
Clinical documentation has a long, well-studied history of bias, and a model trained on human medical text inherits it. The patterns are specific and they are not subtle:
- Pain minimization. Older patients', women's, and minority patients' reports of pain are historically under-credited in the record. A model can reproduce that by rendering it really hurts as reports mild discomfort.
- Stigmatizing framing. Language like refuses, non-compliant, or pleasant but confused carries judgment and clusters around particular groups. The model will happily generate it because the training corpus did.
- Confidence laundering. A hesitant, hedged spoken complaint can come out of summarization as a clean clinical assertion — or the reverse, where a clear complaint is hedged into nothing. Either way the resident's own certainty is overwritten.
- Omission. The most dangerous bias is the symptom that simply does not make the note. You cannot see what was left out by reading what remains.
Stack these on top of the upstream recognition bias and the failure compounds: the resident whose accent was hardest to transcribe is now also the one whose complaint is most likely to be smoothed away in the summary. The people the system already serves worst are served worst again at the next layer.
Why nurse review and provenance are not optional
Our answer is structural, not a prompt we hope works. Two commitments hold the line:
- The note is a draft until a nurse signs it. Companion never writes to the clinical record on its own. Every SOAP note is reviewed by a clinician before it counts, and the interface is built to make editing the obvious default, not a friction the model discourages.
- Every line is traceable. Each statement in the note links back to the specific event it came from, so a reviewer can ask where did the model get this and what did it leave out — and answer both. Provenance turns a black-box summary into something a nurse can audit in seconds.
We also run the summarizer against our cohort-sliced evals, checking whether pain language, stigmatizing terms, and omissions skew across age, accent, and gender. Aggregate summary quality can look fine while one group's complaints are systematically dimmed — the same averaging trap as everywhere else.
The bedside effect is the whole reason for the rigor. When a resident says she hurts, that has to reach the nurse as she says she hurts — not as a softened paraphrase a model preferred. An AI that decides whose pain counts is not a tool we are willing to ship. An AI that hands a nurse a fast, traceable, editable draft, and then gets out of the way, is.