The hardest decision Companion makes is when to stop listening.
Endpointing — deciding the resident has finished their turn — has exactly two ways to fail, and both are felt at the bedside. Here is how we frame the problem and what we measure.
Companion has no button and no wake word. A resident in 214B just speaks, and somewhere in the next few hundred milliseconds the system has to make a binary call that it cannot take back: is she done, or is she still talking? Get it right and the conversation feels like a conversation. Get it wrong and you have produced one of two specific failures, each of which the resident notices immediately. This decision — endpointing, or end-of-turn detection — is the single hardest piece of the voice stack, and it is hard precisely because the right answer changes from person to person and from sentence to sentence.
Two failure modes, both audible
There are only two ways to be wrong, and they sit on opposite sides of the same threshold.
- Cut-off (endpoint too early). The resident pauses to find a word, the system decides she's finished, and Companion starts talking over her. For an elderly resident who paused because of word-finding or breath, this is the cruelest version: the device punished her for the thing that makes her speech slower. It teaches her not to trust it.
- Awkward wait (endpoint too late). The resident finishes a clear, complete sentence and then nothing happens. One second of dead air, then two. She wonders if it heard her, repeats herself, and now the system is processing a half-finished restart. Less cruel than a cut-off, but it makes Companion feel broken and slow.
Every endpointing design is a choice about how to trade one of these against the other. In an eldercare setting we are not neutral about that trade — a cut-off is far worse than a wait — but you cannot simply slide the dial all the way to wait, or the device feels unresponsive to everyone who actually did finish.
Why a fixed timeout isn't enough
The obvious first design is a fixed silence timeout: stream audio, count silent frames, and after N milliseconds of quiet, end the turn. We ship a version of this on the CoreS3 — 20ms frames of 16kHz PCM, an RMS gate, a silence counter — and for a fast, fluent speaker it works fine. The problem is that there is no single N that is correct. A consumer assistant uses something like 700ms. Our residents routinely pause two or three seconds mid-thought. Set N short and you cut off the people who most need not to be cut off; set N long enough to protect them and every fluent speaker now eats a multi-second wait after every sentence. A constant cannot serve a population whose pause lengths vary by an order of magnitude.
So the timeout is a floor, not the whole system. Above it we layer signals that a constant can't capture — prosody, syntactic completeness, and per-resident history — which are the subjects of the next posts in this series. But none of that is worth building if you can't measure whether it helped.
The metrics we track
Because the failures are asymmetric, we don't track a single accuracy number. We track them separately, against nurse-reviewed event logs:
- Cut-off rate — turns we ended while the resident was still speaking. This is our primary metric and we treat it as a safety number, not a quality number. In our pilot wing we hold it under 1.5% of turns.
- Median and p95 endpoint latency — how long after the resident actually stopped before we closed the turn. Median tells you the common case; p95 catches the awkward-wait tail that a median hides.
- Repair rate — turns where the resident repeated or restarted themselves, a behavioral signal that endpointing felt wrong even when we can't label which way.
- False-trigger rate — turns opened by the TV, a roommate, or a hallway, which corrupt endpointing because the device is now timing silence on speech that was never directed at it.
We slice every one of these by resident, because the averages lie. A facility-wide cut-off rate of 1% can hide one resident with aphasia being cut off on a third of her turns. Endpointing isn't a number you tune once. It's the rhythm of the room, and the only honest way to know you got it right is that the resident in 214B finishes her own sentences and never thinks about the device at all.