EngineeringAugust 17, 2025·3 min read

Silence tells you when she stopped, not whether she's finished.

Acoustic endpointing hears a pause; semantic endpointing reads whether the sentence is actually complete. Companion fuses both, and the fusion is where the accuracy comes from.

Two residents each go quiet for one second. One has finished — Can you tell my daughter I called. The other has not — Can you tell my daughter... An acoustic endpointer hears the exact same thing in both cases: a one-second gap of silence. It has no way to tell them apart, because the difference isn't in the audio. It's in the words. This is the ceiling on any purely acoustic approach, and getting past it means listening to what was said, not just whether sound is present.

Two signals that know different things

We run two endpointing signals in parallel, and they're good at opposite things.

The acoustic signal lives on the CoreS3 and in the realtime provider's own server-side VAD. It's cheap, fast, and operates on raw 20ms PCM frames: energy gating, silence duration, and prosodic cues — a falling pitch contour and a slowing rate at the end of a clause are strong evidence of a real boundary, while flat pitch trailing into a pause often means still going. Acoustic signals are excellent at the when (precise to the frame) and weak at the whether.

The semantic signal asks a different question of the running transcript: is this utterance syntactically and semantically complete? Can you tell my daughter I called is a complete clause; Can you tell my daughter dangles. We run a lightweight completion classifier over the incremental transcript that outputs a probability the turn is finished. It's excellent at the whether and useless on its own at the when, because it only updates when new words arrive — and a thinking pause produces no new words at all.

How we fuse them

Neither signal alone is enough, so we don't pick one — we let the semantic estimate modulate the acoustic silence window. The mechanism is deliberately simple, because this code runs in the latency-critical path and has to be debuggable from a nurse-reviewed event log:

  1. Acoustic VAD detects the resident has gone silent and starts the endpoint timer.
  2. We read the latest semantic completion probability for the transcript so far.
  3. High completion probability (the sentence looks done) → we shorten the required silence toward our floor, around 400ms, so a clear finished sentence gets a snappy reply.
  4. Low completion probability (the sentence dangles) → we stretch the window out toward 2.5s or beyond, betting she's mid-thought and protecting her from a cut-off.
  5. If new words arrive before the timer fires, we reset and re-score. The turn only ends when silence outlasts the window the semantics chose for it.

The fusion is asymmetric on purpose. A confident complete can pull the window in aggressively, because the downside of replying to a finished sentence is small. A confident incomplete can only ever extend the wait, never trigger an endpoint — semantics is allowed to grant patience, never to cut someone off. That asymmetry keeps the cruelest failure mode rare even when the classifier is wrong.

Where it still breaks

Fusion narrows the problem; it doesn't close it. Two cases stay hard. First, the completion classifier was trained mostly on fluent grammar, and elder speech is full of disfluency, self-repair, and trailing clauses that look complete but aren't — I think that's all, well, no... — so we keep its authority bounded. Second, a long thinking pause on a sentence that already parses as complete still looks, to both signals, like the end of a turn; distinguishing that from a real finish needs per-resident history, which is the next post. For now the win is concrete: feeding semantics into the silence window cut our awkward-wait p95 substantially without moving the cut-off rate, which means the resident in 214B gets a quick answer when she's plainly done and a patient one when she isn't — and she never has to know which machine was listening for which.

turn-detectionsemanticvad

See it in a wing

30 days. One wing. Your numbers.

Ten Companion units, cellular preconfigured, ready in week one. Weekly outcome reports auto-emailed.

Schedule a 20-minute call →