EngineeringJanuary 25, 2026·3 min read

How do you regression-test something that's allowed to give a slightly different answer every time?

Audio fixtures, golden transcripts, and tolerance thresholds for a voice pipeline that is flaky by nature — because ASR is non-deterministic and a strict golden test would fail on every green build.

Regression testing normal code is easy: same input, same output, assert equal. Run the same 16kHz WAV through our voice pipeline twice and you get 'has my daughter called today' once and 'has my daughter called to day' the next, with the word timings off by a few milliseconds. Both are correct. A strict equality test would fail on a perfectly healthy build, so we'd disable it, and then we'd have no regression test at all. The flakiness isn't a bug in our tests — it's a property of the thing under test.

Fixtures: real audio, frozen forever

The foundation is a library of audio fixtures — actual recordings, captured with consent and de-identified, of the situations we need to keep working. Each fixture is a real WAV plus the metadata that makes it a test case.

  • A canonical hard set: slow speech, hearing-aid whistle, two people in the room, a TV in the background, a request for help spoken quietly mid-sentence.
  • Captured at the device's real format (16kHz mono PCM) so the test exercises the same path the CoreS3 mic produces — not a clean studio file that flatters the ASR.
  • Frozen. Once a fixture is in, the bytes never change. The expected answer can be revised with review; the input is immutable, or the test stops being a regression test.

Golden transcripts with tolerance, not equality

Each fixture has a golden transcript, but we never assert string equality against it. We assert that the result is close enough along the axes we care about, and we define 'close enough' per fixture.

def assert_transcript_ok(actual, golden, *, max_wer=0.10):
    # word-level edit distance, normalized
    assert wer(actual, golden) <= max_wer
    # entities matter more than function words
    assert entities(actual) >= entities(golden)   # no dropped entities
    # the safety-critical phrase must survive verbatim
    if golden.help_phrase:
        assert golden.help_phrase in normalize(actual)

The thresholds encode our priorities. A dropped 'the' is within tolerance. A dropped medication name is not. The word 'help' going missing fails the test outright regardless of WER, because that's the one error that maps to someone not getting reached. Tolerance isn't sloppiness — it's spending our strictness where it matters and relaxing it where it doesn't.

Managing the flake instead of pretending it's gone

Even with tolerance, a single run can get unlucky. So we treat each fixture statistically: run it several times and gate on the distribution, not one sample.

  1. Run each fixture N times (we use 5 in CI, more for the safety-critical set) against the candidate build.
  2. Pass on a quorum — e.g. WER within tolerance on at least 4 of 5 — so one unlucky sample doesn't redden a good build.
  3. Track the pass-rate over time, not just pass/fail. A fixture sliding from 5/5 to 4/5 to 3/5 across releases is a regression in progress, even while it's still 'passing.'

That trend line is the real signal. The binary green check tells us today is fine; the drifting pass-rate tells us a change is quietly eroding recognition on the exact recording where a resident in 118 mumbles 'I think I need help' into a room with the TV on. We'd rather catch that erosion in a graph than in a missed call.

evalsregressiontesting

See it in a wing

30 days. One wing. Your numbers.

Ten Companion units, cellular preconfigured, ready in week one. Weekly outcome reports auto-emailed.

Schedule a 20-minute call →