EngineeringOctober 7, 2025·3 min read

A comma is not a pause. We have to spell it out.

Default TTS prosody is tuned for fast, fluent listeners. Getting rate, pauses, and emphasis right for an 88-year-old means reaching for SSML — and discovering how unevenly providers support it.

Read this sentence out loud the way a TTS engine does by default: brisk, smooth, no real gaps, every clause running into the next. Now imagine hearing it through one hearing aid, after a stroke, at 4pm when you're already tired. The words are all there. The comprehension is not. Default prosody is tuned for a listener who can keep up, and the resident in 214B often can't. So we stop letting the synthesizer decide pacing and start spelling it out.

What SSML buys us

SSML — Speech Synthesis Markup Language — lets us wrap the text in instructions: slow this down, pause here, lift this word. For eldercare the three controls that earn their keep are speaking rate, explicit pauses, and selective emphasis. A reply about a medication time, marked up, looks like this:

<speak>
  <prosody rate="85%">
    Good morning, Mrs. Okafor.
    <break time="500ms"/>
    Your next pill is at
    <emphasis level="moderate"><say-as interpret-as="time">2:30</say-as></emphasis>
    this afternoon.
    <break time="400ms"/>
    I'll remind you again before then.
  </prosody>
</speak>

Every piece of that is doing work. The rate="85%" gives each word room to land. The <break> after her name is the gap a kind nurse leaves so the resident knows she's being addressed before the content arrives. The <emphasis> on the time, plus <say-as interpret-as="time"> so 2:30 is spoken two thirty and not two colon thirty, makes the one fact she needs to retain the most salient sound in the sentence.

The catch: SSML support is a patchwork

Here is what nobody puts in the docs: SSML is a standard the way breakfast is a standard. Everyone has one; they don't agree. Across the providers our adapter routes to, support ranges from rich to nearly nonexistent, and the realtime/streaming endpoints — the ones we need for low first-audio — tend to support less SSML than the slower batch endpoints, because the markup has to be parsed and applied while audio is already streaming.

Some providers honor `<prosody>` and `<break>` fully but quietly ignore <emphasis> on the realtime path.
Some accept the tags and drop them silently — no error, just default prosody — which is the worst case because it looks like it worked.
Some realtime APIs take no SSML at all and only expose a coarse speed knob and natural-language style hints in the prompt.

So our prosody layer can't assume the markup will land. For each provider in the adapter we keep a capability profile: which tags are honored, which are ignored, which break things. When a target doesn't support a control, we fall back — converting a <break> into actual punctuation and inserted phrasing the model will pause on naturally, or pushing the slow-down into the voice config and the prompt instead of the markup. The intent is the same everywhere; the mechanism degrades to whatever the provider actually respects.

The resident in 214B hears: Good morning, Mrs. Okafor. — a pause, her name landing first — then, unhurried and clearly, the one time she needs to remember. She never sees the markup or knows that three providers would have rendered it three different ways. She just understood, the first time, without asking us to say it again.

ttsssmlprosody

What SSML buys us

The catch: SSML support is a patchwork

30 days. One wing. Your numbers.