SSML
SSML — Speech Synthesis Markup Language
SSML is an XML-based markup language defined by the W3C that gives you fine-grained control over how a TTS engine pronounces text. Instead of sending raw text and hoping the engine makes good choices about emphasis and pauses, SSML lets you specify exactly how each phrase should be spoken.
Basic structure
<speak>
Hello, <break time="200ms"/> welcome to the site.
<prosody rate="slow" pitch="+5%">This part is emphasized.</prosody>
The price is <say-as interpret-as="currency">$22.00</say-as> per month.
</speak>
The <speak> element is the root wrapper. Everything inside it is SSML-tagged text.
Core SSML elements
<break> — Insert a pause:
<break time="500ms"/> <!-- Half-second pause -->
<break strength="medium"/> <!-- Relative pause (x-weak, weak, medium, strong, x-strong) -->
<prosody> — Control rate, pitch, and volume:
<prosody rate="slow" pitch="-10%" volume="loud">
This will be slow, lower-pitched, and louder.
</prosody>
<say-as> — Tell the engine how to interpret text:
<say-as interpret-as="date" format="mdy">5/19/2026</say-as>
<say-as interpret-as="phone">+44 20 7946 0958</say-as>
<say-as interpret-as="cardinal">42</say-as>
<emphasis> — Stress a word:
That is <emphasis level="strong">not</emphasis> what we said.
<phoneme> — Specify pronunciation directly in IPA or x-SAMPA:
<phoneme alphabet="ipa" ph="ˈpɪk.ʧər">pecan</phoneme>
Which TTS vendors support SSML (2026)
| Vendor | SSML support | Notes |
|---|---|---|
| Amazon Polly | Full | Best SSML implementation — all W3C elements |
| Google Cloud TTS | Full | W3C compliant + Google extensions |
| Azure Speech | Full | W3C + Microsoft extensions (pitch contour, etc.) |
| ElevenLabs | Partial | break, prosody, emphasis only — no phoneme |
| Murf AI | Full | Via Studio or API |
| Play.ht | Partial | break, prosody — limited phoneme |
| Speechify | No | Reading app, not SSML-addressable |
When SSML matters
For consumer content (YouTube narration, podcast) using a modern neural TTS engine: SSML rarely matters. ElevenLabs Eleven v3 makes good prosody choices automatically — adding SSML markup to a 500-word script is usually wasted effort.
SSML matters when:
- You’re building IVR or AI agent voice where consistency and predictability across re-generations is critical
- You’re using an older or cloud API TTS (Polly, Google) where the default prosody choices are poor
- You need specific pronunciations for technical terms, names, or non-standard formatting (phone numbers, currencies, acronyms)
- You’re in a regulated domain where the phrasing must be exact (legal, medical, financial disclosures)
SSML in the three-axis framework
SSML is a tool for the latency-tolerant batch generation segment. If your use case needs real-time sub-200ms latency, SSML parsing adds overhead you can’t afford — Cartesia Sonic 3 and Deepgram Aura 2 work with lightweight control via API parameters, not full SSML.
Related concepts
- Prosody — the voice quality dimension SSML directly controls
- Neural TTS — the engine that interprets SSML tags
- Voice model — what processes the SSML instructions
See also
- Learn: TTS for video content
- Murf AI review — full SSML support, good for script-heavy projects