SSML

SSML — Speech Synthesis Markup Language

SSML is an XML-based markup language defined by the W3C that gives you fine-grained control over how a TTS engine pronounces text. Instead of sending raw text and hoping the engine makes good choices about emphasis and pauses, SSML lets you specify exactly how each phrase should be spoken.

Basic structure

<speak>
  Hello, <break time="200ms"/> welcome to the site.
  <prosody rate="slow" pitch="+5%">This part is emphasized.</prosody>
  The price is <say-as interpret-as="currency">$22.00</say-as> per month.
</speak>

The <speak> element is the root wrapper. Everything inside it is SSML-tagged text.

Core SSML elements

<break> — Insert a pause:

<break time="500ms"/>   <!-- Half-second pause -->
<break strength="medium"/> <!-- Relative pause (x-weak, weak, medium, strong, x-strong) -->

<prosody> — Control rate, pitch, and volume:

<prosody rate="slow" pitch="-10%" volume="loud">
  This will be slow, lower-pitched, and louder.
</prosody>

<say-as> — Tell the engine how to interpret text:

<say-as interpret-as="date" format="mdy">5/19/2026</say-as>
<say-as interpret-as="phone">+44 20 7946 0958</say-as>
<say-as interpret-as="cardinal">42</say-as>

<emphasis> — Stress a word:

That is <emphasis level="strong">not</emphasis> what we said.

<phoneme> — Specify pronunciation directly in IPA or x-SAMPA:

<phoneme alphabet="ipa" ph="ˈpɪk.ʧər">pecan</phoneme>

Which TTS vendors support SSML (2026)

Vendor	SSML support	Notes
Amazon Polly	Full	Best SSML implementation — all W3C elements
Google Cloud TTS	Full	W3C compliant + Google extensions
Azure Speech	Full	W3C + Microsoft extensions (pitch contour, etc.)
ElevenLabs	Partial	break, prosody, emphasis only — no phoneme
Murf AI	Full	Via Studio or API
Play.ht	Partial	break, prosody — limited phoneme
Speechify	No	Reading app, not SSML-addressable

When SSML matters

For consumer content (YouTube narration, podcast) using a modern neural TTS engine: SSML rarely matters. ElevenLabs Eleven v3 makes good prosody choices automatically — adding SSML markup to a 500-word script is usually wasted effort.

SSML matters when:

You’re building IVR or AI agent voice where consistency and predictability across re-generations is critical
You’re using an older or cloud API TTS (Polly, Google) where the default prosody choices are poor
You need specific pronunciations for technical terms, names, or non-standard formatting (phone numbers, currencies, acronyms)
You’re in a regulated domain where the phrasing must be exact (legal, medical, financial disclosures)

SSML in the three-axis framework

SSML is a tool for the latency-tolerant batch generation segment. If your use case needs real-time sub-200ms latency, SSML parsing adds overhead you can’t afford — Cartesia Sonic 3 and Deepgram Aura 2 work with lightweight control via API parameters, not full SSML.

Prosody — the voice quality dimension SSML directly controls
Neural TTS — the engine that interprets SSML tags
Voice model — what processes the SSML instructions

SSML — Speech Synthesis Markup Language

Basic structure

Core SSML elements

Which TTS vendors support SSML (2026)

When SSML matters

SSML in the three-axis framework

See also

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review

SSML

SSML — Speech Synthesis Markup Language

Basic structure

Core SSML elements

Which TTS vendors support SSML (2026)

When SSML matters

SSML in the three-axis framework

Related concepts

See also

Related terms

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review