SSML

Voice / Audio Technology SSMLSpeech Synthesis Markup LanguageprosodyXMLAPI

SSML — Speech Synthesis Markup Language

SSML is an XML-based markup language defined by the W3C that gives you fine-grained control over how a TTS engine pronounces text. Instead of sending raw text and hoping the engine makes good choices about emphasis and pauses, SSML lets you specify exactly how each phrase should be spoken.

Basic structure

<speak>
  Hello, <break time="200ms"/> welcome to the site.
  <prosody rate="slow" pitch="+5%">This part is emphasized.</prosody>
  The price is <say-as interpret-as="currency">$22.00</say-as> per month.
</speak>

The <speak> element is the root wrapper. Everything inside it is SSML-tagged text.

Core SSML elements

<break> — Insert a pause:

<break time="500ms"/>   <!-- Half-second pause -->
<break strength="medium"/> <!-- Relative pause (x-weak, weak, medium, strong, x-strong) -->

<prosody> — Control rate, pitch, and volume:

<prosody rate="slow" pitch="-10%" volume="loud">
  This will be slow, lower-pitched, and louder.
</prosody>

<say-as> — Tell the engine how to interpret text:

<say-as interpret-as="date" format="mdy">5/19/2026</say-as>
<say-as interpret-as="phone">+44 20 7946 0958</say-as>
<say-as interpret-as="cardinal">42</say-as>

<emphasis> — Stress a word:

That is <emphasis level="strong">not</emphasis> what we said.

<phoneme> — Specify pronunciation directly in IPA or x-SAMPA:

<phoneme alphabet="ipa" ph="ˈpɪk.ʧər">pecan</phoneme>

Which TTS vendors support SSML (2026)

VendorSSML supportNotes
Amazon PollyFullBest SSML implementation — all W3C elements
Google Cloud TTSFullW3C compliant + Google extensions
Azure SpeechFullW3C + Microsoft extensions (pitch contour, etc.)
ElevenLabsPartialbreak, prosody, emphasis only — no phoneme
Murf AIFullVia Studio or API
Play.htPartialbreak, prosody — limited phoneme
SpeechifyNoReading app, not SSML-addressable

When SSML matters

For consumer content (YouTube narration, podcast) using a modern neural TTS engine: SSML rarely matters. ElevenLabs Eleven v3 makes good prosody choices automatically — adding SSML markup to a 500-word script is usually wasted effort.

SSML matters when:

  • You’re building IVR or AI agent voice where consistency and predictability across re-generations is critical
  • You’re using an older or cloud API TTS (Polly, Google) where the default prosody choices are poor
  • You need specific pronunciations for technical terms, names, or non-standard formatting (phone numbers, currencies, acronyms)
  • You’re in a regulated domain where the phrasing must be exact (legal, medical, financial disclosures)

SSML in the three-axis framework

SSML is a tool for the latency-tolerant batch generation segment. If your use case needs real-time sub-200ms latency, SSML parsing adds overhead you can’t afford — Cartesia Sonic 3 and Deepgram Aura 2 work with lightweight control via API parameters, not full SSML.

  • Prosody — the voice quality dimension SSML directly controls
  • Neural TTS — the engine that interprets SSML tags
  • Voice model — what processes the SSML instructions

See also

Go deeper