Prosody

Prosody is the collective term for the suprasegmental features of speech: rhythm, stress, intonation, and tempo. It’s the difference between reading a sentence in a monotone voice and reading it the way a person actually speaks — with emphasis on key words, natural pauses, rising intonation for questions, and falling intonation at the end of declarative sentences.

If MOS (Mean Opinion Score) is the overall quality score for a TTS voice, prosody is the dimension that most determines that score. A voice can be technically clean — no artifacts, clear phoneme articulation — but still score poorly if its prosody is flat or wrong.

The three dimensions of prosody

Pitch (intonation): The rise and fall of fundamental frequency across a sentence. Questions rise; statements fall; lists have a particular melodic pattern. TTS systems that get pitch wrong sound robotic or confused.

Rhythm (tempo): The timing of words, syllables, and phrases. Humans don’t speak at a constant rate — they speed up on less important clauses and slow down for emphasis. Flat-rate TTS sounds unnatural even if every phoneme is correctly pronounced.

Stress: The relative emphasis on words and syllables within words. “I didn’t say she stole the money” — the meaning changes completely depending on which word is stressed. Good TTS infers stress from context; poor TTS stresses every word equally.

Where modern TTS stands on prosody (2026)

The MOS plateau (4.4–4.6 for all flagship vendors) is substantially a prosody plateau. All major neural TTS systems now produce clean, artifact-free audio. The remaining gap to human voiceover is mostly in:

Emotional prosody — sarcasm, surprise, genuine laughter, grief. Human speakers use extreme pitch variation, tempo changes, and voice quality shifts. Current TTS systems clip at both ends of the emotional range.
Long-form prosody — over 1,000 words, some TTS systems “drift” — the intonation pattern becomes repetitive, the rhythm flattens. ElevenLabs Eleven v3 handles this better than Murf Studio; both are behind a professional VO actor on 5,000-word narration.
Contextual stress — inferring which word in a sentence should be stressed requires understanding semantics, not just syntax. “The red car” vs “the red car” (as opposed to the blue one) — the stress differs but a TTS system reading raw text can’t reliably make that call.

Controlling prosody

The two main tools:

SSML: Direct prosody control via markup — specify rate, pitch, volume, and break timing explicitly. See SSML glossary entry for implementation. Best for formal, predictable content (IVR, legal disclosures).

Prompt engineering (ElevenLabs): ElevenLabs Eleven v3 responds to parenthetical cues in text — (in a hushed tone) before a phrase, ... for hesitation, ! for exclamation. Less precise than SSML but more natural for conversational content.

Vendor prosody comparison

Vendor	Neutral narration	Conversational	Emotional range	Long-form consistency
ElevenLabs v3	Excellent	Excellent	Good	Good
Cartesia Sonic 3	Excellent	Good	Good	Good
Murf Studio	Good	Fair	Fair	Excellent
Play.ht PlayDialog	Good	Good	Fair	Good
Speechify	Fair	Fair	Poor	Fair
Amazon Polly Neural	Good	Fair	Poor	Good

“Excellent” = indistinguishable from human VO in blinded test. “Fair” = audibly synthetic in focused listening.

Neural TTS — the architecture that generates prosodically-aware speech
SSML — the markup language for explicit prosody control
Voice cloning — cloned voices replicate the original speaker’s prosodic patterns

Prosody

Prosody

The three dimensions of prosody

Where modern TTS stands on prosody (2026)

Controlling prosody

Vendor prosody comparison

See also

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review

Prosody

Prosody

The three dimensions of prosody

Where modern TTS stands on prosody (2026)

Controlling prosody

Vendor prosody comparison

Related concepts

See also

Related terms

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review