Prosody
Prosody
Prosody is the collective term for the suprasegmental features of speech: rhythm, stress, intonation, and tempo. It’s the difference between reading a sentence in a monotone voice and reading it the way a person actually speaks — with emphasis on key words, natural pauses, rising intonation for questions, and falling intonation at the end of declarative sentences.
If MOS (Mean Opinion Score) is the overall quality score for a TTS voice, prosody is the dimension that most determines that score. A voice can be technically clean — no artifacts, clear phoneme articulation — but still score poorly if its prosody is flat or wrong.
The three dimensions of prosody
Pitch (intonation): The rise and fall of fundamental frequency across a sentence. Questions rise; statements fall; lists have a particular melodic pattern. TTS systems that get pitch wrong sound robotic or confused.
Rhythm (tempo): The timing of words, syllables, and phrases. Humans don’t speak at a constant rate — they speed up on less important clauses and slow down for emphasis. Flat-rate TTS sounds unnatural even if every phoneme is correctly pronounced.
Stress: The relative emphasis on words and syllables within words. “I didn’t say she stole the money” — the meaning changes completely depending on which word is stressed. Good TTS infers stress from context; poor TTS stresses every word equally.
Where modern TTS stands on prosody (2026)
The MOS plateau (4.4–4.6 for all flagship vendors) is substantially a prosody plateau. All major neural TTS systems now produce clean, artifact-free audio. The remaining gap to human voiceover is mostly in:
-
Emotional prosody — sarcasm, surprise, genuine laughter, grief. Human speakers use extreme pitch variation, tempo changes, and voice quality shifts. Current TTS systems clip at both ends of the emotional range.
-
Long-form prosody — over 1,000 words, some TTS systems “drift” — the intonation pattern becomes repetitive, the rhythm flattens. ElevenLabs Eleven v3 handles this better than Murf Studio; both are behind a professional VO actor on 5,000-word narration.
-
Contextual stress — inferring which word in a sentence should be stressed requires understanding semantics, not just syntax. “The red car” vs “the red car” (as opposed to the blue one) — the stress differs but a TTS system reading raw text can’t reliably make that call.
Controlling prosody
The two main tools:
SSML: Direct prosody control via markup — specify rate, pitch, volume, and break timing explicitly. See SSML glossary entry for implementation. Best for formal, predictable content (IVR, legal disclosures).
Prompt engineering (ElevenLabs): ElevenLabs Eleven v3 responds to parenthetical cues in text — (in a hushed tone) before a phrase, ... for hesitation, ! for exclamation. Less precise than SSML but more natural for conversational content.
Vendor prosody comparison
| Vendor | Neutral narration | Conversational | Emotional range | Long-form consistency |
|---|---|---|---|---|
| ElevenLabs v3 | Excellent | Excellent | Good | Good |
| Cartesia Sonic 3 | Excellent | Good | Good | Good |
| Murf Studio | Good | Fair | Fair | Excellent |
| Play.ht PlayDialog | Good | Good | Fair | Good |
| Speechify | Fair | Fair | Poor | Fair |
| Amazon Polly Neural | Good | Fair | Poor | Good |
“Excellent” = indistinguishable from human VO in blinded test. “Fair” = audibly synthetic in focused listening.
Related concepts
- Neural TTS — the architecture that generates prosodically-aware speech
- SSML — the markup language for explicit prosody control
- Voice cloning — cloned voices replicate the original speaker’s prosodic patterns
See also
- ElevenLabs review — the current leader on conversational prosody
- ElevenLabs vs Murf comparison