Prosody

Voice / Audio Technology prosodyintonationrhythmstressnaturalness

Prosody

Prosody is the collective term for the suprasegmental features of speech: rhythm, stress, intonation, and tempo. It’s the difference between reading a sentence in a monotone voice and reading it the way a person actually speaks — with emphasis on key words, natural pauses, rising intonation for questions, and falling intonation at the end of declarative sentences.

If MOS (Mean Opinion Score) is the overall quality score for a TTS voice, prosody is the dimension that most determines that score. A voice can be technically clean — no artifacts, clear phoneme articulation — but still score poorly if its prosody is flat or wrong.

The three dimensions of prosody

Pitch (intonation): The rise and fall of fundamental frequency across a sentence. Questions rise; statements fall; lists have a particular melodic pattern. TTS systems that get pitch wrong sound robotic or confused.

Rhythm (tempo): The timing of words, syllables, and phrases. Humans don’t speak at a constant rate — they speed up on less important clauses and slow down for emphasis. Flat-rate TTS sounds unnatural even if every phoneme is correctly pronounced.

Stress: The relative emphasis on words and syllables within words. “I didn’t say she stole the money” — the meaning changes completely depending on which word is stressed. Good TTS infers stress from context; poor TTS stresses every word equally.

Where modern TTS stands on prosody (2026)

The MOS plateau (4.4–4.6 for all flagship vendors) is substantially a prosody plateau. All major neural TTS systems now produce clean, artifact-free audio. The remaining gap to human voiceover is mostly in:

  1. Emotional prosody — sarcasm, surprise, genuine laughter, grief. Human speakers use extreme pitch variation, tempo changes, and voice quality shifts. Current TTS systems clip at both ends of the emotional range.

  2. Long-form prosody — over 1,000 words, some TTS systems “drift” — the intonation pattern becomes repetitive, the rhythm flattens. ElevenLabs Eleven v3 handles this better than Murf Studio; both are behind a professional VO actor on 5,000-word narration.

  3. Contextual stress — inferring which word in a sentence should be stressed requires understanding semantics, not just syntax. “The red car” vs “the red car” (as opposed to the blue one) — the stress differs but a TTS system reading raw text can’t reliably make that call.

Controlling prosody

The two main tools:

SSML: Direct prosody control via markup — specify rate, pitch, volume, and break timing explicitly. See SSML glossary entry for implementation. Best for formal, predictable content (IVR, legal disclosures).

Prompt engineering (ElevenLabs): ElevenLabs Eleven v3 responds to parenthetical cues in text — (in a hushed tone) before a phrase, ... for hesitation, ! for exclamation. Less precise than SSML but more natural for conversational content.

Vendor prosody comparison

VendorNeutral narrationConversationalEmotional rangeLong-form consistency
ElevenLabs v3ExcellentExcellentGoodGood
Cartesia Sonic 3ExcellentGoodGoodGood
Murf StudioGoodFairFairExcellent
Play.ht PlayDialogGoodGoodFairGood
SpeechifyFairFairPoorFair
Amazon Polly NeuralGoodFairPoorGood

“Excellent” = indistinguishable from human VO in blinded test. “Fair” = audibly synthetic in focused listening.

  • Neural TTS — the architecture that generates prosodically-aware speech
  • SSML — the markup language for explicit prosody control
  • Voice cloning — cloned voices replicate the original speaker’s prosodic patterns

See also

Go deeper