MOS (Mean Opinion Score)

Voice Quality qualitybenchmarkmeasurementvoice-naturalness

MOS — Mean Opinion Score. A subjective 1-to-5 rating of perceived voice naturalness, averaged across 15–30 blinded listeners per sample.

MOS was the TTS industry’s primary quality metric from 2010 to 2024. Human speech typically scores 4.5–4.8. Early neural TTS scored 3.5–4.0. The race to improve MOS drove the major model releases: WaveNet (2016), Tacotron 2 (2018), ElevenLabs v1 (2023).

Why MOS plateaued in 2026

It plateaued in 2025. Every flagship vendor now scores between 4.4 and 4.6, which is statistically indistinguishable from human voiceover actors in blinded comparisons:

VendorMOS (our panel, 2026)
ElevenLabs Eleven v34.6
Cartesia Sonic 34.6
Play.ht PlayDialog4.4
Murf Studio4.4
Speechify4.3
Descript Overdub4.1

The differences within this cluster are inside sampling noise for most use cases. A blinded listener cannot reliably rank ElevenLabs above Cartesia at MOS 4.6 vs 4.6.

What to rank on instead

We still report MOS on review pages because it’s the language vendors use. But we rank on three other axes:

  1. First-byte latency — 120ms (Deepgram Aura 2) vs 820ms (ElevenLabs Eleven v3). The gap that breaks AI agents.
  2. Character economics at scale — At 5M chars/month, ElevenLabs is roughly $1,500/mo; Inworld TTS-1.5 Max is ~$90/mo.
  3. Locale/dialect coverage — “We support 30 languages” hides that Korean and Cantonese quality can trail Azure by a full MOS point.

See The 3 axes that actually matter for the deep dive.

Go deeper