MOS (Mean Opinion Score)

MOS — Mean Opinion Score. A subjective 1-to-5 rating of perceived voice naturalness, averaged across 15–30 blinded listeners per sample.

MOS was the TTS industry’s primary quality metric from 2010 to 2024. Human speech typically scores 4.5–4.8. Early neural TTS scored 3.5–4.0. The race to improve MOS drove the major model releases: WaveNet (2016), Tacotron 2 (2018), ElevenLabs v1 (2023).

Why MOS plateaued in 2026

It plateaued in 2025. Every flagship vendor now scores between 4.4 and 4.6, which is statistically indistinguishable from human voiceover actors in blinded comparisons:

Vendor	MOS (our panel, 2026)
ElevenLabs Eleven v3	4.6
Cartesia Sonic 3	4.6
Play.ht PlayDialog	4.4
Murf Studio	4.4
Speechify	4.3
Descript Overdub	4.1

The differences within this cluster are inside sampling noise for most use cases. A blinded listener cannot reliably rank ElevenLabs above Cartesia at MOS 4.6 vs 4.6.

What to rank on instead

We still report MOS on review pages because it’s the language vendors use. But we rank on three other axes:

First-byte latency — 120ms (Deepgram Aura 2) vs 820ms (ElevenLabs Eleven v3). The gap that breaks AI agents.
Character economics at scale — At 5M chars/month, ElevenLabs is roughly $1,500/mo; Inworld TTS-1.5 Max is ~$90/mo.
Locale/dialect coverage — “We support 30 languages” hides that Korean and Cantonese quality can trail Azure by a full MOS point.

See The 3 axes that actually matter for the deep dive.

Why MOS plateaued in 2026

What to rank on instead

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review

MOS (Mean Opinion Score)

Why MOS plateaued in 2026

What to rank on instead

Related terms

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review