MOS, latency, character economics: the 3 axes that actually matter

Most text-to-speech comparisons still rank tools on MOS — the 1-to-5 voice-naturalness score. We stopped using it as a primary axis in 2025, because it plateaued: ElevenLabs v3, Murf, Inworld TTS-1.5 Max, and Cartesia Sonic 3 all cluster between 4.4 and 4.6, inside the sampling noise of human voiceover actors.

The real decision in 2026 is on three other axes. Nobody in the SERP top-10 leads with them because it requires publishing data that makes some affiliate programs look worse than others.

Axis 1 — Latency

The question: How many milliseconds from sending your API request to receiving the first audio byte?

Why it matters: For batch content generation (YouTube voiceover, podcast narration), latency is irrelevant — you generate, wait, download. For interactive voice — AI phone agents, IVR systems, conversational AI — first-byte latency is the UX cliff that determines whether your product works.

2026 benchmarks (our testing, 20-run median, streaming endpoints):

Vendor	First-byte latency	Use-case implication
Deepgram Aura 2	~120ms	Best for real-time — below human response perception
Cartesia Sonic 3	~180ms	Excellent for AI agents and IVR
ElevenLabs Turbo v2	~295ms (warm)	Acceptable for conversational pacing
Play.ht PlayDialog	~320ms	Borderline for interactive
ElevenLabs Eleven v3	~380ms (warm)	Not interactive — use only for batch
Murf API	~480ms	Batch only — not suitable for interactive

The no-one-says-it fact: Murf is unusable for AI agents. Not marginally slower — 5x slower than Cartesia on first byte. If you’re building a voice agent and you pick Murf because their demo sounds good, you’ll rediscover this in production.

Branch logic for your decision: If your use case needs real-time voice (under 500ms response), rank Cartesia Sonic 3 and Deepgram Aura 2 first, regardless of how other axes compare. Latency is a hard constraint, not a preference.

Axis 2 — Character economics at scale

The question: What does this tool actually cost when you’re processing real production volume?

Why it matters: Vendor marketing shows entry-tier pricing. The character economics at scale are buried or absent. The gap between vendors is 5–16x at high volume.

The maths at 5M characters/month:

Vendor	Cost at 5M chars/mo	Per 1M chars
Inworld TTS-1.5 Max	~$90	~$18
Amazon Polly Neural	$80	$16
Azure Speech Neural	$80	$16
Google Cloud TTS Studio	$150	$30
Play.ht Pro	~$500	~$100
ElevenLabs Scale	~$1,500	~$300

ElevenLabs at 5M chars/month costs roughly 16x what Inworld does for comparable MOS. That’s the line the affiliates who earn commission on ElevenLabs won’t put in their comparisons.

Context on when this matters: The crossover where Inworld/Polly becomes significantly cheaper than ElevenLabs starts around 500K chars/month. Below that, ElevenLabs Creator ($22/mo for 100K chars) is competitive.

How to estimate your volume: 1,000 characters is approximately 1.5–2 minutes of audio at normal speaking pace. A 10-minute YouTube video script is roughly 8,000 characters. An IVR system at 1,000 calls/day (30 seconds average) burns about 4.5M characters/month.

Branch logic for your decision: If you’re processing above 500K chars/month, run the cost calculation before committing. Above 2M chars/month, the economics actively favour API alternatives over subscription tiers.

Axis 3 — Locale / dialect coverage

The question: Not “how many languages?” — every vendor claims 30+. The real question: does the voice sound genuinely native to a specific dialect, or does it sound like a non-native speaker?

Why it matters: “We support Spanish” hides a meaningful gap between ElevenLabs’ Anglo-strong training data and Azure Speech’s broad multilingual corpus. A Spanish-speaking audience in Mexico City will hear the difference.

2026 honest grade by vendor and language cluster:

Vendor	English	European (ES/FR/DE)	Mandarin/Cantonese	Korean/Japanese	Hindi/Arabic
ElevenLabs	Excellent	Good	Weak	Weak	Weak
Azure Speech	Good	Excellent	Excellent	Excellent	Good
Google Cloud TTS	Good	Good	Good	Good	Good
Murf	Good	Adequate	Weak	Weak	Weak
Play.ht	Good	Good	Adequate	Adequate	Weak

What “weak” means in practice: A Korean-speaking listener can immediately identify the voice as non-native. The prosody, vowel length, and tonal patterns are off in a way that reads as foreign. For internal-use content (corporate training viewed by Korean-speaking employees), it may be acceptable. For consumer-facing content, it erodes trust.

Branch logic for your decision: If your target audience is English-speaking, this axis doesn’t constrain your choice. If you need Asian language content for a consumer audience, deprioritise ElevenLabs and Murf; Azure Speech or Google Cloud TTS should be your default.

Putting the three axes together

The decision tree:

Does your use case need real-time voice (IVR, AI agent, live response)?
- Yes → Cartesia Sonic 3 or Deepgram Aura 2. Stop.
- No → continue.
Are you processing above 500K characters/month?
- Yes → compare Inworld/Polly/Azure against ElevenLabs on a cost spreadsheet. ElevenLabs likely loses.
- No → continue.
Is your target language other than English (or major European)?
- Yes, Asian dialects → Azure Speech or Google Cloud TTS. ElevenLabs and Murf are secondary.
- No → continue.
Remaining field? Now rank on MOS and voice library depth. ElevenLabs likely wins on English voice quality; Murf wins on team workflow tools.

The decision wizard runs this logic interactively — five questions, three ranked recommendations with honest alternatives.

MOS, latency, character economics: the 3 axes that actually matter

Axis 1 — Latency

Axis 2 — Character economics at scale

Axis 3 — Locale / dialect coverage

Putting the three axes together

Go deeper

All learn guides

ElevenLabs Review — 8.4/10

Decision wizard