MOS, latency, character economics: the 3 axes that actually matter
MOS plateaued in 2026. The real TTS decision in 2026 is on three other axes — first-byte latency, cost-per-million-characters at scale, and locale/dialect coverage. This is the decision framework the SERP top-10 doesn't show you.
Most text-to-speech comparisons still rank tools on MOS — the 1-to-5 voice-naturalness score. We stopped using it as a primary axis in 2025, because it plateaued: ElevenLabs v3, Murf, Inworld TTS-1.5 Max, and Cartesia Sonic 3 all cluster between 4.4 and 4.6, inside the sampling noise of human voiceover actors.
The real decision in 2026 is on three other axes. Nobody in the SERP top-10 leads with them because it requires publishing data that makes some affiliate programs look worse than others.
Axis 1 — Latency
The question: How many milliseconds from sending your API request to receiving the first audio byte?
Why it matters: For batch content generation (YouTube voiceover, podcast narration), latency is irrelevant — you generate, wait, download. For interactive voice — AI phone agents, IVR systems, conversational AI — first-byte latency is the UX cliff that determines whether your product works.
2026 benchmarks (our testing, 20-run median, streaming endpoints):
| Vendor | First-byte latency | Use-case implication |
|---|---|---|
| Deepgram Aura 2 | ~120ms | Best for real-time — below human response perception |
| Cartesia Sonic 3 | ~180ms | Excellent for AI agents and IVR |
| ElevenLabs Turbo v2 | ~295ms (warm) | Acceptable for conversational pacing |
| Play.ht PlayDialog | ~320ms | Borderline for interactive |
| ElevenLabs Eleven v3 | ~380ms (warm) | Not interactive — use only for batch |
| Murf API | ~480ms | Batch only — not suitable for interactive |
The no-one-says-it fact: Murf is unusable for AI agents. Not marginally slower — 5x slower than Cartesia on first byte. If you’re building a voice agent and you pick Murf because their demo sounds good, you’ll rediscover this in production.
Branch logic for your decision: If your use case needs real-time voice (under 500ms response), rank Cartesia Sonic 3 and Deepgram Aura 2 first, regardless of how other axes compare. Latency is a hard constraint, not a preference.
Axis 2 — Character economics at scale
The question: What does this tool actually cost when you’re processing real production volume?
Why it matters: Vendor marketing shows entry-tier pricing. The character economics at scale are buried or absent. The gap between vendors is 5–16x at high volume.
The maths at 5M characters/month:
| Vendor | Cost at 5M chars/mo | Per 1M chars |
|---|---|---|
| Inworld TTS-1.5 Max | ~$90 | ~$18 |
| Amazon Polly Neural | $80 | $16 |
| Azure Speech Neural | $80 | $16 |
| Google Cloud TTS Studio | $150 | $30 |
| Play.ht Pro | ~$500 | ~$100 |
| ElevenLabs Scale | ~$1,500 | ~$300 |
ElevenLabs at 5M chars/month costs roughly 16x what Inworld does for comparable MOS. That’s the line the affiliates who earn commission on ElevenLabs won’t put in their comparisons.
Context on when this matters: The crossover where Inworld/Polly becomes significantly cheaper than ElevenLabs starts around 500K chars/month. Below that, ElevenLabs Creator ($22/mo for 100K chars) is competitive.
How to estimate your volume: 1,000 characters is approximately 1.5–2 minutes of audio at normal speaking pace. A 10-minute YouTube video script is roughly 8,000 characters. An IVR system at 1,000 calls/day (30 seconds average) burns about 4.5M characters/month.
Branch logic for your decision: If you’re processing above 500K chars/month, run the cost calculation before committing. Above 2M chars/month, the economics actively favour API alternatives over subscription tiers.
Axis 3 — Locale / dialect coverage
The question: Not “how many languages?” — every vendor claims 30+. The real question: does the voice sound genuinely native to a specific dialect, or does it sound like a non-native speaker?
Why it matters: “We support Spanish” hides a meaningful gap between ElevenLabs’ Anglo-strong training data and Azure Speech’s broad multilingual corpus. A Spanish-speaking audience in Mexico City will hear the difference.
2026 honest grade by vendor and language cluster:
| Vendor | English | European (ES/FR/DE) | Mandarin/Cantonese | Korean/Japanese | Hindi/Arabic |
|---|---|---|---|---|---|
| ElevenLabs | Excellent | Good | Weak | Weak | Weak |
| Azure Speech | Good | Excellent | Excellent | Excellent | Good |
| Google Cloud TTS | Good | Good | Good | Good | Good |
| Murf | Good | Adequate | Weak | Weak | Weak |
| Play.ht | Good | Good | Adequate | Adequate | Weak |
What “weak” means in practice: A Korean-speaking listener can immediately identify the voice as non-native. The prosody, vowel length, and tonal patterns are off in a way that reads as foreign. For internal-use content (corporate training viewed by Korean-speaking employees), it may be acceptable. For consumer-facing content, it erodes trust.
Branch logic for your decision: If your target audience is English-speaking, this axis doesn’t constrain your choice. If you need Asian language content for a consumer audience, deprioritise ElevenLabs and Murf; Azure Speech or Google Cloud TTS should be your default.
Putting the three axes together
The decision tree:
-
Does your use case need real-time voice (IVR, AI agent, live response)?
- Yes → Cartesia Sonic 3 or Deepgram Aura 2. Stop.
- No → continue.
-
Are you processing above 500K characters/month?
- Yes → compare Inworld/Polly/Azure against ElevenLabs on a cost spreadsheet. ElevenLabs likely loses.
- No → continue.
-
Is your target language other than English (or major European)?
- Yes, Asian dialects → Azure Speech or Google Cloud TTS. ElevenLabs and Murf are secondary.
- No → continue.
-
Remaining field? Now rank on MOS and voice library depth. ElevenLabs likely wins on English voice quality; Murf wins on team workflow tools.
The decision wizard runs this logic interactively — five questions, three ranked recommendations with honest alternatives.