Stage 1 9 min read Last reviewed 2026-05-19

MOS, latency, character economics: the 3 axes that actually matter

MOS plateaued in 2026. The real TTS decision in 2026 is on three other axes — first-byte latency, cost-per-million-characters at scale, and locale/dialect coverage. This is the decision framework the SERP top-10 doesn't show you.

By Max Yao · FTC: We earn commissions from some links. Disclosure.
creatordeveloperenterprise

Most text-to-speech comparisons still rank tools on MOS — the 1-to-5 voice-naturalness score. We stopped using it as a primary axis in 2025, because it plateaued: ElevenLabs v3, Murf, Inworld TTS-1.5 Max, and Cartesia Sonic 3 all cluster between 4.4 and 4.6, inside the sampling noise of human voiceover actors.

The real decision in 2026 is on three other axes. Nobody in the SERP top-10 leads with them because it requires publishing data that makes some affiliate programs look worse than others.

Axis 1 — Latency

The question: How many milliseconds from sending your API request to receiving the first audio byte?

Why it matters: For batch content generation (YouTube voiceover, podcast narration), latency is irrelevant — you generate, wait, download. For interactive voice — AI phone agents, IVR systems, conversational AI — first-byte latency is the UX cliff that determines whether your product works.

2026 benchmarks (our testing, 20-run median, streaming endpoints):

VendorFirst-byte latencyUse-case implication
Deepgram Aura 2~120msBest for real-time — below human response perception
Cartesia Sonic 3~180msExcellent for AI agents and IVR
ElevenLabs Turbo v2~295ms (warm)Acceptable for conversational pacing
Play.ht PlayDialog~320msBorderline for interactive
ElevenLabs Eleven v3~380ms (warm)Not interactive — use only for batch
Murf API~480msBatch only — not suitable for interactive

The no-one-says-it fact: Murf is unusable for AI agents. Not marginally slower — 5x slower than Cartesia on first byte. If you’re building a voice agent and you pick Murf because their demo sounds good, you’ll rediscover this in production.

Branch logic for your decision: If your use case needs real-time voice (under 500ms response), rank Cartesia Sonic 3 and Deepgram Aura 2 first, regardless of how other axes compare. Latency is a hard constraint, not a preference.

Axis 2 — Character economics at scale

The question: What does this tool actually cost when you’re processing real production volume?

Why it matters: Vendor marketing shows entry-tier pricing. The character economics at scale are buried or absent. The gap between vendors is 5–16x at high volume.

The maths at 5M characters/month:

VendorCost at 5M chars/moPer 1M chars
Inworld TTS-1.5 Max~$90~$18
Amazon Polly Neural$80$16
Azure Speech Neural$80$16
Google Cloud TTS Studio$150$30
Play.ht Pro~$500~$100
ElevenLabs Scale~$1,500~$300

ElevenLabs at 5M chars/month costs roughly 16x what Inworld does for comparable MOS. That’s the line the affiliates who earn commission on ElevenLabs won’t put in their comparisons.

Context on when this matters: The crossover where Inworld/Polly becomes significantly cheaper than ElevenLabs starts around 500K chars/month. Below that, ElevenLabs Creator ($22/mo for 100K chars) is competitive.

How to estimate your volume: 1,000 characters is approximately 1.5–2 minutes of audio at normal speaking pace. A 10-minute YouTube video script is roughly 8,000 characters. An IVR system at 1,000 calls/day (30 seconds average) burns about 4.5M characters/month.

Branch logic for your decision: If you’re processing above 500K chars/month, run the cost calculation before committing. Above 2M chars/month, the economics actively favour API alternatives over subscription tiers.

Axis 3 — Locale / dialect coverage

The question: Not “how many languages?” — every vendor claims 30+. The real question: does the voice sound genuinely native to a specific dialect, or does it sound like a non-native speaker?

Why it matters: “We support Spanish” hides a meaningful gap between ElevenLabs’ Anglo-strong training data and Azure Speech’s broad multilingual corpus. A Spanish-speaking audience in Mexico City will hear the difference.

2026 honest grade by vendor and language cluster:

VendorEnglishEuropean (ES/FR/DE)Mandarin/CantoneseKorean/JapaneseHindi/Arabic
ElevenLabsExcellentGoodWeakWeakWeak
Azure SpeechGoodExcellentExcellentExcellentGood
Google Cloud TTSGoodGoodGoodGoodGood
MurfGoodAdequateWeakWeakWeak
Play.htGoodGoodAdequateAdequateWeak

What “weak” means in practice: A Korean-speaking listener can immediately identify the voice as non-native. The prosody, vowel length, and tonal patterns are off in a way that reads as foreign. For internal-use content (corporate training viewed by Korean-speaking employees), it may be acceptable. For consumer-facing content, it erodes trust.

Branch logic for your decision: If your target audience is English-speaking, this axis doesn’t constrain your choice. If you need Asian language content for a consumer audience, deprioritise ElevenLabs and Murf; Azure Speech or Google Cloud TTS should be your default.

Putting the three axes together

The decision tree:

  1. Does your use case need real-time voice (IVR, AI agent, live response)?

    • Yes → Cartesia Sonic 3 or Deepgram Aura 2. Stop.
    • No → continue.
  2. Are you processing above 500K characters/month?

    • Yes → compare Inworld/Polly/Azure against ElevenLabs on a cost spreadsheet. ElevenLabs likely loses.
    • No → continue.
  3. Is your target language other than English (or major European)?

    • Yes, Asian dialects → Azure Speech or Google Cloud TTS. ElevenLabs and Murf are secondary.
    • No → continue.
  4. Remaining field? Now rank on MOS and voice library depth. ElevenLabs likely wins on English voice quality; Murf wins on team workflow tools.

The decision wizard runs this logic interactively — five questions, three ranked recommendations with honest alternatives.

Go deeper