Voice Model
Voice Model
A voice model is the trained neural network artifact that defines the output sound of a TTS system. When ElevenLabs says “Eleven v3” or “Turbo v2,” they’re referring to different model versions with different quality, latency, and capability profiles. The voice you select (Rachel, Adam, custom clone) is a configuration on top of that underlying model.
Model versioning in practice
Most vendors ship multiple named models with different optimization targets:
ElevenLabs:
- Eleven v3 — flagship quality model. Best MOS (4.6), slowest generation (380–820ms). For pre-rendered content.
- Turbo v2 — latency-optimized. MOS 4.4, first-byte 295–420ms. For near-real-time use.
- Multilingual v2 — same quality as Eleven v3 but trained on more language data. For non-English content.
Cartesia:
- Sonic 3 — 2026 flagship. MOS 4.5, first-byte ~180ms. Optimized for streaming, not just quality.
Google Cloud TTS:
- Standard voices — older concatenative. MOS ~3.8. Cheapest.
- WaveNet voices — first-gen neural. MOS ~4.2.
- Neural2 voices — 2022+ neural. MOS ~4.4.
- Studio voices — latest, highest quality. MOS ~4.5. Most expensive.
Amazon Polly:
- Standard — older concatenative. $4/1M chars.
- Neural — 2021+ neural. $16/1M chars.
- Generative — 2024+ latest. $30/1M chars.
What a voice model controls
The underlying model determines:
- Quality ceiling — the maximum MOS achievable. A low-quality model can’t be improved by voice selection.
- Latency floor — how fast the first audio byte can arrive. Model architecture sets this.
- Language capability — which languages and dialects the model was trained on.
- Prosody characteristics — how natural the intonation, rhythm, and stress patterns are.
The voice selection (Rachel vs Adam vs custom clone) controls:
- Timbre — the characteristic tone color of the voice
- Pitch range — how high or low the voice sits
- Speaking style — some voices are trained to sound more authoritative, others more conversational
The model version upgrade question
When a vendor releases a new model version, you typically need to explicitly opt in. Old API calls using model: "eleven_monolingual_v1" won’t automatically upgrade to v3. For production applications, this is feature, not a bug — model upgrades change voice output and require regression testing. For manual content generation, staying on the latest model is usually correct.
Which model should I use?
Choose by use case, not marketing:
| Use case | Best model choice |
|---|---|
| YouTube / podcast narration (English) | ElevenLabs Eleven v3 |
| Real-time AI agent | Cartesia Sonic 3 (flow-matching, 180ms) |
| Multilingual content (Asian) | Azure Speech Neural (language depth) |
| Cost-sensitive high-volume | Inworld TTS-1.5 Max |
| API integration / predictable pricing | Amazon Polly Neural |
Related concepts
- Neural TTS — the architecture category these models belong to
- Voice cloning — a voice applied to an underlying model
- Prosody — what model quality primarily affects
- SSML — model-dependent SSML feature support
See also
- ElevenLabs review — most detailed model comparison in the market
- Choosing AI voice software