Voice Model

A voice model is the trained neural network artifact that defines the output sound of a TTS system. When ElevenLabs says “Eleven v3” or “Turbo v2,” they’re referring to different model versions with different quality, latency, and capability profiles. The voice you select (Rachel, Adam, custom clone) is a configuration on top of that underlying model.

Model versioning in practice

Most vendors ship multiple named models with different optimization targets:

ElevenLabs:

Eleven v3 — flagship quality model. Best MOS (4.6), slowest generation (380–820ms). For pre-rendered content.
Turbo v2 — latency-optimized. MOS 4.4, first-byte 295–420ms. For near-real-time use.
Multilingual v2 — same quality as Eleven v3 but trained on more language data. For non-English content.

Cartesia:

Sonic 3 — 2026 flagship. MOS 4.5, first-byte ~180ms. Optimized for streaming, not just quality.

Google Cloud TTS:

Standard voices — older concatenative. MOS ~3.8. Cheapest.
WaveNet voices — first-gen neural. MOS ~4.2.
Neural2 voices — 2022+ neural. MOS ~4.4.
Studio voices — latest, highest quality. MOS ~4.5. Most expensive.

Amazon Polly:

Standard — older concatenative. $4/1M chars.
Neural — 2021+ neural. $16/1M chars.
Generative — 2024+ latest. $30/1M chars.

What a voice model controls

The underlying model determines:

Quality ceiling — the maximum MOS achievable. A low-quality model can’t be improved by voice selection.
Latency floor — how fast the first audio byte can arrive. Model architecture sets this.
Language capability — which languages and dialects the model was trained on.
Prosody characteristics — how natural the intonation, rhythm, and stress patterns are.

The voice selection (Rachel vs Adam vs custom clone) controls:

Timbre — the characteristic tone color of the voice
Pitch range — how high or low the voice sits
Speaking style — some voices are trained to sound more authoritative, others more conversational

The model version upgrade question

When a vendor releases a new model version, you typically need to explicitly opt in. Old API calls using model: "eleven_monolingual_v1" won’t automatically upgrade to v3. For production applications, this is feature, not a bug — model upgrades change voice output and require regression testing. For manual content generation, staying on the latest model is usually correct.

Which model should I use?

Choose by use case, not marketing:

Use case	Best model choice
YouTube / podcast narration (English)	ElevenLabs Eleven v3
Real-time AI agent	Cartesia Sonic 3 (flow-matching, 180ms)
Multilingual content (Asian)	Azure Speech Neural (language depth)
Cost-sensitive high-volume	Inworld TTS-1.5 Max
API integration / predictable pricing	Amazon Polly Neural

Neural TTS — the architecture category these models belong to
Voice cloning — a voice applied to an underlying model
Prosody — what model quality primarily affects
SSML — model-dependent SSML feature support

Voice Model

Voice Model

Model versioning in practice

What a voice model controls

The model version upgrade question

Which model should I use?

See also

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review

Voice Model

Voice Model

Model versioning in practice

What a voice model controls

The model version upgrade question

Which model should I use?

Related concepts

See also

Related terms

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review