Neural TTS

Neural TTS (Neural Text-to-Speech)

Neural TTS is the technology that powers modern AI voice generators. It uses deep neural networks to synthesize human-like speech from text, as opposed to the older “concatenative” approach that stitched together pre-recorded audio clips.

The difference matters practically: old concatenative TTS (think 2010-era GPS voices, the original Siri, or Stephen Hawking’s speech synthesizer) sounds robotic because the joins between audio clips are audible. Neural TTS generates audio continuously, with natural transitions — the robotic quality comes from errors in the model, not from clip boundaries.

A brief history

2016 — WaveNet (DeepMind/Google): The first publicly demonstrated neural TTS system that matched human VO quality in controlled tests. WaveNet uses autoregressive generation — it predicts each audio sample from the previous samples. The quality was revolutionary, but the generation was too slow for real-time use (~55 milliseconds of audio per second).

2018 — Tacotron 2 (Google): Combined WaveNet with a sequence-to-sequence model that converts text to a mel spectrogram first, then to audio. Faster and still high quality. Google Cloud TTS still uses Tacotron-family models for many voices.

2021-2023 — ElevenLabs, Murf, Play.ht: Commercial products built on transformer-based TTS architectures. ElevenLabs’ models are proprietary but described in their research as transformer-based, trained on massive multilingual corpora.

2024-2026 — Diffusion models and flow-matching: Cartesia Sonic 3 and new Inworld TTS-1.5 Max use diffusion-based approaches (or flow-matching variants) that enable sub-200ms streaming latency — something autoregressive models couldn’t achieve.

How neural TTS works (simplified pipeline)

Text normalization: Convert raw text to a normalized form — expand abbreviations, convert numbers to words, handle punctuation.
Phoneme conversion: Map normalized text to phoneme sequences (the basic sound units of the language).
Acoustic model: Convert phoneme sequences to a mel spectrogram — a frequency-vs-time representation of the audio.
Vocoder: Convert the mel spectrogram to actual audio waveform. HiFi-GAN, BigVGAN, and similar vocoders are used here.

In modern end-to-end systems (like ElevenLabs v3), steps 2-3-4 are merged into a single model that goes directly from text to audio, with no explicit intermediate representation.

MOS and neural TTS

MOS (Mean Opinion Score) measures the perceived quality of the output audio. Neural TTS crossed 4.0 MOS (above “good, but clearly synthetic”) with WaveNet in 2016, and plateaued at 4.4–4.6 (statistically indistinguishable from human VO in blinded tests) by 2025. This MOS plateau means voice quality is no longer the primary differentiator between vendors — latency, cost, and locale coverage are.

Latency characteristics

Architecture	Typical first-byte	Real-time capable
Autoregressive (old)	1000ms+	No
Transformer-based	300–800ms	Barely
Diffusion / flow-matching	70–200ms	Yes

The latency gap is why Cartesia Sonic 3 and Deepgram Aura 2 are the right choice for AI agents and IVR — they use architectures optimized for streaming, while ElevenLabs and Murf use architectures optimized for quality.

Prosody — neural TTS has made prosody the primary remaining quality dimension
SSML — markup language that neural TTS engines interpret
Voice cloning — built on neural TTS architectures
Voice model — the trained artifact that produces a specific speaker style

Neural TTS (Neural Text-to-Speech)

A brief history

How neural TTS works (simplified pipeline)

MOS and neural TTS

Latency characteristics

See also

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review

Neural TTS

Neural TTS (Neural Text-to-Speech)

A brief history

How neural TTS works (simplified pipeline)

MOS and neural TTS

Latency characteristics

Related concepts

See also

Related terms

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review