Neural TTS
Neural TTS (Neural Text-to-Speech)
Neural TTS is the technology that powers modern AI voice generators. It uses deep neural networks to synthesize human-like speech from text, as opposed to the older “concatenative” approach that stitched together pre-recorded audio clips.
The difference matters practically: old concatenative TTS (think 2010-era GPS voices, the original Siri, or Stephen Hawking’s speech synthesizer) sounds robotic because the joins between audio clips are audible. Neural TTS generates audio continuously, with natural transitions — the robotic quality comes from errors in the model, not from clip boundaries.
A brief history
2016 — WaveNet (DeepMind/Google): The first publicly demonstrated neural TTS system that matched human VO quality in controlled tests. WaveNet uses autoregressive generation — it predicts each audio sample from the previous samples. The quality was revolutionary, but the generation was too slow for real-time use (~55 milliseconds of audio per second).
2018 — Tacotron 2 (Google): Combined WaveNet with a sequence-to-sequence model that converts text to a mel spectrogram first, then to audio. Faster and still high quality. Google Cloud TTS still uses Tacotron-family models for many voices.
2021-2023 — ElevenLabs, Murf, Play.ht: Commercial products built on transformer-based TTS architectures. ElevenLabs’ models are proprietary but described in their research as transformer-based, trained on massive multilingual corpora.
2024-2026 — Diffusion models and flow-matching: Cartesia Sonic 3 and new Inworld TTS-1.5 Max use diffusion-based approaches (or flow-matching variants) that enable sub-200ms streaming latency — something autoregressive models couldn’t achieve.
How neural TTS works (simplified pipeline)
- Text normalization: Convert raw text to a normalized form — expand abbreviations, convert numbers to words, handle punctuation.
- Phoneme conversion: Map normalized text to phoneme sequences (the basic sound units of the language).
- Acoustic model: Convert phoneme sequences to a mel spectrogram — a frequency-vs-time representation of the audio.
- Vocoder: Convert the mel spectrogram to actual audio waveform. HiFi-GAN, BigVGAN, and similar vocoders are used here.
In modern end-to-end systems (like ElevenLabs v3), steps 2-3-4 are merged into a single model that goes directly from text to audio, with no explicit intermediate representation.
MOS and neural TTS
MOS (Mean Opinion Score) measures the perceived quality of the output audio. Neural TTS crossed 4.0 MOS (above “good, but clearly synthetic”) with WaveNet in 2016, and plateaued at 4.4–4.6 (statistically indistinguishable from human VO in blinded tests) by 2025. This MOS plateau means voice quality is no longer the primary differentiator between vendors — latency, cost, and locale coverage are.
Latency characteristics
| Architecture | Typical first-byte | Real-time capable |
|---|---|---|
| Autoregressive (old) | 1000ms+ | No |
| Transformer-based | 300–800ms | Barely |
| Diffusion / flow-matching | 70–200ms | Yes |
The latency gap is why Cartesia Sonic 3 and Deepgram Aura 2 are the right choice for AI agents and IVR — they use architectures optimized for streaming, while ElevenLabs and Murf use architectures optimized for quality.
Related concepts
- Prosody — neural TTS has made prosody the primary remaining quality dimension
- SSML — markup language that neural TTS engines interpret
- Voice cloning — built on neural TTS architectures
- Voice model — the trained artifact that produces a specific speaker style
See also
- ElevenLabs review — state of the art in neural TTS quality
- Choosing AI voice software