What is text-to-speech software?
From Stephen Hawking's Equalizer in 1986 to ElevenLabs Eleven v3 in 2026: what text-to-speech software actually is, how the technology changed, and why it matters if you last looked at TTS in 2018.
If you last evaluated text-to-speech software in 2018, you are evaluating a different product category now. The voice quality gap between a human VO actor and a flagship AI TTS tool, which was obvious and jarring in 2018, closed to within statistical noise in 2025. That changes who should use TTS, how they should use it, and what they should pay for it.
The short definition
Text-to-speech software (TTS) converts written text into spoken audio. The input is text; the output is an audio stream or file. That sounds simple. The complexity is in the pipeline that converts one to the other — specifically how that pipeline has changed over 40 years.
The history in four phases
Phase 1 — concatenative synthesis (1960s–2010s)
Early TTS stitched pre-recorded phoneme fragments together like an audio jigsaw puzzle. The result sounded robotic because the stitching was audible: flat pitch, no rhythm variation, unnatural pauses at punctuation.
Stephen Hawking’s Equalizer device (1986) ran on this technology. The distinctive synthesised voice — often called the “Hawking voice” — is a specific early concatenative synthesizer called DECtalk. It became iconic precisely because it was machine-like.
Phase 2 — statistical parametric synthesis (2000s–2015)
Hidden Markov Models and later DNNs predicted smooth acoustic parameters rather than stitching samples. Quality improved; the robotic clip-and-stitch artefacts largely disappeared. But the output was still identifiably synthetic — “smooth and wrong” rather than “choppy and wrong.”
Phase 3 — neural TTS (2016–2023)
DeepMind’s WaveNet (2016) was the inflection point. Instead of predicting acoustic parameters, it generated raw audio waveforms directly from text using a deep convolutional neural network. MOS scores jumped from 3.8 to 4.4 in two years.
Google Tacotron 2 (2018) added an end-to-end architecture that learned prosody from data rather than rules. ElevenLabs launched in 2022 with a diffusion-based approach that finally matched human naturalness on short passages in blinded tests.
Phase 4 — quality plateau + differentiation on other axes (2024–present)
By 2025, ElevenLabs, Murf, Cartesia, and Inworld all clustered between MOS 4.4 and 4.6 — within the sampling noise of human VO actors. The voice quality race effectively ended.
The new differentiation is on three axes: first-byte latency, character economics at scale, and locale/dialect coverage. See The 3 axes that actually matter for the full analysis.
Five ways people use TTS in 2026
-
Content creation — YouTube voiceover, TikTok narration, podcast production without recording equipment. ElevenLabs and Murf own this segment.
-
Accessibility — Reading PDFs, web articles, Kindle books aloud for people with dyslexia, low vision, or ADHD. Speechify and NaturalReader are purpose-built for this.
-
Developer / API integration — Embedding voice output into apps, AI agents, IVR systems. Cartesia, Deepgram Aura 2, and Amazon Polly target this segment.
-
E-learning and corporate training — Generating 200+ training module narrations in multiple languages without voiceover actors. Murf Enterprise and WellSaid Labs focus here.
-
Personal reading — Consuming long-form content (articles, research papers, newsletters) as audio during commutes or while doing something else.
Who should use TTS in 2026 vs who should not
TTS is a clear win if: you produce content at volume, you have reading accessibility needs, you’re building a product that needs voice output, or you need consistent narration across many episodes or modules.
TTS is not the right call if: you need a specific real human’s voice for legal or creative reasons, you need emotional performance beyond what current models handle (stand-up comedy, sarcasm-heavy improv), or you’re in a jurisdiction with strict synthetic voice disclosure requirements for your content type.
The decision wizard branches on your specific use case — five questions, three ranked recommendations, honest alternatives included.