Voice Cloning
Voice Cloning
Voice cloning is the process of training an AI model on a sample of a specific person’s speech, then using that model to generate new audio that sounds like that person saying things they never recorded.
Modern voice clones require far less data than early approaches. ElevenLabs’ instant clone works from 60 seconds of clean audio. Professional clones (used for brand voices or audiobook narration) typically require 30–60 minutes of studio-quality recordings.
How it works
- You record or provide a sample of the target voice — clean audio, minimal background noise, natural speaking rate.
- The model extracts voice characteristics: pitch range, timbre, resonance, speaking rhythm, vocal texture.
- The trained model receives new text and generates speech that replicates those characteristics.
- The output isn’t a recording splice — it’s a neural synthesis in the cloned voice style.
Quality benchmarks (2026)
In our blinded test panel, a well-trained instant clone (60-second training audio) scores approximately MOS 4.2–4.4 — slightly below the original speaker’s MOS score, but above Speechify’s standard voices. A professional clone (studio session, 60 minutes) reaches MOS 4.5, effectively indistinguishable from the source in neutral narration contexts.
Quality degrades on emotional extremes — cloned voices handle anger, crying, or high-register emotion less convincingly than the original speaker.
The consent and legal floor
Cloning a voice without the speaker’s explicit written consent is:
- Illegal in the United States under Right-of-Publicity laws (varies by state — California’s Cal. Civ. Code section 3344 is the most expansive)
- Illegal in the EU under GDPR Article 9 (biometric data requiring explicit consent)
- A violation of every major TTS vendor’s Terms of Service (ElevenLabs, Play.ht, Cartesia, Murf all require consent attestation)
This is not a theoretical risk. ElevenLabs suspended thousands of accounts in 2024 for consent violations and now uses automated voice-matching to detect clones of public figures.
For legitimate business use: record the target speaker reading a consent statement into the same audio session as the training data. Store the consent recording.
Who uses voice cloning
- Content creators — cloning their own voice for passive content production (podcast intro, YouTube narration when they’re travelling)
- Audiobook publishers — cloning a narrator’s voice to fix mistakes without a re-record session
- Enterprise training — cloning a brand spokesperson for consistent video narration across hundreds of modules
- Accessibility — some ALS and motor neuron disease patients bank their voice before losing speech function
Vendors with voice cloning (2026)
| Vendor | Instant clone | Professional clone | Min training audio |
|---|---|---|---|
| ElevenLabs | Yes | Yes (Voice Lab) | 60 seconds |
| Play.ht | Yes | No | 30 seconds |
| Cartesia | Yes | No | 1 minute |
| Murf AI | No | Enterprise only | Studio session |
| Speechify | No | No | — |
See ElevenLabs review for the most detailed voice cloning evaluation.
Related concepts
- Neural TTS — the underlying architecture that makes modern voice cloning possible
- Prosody — the voice quality dimension that clones struggle to replicate perfectly
- Voice model — the trained model artifact that defines a cloned voice
See also
- Voice cloning ethics and law — the legal + TOS deep dive
- ElevenLabs review — best current voice cloning quality