Voice Cloning

Voice cloning is the process of training an AI model on a sample of a specific person’s speech, then using that model to generate new audio that sounds like that person saying things they never recorded.

Modern voice clones require far less data than early approaches. ElevenLabs’ instant clone works from 60 seconds of clean audio. Professional clones (used for brand voices or audiobook narration) typically require 30–60 minutes of studio-quality recordings.

How it works

You record or provide a sample of the target voice — clean audio, minimal background noise, natural speaking rate.
The model extracts voice characteristics: pitch range, timbre, resonance, speaking rhythm, vocal texture.
The trained model receives new text and generates speech that replicates those characteristics.
The output isn’t a recording splice — it’s a neural synthesis in the cloned voice style.

Quality benchmarks (2026)

In our blinded test panel, a well-trained instant clone (60-second training audio) scores approximately MOS 4.2–4.4 — slightly below the original speaker’s MOS score, but above Speechify’s standard voices. A professional clone (studio session, 60 minutes) reaches MOS 4.5, effectively indistinguishable from the source in neutral narration contexts.

Quality degrades on emotional extremes — cloned voices handle anger, crying, or high-register emotion less convincingly than the original speaker.

Cloning a voice without the speaker’s explicit written consent is:

Illegal in the United States under Right-of-Publicity laws (varies by state — California’s Cal. Civ. Code section 3344 is the most expansive)
Illegal in the EU under GDPR Article 9 (biometric data requiring explicit consent)
A violation of every major TTS vendor’s Terms of Service (ElevenLabs, Play.ht, Cartesia, Murf all require consent attestation)

This is not a theoretical risk. ElevenLabs suspended thousands of accounts in 2024 for consent violations and now uses automated voice-matching to detect clones of public figures.

For legitimate business use: record the target speaker reading a consent statement into the same audio session as the training data. Store the consent recording.

Who uses voice cloning

Content creators — cloning their own voice for passive content production (podcast intro, YouTube narration when they’re travelling)
Audiobook publishers — cloning a narrator’s voice to fix mistakes without a re-record session
Enterprise training — cloning a brand spokesperson for consistent video narration across hundreds of modules
Accessibility — some ALS and motor neuron disease patients bank their voice before losing speech function

Vendors with voice cloning (2026)

Vendor	Instant clone	Professional clone	Min training audio
ElevenLabs	Yes	Yes (Voice Lab)	60 seconds
Play.ht	Yes	No	30 seconds
Cartesia	Yes	No	1 minute
Murf AI	No	Enterprise only	Studio session
Speechify	No	No	—

See ElevenLabs review for the most detailed voice cloning evaluation.

Neural TTS — the underlying architecture that makes modern voice cloning possible
Prosody — the voice quality dimension that clones struggle to replicate perfectly
Voice model — the trained model artifact that defines a cloned voice

Voice Cloning

Voice Cloning

How it works

Quality benchmarks (2026)

Who uses voice cloning

Vendors with voice cloning (2026)

See also

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review

Voice Cloning

Voice Cloning

How it works

Quality benchmarks (2026)

The consent and legal floor

Who uses voice cloning

Vendors with voice cloning (2026)

Related concepts

See also

Related terms

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review