First-byte latency

First-byte latency is the elapsed time between sending a text-to-speech API request and receiving the first audio byte in the response stream. It determines whether a TTS tool is usable for interactive applications.

Why it matters

For batch content generation (pre-render a podcast, download a voiceover file), first-byte latency is irrelevant. You queue a job and wait.

For interactive voice — AI phone agents, IVR systems, live captioning, conversational UI — first-byte latency is the UX cliff:

Under 200ms: indistinguishable from a human response delay
200–400ms: perceptible but acceptable for conversational use
400–800ms: noticeably slow; users experience a “loading” pause before every sentence
Over 800ms: breaks conversational flow; usable only for non-interactive reading

2026 benchmark (our testing, 20-run median, streaming endpoints)

Vendor	Cold start	Warm cache	Mode
Deepgram Aura 2	130ms	120ms	Streaming only
Cartesia Sonic 3	190ms	180ms	Streaming only
ElevenLabs Turbo v2	420ms	295ms	Streaming
ElevenLabs Eleven v3	820ms	380ms	Streaming
Play.ht PlayDialog	420ms	320ms	Streaming
Murf API	620ms	480ms	Batch

What this means for tool selection

The no-one-puts-in-the-comparison-table fact: Murf is unusable for AI agents. Not marginally slower — 5x slower than Cartesia on cold start. For a voice agent where every response starts with a TTS call, that 480ms delay stacks per turn.

ElevenLabs Turbo v2 at 295ms warm is the minimum viable option for interactive use. Cartesia Sonic 3 at 180ms is the safe choice.

If you’re building batch content (YouTube voiceover, e-learning narration), pick on voice quality and cost — latency is irrelevant. If you’re wiring TTS into an app, put latency as Q1 before you evaluate any other axis.

See the decision wizard — Q3 branches the entire recommendation on latency tolerance.

Why it matters

2026 benchmark (our testing, 20-run median, streaming endpoints)

What this means for tool selection

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review

First-byte latency

Why it matters

2026 benchmark (our testing, 20-run median, streaming endpoints)

What this means for tool selection

Related terms

Go deeper

Choosing AI Voice Software

Full Glossary

ElevenLabs Review