First-byte latency

Performance latencystreamingreal-timeapiivrai-agents

First-byte latency is the elapsed time between sending a text-to-speech API request and receiving the first audio byte in the response stream. It determines whether a TTS tool is usable for interactive applications.

Why it matters

For batch content generation (pre-render a podcast, download a voiceover file), first-byte latency is irrelevant. You queue a job and wait.

For interactive voice — AI phone agents, IVR systems, live captioning, conversational UI — first-byte latency is the UX cliff:

  • Under 200ms: indistinguishable from a human response delay
  • 200–400ms: perceptible but acceptable for conversational use
  • 400–800ms: noticeably slow; users experience a “loading” pause before every sentence
  • Over 800ms: breaks conversational flow; usable only for non-interactive reading

2026 benchmark (our testing, 20-run median, streaming endpoints)

VendorCold startWarm cacheMode
Deepgram Aura 2130ms120msStreaming only
Cartesia Sonic 3190ms180msStreaming only
ElevenLabs Turbo v2420ms295msStreaming
ElevenLabs Eleven v3820ms380msStreaming
Play.ht PlayDialog420ms320msStreaming
Murf API620ms480msBatch

What this means for tool selection

The no-one-puts-in-the-comparison-table fact: Murf is unusable for AI agents. Not marginally slower — 5x slower than Cartesia on cold start. For a voice agent where every response starts with a TTS call, that 480ms delay stacks per turn.

ElevenLabs Turbo v2 at 295ms warm is the minimum viable option for interactive use. Cartesia Sonic 3 at 180ms is the safe choice.

If you’re building batch content (YouTube voiceover, e-learning narration), pick on voice quality and cost — latency is irrelevant. If you’re wiring TTS into an app, put latency as Q1 before you evaluate any other axis.

See the decision wizard — Q3 branches the entire recommendation on latency tolerance.

Go deeper