First-byte latency
First-byte latency is the elapsed time between sending a text-to-speech API request and receiving the first audio byte in the response stream. It determines whether a TTS tool is usable for interactive applications.
Why it matters
For batch content generation (pre-render a podcast, download a voiceover file), first-byte latency is irrelevant. You queue a job and wait.
For interactive voice — AI phone agents, IVR systems, live captioning, conversational UI — first-byte latency is the UX cliff:
- Under 200ms: indistinguishable from a human response delay
- 200–400ms: perceptible but acceptable for conversational use
- 400–800ms: noticeably slow; users experience a “loading” pause before every sentence
- Over 800ms: breaks conversational flow; usable only for non-interactive reading
2026 benchmark (our testing, 20-run median, streaming endpoints)
| Vendor | Cold start | Warm cache | Mode |
|---|---|---|---|
| Deepgram Aura 2 | 130ms | 120ms | Streaming only |
| Cartesia Sonic 3 | 190ms | 180ms | Streaming only |
| ElevenLabs Turbo v2 | 420ms | 295ms | Streaming |
| ElevenLabs Eleven v3 | 820ms | 380ms | Streaming |
| Play.ht PlayDialog | 420ms | 320ms | Streaming |
| Murf API | 620ms | 480ms | Batch |
What this means for tool selection
The no-one-puts-in-the-comparison-table fact: Murf is unusable for AI agents. Not marginally slower — 5x slower than Cartesia on cold start. For a voice agent where every response starts with a TTS call, that 480ms delay stacks per turn.
ElevenLabs Turbo v2 at 295ms warm is the minimum viable option for interactive use. Cartesia Sonic 3 at 180ms is the safe choice.
If you’re building batch content (YouTube voiceover, e-learning narration), pick on voice quality and cost — latency is irrelevant. If you’re wiring TTS into an app, put latency as Q1 before you evaluate any other axis.
See the decision wizard — Q3 branches the entire recommendation on latency tolerance.