Latency and naturalness of an AI voice agent: how to drop under 600 ms
Below 600 ms, the human ear stops hearing an AI. Above, it sounds robotic. Here are the technical levers to cross the threshold — and why this is what separates an agent that converts from one that gets hung up on.
In real human conversation, the time between the end of your sentence and the start of mine is around 200-400 ms. Above that, my silence becomes uncomfortable. For an AI voice agent, it's the exact same threshold — and it's what separates an agent that converts from one that gets hung up on.
Why 600 ms is the right target#
Not 1 second, not 300 ms: 600 ms. Below it, conversation feels neurologically natural. Above, your customer's brain starts running hypotheses — "it didn't understand", "it's bugging", "should I repeat?". You lose the customer mentally before the next word.
The 5 sources of latency#
- Voice activity detection (VAD): 100-200 ms badly tuned, 50 ms tuned right.
- Speech-to-text (STT): 200-400 ms depending on model and streaming.
- LLM inference: 300-800 ms — often the spike.
- Text-to-speech (TTS): 100-300 ms depending on voice and streaming.
- Network round-trip: 30-150 ms depending on provider region.
The 4 concrete levers#
1. Streaming STT instead of buffered — agent starts understanding from the first 3 words. 2. Streaming LLM with short default replies. 3. TTS that starts speaking on the first token received. 4. Co-location: LLM and TTS in the same cloud region as your telecom carrier.
The naturalness trap#
You can hit 300 ms and still sound robotic — because the agent strings sentences with no breath. True naturalness adds 80 ms of strategic pause, soft "uh"s, slower delivery on numbers. Deliberately slower to feel more human.
What we hold at VocazAI#
p50 at 480 ms, p95 at 620 ms in trilingual production with the Voxtral + faster-whisper + Mistral + Piper cascade. Below the threshold on most calls — above it only when the conversation enters a genuinely hard zone (long numbers, aggressive code-switching). First month free to measure on your calls.