Skip to main content
All articles
Published on7 min read

How an AI voice agent works: from pickup to CRM update

We unpack the full pipeline of an AI voice agent — pickup, transcription, reasoning, voice response, integrations. No jargon — what really happens in under a second.

When a customer calls your number, what the AI voice agent does in the background is like an orchestra where each player gets 200 ms. Here's the exact flow, step by step, so you understand where your price is spent and where quality is won or lost.

Step 1 — Pickup and routing#

Your number is attached to a telecom carrier (Twilio, Vonage, OVH). When the call comes in, the carrier routes it via SIP/WebRTC to the agent platform (VocazAI, Vapi, Retell). This step costs 50–100 ms and $0.002–$0.005 per minute in telecom fees.

Step 2 — Streaming transcription#

The audio stream is sent in real time to a speech recognition engine (Voxtral, Whisper). It transcribes in 200–300 ms chunks. No waiting for the end of the sentence — the agent starts understanding once you say the first 3 words.

Step 3 — Reasoning (the LLM)#

The transcript goes into an LLM (GPT-4o, Mistral, Claude) with a system prompt describing your business, rules and tools. The model decides: answer, ask a question, call a function (check calendar, book), or transfer.

Step 4 — Voice synthesis#

  • Response text → TTS engine (Piper, ElevenLabs, OpenAI TTS).
  • Voice per language (Siwis French, Lessac English, Kareem Arabic).
  • Streamed output to cut perceived latency.
  • Step total: 80–200 ms.

Step 5 — Integrations (tool functions)#

If the agent needs to act (book, check availability, push a CRM lead), it calls a server-side function. The result returns as JSON, the LLM phrases it back. This is where your Google Calendar, HubSpot, Pipedrive, PMS plug in.

Final tally#

Pickup (100 ms) + STT (300 ms) + LLM (400 ms) + TTS (150 ms) + network (50 ms) = ~1000 ms per turn. Tuned well, you sit under 600 ms — the threshold where the agent stops feeling 'AI'. First month VocazAI free so you can measure on your real calls.