Skip to main content
All articles
Published on7 min read

Voxtral vs Whisper: which transcription engine for a multilingual voice agent

Mistral's Voxtral and OpenAI's Whisper are the two most-used speech recognition engines in 2026. Here's an honest comparison on latency, multilingual quality, price and hosting.

For a voice agent that has to understand French, Arabic and English in real time, the transcription engine matters as much as the LLM. Two options dominate: Voxtral (Mistral) and Whisper (OpenAI). Here's what actually sets them apart.

Latency — the gap is audible#

Voxtral mini-transcribe lands around 200-400 ms end-to-end on short audio. Self-hosted Whisper-large-v3 sits at 300-500 ms on GPU, plus network. In live conversation, 200 ms feels noticeably more fluid. Whisper-large still wins on long-form audio, but on telephony-style turn-taking, Voxtral wins.

Multilingual quality#

  • Standard French: tie, WER below 4 % on both.
  • English: Whisper-large keeps a slight edge (trained on more audio).
  • Standard Arabic: Voxtral mini-2602 better calibrated, fewer wonky transliterations.
  • Arabic dialects: both can choke; a human fallback is wise.
  • Code-switching (FR↔AR↔EN in the same sentence): Voxtral more stable.

Hosting and compliance#

Whisper open-source self-hosts on any recent GPU — great for data sovereignty. Voxtral is by default a Mistral API hosted in EU region — convenient but external dependency. Both are GDPR-defensible; what matters is what your privacy policy commits to.

Price#

Voxtral bills per audio minute ($0.005-$0.01/min depending on model). Self-hosted Whisper costs the GPU (~$0.50-1.50/hour depending on cloud), which becomes cheap past ~80 hours of audio/month. Below that volume, Voxtral is simpler and cheaper all-in.

Our VocazAI setup#

Cascade: Voxtral first (per-language model), faster-whisper medium as a Docker self-hosted fallback. The cascade holds below 600 ms p95 in trilingual production, and stays alive even when the Mistral API hiccups. First month free to test under real call conditions.