Voxtral vs Whisper: which transcription engine for a multilingual voice agent
Mistral's Voxtral and OpenAI's Whisper are the two most-used speech recognition engines in 2026. Here's an honest comparison on latency, multilingual quality, price and hosting.
For a voice agent that has to understand French, Arabic and English in real time, the transcription engine matters as much as the LLM. Two options dominate: Voxtral (Mistral) and Whisper (OpenAI). Here's what actually sets them apart.
Latency — the gap is audible#
Voxtral mini-transcribe lands around 200-400 ms end-to-end on short audio. Self-hosted Whisper-large-v3 sits at 300-500 ms on GPU, plus network. In live conversation, 200 ms feels noticeably more fluid. Whisper-large still wins on long-form audio, but on telephony-style turn-taking, Voxtral wins.
Multilingual quality#
- Standard French: tie, WER below 4 % on both.
- English: Whisper-large keeps a slight edge (trained on more audio).
- Standard Arabic: Voxtral mini-2602 better calibrated, fewer wonky transliterations.
- Arabic dialects: both can choke; a human fallback is wise.
- Code-switching (FR↔AR↔EN in the same sentence): Voxtral more stable.
Hosting and compliance#
Whisper open-source self-hosts on any recent GPU — great for data sovereignty. Voxtral is by default a Mistral API hosted in EU region — convenient but external dependency. Both are GDPR-defensible; what matters is what your privacy policy commits to.
Price#
Voxtral bills per audio minute ($0.005-$0.01/min depending on model). Self-hosted Whisper costs the GPU (~$0.50-1.50/hour depending on cloud), which becomes cheap past ~80 hours of audio/month. Below that volume, Voxtral is simpler and cheaper all-in.
Our VocazAI setup#
Cascade: Voxtral first (per-language model), faster-whisper medium as a Docker self-hosted fallback. The cascade holds below 600 ms p95 in trilingual production, and stays alive even when the Mistral API hiccups. First month free to test under real call conditions.