Mistral releases Voxtral and Voxtral TTS

Mistral AI has just enhanced its portfolio by releasing two advanced and capable audio models: Voxtral e Voxtral TTS. This decision heavily expands the company's capabilities into speech processing, providing open-weight alternatives that can finally rival closed-source proprietary APIs.

The Voxtral models for speech understanding

Voxtral enters the market as a cutting-edge alternative for Audio Speech Recognition (ASR) coupled with native semantic understanding. Historically, engineering teams had to negotiate between using open-source models with considerable error rates or dealing with expensive, highly restricted vendor APIs.

Backed by the Mistral Small 3.1 LLM architecture, Voxtral fills this gap by marrying accurate transcription with comprehensive contextual reasoning.

Key features and technical specifications include:

Two dedicated sizes: A flagship 24B variant for heavy-duty production use cases and an edge-friendly 3B version (Voxtral Mini) meant for local deployments.
Long-form context: Packing a 32k token window, the model can transcribe up to 30 minutes of continuous audio, or hold up to 40 minutes for QA reasoning.
Multilingual capability: Achieving class-leading performance right out of the box in 9 widely spoken languages, delivering automated language tagging, translation, and text summarization all in one pass.
Native function calling: Specifically architected to power voice agents, it supports translating spoken intent instantly into backend API invocations without needing a separate text parser.

Official benchmarks demonstrate that Voxtral outperforms Whisper large-v3 and models like Gemini 2.5 Flash on many datasets, doing so at a steep discount compared to closed alternatives.

Voxtral TTS: low-latency, emotive speech generation

In tandem with its speech-to-text offerings, Mistral has rolled out Voxtral TTS, a text-to-speech engine running on top of the Ministral 3B core. The engineering priority here has heavily leaned toward providing human-like expressiveness with practically imperceptible lag.

Standout elements of Voxtral TTS:

A three-stage architecture: Connecting a 3.4B semantic transformer, a 390M flow-matching acoustic backbone, and a 300M custom neural audio codec to stream frames at a rapid 12.5Hz.
Extreme low-latency: Driving a Time-to-First-Audio (TTFA) of roughly 70 milliseconds for simple interactions. This is a critical metric for conversational interfaces that need to feel reactive.
Instant voice adaptation: Capable of mapping out expressive cloning off merely a 3-second reference voice snippet. Astonishingly, it pulls off zero-shot cross-lingual adaption while keeping emotional traits intact.

Licensing and access to the models

Mistral continues to distribute these models primarily to the open-source community via Hugging Face. The Voxtral 24B and 3B speech understanding versions fall under an Apache 2.0 permissive license. Meanwhile, the Text-to-Speech system (Voxtral TTS) is classified under the CC BY-NC 4.0 license. Of course, all functionality is commercially available for teams utilizing Mistral's hosted API keys.

By connecting Voxtral and Voxtral TTS, developers have received an extremely mature, end-to-end tooling pipeline to create voice-first agents entirely off-premise if needed.

FAQ

What license do the Voxtral models use? The main Voxtral audio understanding releases use the Apache 2.0 licensing model. Voxtral TTS uses the open-weights CC BY-NC 4.0 license.

How much does it cost to hit the Mistral APIs? Pricing tracks very aggressively at $0.016 for 1k generated TTS characters, and starts at $0.001 per minute for standard audio transcription.

What languages are fully supported by Voxtral TTS? The model is natively trained to handle English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch