·
©
All Models

Audio
Lightweight, high-quality text-to-speech (TTS) model built for fast, natural voice generation.
Small but Capable. A 4B parameter model designed to deliver strong speech quality while staying cheap and fast to run. Good balance between latency, cost, and output quality.
Natural Speech Output. Produces clear, expressive audio with good pacing and pronunciation, suitable for real-world applications like assistants, narration, and UI voice.
Low Latency. Optimized for fast inference, making it practical for real-time or near-real-time use cases.
Simple TTS Pipeline. Takes text as input and outputs speech directly - no complex multi-stage pipeline required.
Production Friendly. Lightweight enough to deploy on modest hardware while still delivering consistent, usable voice output.
Voxtral-4B-TTS is a solid default for adding voice to your product. It’s fast, cheap, and good enough for most use cases - from voice assistants to reading out responses. Use it when you need reliable TTS without overengineering or heavy infrastructure.
·
©