Voxtral – Frontier open source speech understanding models

156•meetpateltech•6mo ago

Comments

danelski•6mo ago

They claim to undercut competitors of similar quality by half for both models, yet they released both as Apache 2.0 instead of following smaller - open, larger - closed strategy used for their last releases. What's different here?

Havoc•6mo ago

Probably not looking to directly compete in transcription space

wmf•6mo ago

They're working on a bunch of features so maybe those will be closed. I guess they're feeling generous on the base model.

halJordan•6mo ago

They didn't release voxtral large so your question doesn't really make sense

danelski•6mo ago

It's about what their top offering is at the moment, not having Large in name. Mistral Medium 3 is notably not Mistral Large 3, but it was released as API-only.

homarp•6mo ago

weights:https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 and https://huggingface.co/mistralai/Voxtral-Small-24B-2507

homarp•6mo ago

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

GaggiX•6mo ago

There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507

homarp•6mo ago

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

lostmsu•6mo ago

My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.

ImageXav•6mo ago

How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.

lostmsu•6mo ago

Harvesting idle compute. https://borgcloud.org/speech-to-text

BetterWhisper•6mo ago

Do you support speaker recognition?

lostmsu•6mo ago

No. I found models doing that unreliable when there are many speakers.

4b11b4•6mo ago

This is your service?

lostmsu•6mo ago

Yes

lostmsu•6mo ago

Does it support realtime transcription? What is the ~latency?

rolisz•6mo ago

Unlikely. The small model is much larger than whisper (which is already hard to use for realtime)

ipsum2•6mo ago

24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)

azinman2•6mo ago

But it also includes world knowledge, can do tool calls, etc. It’s an omnimodel

qwertox•6mo ago

Only the mini is meant for pure transcription. And with the tests I just did on their API, comparing to Whisper large, they are around three times faster, more accurate and cheaper.

24B is, as sibling comment says, an omni model, it can also do function calling.

sheerun•6mo ago

In demo they mention polish prononcuation is pretty bad, spoken as if second language of english-native speaker. I wonder if it's the same for other languages. On the other hand whispering-english is hillariously good, especially different emotions.

Raed667•6mo ago

It is insane how good the "French man speaking English" demo is. It captures a lot of subtleties

potlee•6mo ago

That’s an actual French man speaking English

kamranjon•6mo ago

Im pretty excited to play around with this. I’ve worked with whisper quite a bit, it’s awesome to have another model in the same class and from Mistral, who tend to be very open. I’m sure unsloth is already working on some GGUF quants - will probably spin it up tomorrow and try it on some audio.

vivalapomy•6mo ago

Won't comment on the 24B model as I see no use for it personally, but regarding purely ASR tasks, I honestly can't see voxtral taking off. For personal usage, I've been running a quant of whisper tiny(for english), as well as whisper small(for spanish, as is my native language), and have never experienced major latency when using for globally available voice commands. Considering my machine runs an Ivy Bridge processor, using CPU inference, the pricing seems unreasonable.

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

Voxtral – Frontier open source speech understanding models

Comments