When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls
People will also post their own interpretations in response to comments, and quickly find out they missed something.
… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.
[on topic]
(OK I’m done making excuses, time to read the article… thanks for the encouragement!)
I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:
“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck
Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.
https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...
which is microsoft for "we removed two dead links". AI innovation knows no limits!
- Cohere Transcribe (self hosted)
- Grok Speech To Text (they provide an API, only $0.10/hr!)
They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?
This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.
Way early on (spring 2023) people tried to stop it, but no luck.
Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.
I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.
Elevenlabs in the cloud.
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...
CubsFan1060•1h ago
542458•1h ago
JumpCrisscross•54m ago
Why?