A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.
Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.
We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
We've seen Bark from Suno go from 16GB requirement -> 4GB requirement + running on CPUs. Won't be too hard, just need some time to work on it.
time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?
the python code snippet seems to imply that the entire audio bytes are generated directly?
It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!
Seriously impressive. Wish I could direct link the audio.
Kudos to the Dia team.
I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.
Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding
toebee•2h ago
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
new_user_final•1h ago
Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.
gfaure•1h ago
isoprophlex•1h ago
nickthegreek•22m ago