Show HN: Dia, an open-weights TTS model for generating realistic dialogue

162•toebee•2h ago

Comments

toebee•2h ago

Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

new_user_final•1h ago

Easily 10 times better than recent OpenAI voice model. I don't like robotic voices.

Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.

gfaure•1h ago

Amazing that you developed this over the course of three months! Can you drop any insight into how you pulled together the audio data?

isoprophlex•1h ago

+1 to this, amazing how you managed to deliver this, and iff you're willing to share i'd be most interested in learning what you did in terms of train data..!

nickthegreek•22m ago

Are there any examples of the audio differences between the this and the larger model?

strobe•2h ago

just in case, another opensource project using same name https://wiki.gnome.org/Apps/Dia/

https://gitlab.gnome.org/GNOME/dia

toebee•2h ago

Thanks for the heads-up! We weren’t aware of the GNOME Dia project. Since we focus on speech AI, we’ll make sure to clarify that distinction.

aclark•1h ago

Ditto this! Dia diagram tool user here just noticing the name clash. Good luck with your Dia!! Assuming both can exist in harmony. :-)

Magma7404•1h ago

I know it's a bit ridiculous to see that as some kind of conspiracy, but I have seen a very long list of AI-related projects that got the same name as a famous open-source project, as if they wanted to hijack the popularity of those projects, and Dia is yet another example. It was relatively famous a few years ago and you cannot have forgotten it if you used Linux for more than a few weeks. It's almost done on purpose.

teddyh•1h ago

The generous interpretation is that the AI hype people just didn’t know about those other projects, i.e. that they are neither open source developers, nor users.

SoKamil•1h ago

And another one, not open source but in AI sphere: https://www.diabrowser.com/

freedomben•57m ago

Fun, I can't get to it because I can't get past the "Making sure you're not a bot!" page. It's just stuck at "calculating...". I understand the desire to slow down AI bots, but . If all the gnome apps are now behind this, they just completely shut down a small-time contributor. I love to play with Gnome apps and help out with things here and there, but I'm not going to fight with this damn thing to do so.

zorgmonkey•24m ago

My understanding is that is they don't really care if the AI bots got the data on gitlab, but the problem is that the bots are very aggressive and have been slowing down the gitlab instance too much. I definitely agree it isn't great, but I don't know what a better option would be. Also once it passes the page once it should save a cookie and you'll be good to go for a while.

stuartjohnson12•2h ago

Impressive project! We'd love to use something like this over at Delfa (https://delfa.ai). How does this hold up from the perspective of stability? I've spoken to various folks working on voice models, and one thing that has consistently held Eleven Labs ahead of the pack from my experience is that their models seem to mostly avoid (while albeit not being immune to) accent shifts and distortions when confronted with unfamiliar medical terminology.

A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.

toebee•2h ago

Interesting. I haven't thought of that problem before. I'm guessing a large enough audio dataset for medical terminology does not exist publicly.

But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.

IshKebab•2h ago

Why does it say "join waitlist" if it's already available?

Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.

flakiness•2h ago

Seek back a few tens of bytes which states "Play with a larger version of Dia"

toebee•2h ago

We're envisioning a platform with a social aspect, so that is the biggest difference. Also, bigger models!

We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)

ivape•2h ago

Darn, don't have the appropriate hardware.

The full version of Dia requires around 10GB of VRAM to run.

If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.

toebee•2h ago

We will work on a quantized version of the model, so hopefully you will be able to run it soon!

We've seen Bark from Suno go from 16GB requirement -> 4GB requirement + running on CPUs. Won't be too hard, just need some time to work on it.

ivape•2h ago

No doubt, these TTS models locally are what I'm looking for because I'm so done typing and reading :)

sarangzambare•2h ago

Impressive demo! We'd love to use this at https://useponder.ai

time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?

the python code snippet seems to imply that the entire audio bytes are generated directly?

toebee•2h ago

Sounds awesome! I think it won't be very hard to run it using output streaming, although that might require beefier GPUs. Give us an email and we can talk more - nari.ai.contact at gmail dot com.

It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)

sarangzambare•2h ago

no worries, i will email you

xienze•2h ago

How do you declare which voice should be used for a particular speaker? And can it created a cloned speaker voice from a sample?

toebee•1h ago

You can add an audio prompt and prepend text corresponding to it in the script. You can get a feel for it by trying the second example in the Gradio interface!

brumar•1h ago

Impressive! Is it english only at the moment?

toebee•1h ago

Unfortunately yes at the moment

toebee•1h ago

It is way past bedtime here, will be getting back to comments after a few hours of sleep! Thanks for all the kind words and feedback

pzo•1h ago

Sounds great. Hope more language support in the future. In comparison Sesame CSM-1B sounds like trained on stoned people.

film42•1h ago

Very very impressive.

Versipelle•1h ago

This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.

I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with

tyrauber•1h ago

Hey, do yourself a favor and listen to the fun example:

> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!

Seriously impressive. Wish I could direct link the audio.

Kudos to the Dia team.

notdian•1h ago

made a small change and got it running on M2 Pro 16GB Macbook pro, the quality is amazing.

https://github.com/nari-labs/dia/pull/4

isoprophlex•1h ago

Incredible quality demo samples, well done. How's the performance for multilingual generation?

999900000999•1h ago

Does this only support English?

I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.

verghese•1h ago

How does this compare with Spark TTS?

https://github.com/SparkAudio/Spark-TTS

youssefabdelm•54m ago

Anyone know if possible to fine-tune for cloning my voice?

xbmcuser•54m ago

Wow first time I have felt that this could be the end of voice acting/audio book narration etc. The speed with with the ways things are changing how soon before you can make any book any novel into a complete audio video / movie or tv show.

a2128•52m ago

What's the training process like? I have some data in my language I'd love to use train it on my language seeing as it's English-only

popalchemist•44m ago

This looks excellent, thank you for releasing openly.

hemloc_io•32m ago

Very cool!

Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding

lostmsu•30m ago

Does this only work for two voices? Can I generate an entire conversation between multiple people? Like this HN thread.

Teen coder shuts down open source Mac app Whisky, citing harm to paid apps

European Landscapes of Rock-Art (2002) [pdf]

Thieves took their iPhones. Apple won't give their digital lives back

Microsoft Recall on Copilot+ PC: testing the security and privacy implications

Show HN:I Built a single person design to code tech company

The Art of Street Photography

Vendoring

Show HN: HyperAgent: open-source Browser Automation with LLMs

Long-Run Effects of Trade Wars (Less Capital Investment)

The Value of Differences: Jennifer Lindsay on Noticing Translation

The Abbreviated Language for Authorization (Alfa)

Golden Mean Fallacy

Ask HN: Why were lengthy type‑in programs legal in the '80s?

Show HN: Revopush – A Compatible, Faster Alternative to CodePush

Reading ZFS drives on Windows with the power of WSL

The quest to build islands with ocean currents in the Maldives

Instagram tests AI to catch underage users as part of teen safety push

Show HN: Imaginator – Image generation that helps you brainstorm and explore

Ask HN: Do you think Shopify has become too horizontal?

DandeGUI, a GUI Library for Medley Interlisp

Separating work and personal config [Git, Emacs]

Hubproxy: The GitHub webhooks you deserve

Is 1 Prime, and Does It Matter?

The '5 things' emails are going by the wayside as Musk readies his exit

EU says it will enforce digital rules irrespective of CEO and location

Internal tech emails: Mark Zuckerberg and Snapchat

Introducing Incrementalist, An Incremental .NET Build Tool for Large Solutions

A man behind Fukushima town's 'Strawberry Sheep'

Data Structure for Dynamic Discrete Probability Distributions

China sends back new Boeing jet made more expensive by tariffs

Teen coder shuts down open source Mac app Whisky, citing harm to paid apps

European Landscapes of Rock-Art (2002) [pdf]

Thieves took their iPhones. Apple won't give their digital lives back

Microsoft Recall on Copilot+ PC: testing the security and privacy implications

Show HN:I Built a single person design to code tech company

The Art of Street Photography

Vendoring

Show HN: HyperAgent: open-source Browser Automation with LLMs

Long-Run Effects of Trade Wars (Less Capital Investment)

The Value of Differences: Jennifer Lindsay on Noticing Translation

The Abbreviated Language for Authorization (Alfa)

Golden Mean Fallacy

Ask HN: Why were lengthy type‑in programs legal in the '80s?

Show HN: Revopush – A Compatible, Faster Alternative to CodePush

Reading ZFS drives on Windows with the power of WSL

The quest to build islands with ocean currents in the Maldives

Instagram tests AI to catch underage users as part of teen safety push

Show HN: Imaginator – Image generation that helps you brainstorm and explore

Ask HN: Do you think Shopify has become too horizontal?

DandeGUI, a GUI Library for Medley Interlisp

Separating work and personal config [Git, Emacs]

Hubproxy: The GitHub webhooks you deserve

Is 1 Prime, and Does It Matter?

The '5 things' emails are going by the wayside as Musk readies his exit

EU says it will enforce digital rules irrespective of CEO and location

Internal tech emails: Mark Zuckerberg and Snapchat

Introducing Incrementalist, An Incremental .NET Build Tool for Large Solutions

A man behind Fukushima town's 'Strawberry Sheep'

Data Structure for Dynamic Discrete Probability Distributions

China sends back new Boeing jet made more expensive by tariffs

Show HN: Dia, an open-weights TTS model for generating realistic dialogue

Comments