Show HN: Dia, an open-weights TTS model for generating realistic dialogue

2025-04-2117:07652190github.com

A TTS model capable of generating ultra-realistic dialogue in one pass. - nari-labs/dia

Show article

Dia is a 1.6B parameter text to speech model created by Nari Labs.

Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

To accelerate research, we are providing access to pretrained model checkpoints and inference code. The model weights are hosted on Hugging Face.

We also provide a demo page comparing our model to ElevenLabs Studio and Sesame CSM-1B.

Join our discord server for community support and access to new features.
Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the waitlist for early access.

This will open a Gradio UI that you can work on.

git clone https://github.com/nari-labs/dia.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install uv
uv run app.py

import soundfile as sf from dia.model import Dia model = Dia.from_pretrained("nari-labs/Dia-1.6B") text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face." output = model.generate(text) sf.write("simple.mp3", output, 44100)

A pypi package and a working CLI tool will be available soon.

Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon. The initial run will take longer as the Descript Audio Codec also needs to be downloaded.

On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower. For reference, on a A4000 GPU, Dia rougly generates 40 tokens/s (86 tokens equals 1 second of audio). torch.compile will increase speeds for supported GPUs.

The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.

If you don't have hardware available or if you want to play with bigger versions of our models, join the waitlist here.

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are strictly forbidden:

Identity Misuse: Do not produce audio resembling real individuals without permission.
Deceptive Content: Do not use this model to generate misleading content (e.g. fake news)
Illegal or Malicious Use: Do not use this model for activities that are illegal or intended to cause harm.

By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We are not responsible for any misuse and firmly oppose any unethical usage of this technology.

Docker support.
Optimize inference speed.
Add quantization for memory efficiency.

We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions! Join our Discord Server for discussions.

Read the original article

Comments

By sebstefan 2025-04-229:075 reply

I inserted the non-verbal command "(pauses)" in the middle of a sentence and I think I caused it to have an aneurysm.

https://i.horizon.pics/4sEVXh8GpI (27s)

It starts with an intro, too. Really strange

By antiraza 2025-04-2213:25

That was... amazing.

By abrookewood 2025-04-2212:10

That's certainly unusual ...

By cchance 2025-04-262:47

[pauses] i think i heard demons

By throwaway-alpha 2025-04-2210:473 reply

I have a hunch they're pulling data from radio shows to give it that "high quality" vibe. Tried running it through this script and hit some weird bugs too:

    [S1] It really sounds as if they've started using NPR to source TTS models
    [S2] Yeah... yeah... it's kind of disturbing (laughs dejectedly).
    [S3] I really wish, that they would just Stop with this.

https://i.horizon.pics/Tx2PrPTRM3

By yahoozoo 2025-04-2310:44

The “Yeah…” followed by an uncomfortably long pause then a second “Yeah…” killed me.

By degosuke 2025-04-2216:362 reply

It even added an extra f-word at the end. Still veeery impressive

By sebstefan 2025-04-237:26

Oh brother imagine using this and noticing 4 months in that it randomly throws f-words for flavor at your users

By yencabulator 2025-04-2321:45

I think it also said "maaan" at the end of the previous line. And speaks out lout the "dejectedly".

By xdfgh1112 2025-04-2217:38

Just noticed he says dejectedly too

By bt1a 2025-04-2213:12

You hittin them balloons again, mate ?

By hemloc_io 2025-04-2119:223 reply

Very cool!

Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding

By miki123211 2025-04-228:502 reply

Yeah, Eleven Labs must be raking it in.

You can get hours of audio out of it for free with Eleven Reader, which suggests that their inference costs aren't that high. Meanwhile, those same few hours of audio, at the exact same quality, would cost something like $100 when generated through their website or API, a lot more than any other provider out there. Their pricing (and especially API pricing) makes no sense, not unless it's just price discrimination.

Somebody with slightly deeper pockets than academics or one guy in a garage needs to start competing with them and drive costs down.

Open TTS models don't even seem to utilize audiobooks or data scraped off the internet, most are still Librivox / LJ Speech. That's like training an LLM on just Wikipedia and expecting great results. That may have worked in 2018, but even in 2020 we knew better, not to mention 2025.

TTS models never had their "Stable Diffusion moment", it's time we get one. I think all it would take is somebody doing open-weight models applying the lessons we learned from LLMs and image generation to TTS models, namely more data, more scraping, more GPUs, less qualms and less safety. Eleven Labs already did, and they're profiting from it handsomely.

By pzo 2025-04-2210:582 reply

Kokoro gives great results especially when speaking english. Model is small enough to run even on smartphone ~3x faster than realtime.

By miki123211 2025-04-239:571 reply

Kokoro just proves my point; it's "one guy in a garage", 1000 hours of distilled audio (I think) and ~100m params.

With the budget one tenth that of Stable Diffusion and less ethical qualms, you could easily 10x or 100x this.

By cchance 2025-04-262:48

I'm actually surprised people aren't just using elevenreader to generate solid content from various books for datasets lol

By bavell 2025-04-2211:56

Another +1 to Kokoro from me, great quality with good speed.

By bazlan 2025-04-229:31

[dead]

By toebee 2025-04-2122:54

Thank you for the kind words <3

By kreelman 2025-04-222:151 reply

This is amazing. Is it possible to build in a chosen voice, a bit like Eleven Labs does? ...This may be on the git summary, being lazy and asking anyway :=) Thanks for your work.

By JonathanFly 2025-04-224:33

Yes, see: https://github.com/nari-labs/dia/blob/main/example/voice_clo...

By Versipelle 2025-04-2118:272 reply

This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.

I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with

By mclau157 2025-04-2120:011 reply

Realistic voice acting for audio books, realistic images for each page, realistic videos for each page, oh wait I just created a movie, maybe I can change the plot? Oh wait I just created a video game

By hleszek 2025-04-226:52

Now do it in VR and make it fully interactive.

By azinman2 2025-04-2120:166 reply

Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?

By Versipelle 2025-04-2121:39

> Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?

Of course, but it's not always available.

For example, I would love an audiobook for Stanisław Lem's "The Invincible," as I just finished its video game adaptation, yet it simply doesn't exist in my native language.

It's quite seldom that the author narrates the audiobooks I listen to, and sometimes the narrator does a horrible job, butchering the characters with exaggerated tones.

By satvikpendem 2025-04-226:501 reply

Why a human? There are many cases where I like a book but dislike the audiobook speaker, so I essentially can't listen to that book anymore. With a machine, I can tweak the voice to my heart's content.

By iamsaitam 2025-04-227:471 reply

And get a completely wrong/bland but custom read of the book. Reading is much more than simply transforming text to audio.

By satvikpendem 2025-04-2213:41

Sometimes, I don't care if it's bland, I just want to listen to the text. There are a lot of Asian light novels for example which never get English audiobooks, and I've listened to many of them with basic TTS, not even an AI model TTS like these more recent ones, and I thoroughly enjoyed these books even still.

By ks2048 2025-04-2123:45

With 1M+ new books every year, that’s not possible for all but the few most popular.

By senordevnyc 2025-04-2121:091 reply

Honestly, I’d say that’s true only for the author. Anyone else is just going to be interpreting the words to understand how to best convey the character / emotion / situation / etc., just like an AI will have to do. If an AI can do that more effectively than a human, why not?

The author could be better, because they at least have other info beyond the text to rely on, they can go off-script or add little details, etc.

By DrSiemer 2025-04-2121:361 reply

As somebody who has listened to hundreds of audiobooks, I can tell you authors are generally not the best choice to voice their own work. They may know every intent, but they are writers, not actors.

The most skilled readers will make you want to read books _just because they narrated them_. They add a unique quality to the story, that you do not get from reading yourself or from watching a video adaptation.

Currently I'm in The Age of Madness, read by Steven Pacey. He's fantastic. The late Roy Dotrice is worth a mention as well, for voicing Game of Thrones and claiming the Guinness world record for most distinct voices (224) in one series.

It will be awesome if we can create readings automatically, but it will be a while before TTS can compete with the best readers out there.

By azinman2 2025-04-2123:27

I’d suggest even if the TTS sounded good, I’d still rather a human because:

1. It’s a job that seems worthwhile to support, especially as it’s “practice” that only adds to a lifetime of work and improves their central skill set

2. A voice actor will bring their own flare, just like any actor does to their job

3. They (should) prepare for the book, understanding what it’s about in its entirety, and bring that context to the reading

By fennecfoxy 2025-04-2315:48

It'd be nice if there were mainstream releases on GBC/GBA/PSP again too! But apparently if there's no money in something then people don't really wanna do it.

By cchance 2025-04-222:53

You really think people writing these papers actually have good speaking voices? LOL, theirs a reason not everyone could be an audio book maker or podcaster, a lot of peoples voices suck for audiobooks