Qwen3-TTS family is now open sourced: Voice design, clone, and generation

Comments

By simonw 2026-01-2217:229 reply

If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.

I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/

By javier123454321 2026-01-2217:387 reply

This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it.

By rdtsc 2026-01-2217:594 reply

That was my thought too. You’d have “loved ones” calling with their faces and voices asking for money in some emergency. But you’d also have plausible deniability as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.

By rpdillon 2026-01-2222:201 reply

Only if you focus on the form instead of the content. For a long time my family has had secret words and phrases we use to identify ourselves to each other over secure, but unauthenticated, channels (i.e. the channel is encrypted, but the source is unknown). The military has had to deal with this for some time, and developed various form of IFF that allies could use to identify themselves. E.g. for returning aircraft, a sequence of wing movements that identified you as friend. I think for a small group (in this case, loved ones), this could be one mitigation of that risk. My parents did this with me as a kid, ostensibly as a defense against some other adult saying "My mom sent me to pick you up...". I never did hear of that happening, though.

By nineteen999 2026-01-256:26

That sounds way too complicated. I get around that by just not having any family any more.

By plagiarist 2026-01-234:301 reply

For now you could ask them to turn away from the camera while keeping their eyes open. If they are a Z-Image they will instantly snap their head to face you.

By muggermuch 2026-01-238:43

This scenario is oddly terrifying.

By aprilthird2021 2026-01-233:261 reply

> as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.

This won't change anything about Western style courts which have always required an unbroken chain of custody of evidence for evidence to be admissable in court

By cwillu 2026-01-235:581 reply

Court account for a vanishingly small proportion of most people's lives.

By aprilthird2021 2026-01-2320:11

So does the presentation of evidence...

By neevans 2026-01-2218:142 reply

this was already possible with chatterbox for a long while.

By freedomben 2026-01-2218:33

Yep, this has been the reality now for years. Scammers have already had access to it. I remember an article years ago about a grandma who wired her life savings to a scammer who claimed to have her granddaughter held hostage in a foreign country. Turns out they just cloned her voice from Facebook data and knew her schedule so timed it while she would be unreachable by phone.

By DANmode 2026-01-2218:52

or anyone who refuses to use hearing aids.

By u8080 2026-01-2222:082 reply

https://www.youtube.com/watch?v=diboERFAjkE pretty much this

By harshreality 2026-01-231:08

That's a reupload of Cybergem's video. https://www.youtube.com/watch?v=-gGLvg0n-uY

By javier123454321 2026-01-2222:231 reply

Oh wow. Thank you for this. Amazing, terrifying, spot on, all of it.

By arcanemachiner 2026-01-2222:52

I knew what it would be before I even opened it. The crazy thing is that video is like 3 years old.

By oceanplexian 2026-01-2220:134 reply

> This is terrifying.

Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.

By mrandish 2026-01-235:051 reply

> Far more terrifying is Big Tech having access to a closed version of the same model

Agreed. The only thing worse than everyone having access to this tech is only governments, mega corps and highly-motivated bad actors having access. They've had it a while and there's no putting the genii back in the bottle. The best thing the rest of us can do is use it widely so everyone can adapt to this being the new normal.

By apitman 2026-01-2316:44

I know genii is the plural of genie, but for a second I thought it was a typo of genai and I kind of like that better.

By javier123454321 2026-01-2220:37

I do strongly agree. Though the societal impact is only mitigated by open models, not curtailed at all.

By refulgentis 2026-01-2318:05

The really terrifying thing is the next logical step from the instinctual reaction. Eschew miracle, eschew the cognitive bias of feeling warm and fuzzy for the guy who gives you it for free.

Socratic version: how can the Chinese companies afford to make them and give them out for free? Cui bono?

n.b. it's not because they're making money on the API, ex. open openrouter and see how Moonshot or DeepSeek's 1st party inference speed compares to literally any other provider. Note also that this disadvantage can't just be limited to LLMs, due to GPU export rules.

By vonneumannstan 2026-01-2314:461 reply

>Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments).

Lol what exactly do you think Zuck would do with your voice, drain your bank account??

By liamN 2026-01-2418:25

More likely sell your family ads while using your voice.

By razster 2026-01-235:45

I'd be a bit more worried with Z-Image Edit/Base is release. Flux.2 Klein is our and its on par with Zit, and with some fine tuning can just about hit Flux.2. Adding on top of that is Qwen Image Edit 2511 for additional refinement. Anything is possible. Those folks at r/StableDiffusion and falling over the possible release of Z-Image-Omni-Base, a hold me over until actual base is out. I've heard its equal to Flux.2. Crazy time.

By TacticalCoder 2026-01-2312:04

> With this and z-image-turbo, we've crossed a chasm.

And most of all: they're both local models. The cat is out of the box and it's never going back in. There's no censoring of this. No company that can pull the plug. Anyone with a semi-modern GPU can use these models.

By fridder 2026-01-2221:122 reply

Admittedly I have not dove into it much but, I wonder if we might finally have a usecase for NFTs and web3? We need some sort of way to denote items are persion generated not AI. Would certainly be easier than trying to determine if something is AI generated

By grumbel 2026-01-2221:24

That's the idea behind C2PA[1], your camera and the tools put a signature on the media to prove its provenance. That doesn't make manipulation impossible (e.g. you could photograph an AI image of a screen), but it does give you a trail of where a photo came from and thus an easier way to filter it or lookup the original.

[1] https://c2pa.org/

By simonw 2026-01-2222:001 reply

How would NFTs/web3 help differentiate between something created by a human and something that a human created with AI and then tagged with their signature using those tools?

By _kb 2026-01-236:051 reply

In a live conversation context you can mention the term NFTs/web3 and if the far end is human they'll wince a little.

By disillusioned 2026-01-2310:04

This made me laugh far too hard for far too long.

By echelon 2026-01-2218:386 reply

We're going to be okay.

There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.

Nothing was more scary than the invention of the nuclear weapon. And we're all still here.

Life will go on. And there will be incredible benefits that come out of this.

By javier123454321 2026-01-2219:33

I'm not denigrating the tech, all I'm saying is that we've crossed to new territory and there will be consequences that we don't understand from this. The same way that social media has been particularly detrimental to young people (especially women) in a way we were not ready for. This __smells__ like it could be worse, alongside with (or regardless of) the benefits of both.

I simply think people don't really know that the new world requires a new set of rules of engagement for anything that exists behind a screen (for now).

By supern0va 2026-01-2218:42

We'll be okay eventually, when society adapts to this and becomes fully aware of the capabilities and the use cases for abuse. But, that may take some time. The parent is right to be concerned about the interim, at the very least.

That said, I am likewise looking forward to the cool things to come out of this.

By cookiengineer 2026-01-236:20

> We're going to be okay.

> And there will be incredible benefits that come out of this.

Your username is echelon.

I just wanted to point that out.

By michelb 2026-01-2310:25

Yeah. Not using voice, but...https://nymag.com/intelligencer/article/white-house-posts-fa...

By doug713705 2026-01-2223:481 reply

> Nothing was more scary than the invention of the nuclear weapon. And we're all still here.

Except that building a nuclear weapon was not available to everyone, certainly not to dumb people whose brain have been feeded with social media content.

By lynx97 2026-01-2311:41

I usually don't correct typos and/or grammar, but you asked for it. Calling random people "dumb" while using an incorrect past tense is pretty funny. It is "fed", not "feeded"...

By DANmode 2026-01-2218:521 reply

> People that couldn't sing will make music.

I was with you, until

But, yeah. Life will go on.

By echelon 2026-01-2218:553 reply

There are plenty of electronic artists who can't sing. Right now they have to hire someone else to do the singing for them, but I'd wager a lot of them would like to own their music end-to-end. I would.

I'm a filmmaker. I've done it photons-on-glass production for fifteen years. Meisner trained, have performed every role from cast to crew. I'm elated that these tools are going to enable me to do more with a smaller budget. To have more autonomy and creative control.

By DANmode 2026-01-2222:062 reply

What happens to lyricless electronica if suddenly every electronic artist has quality vocal-backing?

Oh no.

Maybe we did frig this up.

By fc417fc802 2026-01-238:541 reply

On the other hand, maybe we'll get models capable of removing the lyrics from things without damaging the rest of the audio. Or better yet, replacing the lyrics with a new instrument. So it might yet work out in our favor.

By DANmode 2026-01-2314:40

This was one of the first things they were doing with neural nets,

and there are even a couple SaaS options for it now.

By echelon 2026-01-2223:301 reply

More choices for artists is not a bad thing.

By DANmode 2026-01-2314:41

Indeed.

But it does change who can be an artist in each niche,

and that’s been interesting to briefly pause and consider here with the community.

By redwall_hp 2026-01-236:111 reply

We've had Yamaha Vocaloid for over two decades now, and Synthesizer V is probably coming up on a decade too now. They're like any other synth: MIDI (plus phonemes) in, sound out. It's a tool of musical expression, like any other instrument.

Hatsune Miku (Fujita Saki) is arguably the most prolific singer in the world, if you consider every Vocaloid user and the millions of songs that have come out of it.

So I don't think there's any uncharted territory...we still have singers, and sampled VST instruments didn't stop instrumentalists from existing; if anything, most of these newcomer generative AI tools are far less flexible or creatively useful than the vast array of synthesis tools musicians already use.

By fc417fc802 2026-01-238:36

Miku is neat but not a replacement for a human by any stretch of the imagination. In practice most amateur usage of that lands somewhere in a cringey uncanny valley.

No one was going to replace voice actors for TV and movie dubs with Miku whereas the cutting edge TTS tools seem to be nearing that point. Presumably human vocal performances will follow that in short order.

By javier123454321 2026-01-2219:412 reply

Yes, the flipside of this is that we're eroding the last bit of ability for people to make a living through their art. We are capturing the market for people to live off of making illustrations, to making background music, jingles, promotional videos, photographs, graphic design, and funnelling those earnings to NVIDIA. The question I keep asking is whether we care to value as a society for people to make a living through their art. I think there is a reason to care.

It's not so much of an issue with art for art's sake aided by AI. It's an issue with artistic work becoming unviable work.

By volkercraig 2026-01-2220:543 reply

This feels like one of those tropes that keeps showing up whenever new tech comes out. At the advent of recorded music, im sure buskers and performers were complaing that live music is dead forever. Stage actors were probably complaining that film killed plays. Heck, I bet someome even complained that video itself killed the radio star. Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around, theyre just called v-tubers and podcasters.

By fc417fc802 2026-01-238:461 reply

> This feels like one of those tropes that keeps showing up whenever new tech comes out.

And this itself is another tired trope. Just because you can pattern match and observe that things repeatedly went a certain way in the past, doesn't mean that all future applications of said pattern will play out the same way. On occasion entire industries have been obliterated without a trace by technological advancement.

We can also see that there must be some upper ceiling on what humans in general are capable of - hit that and no new jobs will be created because humans simply won't be capable of the new tasks. (Unless we fuse with the machines or genetically engineer our brains or etc but I'm choosing to treat those eventualities as out of scope.)

By Urahandystar 2026-01-2311:261 reply

Give me one aspect in which that has actually happened? I'm wracking my brains but can't think of one. We are a weird species in that even if we could replace ourselves our fascination with ourselves means that we don't ever do it. Cars and bicycles have replaced our ability to travel at great and small distances and yet we still have track events culminating in the olympics.

By fc417fc802 2026-01-2312:33

Sure, things continue to persist as a hobby, a curiosity, a bespoke luxury, or the like. But that's not at all the same thing as an industry. Only the latter is relevant if we're talking about the economy and employment prospects and making a living and such.

It's a bit tricky to come up with concrete examples on the spot, in particular because drawing a line around a given industry or type of work is largely subjective. I could point to blacksmithing and someone could object that we still have metalworkers. But we don't have individual craftsmen hammering out pieces anymore. Someone might still object that an individual babysitting a CNC machine is analogous but somehow it feels materially different to me.

Leather workers are another likely example. To my mind that's materially different from a seamstress, a job that itself has had large parts of the tasks automated.

Horses might be a good example. Buggies and carriages replaced by the engine. Most of the transportation counterparts still exist but I don't think mechanics are really a valid counterpart to horse tenders and all the (historic) economic activity associated with that. Sure a few rich people keep race horses but that's the sort of luxury I was referring to above. The number of related job positions is a tiny fraction of what it was historically and exists almost solely for the purpose of entertaining rich people.

Historically the skill floor only crept up at a fairly slow rate so the vast majority of those displaced found new sectors to work in. But the rate of increase appears to have picked up to an almost unbelievable clip (we're literally in the midst of redefining the roles of software developers of all things, one of the highest skilled "bulk" jobs out there). It should be obvious that if things keep up the way they've been going then we're going to hit a ceiling for humans as a species not so long from now.

By redwall_hp 2026-01-236:18

Tin Pan Alley is the historical industry from before recording: composers sold sheet music and piano rolls to publishers, who sold them to working musicians. The ASCAP/BMI mafia would shake down venues and make sure they were paying licensing fees.

Recorded music and radio obviously reduced the demand for performers, which reduced demand for sheets.

By javier123454321 2026-01-2221:073 reply

umm, I don't know if you've seen the current state of trying to make a living with music but It's widely accepted as dire. Touring is a loss leader, putting out music for free doesn't pay, stream counts payouts are abysmally low. No one buys songs.

All that is before the fact that streaming services are stuffing playlists with AI generated music to further reduce the payouts to artists.

> Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around...

Yes all those things still happen, but it's increasingly untenable to make a living through it.

By cthalupa 2026-01-234:42

Artists were saying this even before streaming, though, much less AI.

I listen pretty exclusively to metal, and a huge chunk of that is bands that are very small. I go to shows where they headliners stick around at the bar and chat with people. Not saying this to be a hipster - I listen to plenty of "mainstream" stuff too - but to show that it's hard to get smaller than this when it comes to people wanting to make a living making music.

None of them made any money off of Spotify or whatever before AI. They probably don't notice a difference, because they never paid attention to the "revenue" there either.

But they do pay attention to Bandcamp. Because Bandcamp has given them more ability to make money off the actual sale of music than they've had in their history - they don't need to rely on a record deal with a big label. They don't need to hope that the small label can somehow get their name out there.

For some genres, some bands, it's more viable than ever before to make a living. For others, yeah, it's getting harder and harder.

By volkercraig 2026-01-2319:29

Is it though? Think about being a musician 200 years ago. In 1826 you needed to essentially be nobility or nobility-adjacent just to be able to touch an instrument let alone make a living from it. 100 years later, 1926 the barrier to entry was still sky high, nobody could make and distribute recordings without extensive investment. Nowadays it's not uncommon for a 17 year old to download some free composer software, sign up for a few accounts and distribute their music to an audience of millions. It's not easy to do, sure, but there is still opportunity that never existed. If you were to take at random a 20 year old from the general population in 1826, 1923, 1943, 1953, 1973, 83, etc, would you REALLY say that any of them have a BETTER opportunity than today?

By patrickdavey 2026-01-232:29

But this is different? Wholesale copying of copyrighted works and packaging it up and allowing it to be generated. It's not remotely reasonable

By lynx97 2026-01-2311:46

The amount of artists that managed to actually earn enough to pay the rent and bills was already very very small before AI emerged. I totally agree with you, its heartbreaking to watch how it got even worse, but, the music industry already shuffled the big money to the big players way before AI.

By magicalhippo 2026-01-2219:082 reply

The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.

I presume this is due to using the base model, and not the one tuned for more expressiveness.

edit: Or more likely, the demo not exposing the expressiveness controls.

The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.

Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.

Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.

By thedangler 2026-01-2220:571 reply

How did you do this locally? Tools? Language?

By magicalhippo 2026-01-230:012 reply

I just followed the Quickstart[1] in the GitHub repo, refreshingly straight forward. Using the pip package worked fine, as did installing the editable version using the git repository. Just install the CUDA version of PyTorch[2] first.

The HF demo is very similar to the GitHub demo, so easy to try out.

  pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
  pip install qwen3-tts
  qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --ip 127.0.0.1 --port 8000

That's for CUDA 12.8, change PyTorch install accordingly.

Skipped FlashAttention since I'm on Windows and I haven't gotten FlashAttention 2 to work there yet (I found some precompiled FA3 files[3] but Qwen3-TTS isn't FA3 compatible yet).

[1]: https://github.com/QwenLM/Qwen3-TTS?tab=readme-ov-file#quick...

[2]: https://pytorch.org/get-started/locally/

[3]: https://windreamer.github.io/flash-attention3-wheels/

By dur-randir 2026-01-235:58

https://github.com/sdbds/flash-attention-for-windows/release... - FA2 binaries for you

By regularfry 2026-01-2311:271 reply

It flat didn't work for me on mps. CUDA only until someone patches it.

By magicalhippo 2026-01-2312:05

Demo ran fine, if very slowly, with CPU-only using "--device cpu" for me. It defaults to CUDA though.

Try using mps I guess, I saw multiple references to code checking if device is not mps, so seems like it should be supported. If not, CPU.

By dsrtslnd23 2026-01-2222:531 reply

Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices.

By magicalhippo 2026-01-2223:57

The demo uses 6GB dedicated VRAM on Windows, but keep in mind that it's without FlashAttention. I expect it would drop a bit if I got that working.

Haven't looked into the demo to see if it could be optimized by moving certain bits to CPU for example.

By pseudosavant 2026-01-2219:071 reply

Remarkable tech that is now accessible to almost anyone. My cloned voice sounded exactly like me. The uses for this will be from good to bad and everywhere in-between. A deceased grandmother reading "Good Night Moon" to grandkids, scamming people, the ability to create podcasts with your own voices from just prompts.

By _kb 2026-01-236:00

It's a good thing governments (https://www.ato.gov.au/online-services/voice-authentication) and banks (https://www.anz.com.au/security/how-we-protect-you/voice-id/) haven't gone all in on using voice as an authentication mechanism.

By parentheses 2026-01-238:14

I got some errors trying to run this on my MBP. Claude was able to one-shot a fix.

``` Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426 e0d11f67716c1211e/speech_tokenizer Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s] The tokenizer you are loading from '!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ```

By cristoperb 2026-01-234:232 reply

I cloned my voice and had it generate audio for a paragraph from something I wrote. It definitely kind of sounds like me, but I like it much better than listening to my real voice. Some kind of uncanny peak.

By viraptor 2026-01-234:41

They weirdly makes it a canny peak though :)

By bsenftner 2026-01-2311:52

You do realize that you don't hear your real voice normally, an individual has to record their voice to hear how others hear their voice. What you hear when you speak includes your skull resonating, which other's do not hear.

By mohsen1 2026-01-2219:121 reply

> The requested GPU duration (180s) is larger than the maximum allowed

What am I doing wrong?

By gregsadetsky 2026-01-2219:19

you need to login

By KolmogorovComp 2026-01-2222:571 reply

Hello, the recording you posted does not tell much about the cloning capability without an example from your real voice.

By simonw 2026-01-2223:411 reply

Given how easy voice cloning is with this thing I chickened out of sharing the training audio I recorded!

That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s

By KolmogorovComp 2026-01-230:381 reply

Thanks, so it’s in the [pretty close but still distinguishable] range.

By genewitch 2026-01-234:21

it depends on the medium. A flac will be distinguishable (for now); but put it out over low bandwidth media and you get https://youtube.com/shorts/dpScfg3how8 which, https://www.youtube.com/shorts/zi7BVqVzRx4 is real good and close to what that podcaster's voice sounded like when i made that clone!

i have several other examples from before my repeater ID voice clone. Newer voice models will have to wait till i recover my NAS tomorrow!

this is the newest one i have access to: Dick Powell voice clone off his Richard Diamond Persona: https://soundcloud.com/djoutcold/dick-powell-voice-clone-tes...

i was one-shotting voices years ago that were timbre/tonally identical to the reference voice; however the issue i had was inflection and subtlety. I find that female voices are much easier to clone, or at least it fools my brain into thinking so.

this model, if the results weren't too cherry picked, will be huge improvement!

By kingstnap 2026-01-2220:54

It was fun to try out. I wonder if at some point if I have a few minutes of me talking I could make myself read an entire book to myself.

By itsTyrion 2026-01-240:19

well that isnt concerning at all

By simonw 2026-01-2223:254 reply

I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423

Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py

You can try it with uv (downloads a 4.5GB model on first run) like this:

  uv run https://tools.simonwillison.net/python/q3_tts.py \
    'I am a pirate, give me your gold!' \
    -i 'gruff voice' -o pirate.wav

By genewitch 2026-01-233:261 reply

If i am ever in the same city as you, i'll buy you dinner. I poked around during my free time today trying to figure out how to run these models, and here is the estimable Simon Willison just presenting it on a platter.

hopefully i can make this work on windows (or linux, i guess).

thanks so much.

By cube00 2026-01-2310:571 reply

> hopefully i can make this work on windows (or linux, i guess).

mlx-audio only works on Apple Silicon

By bigyabai 2026-01-2317:45

The original script supports CPU inference, nonetheless.

By rahimnathwani 2026-01-2320:44

If you want to do custom voice cloning, record a sample wav file with a sentence or two, and then try this:

  uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
    
  python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play

By indigodaddy 2026-01-231:362 reply

Simon how do you think this would perform on CPU only? Lets say threadripper with 20G ram. (Voice cloning in particular)

By simonw 2026-01-233:06

No idea at all, but my guess is it would work but be a bit slow.

You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.

By genewitch 2026-01-233:30

the old voice cloning and/or TTS models were CPU only, and they weren't realtime, but no worse than 2:1, 30 seconds of audio would take 60 seconds to generate. roughly. in 2021 one-shot TTS/cloning using GPUs was getting there, and that was close enough to realtime; one could, if one was willing to deal with it, wire microphone audio to the model, and speak words, and the model would, in real time, modify the voice. Phil Hendrie is jealous.

anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.

By gcr 2026-01-2223:44

This is wonderful, thank you. Another win for uv!

By TheAceOfHearts 2026-01-2218:202 reply

Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.

Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.

If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.

By KaoruAoiShiho 2026-01-2218:481 reply

Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.

By TheAceOfHearts 2026-01-2219:351 reply

For the system prompt I used:

> Read this in a calm, clear, and wise audiobook tone.

> Do not rush. Allow the meaning to sink in.

But maybe I should experiment with something more detailed. Do you have any suggestions?

By KaoruAoiShiho 2026-01-233:12

Something like this:

Character Name: Marcus Cole Voice Profile: A bright, agile male voice with a natural upward lift, delivering lines at a brisk, energetic pace. Pitch leans high with spark, volume projects clearly—near-shouting at peaks—to convey urgency and excitement. Speech flows seamlessly, fluently, each word sharply defined, riding a current of dynamic rhythm. Background: Longtime broadcast booth announcer for national television, specializing in live interstitials and public engagement spots. His voice bridges segments, rallies action, and keeps momentum alive—from voter drives to entertainment news. Presence: Late 50s, neatly groomed, dressed in a crisp shirt under studio lights. Moves with practiced ease, eyes locked on the script, energy coiled and ready. Personality: Energetic, precise, inherently engaging. He doesn’t just read—he propels. Behind the speed is intent: to inform fast, to move people to act. Whether it’s “text VOTE to 5703” or a star-studded tease, he makes it feel immediate, vital.

By dsrtslnd23 2026-01-2223:031 reply

do you have the RTF for the 1080? I am trying to figure out if the 0.6B model is viable for real-time inference on edge devices.

By TheAceOfHearts 2026-01-230:102 reply

Yeah, it's not great. I wrote a harness that calculates it as: 3.61s Load Time, 38.78s Gen Time, 18.38s Audio Len, RTF 2.111.

The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.

I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.

By storystarling 2026-01-233:09

An RTF above 1 for just 0.6B parameters suggests the bottleneck isn't the GPU, even on a 1080. The raw compute should be much faster. I'd bet it's mostly CPU overhead or an issue with the serving implementation.

By genewitch 2026-01-233:33

you can install flash attention, et al, but if you're on windows, afaik, you can't use/run/install "triton kernels", which apparently make audio models scream. Whisper complains every time i start it, and it is pretty slow; so i just batch hundreds of audio files on a machine in the corner with a 3060 instead. technically i could batch them on a CPU, too, since i don't particularly care when they finish.