If you want better "open" models, these all sound better for zero shot:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
Granted, only Seed-VC has training/fine tuning code, but all of these models sound better than Chatterbox. So if you're going to deal with something you can't fine tune and you need a better zero shot fit to your voice, use one of these models instead. (Especially ByteDance's MegaTTS3. ByteDance research runs circles around most TTS research teams except for ElevenLabs. They've got way more money and PhD researchers than the smaller labs, plus a copious amount of training data.)
It makes my Australian accent sound very English though, in a posh RP way.
Very natural sounding, but not at all recreating my accent.
Still, amazingly clear and perfect for most TTS uses where you aren't actually impersonating anyone.
https://github.com/nazdridoy/kokoro-tts/blob/main/kokoro-tts
It was important to me that it be 100% private and local and wanted it to be a one time payment solution. Because it locally process your data it can be a one time payment text to speech app.
If you are interested in creating audiobooks from epubs check this demo: https://www.youtube.com/watch?v=pOHzo6Oq0lQ If you are interested in listening while reading with text highlighting check these demos: - https://www.youtube.com/watch?v=8yJ-lsbzAuw - https://www.youtube.com/watch?v=y8wi4d8xmnw
``` watermarked_wav = self.watermarker.apply_watermarl(... ```
The whole audiobook business will eventually disappear - probably within the decade. There will only be ebooks and on-device AI assistants will read it to you on demand.
I imagine it'll go like this: First pre-generated audiobooks as audio files. Next, online service to generate audio on demand with hyper customizable voices which can be downloaded. Next, a new ebook format which embeds instructions for narration and pronunciation to be read on-device. Finally, AI that's good enough to read it like a storyteller instantly without hints.
Honestly I read (or rather, listen to) a lot of books already by getting the epubs onto my phone then using a very basic TTS to read it out. Yes, they're definitely not as lifelike as even the most common AI TTS systems but they're good enough to listen to at high speed. Moon+ Reader is pretty good for Android, not sure about iOS.
I've also found that if your one-shot sample wave isn't really clean that sometimes Chatterbox produces random unholy whooshing sounds at the end of the generated audio which is an added bonus if you're recording Dante's Inferno.
This is a good release if they're not too cherry picked!
I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.
(I've yet to experiment with giving the LLM alternate transcriptions or confidence levels, but I bet they could make good use of that too)
Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.
old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.
I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.
Public gist in case anyone finds it useful:
https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...
git clone <whisper-diarization.git URL>
cd whisper-diarization
python -m venv .
cd scripts
# and then depending on your OS it's activate.sh, activate.ps1, activate.bat, etc. so on linux [0]
your prompt should change to say(whisper-diarization) <your OS prompt>$
now you can type
cd ..
pip install -c constraints.txt -r requirements.txt
python ./diarize.py --no-stem --suppress_numerals --whisper-model large-v3-turbo --device cuda -a <FILE>
next time you want to use it, you can just do like cd ~/whisper-diarization
scripts/activate.sh (or whatever) [0]
python ./diarize.py [...]
[0]
To activate a Python virtual environment created with venv, use the command source venv/bin/activate
on Linux or macOS, or venv\Scripts\activate
on Windows. This will change your terminal prompt to indicate that the virtual environment is active.(the [0] note was 'AI generated' by DDG, but whatever, linux puts it in ./bin/activate and windows puts it in ./Scripts/activate.ps1 (ideally))
> Any of you fucking pricks move and I'll execute every motherfucking last one of you.
I'm so tired of the boring old "miss daisy" demos.
People in the indie TTS community often use the Navy Seals copypasta [1, 2]. It's refreshing to see Resemble using swear words themselves.
They know how this will be used.
It also can be configured to use Ollama or an API key from other providers (OpenRouter included) and from what I gather the default prompt can be changed too.
Sadly it's closed source.
It's obviously an AI for playing wargames without having to bother painting all the miniatures, or finding someone with the same weird interest in Balkan engagements during the Napoleonic era.
Listing the issues in case it helps anyone:
- It doesn't work with Python 3.13, luckily `uv` makes it easy to build a venv with 3.12
- It said numpy 1.26.4 doesn't exist. It definitely does, but `uv pip` was searching for it on the pytorch repo. I passed an `--index-strategy` flag so it would check other repos. This could just be a bug in uv, but when I see "numpy 1.26.4 doesn't exist" and numpy is currently on 2.x, my brain starts to cramp up.
- The `pip install chatterbox-tts` version has a bug in CPU-only mode, so I cloned the Git repo
- The version at the tip of main requires `protobuf-compiler` installed on Debian
- I got a weird CMake error that I can't decipher. I think maybe it's complaining that the Python dev headers are not installed. Why would they be, I'm trying to do inference, not compile Python...
I know anger isn't productive but this is my experience almost any time I'm running Somebody Else's Python Project. Hit an issue, back up, hit another issue, back up, after an hour it still doesn't run.
> We developed and tested Chatterbox on Python 3.11 on Debain 11 OS; the versions of the dependencies are pinned in pyproject.toml to ensure consistency.
If something can be run for free but it's cheaper to rent, it voids the DIY aspect of it.
But if the model is any good someone will probably find a way to optimize it to run on even less.
Edit: Got it running on an old Nvidia 2060, I'm seeing ~5 GB VRAM peak.
So out of the box it seems quite beefy consumer hardware will be needed for it to perform reasonably. However it seems like there's significant potential for improvements, though I'm no expert.
Funnily enough, it made my Australian accent sound very English RP. I was suddenly very posh.
then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task
you then take the codebook entries generated and run it through the codec’s decoder and yield audio
it works surprisingly well
speech text models (tts model with an llm as backbone) is the current meta
Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...
I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?
2. This is CYA for that. Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).
[1] This is the right way to do it. Offer source code and weights, offer their own API/fine tuning so developers don't have to deal with the hassle. That's how they win back some market share.
[2] https://www.404media.co/wikipedia-pauses-ai-generated-summar...
https://github.com/resemble-ai/chatterbox/issues/45#issuecom...
> For now, that means we’re not releasing the training code, and fine-tuning will be something we support through our paid API (https://app.resemble.ai). This helps us pay the bills and keep pushing out models that (hopefully) benefit everyone.
Big bummer here, Resemble. This is not at all open.
For everyone stumbling upon this, there are better "open weights" models than Resemble's Chatterbox TTS:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
These are really good robust models that score higher in openness.
Unfortunately only Seed-VC is fully open. But all of the above still beat Resemble's Chatterbox in zero shot MOS (we tested a lot), especially the mega-OP Chinese models.
(ByteDance slaps with all things AI. Their new secretive video model is better than Veo 3, if you haven't already seen it [2]!)
You can totally ignore this model masquerading as "open". Resemble isn't really being generous at all here, and this is some cheap wool over the eyes trickery. They know they retain all of the cards here, and really - if you're just going to use an API, why not just use ElevenLabs?
Shame on y'all, Resemble. This isn't "open" AI.
The Chinese are going to wipe the floor with TTS. ByteDance released their model in a more open manner than yours, and it sounds way better and generalizes to voices with higher speaker similarity.
Playing with open source is a path forward, but it has to be in good faith. Please do better.
[1] "10/10" open includes: 1. model code, 2. training code, 3. fine tuning code, 4. inference code, 5. raw training data, 6. processed training data, 7. weights, 8. license to outputs, 9. research paper, 10. patents. For something to be a good model, it should have 7/10 or above.
[2] https://artificialanalysis.ai/text-to-video/arena?tab=leader...
If you're going to drop weights on unsuspecting developers (who might not be familiar with TTS) and make them think that they'll fit their use case, that's a bit of a bait-and-switch.
Chatterbox TTS is only available over API for fine tunes. That's an incredibly saturated market, and there are better quality and cheaper models for this.
Chatterbox TTS is equivalent to already-released semi-open weights from ByteDance and other labs, and those models already sound and perform better.
It'd be truly exciting if Chatterbox fine tunes could be done as open weights, similar to how Flux operates. Black Forest Labs has an entire open weights ecosystem built around them. While they do withhold their pro / highest quality variants, they always release open weights with training code for each commercial release. That's a much better model for courting open source developers.
Another company doing "open weights" right is Lightricks with LTX-1. They have a commercial studio, but they release all of their weights and tuning code in the open.
I don't see how this is a carrot for open source. It's an ad for the hosted API.
To make a really poor analogy, this repo is like a version of Linux that you can't cross-compile or port.
To make another really poor (but fitting) analogy, this is like an "open core" SaaS platform that you know you'll never be able to run the features that matter on your own.
This repo scores really low on the "openness" continuum. In this case, you're very limited in what you can do with Chatterbox TTS. You certainly can't improve it or fit it to your data.
> You can fine-tune the weights yourself with your own training code.
This will never be built by anyone, and they know that. If it could be, they'd provide it themselves.
If you're considering Chatterbox TTS, just use MegaTTS3 [1] instead. It's better by all accounts.
This can be cross-compiled/ported in the Linux analogy. The Linux analogy would be more like: a kernel dev wrote code for some part of the Linux kernel using JetBrains' CLion. He used features of CLion that made this process much easer than if he had written the code using `nano`. By your logic, the resulting kernel code is not "open" because the tooling used to create it is not open. This is, of course, nonsense.
I agree that the project as a whole is less open than it could be, but the weights are indeed as open as they can be, no scare quotes required.
I'll up the ante. I'll bet you money that nobody forks this and adds fine tuning for at least a year.
And someone else fine-tuned it for German: https://huggingface.co/SebastianBodza/Kartoffelbox-v0.1
I haven't seen this level of involvement for a lot of the models I'm using, including several text to speech models.
The rapidity of this is also quite shocking. I don't think Resemble anticipated this either, given their wording on the aforementioned ticket.
There's probably a lot more work to do to ensure this works, adjusting learning rates, batching, etc., but it's all clearly being put into place and given attention. Even if this model has some finicky fine tuning behaviors, with this kind of willpower it'll be quickly overcome.
I suppose I owe you, haha.
it is highly amusing that they still believe they can put that genie back in the bottle with their usual crybully bullshit.
A lock needs not be infinitely strong to be useful, it just needs to take more resources to crack it than the locked thing is worth.
- fast / cheap to run
- can clone voices
- sounds super realistic
from what I can tell, Chatterbox is the first that apparently lets you pick 3! (have not tried it myself yet, this is just what I can deduce)
- the output had some of the qualities of my voice, but wasn't super similar. (Then again, the fact it could even do this from such a tiny snippet was impressive)
- increasing "CFG/pace" (whatever CFG is) even a little bit often just breaks down into total gibberish
- it was very inconsistent whether it would come out with a kind of British accent or an American one. (My accent is Australian...)
- the emotional exaggeration was interesting, but it seemed to vary a lot exactly what kind of emotion would come out
It’s becoming much more likely that the friend who desperately needs a gift card to Walmart isn’t the friend at all. :(
In a real scenario, they'd know the verbal password and you can authenticate them. Drum it into them that this password will prevent other people from impersonating you in this brave new world of ai voices and even video.
2 factor authentication through a secure app or a trusted family member is probably also needed though i haven't tackled this part with them yet.
the problem is that the sort of emergency scenario in which family member would need the help is not often done or possible via a secured app. It's often just a telephone, with a number that you cannot recognize - imagine getting that phone call from a police station in the middle of nowhere when arrested, then you dont have access to any of your personal belongings as they're confiscated. The phone is a landline from the police station!
Therefore, a verbal password is needed, as this scenario is exactly how a scammer would present as the emergency that they need help (usually, wire some dollars to this account to bail out).
This is a HN fantasy solution.
Not to mention that it is your responsibility as the technically minded to hammer it into your family members.
I wouldn't assume you're safe just because the tech in your phone can't speak your language.
interupting them with "can you make me a poem about x" works reliably. However the latency is a dead give away.
https://github.com/basetenlabs/truss-examples/tree/main/chat...
Still working on streaming
So far the US and China are spearheading AI research, so it makes sense that models optimize for languages spoken there. Spanish is an interesting omission on the US part, but that's probably because most AI researchers in the US speak English even if their native tongue is Spanish.
Also, a deployable model: https://lightning.ai/bhimrajyadav/ai-hub/temp_01jwr0adpqf055...
I do feel bad for pharmacists, their job is challenging in so many ways.
Although, from a risk avoidance point of view, I'd understand if Google wanted to stay as far away from having AI deal with medication as possible. Who knows what it'll do when it starts concocting new information while ordering medicine.
On the Huggingface demo, there seems to be no option for it.
It has a female voice. Any way to set it to a male voice?
I'd love to get to real-time generation if that's in the pipeline? Would like to use it along with Home Assistant.
I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/
Best voice cloning option available locally by far, in my experience.
To be honest, it uses a decently large amount of resources. If you had a GPU, you could expect about 4-5 gb memory usage. And given the optimizations for tensors on GPUs, I'm not sure how well things would work "CPU only".
If you try it, let me know. There are some "CPU" Docker builds in the repo you could look at for guidance.
If you want free TTS without using local resources, you could try edge-tts https://github.com/travisvn/openai-edge-tts
> I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-ap
Gave your wrapper a try and, wow, I'm blown away by both Chatterbox TTS and your API wrapper.
Excuse the rudimentary level of what follows.
Was looking for a quick and dirty CLI incantation to specify a local text file instead of the inline `input` object, but couldn't figure it.
Pointers much appreciated.
A lot of these frontends have an option for using OpenAI's TTS API, and some of them allow you to specify the URL for that endpoint, allowing for "drop-in replacements" like this project.
So the speech generation endpoint in the API is designed to fill that niche. However, its usage is pretty basic and there are curl statements in the README for testing your setup.
Anyway, to get to your actual question, let me see if I can whip something up. I'll edit this comment with the command if I can swing it.
In the meantime, can I assume your local text files are actual `.txt` files?
To answer your question, yes, my local text files are .txt files.
I'm new to actually commenting on HN as opposed to just lurking, so I hope this formatting works..
cat your_file.txt | python3 -c 'import sys, json; print(json.dumps({"input": sys.stdin.read()}))' | curl -X POST http://localhost:5123/v1/audio/speech \
-H "Content-Type: application/json" \
-d @- \
--output speech.wav
Just replace the `your_file.txt` with.. well, you get it.This'll hopefully handle any potential issues you'd have with quotes or other symbols breaking the JSON input.
Let me know how it goes!
Oh and you might want to change `python3` to `python` depending on your setup.
> This'll hopefully handle any potential issues you'd have with quotes or other symbols breaking the JSON input.
> Let me know how it goes!
Wow. I'm humbled and grateful.
I'll update once I'm done with work and back in front of my hone nachine.
For now, there's just a textarea for input (so you'll have to copy the `.txt` contents) — but it's a lot easier than trying to finagle into a `curl` request
Let me know if you have any issues!
Absolutely blown away.
I fed it the first page of Gibson's "Neuromancer" and your incantation worked like a charm. Thanks for the shell script pipe mojo.
Some other details:
- 3:01 (3 mins, 1 sec) of generated .wav took 4:28 to process
- running on M4 Max with 128GB RAM
- Chatterbox TTS inserted a few strange artifacts which sounded like air venting, machine whirring, and vehicles passing. Very odd and, oddly, apropos for cyberpunk.
- Chatterbox TTS managed to enunciate the dialog _as_ dialog, even going so far as to mimick an Australian accent where the speaker was identified as such. (This might be the effect of wishful listening.)
I am astounded.> Currenlty only English.
meh
In the past I've used different samples from the same speaker for this.
There’s eleven labs which is quite good but not incredible and very expensive.
Everything else ……. all the big AI companies …. have TTS systems that are kinda meh.
Everything else in AI has advanced in leaps and bounds, TTS remains deep in the uncanny valley.
If you want to run it without size limits, here's an open-source API wrapper that fixes some of the main headaches with the main repo https://github.com/travisvn/chatterbox-tts-api/
is it a banger??? yes ig so, a full setup ready for indies shipping voicefirst products right now.
gardnr•1d ago
https://news.ycombinator.com/item?id=44120204
https://news.ycombinator.com/item?id=44144155
https://news.ycombinator.com/item?id=44195105
https://news.ycombinator.com/item?id=44230867
https://news.ycombinator.com/item?id=44172134
https://news.ycombinator.com/item?id=44221910
https://news.ycombinator.com/item?id=44145564
pinter69•1d ago
tomhow•1d ago