IMO, the best alternative is Chatterbox-TTS-Server [0] (slower, but quite high quality).
I have an RTX5090, so not exactly what most consumers will have but still accessible, and it's also very fast, around 2 seconds of audio per 1 second of generation.
Here's an example I just generated (first try, 22 seconds runtime, 14 seconds of generation): https://jumpshare.com/s/Vl92l7Rm0IhiIk0jGors
Here's another one, 20 seconds of generation, 30 seconds of runtime, which clones a voice from a Youtuber (I don't use it for nefarious reasons, it's just for the demo): https://jumpshare.com/s/Y61duHpqvkmNfKr4hGFs with the original source for the voice: https://www.youtube.com/@ArbitorIan
But you included 3 samples on your GitHub video and they all sound extremely robotic and have very bad artifacts?
"The pleasure of Buzby's company is what I most enjoy. He put a tack on Miss Yancy's chair ..."
https://www.youtube.com/watch?v=H2kIN9PgvNo
https://literalminded.wordpress.com/2006/05/05/a-panphonic-p...
Would you consider releasing a more capable version that renders with fewer artifacts (and maybe requires a bit more processing power)?
Chatterbox is my go-to, this could be a nice alternative were it capable of high-fidelity results!
Do you have any plans to publish a blog post on how you did that? ?What training data and how much? Your training and ablations methodology, etc.
I wish there was an open/local tts model with voice cloning as good as 11l (for non-english languages even)
https://en.wikipedia.org/wiki/Zero-shot_learning
edit: since there seems to be some degree of confusion regarding this definition, I'll break it down more simply:
We are modeling the conditional probability P(Audio|Voice). If the model samples from this distribution for a Voice class not observed during training, it is by definition zero-shot.
"Prediction" here is not a simple classification, but the estimation of this conditional probability distribution for a Voice class not observed during training.
Providing reference audio to a model at inference-time is no different than including an AGENTS.md when interacting with an LLM. You're providing context, not updating the model weights.
If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?
No, that would still be zero shot. Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.
If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.
Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc.
The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen.
It makes perfect sense; you are simply confusing training samples with inference context. "Zero-shot" refers to zero gradient updates (retraining) required to handle a new class. It does not mean "zero input information."
> how would the model know what voice it should sound like
It uses the reference audio just like a text based model uses a prompt.
> unless it's a celebrity voice or similar included in the training data where it's enough to specify a name
If the voice is in the training data, that is literally the opposite of zero-shot. The entire point of zero-shot is that the model has never encountered the speaker before.
Thus if you feed a the model target voice, ie an example of the desired output vouce, it sure seems like it should be classified as one-shot.
However it seems the zero-shot in voice cloning is relative to learning, and in contrast to one-shot learning[1].
So a bit overloaded term causing confusion from what I can gather.
[1]: https://en.wikipedia.org/wiki/One-shot_learning_(computer_vi...
In voice cloning, the reference audio is simply the input, not a training example. You wouldn't say an image classifier is doing "one-shot learning" just because you fed it one image to classify. That image is the input. Similarly, the reference audio is the input that conditions the generation. It is zero-shot because the model's weights were never optimized for that specific speaker's manifold.
As with other replies, yes this is a silly name.
If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.
In the olden days of 2023, you didn’t just find instruct-tuned models sitting on every shelf.
You could use a base model that has only undergone pretraining and can only generate text continuations based on the input it receives. If you provided the model with several examples of a question followed by an answer, and then provided a new question followed by a blank for the next answer, the model understood from the context that it needed to answer the question. This is the most primitive use of ICL, and a very basic way to achieve limited instruction following behavior.
With this few-shot example, I would call that few-shot ICL. Not zero shot, even though the model weights are locked.
But, I am learning that it is technically called zero shot, and I will accept this, even if I think it is a confusingly named concept.
Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.
Your explanation just rephrases the very definition you dismissed.
> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.
It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).
In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.
The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."
You're getting hung up on the semantics.
Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.
There are two inputs: audio A (“style”) and B (“content”). The timbre is taken from A, and the content, pronunciation, prosody, accent, etc is taken from B.
Strictly soeaking, voice cloning models like this and chatterbox are not “TTS” - they’re better thought of as “S+STS”, that is, speech+style to speech
One constructive use would be making on-demand audiobooks.
I'd be curious to hear why its advocates believe that this is a net win for society.
Don't you ever think about what the balance of good and bad is when you make something like this? What's the upside? What's the downside?
In this particular case I can only see downsides, if there are upsides I'd love to hear about them. All I see is my elderly family members getting 'me' on their phones asking for help, and falling for it.
I've gotten into the habit of waiting for the other person to speak first when I answer the phone now and the number is unknown to me.
People create that technology, therefore enforcing their own lack of morals and lack of ethics onto it. That's the part that most humans in the post-digital age seem to ignore to purposefully deflect and absolve themselves from any responsibilities.
Also, companies will always be controlled by humans that optimized their life for greed, not by the ones that specialized on philosophical implications.
The inventors of novichok or the nuclear bomb didn't have "world peace" in mind. They had "world peace through me enforcing my own will onto my enemies" in mind.
I don't know about Novichok, but nuclear bombs have stopped world wars, at least so far.
If nothing else, it's a debate where we'd need to define our terms.
e.g. If I could have a (local!) clone of my own voice, I could get lots of wait-on-the-phone chores done by typing on my desktop to VOIP while accomplishing other things.
It's like saying .kkrieger looks like a bad game, which it does, but then again .kkrieger is only 96kb or whatever.
.kkrieger looks like an amazing game for the mid-90s. It's incomprehensible that it's only 96kb.
Also keep in mind the processing time. The ^ article above used a NVIDIA L4 with 24-GB VRAM. Sopro claims 7.5 second processing time on CPU for 30 seconds of audio!
If you want to get real good quality TTS, you should check out elevenlabs.io
Different tools for different goals.
How is it voice cloning if even the sample doesn't sound like any human being...
(Prominent politician or actor or somebody with a distinct speaking tone?)
This you give the voice sample as part of the input, and immediately it tries to duplicate the voice.
few-shot is providing a few examples to steer the LLM
multi-shot is a longer cycle of prompts and refinement
or one opportunity
I'm sure it has its uses, but for anything practical I think Vibe Voice is the only real OSS cloning option. F2/E5 are also very good but has plenty of bad runs, you need to keep re-rolling.
F2/E5 are also very good but have plenty of bad runs, you need to keep re-rolling until you get good outputs.
lukebechtel•17h ago
Nice work!
sammyyyyyyy•17h ago
lukebechtel•14h ago