frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

GLX: A Bash Replacement–Oriented Programming Language for System Scripting

1•danishk-sinha•46s ago•0 comments

AI gig work explainer [video]

https://www.youtube.com/watch?v=JQ5i_zkiZrw
1•RitesofThing•2m ago•0 comments

So, you want to serialize a B-Tree?

https://kerkour.com/btree-serde-sqlite
1•redcannon218•2m ago•0 comments

ICE Is What Happens When America Refuses to Learn from Black History

https://jemartisby.substack.com/p/ice-is-what-happens-when-america
2•TheUtleyPost•3m ago•0 comments

The Realities of Generative AI in Software Engineering

https://medium.com/takealot-engineering/the-realities-of-generative-ai-in-software-engineering-e1...
1•igitur•5m ago•0 comments

HEINEKEN's Digital Transformation: Why Change Management Comes First?

https://virtocommerce.com/blog/heineken-change-management
1•lizzieyo•6m ago•0 comments

Orbital Rocket Simulation

https://www.donutthejedi.com/
2•tgig•6m ago•1 comments

The 1000 Commits Problem

https://davekiss.com/blog/the-1000-commits-problem
1•foltik•7m ago•0 comments

The Value of Technological Progress

https://worksinprogress.co/issue/the-value-of-technological-progress/
1•ortegaygasset•7m ago•0 comments

Apple's John Ternus Could Be Tim Cook's Successor as CEO

https://www.nytimes.com/2026/01/08/technology/apple-ceo-tim-cook-john-ternus.html
1•tosh•8m ago•0 comments

How problematic is resampling audio from 44.1 to 48 kHz?

https://kevinboone.me/sample48.html
1•brewmarche•8m ago•0 comments

Ollee Watch one: Drop-in smart PCB for Casio F‑91W

https://www.olleewatch.com/shop/p/ollee-watch-one-kit
1•Lwrless•8m ago•0 comments

Code Review in the Age of AI

https://addyo.substack.com/p/code-review-in-the-age-of-ai
1•ostenbom•9m ago•0 comments

GLX: A New Programming Language, Replacement for Bash and Other Shell

1•danishk-sinha•11m ago•0 comments

Cloudflare: /cdn-cgi/ Endpoint

https://developers.cloudflare.com/fundamentals/reference/cdn-cgi-endpoint/
1•tosh•12m ago•0 comments

If you think you are good at math, you need to change your major out of STEM [video]

https://www.youtube.com/watch?v=7s8PfNFeKkQ
1•CGMthrowaway•15m ago•1 comments

Show HN: RCS Composer – a visual editor that outputs RBM JSON

1•lukaslukas•15m ago•0 comments

MCP CLI: Dynamic discovering and interacting with MCP servers

https://github.com/philschmid/mcp-cli
1•philschmidxxx•16m ago•1 comments

A Year of MCP: From Internal Experiment to Industry Standard

https://www.pento.ai/blog/a-year-of-mcp-2025-review
1•leopiney•17m ago•0 comments

Ash HN: Excavating Decision Archaeology

2•brihati•18m ago•0 comments

Scroll to Accept? – AI's pull-to-refresh moment

https://ideas.fin.ai/p/scroll-to-accept
1•destraynor•18m ago•0 comments

Automatic TLS Certificates for Common Lisp with pure-TLS/acme

https://atgreen.github.io/repl-yell/posts/pure-tls-acme/
1•todsacerdoti•18m ago•0 comments

You Can't Debug a System by Blaming a Person

https://humansinsystems.com/blog/you-cant-debug-a-systems-by-blaming-a-person
2•yunusozen•19m ago•0 comments

Beating the House for the Love of Math

https://advantage-player.com/blog/from-excel-to-web-blackjack-calculator
1•prolly97•19m ago•1 comments

AngelScript

https://en.wikipedia.org/wiki/AngelScript
2•flykespice•22m ago•0 comments

An alternative to code mode: serverless MCP

https://www.speakeasy.com/blog/how-we-reduced-token-usage-by-100x-dynamic-toolsets-v2
2•ndimares•23m ago•0 comments

Meta Unveils Nuclear-Power Plan to Fuel Its AI Ambitions

https://www.wsj.com/tech/ai/meta-unveils-sweeping-nuclear-power-plan-to-fuel-its-ai-ambitions-65c...
2•fortran77•24m ago•1 comments

EU states' nod on Mercosur trade deal ends 25-year wait

https://www.aljazeera.com/news/2026/1/9/eu-states-nod-on-mercosur-trade-deal-ends-25-year-wait
2•wslh•25m ago•0 comments

Show HN: LiteGPT – Pre-training a 124M LLM from scratch on a single RTX 4090

https://github.com/kmkrofficial/LiteGPT
2•kmkrworks•29m ago•0 comments

Show HN: Free noise evidence generator for tenant complaints

https://noiseevidence.com/
4•countfeng•29m ago•1 comments
Open in hackernews

Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

https://github.com/samuel-vitorino/sopro
287•sammyyyyyyy•18h ago

Comments

lukebechtel•17h ago
Very cool. I'd love a slightly larger version with hopefully improved voice quality.

Nice work!

sammyyyyyyy•17h ago
Thanks! Yeah I kinda postponed publishing it until it was a bit better, but as a perfectionist, it would have never been published
lukebechtel•14h ago
understood! Glad you shipped.
convivialdingo•17h ago
Impressive! The cloning and voice affect is great. Has a slight warble in the voice on long vowels, but not a huge issue. I'll definitely check it out - we could use voice generation for alerting on one of our projects (no GPUs on hardware).
sammyyyyyyy•17h ago
Cool! Yeah the voice quality really depends on the reference audio. Also mess with the parameters. All the feedback is welcome
realityfactchex•16h ago
That's cool and useful.

IMO, the best alternative is Chatterbox-TTS-Server [0] (slower, but quite high quality).

[0] https://github.com/devnen/Chatterbox-TTS-Server

iLoveOncall•1h ago
Chatterbox-TTS has a MUCH MUCH better output quality though, the quality of the output from Sopro TTS (based on the video embedded on GitHub) is absolutely terrible and completely unusable for any serious application, while Chatterbox has incredible outputs.

I have an RTX5090, so not exactly what most consumers will have but still accessible, and it's also very fast, around 2 seconds of audio per 1 second of generation.

Here's an example I just generated (first try, 22 seconds runtime, 14 seconds of generation): https://jumpshare.com/s/Vl92l7Rm0IhiIk0jGors

Here's another one, 20 seconds of generation, 30 seconds of runtime, which clones a voice from a Youtuber (I don't use it for nefarious reasons, it's just for the demo): https://jumpshare.com/s/Y61duHpqvkmNfKr4hGFs with the original source for the voice: https://www.youtube.com/@ArbitorIan

sammyyyyyyy•1h ago
You should try it! I wouldn’t say it’s the best, far from that. But also wouldn’t say it’s terrible. If you have a 5090, then yes, you can run much more powerful models in real time. Chatterbox is a great model though
iLoveOncall•41m ago
> But also wouldn’t say it’s terrible.

But you included 3 samples on your GitHub video and they all sound extremely robotic and have very bad artifacts?

kkzz99•52m ago
I've been using Higgs-Audio for a while now as the primary TTS system. How would you say does Chatterbox compare to it if you have experience?
iLoveOncall•43m ago
I haven't used it. I compared it with T5Gemma TTS that came out recently and Chatterbox is much better in all aspects, but especially in voice cloning where T5Gemma basically did not work.
blitzar•16h ago
Mission impossible cloning skills without the long compile time.

"The pleasure of Buzby's company is what I most enjoy. He put a tack on Miss Yancy's chair ..."

https://www.youtube.com/watch?v=H2kIN9PgvNo

https://literalminded.wordpress.com/2006/05/05/a-panphonic-p...

btbuildem•16h ago
It's impressive given the constraints!

Would you consider releasing a more capable version that renders with fewer artifacts (and maybe requires a bit more processing power)?

Chatterbox is my go-to, this could be a nice alternative were it capable of high-fidelity results!

sammyyyyyyy•16h ago
This is my side “hobby”. And compute is quite expensive. But if the community’s responsive is good, I will definitely think about it! Btw, chatterbox is a great model and inspiration
bicepjai•15h ago
Thanks can you share details about compute economics you dealt with ?
sammyyyyyyy•14h ago
Yeah sure. The training was about ~250 dollars, which is quite low by today’s standards. And I spent a bit more on ablations and research
littlestymaar•9h ago
Very cool work, especially for a hobby project.

Do you have any plans to publish a blog post on how you did that? ?What training data and how much? Your training and ablations methodology, etc.

elaus•16h ago
Very nice to have done this by yourself, locally.

I wish there was an open/local tts model with voice cloning as good as 11l (for non-english languages even)

sammyyyyyyy•16h ago
Yeah, we are not quite there, but I’m sure we are not far either
SoftTalker•16h ago
What does "zero-shot" mean in this context?
nateb2022•16h ago
> Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

https://en.wikipedia.org/wiki/Zero-shot_learning

edit: since there seems to be some degree of confusion regarding this definition, I'll break it down more simply:

We are modeling the conditional probability P(Audio|Voice). If the model samples from this distribution for a Voice class not observed during training, it is by definition zero-shot.

"Prediction" here is not a simple classification, but the estimation of this conditional probability distribution for a Voice class not observed during training.

Providing reference audio to a model at inference-time is no different than including an AGENTS.md when interacting with an LLM. You're providing context, not updating the model weights.

woodson•16h ago
This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.
coder543•16h ago
Why wouldn’t that be one-shot voice cloning? The concept of calling it zero shot doesn’t really make sense to me.
geocar•16h ago
So if you get your target to record (say) 1 hour of audio, that's a one-shot.

If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?

nateb2022•15h ago
> So if you get your target to record (say) 1 hour of audio, that's a one-shot.

No, that would still be zero shot. Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.

ImPostingOnHN•4m ago
> Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM.

Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc.

The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen.

woodson•16h ago
I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).
nateb2022•15h ago
> Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).

It makes perfect sense; you are simply confusing training samples with inference context. "Zero-shot" refers to zero gradient updates (retraining) required to handle a new class. It does not mean "zero input information."

> how would the model know what voice it should sound like

It uses the reference audio just like a text based model uses a prompt.

> unless it's a celebrity voice or similar included in the training data where it's enough to specify a name

If the voice is in the training data, that is literally the opposite of zero-shot. The entire point of zero-shot is that the model has never encountered the speaker before.

magicalhippo•14h ago
With LLMs I've seen zero-shot used to describe scenarios where there's no example, it "take this and output JSON", while one-shot has the prompt include an example like "take this and output JSON, for this data the JSON should look like this".

Thus if you feed a the model target voice, ie an example of the desired output vouce, it sure seems like it should be classified as one-shot.

However it seems the zero-shot in voice cloning is relative to learning, and in contrast to one-shot learning[1].

So a bit overloaded term causing confusion from what I can gather.

[1]: https://en.wikipedia.org/wiki/One-shot_learning_(computer_vi...

nateb2022•14h ago
The confusion clears up if you stop conflating contextual conditioning (prompting) with actual Learning (weight updates). For LLMs, "few-shot prompting" is technically a misnomer that stuck; you are just establishing a pattern in the context window, not training the model.

In voice cloning, the reference audio is simply the input, not a training example. You wouldn't say an image classifier is doing "one-shot learning" just because you fed it one image to classify. That image is the input. Similarly, the reference audio is the input that conditions the generation. It is zero-shot because the model's weights were never optimized for that specific speaker's manifold.

ben_w•15h ago
Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?"

As with other replies, yes this is a silly name.

nateb2022•15h ago
Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.

coder543•14h ago
To me, a closer analogy is In Context Learning.

In the olden days of 2023, you didn’t just find instruct-tuned models sitting on every shelf.

You could use a base model that has only undergone pretraining and can only generate text continuations based on the input it receives. If you provided the model with several examples of a question followed by an answer, and then provided a new question followed by a blank for the next answer, the model understood from the context that it needed to answer the question. This is the most primitive use of ICL, and a very basic way to achieve limited instruction following behavior.

With this few-shot example, I would call that few-shot ICL. Not zero shot, even though the model weights are locked.

But, I am learning that it is technically called zero shot, and I will accept this, even if I think it is a confusingly named concept.

oofbey•13h ago
It’s nonsensical to call it “zero shot” when a sample of the voice is provided. The term “zero shot cloning” implies you have some representation of the voice from another domain - e.g. a text description of the voice. What they’re doing is ABSOLUTELY one shot cloning. I don’t care if lots of STT folks use the term this way, they’re wrong.
nateb2022•15h ago
> This generic answer from Wikipedia is not very helpful in this context.

Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.

Your explanation just rephrases the very definition you dismissed.

woodson•15h ago
From your definition:

> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.

That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.

nateb2022•14h ago
> That's not what happens in zero-shot voice cloning

It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).

In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.

The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."

You're getting hung up on the semantics.

woodson•14h ago
Jeez, OP asked what it means in this context (zero-shot voice cloning), where you quoted a generic definition copied from Wikipedia. I defined it concretely for this context. Don't take it as a slight, there is no need to get all argumentative.
numpad0•12h ago
I think the point is it's not zero shot if a sample is needed. A system that require one sample is usually considered one-shot, or few-shot if it needs few, etc etc.
derefr•16h ago
Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.)

Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.

vunderba•16h ago
I don’t know about open models, but ElevenLabs has had this idea of mapping intonation/emotion/inflections onto a designated TTS voice for a while.

https://elevenlabs.io/blog/speech-to-speech

gcr•15h ago
Chatterbox TTS does this in “voice cloning” mode but you have to implement the streaming part yourself.

There are two inputs: audio A (“style”) and B (“content”). The timbre is taken from A, and the content, pronunciation, prosody, accent, etc is taken from B.

Strictly soeaking, voice cloning models like this and chatterbox are not “TTS” - they’re better thought of as “S+STS”, that is, speech+style to speech

qingcharles•15h ago
There must be something out there that does this reliably as I often see/hear v-tubers doing it.
lumerios•15h ago
yes, check out RVC (retrieval voice conversation) which I believe is the only good open source voice changer. Currently there's a bit of a conflict between the original creator and current developers. So don't use the main fork. I think you'll be able to find a more up-to-date fork that's in english.
nunobrito•16h ago
Muito fixe. Now the next challenge (for me) is how to convert this to DART and run on Android. :-)
sammyyyyyyy•16h ago
Obrigado! Quando (e se fizeres isso) manda pm!
woodson•16h ago
Does the 169M include the ~90M params for the Mimi codec? Interesting approach using FiLM for speaker conditioning.
sammyyyyyyy•15h ago
No, it doesn’t.
brikym•16h ago
A scammers dream.
jacquesm•15h ago
That's exactly how I see it.
soulofmischief•14h ago
Unfortunately, we have to prepare for a future where this kind of stuff is everywhere. We will have to rethink how trust is modeled online and offline.
gosub100•14h ago
unfortunately I think you're right, the cons massively outweigh the pros.

One constructive use would be making on-demand audiobooks.

CoastalCoder•1h ago
I agree.

I'd be curious to hear why its advocates believe that this is a net win for society.

jacquesm•15h ago
What could possibly go wrong...

Don't you ever think about what the balance of good and bad is when you make something like this? What's the upside? What's the downside?

In this particular case I can only see downsides, if there are upsides I'd love to hear about them. All I see is my elderly family members getting 'me' on their phones asking for help, and falling for it.

I've gotten into the habit of waiting for the other person to speak first when I answer the phone now and the number is unknown to me.

sammyyyyyyy•15h ago
Yes, you are right. However, there are many upsides to this kind of technology. For example, it can restore the voices of people who were affected by numerous diseases
jacquesm•15h ago
Ok, that's an interesting angle, I had not thought of that, but of course you'd still need a good sample of them from before that happened. Thank you for the explanation.
Alex2037•15h ago
are you under the impression that this is the first such tool? it's not. it's not even the hundredth. this Pandora's box has been opened a long time ago.
idiotsecant•14h ago
There is no such thing as bad technology.
jacquesm•14h ago
That is simply not true. There is lots of bad technology.
idiotsecant•13h ago
Like what? There's no technology that simply by existing causes harm to the world, people do that part.
cookiengineer•13h ago
> Like what? There's no technology that simply by existing causes harm to the world, people do that part.

People create that technology, therefore enforcing their own lack of morals and lack of ethics onto it. That's the part that most humans in the post-digital age seem to ignore to purposefully deflect and absolve themselves from any responsibilities.

Also, companies will always be controlled by humans that optimized their life for greed, not by the ones that specialized on philosophical implications.

The inventors of novichok or the nuclear bomb didn't have "world peace" in mind. They had "world peace through me enforcing my own will onto my enemies" in mind.

CamperBob2•9h ago
The inventors of novichok or the nuclear bomb didn't have "world peace" in mind. They had "world peace through me enforcing my own will onto my enemies" in mind.

I don't know about Novichok, but nuclear bombs have stopped world wars, at least so far.

numpad0•12h ago
like that chemical weapon that was specifically designed to react with gas mask absorbent materials so to activate at the protected side so to circumvent filteration(long banned since the end of WWI).
Alex2037•12h ago
who gets to decide which technology must be banned? the same people who decide which books must be burned?
jacquesm•11h ago
Surely that would be you.
CoastalCoder•1h ago
> There is no such thing as bad technology.

If nothing else, it's a debate where we'd need to define our terms.

jbaber•13h ago
I am unhappy about the criminal dimension of voice cloning, too, but there are plenty of use cases.

e.g. If I could have a (local!) clone of my own voice, I could get lots of wait-on-the-phone chores done by typing on my desktop to VOIP while accomplishing other things.

anigbrowl•10h ago
But why do you need it to be a clone of your voice? A generic TTS like Siri or a vocaloid would be sufficient.
sergiotapia•15h ago
It sounds a lot like RFK Jr! Does anyone have any more casual examples?
guerrilla•13h ago
I don't understand the comments here at all. I played the audio and it sounds absolutely horrible, far worse than computer voices sounded fifteen years ago. Not even the most feeble minded person would mistake that as a human. Am I not hearing the same thing everyone else is hearing? It sounds straight up corrupted to me. Tested in different browsers, no difference.
sammyyyyyyy•13h ago
As I said, some reference voices can lead to bad voice quality. But if it sounds that bad, it’s probably not it. Would love to dig into it if you want
guerrilla•13h ago
I mean I'm talking about the mp4. How could people possibly be worried about scammers after listening to that?
sammyyyyyyy•13h ago
I didn’t specially cherry pick those examples. You can try it anyway for yourself. But thanks for the feedback anyway
guerrilla•13h ago
No shade on you. It's definitely impressive. I just didn't understand people's reactions.
jrmg•3h ago
It sounds like someone using an electrolarynx to me.
codefreakxff•11h ago
I agree with the comment above. I have not logged into hacker news in _years_ but did so today just to weigh in here. If people are saying that the audio sounds great, then there is definitely something going on with a subset of users where we are only hearing garbled words with a LOT of distortion. This does not sound like natural speech to met at all. It sounds more like a warped cassette tape. And I do not mean to slight your work at all. I am actually incredibly puzzled here to understand why my perception of this is so radically different from others!
guerrilla•10h ago
Thank you for commenting. I wonder if this could be another situation like "the dress" (2015) or maybe something is wrong with our codecs...
Mashimo•7h ago
No, nothing wrong with your codecs. It's sounds shitty. But given the small size and speed it's still impressive.

It's like saying .kkrieger looks like a bad game, which it does, but then again .kkrieger is only 96kb or whatever.

guerrilla•6h ago
How big are TTS models like this usually?

.kkrieger looks like an amazing game for the mid-90s. It's incomprehensible that it's only 96kb.

Mashimo•5h ago
Here is an overview: https://www.inferless.com/learn/comparing-different-text-to-...

Also keep in mind the processing time. The ^ article above used a NVIDIA L4 with 24-GB VRAM. Sopro claims 7.5 second processing time on CPU for 30 seconds of audio!

If you want to get real good quality TTS, you should check out elevenlabs.io

Different tools for different goals.

foolserrandboy•13h ago
I thought it was RFK
serf•13h ago
spasmodic dysphonia as a service.
jackyysq•9h ago
same here, tried few different voices including my kids and my own, the generated audio is not similar at all, it's not even a proper voice
wildmXranat•2h ago
Yes, if this selected piece is the best that was available to be used as a showcase, it's immediately off putting in distortion and mangling of pronunciation.
eigenvalue•21m ago
Thank you, I was scrolling and scrolling in utter disbelief. It sounds absolutely dreadful. Would drive me nuts to listen to for more than a minute.
Gathering6678•12h ago
Emm...I played the sample audio and it was...horrible?

How is it voice cloning if even the sample doesn't sound like any human being...

sammyyyyyyy•12h ago
I should have posted the reference audio used with the examples. Honestly it doesn’t sound so different from them. Voice cloning can be from a cartoon too, doesn’t have to be from a human being
nemomarx•12h ago
A before / after with the reference and output seems useful to me, and maybe a range from more generic to more recognizable / celebrity voice samples so people can kinda see how it tackles different ones?

(Prominent politician or actor or somebody with a distinct speaking tone?)

Gathering6678•11h ago
That is probably a good idea. I was so confused listening to the example.
sammyyyyyyy•12h ago
Also, I didn’t want to use known voices as the example, so I ended up using generic ones from the datasets
krunck•12h ago
I just had some amusing results using text with lots of exclamations and turning up the temperature. Good fun.
yamal4321•11h ago
Tried english. There are similarities. Really impressive for such budget Also increadibly easy to use, thanks for this
xiconfjs•2h ago
But its english-only - so what else could you have tried? Asking because I‘m interested in a german version :)
VerifiedReports•9h ago
What is "zero-shot" supposed to mean?
carteazy•8h ago
I believe in this case it means that you do not need to provide other voice samples to get a good clone.
spwa4•7h ago
It means there is zero training involved in getting from voice sample to voice duplicate. There used to be models that take a voice sample, run 5 or 10 training iterations (which of course takes 10 mins, or a few hours if you have hardware as shitty as mine), and only then duplicate the voice.

This you give the voice sample as part of the input, and immediately it tries to duplicate the voice.

x3haloed•6h ago
Doesn’t NeuTTS work the same way?
onion2k•6h ago
zero-shot is a single prompt (maybe with additional context in the form of files.)

few-shot is providing a few examples to steer the LLM

multi-shot is a longer cycle of prompts and refinement

moffkalast•4h ago
if you had one-shot

or one opportunity

nake89•4h ago
to seize everything you ever wanted in one moment
mikkupikku•1h ago
I've been calling good results from a single prompt "single-shot." Is this not right?
samtheprogram•18m ago
This is one-shot.
LoveMortuus•5h ago
This is very cool! And it'll only get better. I do wonder, if, at least as a patch-up job, they could do some light audio processing to remove the raspiness from the voices.
armcat•4h ago
Super nice! I've been using Kokoro locally, which is 82M parameters and runs (and sounds) amazing! https://huggingface.co/hexgrad/Kokoro-82M
machiaweliczny•2h ago
I tried Kokoro-JS that I think runs in browser and it was too way too slow with latency also not supporting language I wanted
machiaweliczny•2h ago
BTW does anyone know of good assistant voice stack that's Open Source? I used https://github.com/ricky0123/vad for voice activation -> works good, then just using Web Speech API as that's the fastest and then commercial TTS for speed as couldn't find good one.
jokethrowaway•3h ago
Sorry but the quality is too bad.

I'm sure it has its uses, but for anything practical I think Vibe Voice is the only real OSS cloning option. F2/E5 are also very good but has plenty of bad runs, you need to keep re-rolling.

jokethrowaway•3h ago
I'm sure it has its uses, but for anything with a higher requirement for quality, I think Vibe Voice is the only real OSS cloning option.

F2/E5 are also very good but have plenty of bad runs, you need to keep re-rolling until you get good outputs.