Would probably want to do similar to balance crossfade anyway... having each speaker's input offset from center instead of straight mono.
I generally don't like a lot of the AI generated slop that's starting to pop up on YouTube these days... I do enjoy some of the reddit story channels, but have completely stopped with it all now. With the AI stuff, it really becomes apparent with dates/ages and when numbers are spoken. Dates/ages/timelines are just off as far as story generation, and really should be human tweaked. As to the voice gen, saying a year or measurement is just not how English speakers (US or otherwise) speak.
Most that claim to do a British accent end up sounding like Kelsey Grammer - sort of an American accent pretending to be British.
They could have skipped the singing part, it would be better if the model did not try to do that :)
1. https://music.youtube.com/watch?v=xl8thVrlvjI&si=dU6aIJIPWSs...
If you're in a company and need a model which one do you think you're getting past compliance & legal - the one that says MIT or the one that says "non-commercial use only"?
From what I understand, it's more basic models/techniques that are undersampling, so there is a series of audio pulses which give it that buzzy quality. Better models are produced smoother output.
https://www.perfectcircuit.com/signal/difference-between-wav...
Still not at the astonishing level of Google Notebook text to speech which has been out for a while now. I still can't believe how good that one is.
https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...
So that's a useful next step: for multi-voice TTS models, make them sound like they're in the same room.
The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.
The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).
And, of course, the singing part is painfully bad, I am very curious why they even included it.
For example I've been looking at models and loras for generating images, and the boards are _full_ of ones that will generate women well or in some particular style. Quite often at least a couple of the preview images for each are hidden behind a button because they contain nudity. Clearly the intent is that they are at least able to generate porn containing women. There's a small handful that are focused on men and they're very aware of it, they all have notes lampshading how oddball they are to even exist.
I would expect that this is not as pronounced an effect in the world generating speech, but it must still exist.
Female voices are often rated as being clearer, easier to understand, "warmer", etc.
Why this is the case is still an open question, but it's definitely more complex than just SEX.
> satisfying the sexual desires of
So, "sex" as a reference to "sexual desires". In English, it just so happens that "sex" has other meanings, but those weren't in play at the time.
Women also prefer female voices.
I trust the human scores in the paper. At least my ear aligns with that figure.
With stuff like this coming out in the open, I wonder if ElevenLabs will maintain its huge ARR lead in the field. I really don't see how they can continue to maintain a lead when their offering is getting trounced by open models.
https://yummy-fir-7a4.notion.site/dia
I am not sure why but I find the pacing of the parakeet based models (like Dia) to be much more realistic.
There are people – and it does not matter what it's about – that will overstate the progress made (and others will understate it, case in point). Neither should put a damper on progress. This is the best I personally have heard so far, but I certainly might have missed something.
However Kokoro-82M is an absolute triumph in the small model space. It curbstomps models 10-20x its size in terms of quality while also being runnable on like, a Raspberry Pi. It’s the kind of thing I’m surprised even exists. Its downside is that it isn’t super expressive, but the af_heart voice is extremely clean, and Kokoro is way more reliable than other TTS models: It doesn’t have the common failure mode where you occasionally have a couple extra syllables thrown in because you picked a bad seed.
If you want something that can do convincing voice acting, either pay for ElevenLabs or keep waiting. If you’re trying to build a local AI assistant, Kokoro is perfect, just use that and check the space again in like 6 months to see if something’s beaten it. https://huggingface.co/hexgrad/Kokoro-82M
The model is good, but I will say their inference code leaves a lot to be desired. I had to rewrite large portions of it for simple things like correct chunking and streaming. The advertised expressive keywords are very much hit and miss, and the devs have gone dark unfortunately.
https://github.com/mlang/llm-tts
Strictly speaking, even music generation fits the usage pattern: text in, audio out.
llm-tts is far from complete, but it makes it relatively "easy" to try a few models in an uniform way.
> In fact, we intentionally decided not to denoise our training data because we think it's an interesting feature for BGM to show up at just the right moment. You can think of it as a little easter egg we left for you.
It's not a bug, it's a feature! Okaaaaay
I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.
I'm actually more interested in STT (ASR) but the choices there are rather limited.
edit: Ah, there's a lock icon next to the name of each proprietary model.
Generally if a model is trending on that page, there’s enough juice for it to be worth a try. There’s a lot of subjective-opinion-having in this space, so beyond “is it trending on HF” the best eval is your own ears. But if something is not trending on HF it is unlikely to be much good.
Hey look! [enthusiastic] Should we tell the others? Maybe not ... [giggles]
etc.In fact, I think this kind of thing is absolutely necessary if you want to use this to replace a voice actor.
It seems that it's only variants of English, Spanish and Chinese which are somewhat working.
Disclaimer: I used to work for Soniox
In Android Auto / CarPlay I can't even get voice guidance that works properly, much less reading notifications, or composing a reply using STT
edit: I had forgotten about Jai Ho (Slumdog Millionaire) and Lose Yourself (8 mile)
Nothing on that list - movies or songs - had the cultural impact of Furious 7 or See You Again.
Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.
Compared to IBMs Steven Hawking's chair, maybe. But apple tts is not acceptable quality in any modern understanding of SotA, IMO.
If you need a not-visual output of text, SoyA is a waste of electrons.
If you want to try and mimic a human speaker, then it ain’t.
Question is why would you need to have the computer sound more human, except for “because I can”.
We can't be friends
I think translation would be a big use - maybe translating your voice to another language while maintaining emotion and intonation, or dubbing content (videos, movies, podcasts, ...) that isn't otherwise available in your native language.
Traditional non-ML TTS for longer content like podcasts or audiobooks seems like it'd become grating to the point of being unlistenable, or at least a significantly worse experience. Stands to benefit from more natural sounding voices that can place emphasis in the right places.
Since Stephen Hawking was brought up, there are likely also people with voice-impairing illnesses who would like to speak in their own voice again (in addition to those who are fine with a robotic voice). Or alternatively, people who are uncomfortable with their natural voice and want to communicate closer to how they wish to be perceived.
Could also potentially be used for new forms of interactive media that aren't currently feasible - customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prerecorded lines, etc.
There's a lot of stuff I don't have time to sit down and read, but want to listen to while I cook/laundry/shower/drive/etc.
Often recordings don't exist. Or when they do, an audiobook just has a bad voiceover artist, or one that just rubs you the wrong way.
The more human text-to-speech sounds, the easier and less distracting it is to listen to. There's real value in it, it's not "because I can".
You know how it's nicer to read in 300 dpi instead of 72 dpi? Or in Garamond rather than Courier? Or in Helvetica rather than Comic Sans? It's like that, only for speech.
Some of them have tone wobbles which iirc was more common in early TTS models. Looks like the huge context window is really helping out here.
Making it "open" would be unwise for a commercial entity. =3
For example, many academic data sets are not public domain, and can't be used in a commercial context. A GPL claim on that data is often an argument of which thief showed up first.
Rule #24: A lawyers Strategic Truth is to never lie, but also avoid voluntarily disclosing information that may help opponents.
Thus, a business will never disclose they paid a fool to break laws for them... =3
Indeed, these adversarial behaviors do not follow the spirit of FOSS community standards. If a project started as FOSS, than FOSS it should remain. =3
My usage is for Chinese, but the phonemes it generated looked very much like IPA.
But I do agree with you in that generally there's probably no negative connotation (yet).
A 100M podcast model
https://github.com/microsoft/VibeVoice
I was trying to get this working on strix halo.
egorfine•1d ago
x187463•1d ago
mpeg•1d ago
x187463•1d ago
https://github.com/WhisperSpeech/WhisperSpeech
Or is there some OpenAI official Whisper TTS?
mpeg•1d ago
egorfine•1d ago