TTS still sucks

https://duarteocarmo.com/blog/tts-still-sucks

64•speckx•2mo ago

Comments

skybrian•2mo ago

...if you care about voice cloning.

Maybe that's not so important?

lxe•2mo ago

That's because you're sleeping on things like Higgs Audio

meatmanek•2mo ago

If this demo video[1] is indicative of what you can expect, I'm not particularly impressed. For me, every single one of the recordings fell all the way to the bottom of the uncanny valley.

1. https://github.com/user-attachments/assets/0fd73fad-097f-48a...

hackingonempty•2mo ago

I would think the only way to fairly evaluate the performance of these models as they approach that of professional human voice actors is to evaluate them against humans in a sufficiently powered randomly controlled and blinded trial.

zamadatix•2mo ago

Agreed, but this doesn't approach professional human voice actors - unless they are acting like an android.

rurban•2mo ago

I recently watched the first AI dubbed movie, and it was as horrible as this.

Karrot_Kream•2mo ago

A big caveat of this is that the author is just looking at a ranking of open models? It's just buried in a little sentence in there but makes a big difference to model quality. Kokoro in the overall rankings is only 15, so if #15 is what you consider the "best model" you need to be cognizant that you are leaving performance on the table.

I've heard a lot of Substacks voiced by Eleven Labs models and they seem fine (with the occasional weirdness around a proper noun.) Not a bad article but I think more examples of TTS usage would be more useful.

I guess the outcome is, open weight TTS models are only okay and could be a lot better?

regulation_d•2mo ago

Yeah, from my experience the more helpful conclusion is "TTS is not commoditized yet". At some point in the next 5 years, convincing TTS will be table stakes. But for now, paying for TTS gets you better results.

TheAceOfHearts•2mo ago

The paid models are still too expensive for personal long-form use-cases. For example: if I want to generate an audiobook from a web novel, the price can go as high as thousands of dollars. If I'm just a regular reader (not the author), that's prohibitively expensive for someone who just wants to enjoy the story in a different medium.

BoorishBears•2mo ago

Despite ElevenLabs API usage being expensive, ElevenReader is $11 a month for unlimited personal long-form content.

Even with a local model and hardware you already own, you're not beating that on electricity costs.

rubyn00bie•2mo ago

I dunno about the electricity claims for practical purposes— where I live, that’d be roughly 128 hours of 600W. I suppose the real question is, would it take 128 hours ($11) of power to generate 720 hours of TTS content assuming a good enough model was available?

On Android I’d imagine, big emphasis on imagine since I don’t use it, you could probably script something up and use a phone with an audio jack to record it. Theoretically hitting that maximum of 720 hours of content per month, but I’d imagine at some point they’d find it peculiar you’re listening to content 24/7.

lostmsu•2mo ago

Kokoro is available as a system TTS for Android via OSS project called "sherpa": https://k2-fsa.github.io/sherpa/onnx/android/index.html

I believe its power usage is negligible in comparison to, for example, screen or maybe even Bluetooth audio.

staticman2•2mo ago

I listen to web novels with Elevenlabs reader all the time (The 11 dollar a month unlimited plan). I love it.

When it's a foreign web novel with no English translation, I first translate the Web novel with Claude Sonnet.

huskyr•2mo ago

Yup, ElevenLabs stills rules pretty much in this space. Especially if you're looking for non-English models it's really hard to find anything good although the latest Chatterbox[1] now supports 23 languages.

[1]: https://github.com/resemble-ai/chatterbox

fleshmonad•2mo ago

It's very interesting to see that there actually are people who want to automatically create a "podcast" from their blog using their cloned voice. Is this just what tech bro culture does to someone? Or is it about hustling and grinding while getting your very important word out there. I mean over time one would certainly save up to 20 minutes for each article...

bigfishrunning•2mo ago

Exactly! Why would I want to listen to a written article instead of just reading it?

Also, I suspect these AI-Podcast blogs are probably just generated with AI too, so it's likely safe to skip the whole mess

raw_anon_1111•2mo ago

For me it would be when I’m driving or working out. But I can’t imagine listening to an AI generated podcast. I do listen to the Stratechery podcast that is the same as the email.

But he also not only reads it himself, he has someone else narrate quotes and he uses chapter art that goes along with the article.

tonyarkles•2mo ago

To some degree, you could make the same argument about written books and audio books. Mostly I listen to audiobooks because I’m often bored in the car and learning something seems like a good use of my time.

bigfishrunning•2mo ago

An audio book is usually read by an actor, or the author, and often has a performative quality to it. I'm honestly not interested in anything read by a machine unless it's absolutely indistinguishable from a professional person, and current machine-voice-synthesis (AI or not) isn't there.

TiredOfLife•2mo ago

Wait till you get old and have trouble reading.

imiric•2mo ago

> Anything over 1000 characters starts hallucinating.

So just feed it batches smaller than 1000 characters? It's not like TTS requires maintaining large contexts at a time.

simlevesque•2mo ago

Context helps guessing what the next word will be.

zahlman•2mo ago

If you've been given 1000 characters (a fairly long paragraph) of text to read (and supposing you get to study them before you start speaking), is "guessing what the next word will be" all that relevant to decisions about intonation?

BoorishBears•2mo ago

They in the right direction, but wrong specifics: you can't maintain consistent prosody without context.

The simplest examples are punctuation marks which change your speech before you reach the mark, but the problem extends past sentence boundaries

For example:

"He didn't steal the green car. He borrowed it."

"He didn't steal the green car. He stole the red one."

A natural speaker would slightly emphasize steal and borrowed in the 1st example, but emphasize green and red in the 2nd.

Or like when you're building a set:

"Peter called Mary."

"John called Mary. Peter called Mary. Who didn't call Mary?"

These all sound like small nits but for naively stitched together TTS, at best they nudge the narration towards the uncanny valley (which may be acceptable for some usecases)... but at worst they make the model sound broken.

zahlman•2mo ago

> The simplest examples are punctuation marks which change your speech before you reach the mark, but the problem extends past sentence boundaries

I agree, but it seems unusual for this to matter past paragraph boundaries, and it sounds like there should be enough room for a full paragraph of context.

BoorishBears•2mo ago

It depends on the level of quality you're going for, but prosody even changes based on how related the preceeding paragraph was to the upcoming one.

And the current SOTA for TTS includes breathing too, so you can't just put a fixed empty pause between your paragraphs.

People are chunking by paragraphs anyways (or even sentences) and it works, but the top commercial models support maintaining a context or passing in the most recently generated text for that reason.

tonyarkles•2mo ago

With no significant background in ML-based TTS, I’m assuming that a larger context window would help with tone as well. “We are gathered here today to mourn the loss of…” really provides context into how the whole thing might sound, even if most of it is singing the praises of the deceased.

superkuh•2mo ago

For local TTS for a podcast I'd try the quantized .gguf versions of Microsoft VibeVoice large in comfyui to clone my voice from a ~30 second speech sample the apply it to marked-up text of the desired podcast. But it'd be nowhere near real time and require dedicating a $300 GPU to it. And the quantized version often goes off the rails and loses consistency in voice tone or accent. So just one run often isn't enough and I you have to piece the good parts of many separate runs together. It's not set it and forget it.

I do a lot of desktop screen-reader and pdf/doc/epub/etc text to speech every single day. It's been 20 years and I still use Festival 1.96 TTS with voice_nitech_us_slt_arctic_hts voice because it's so computational cheap and just slightly a step about normal festival/espeak/mbrolla/etc type TTS quality to be clear and tolerable. In terms of this local "do screenreader stuff really fast" use case I've tried modern TTS like vibevoice, kokoro tts, sherpa-onx, piper tts, orpheus tts, etc. And they all have consistency issues, many are way too slow even with a $300 GPU dedicated to them, and most output weird garble noises at unpredictable times along with the good output.

derac•2mo ago

In my experience, VibeVoice is quite good as well. Even the smaller model.

OfflineSergio•2mo ago

>I do a lot of desktop screen-reader and pdf/doc/epub/etc text to speech every single day.

I've been working a product called WithAudio (https://with.audio). Are you open to me reaching out and give a free license so you can use it and let me know what you think ? I should say it only supports Windows and Mac(arm). I'm looking for people who have used similar products to get their feedback.

mcny•2mo ago

Slightly off-topic but Why would a blog of all things have DRM content that I need to enable?

jsheard•2mo ago

It looks like the inline Apple Podcasts player is causing that, though it's not clear why since it's loading unencrypted MP3s directly from the authors S3 bucket. I guess their player eagerly sets up DRM playback at startup rather than waiting until it's needed, or they're using EME for something else (fingerprinting?).

observationist•2mo ago

Superhuman TTS is well within the capabilities of the big AI labs. Even Google had voice indistinguishable from human back in 2017, but they deliberately kneecapped it because of the potential for misuse. Boomers and older folks are not culturally or mentally equipped to handle it - even the crappy open source voice cloning we had in 2019 got used to scam people into buying gift cards.

Because of the potential for abuse, nobody wants to release a truly good, general model, because it makes lawyers lose sleep. A few more generations of hardware, though, and there will be enough open data and DIY scaffolding out there to produce a superhuman model, and someone will release it.

Deepfake video is already indistinguishable from real video (not oneshot prompt video generation, but deliberate skilled craft using AI tools.)

Higgsfield and other tools allow for spectacular voice results, but it takes craft and care. The oneshot stuff is deliberately underpowered. OpenAI doesn't want to be responsible for a viral pitch-perfect campaign ad, or fake scandal video sinking a politician, for example.

Once the lawyers calm down, or we get a decent digital bill of rights that establishes clear accountability on the user of the tool, and not the toolmaker, things should get better. Until then, look for the rogue YOLO boutique services or the ambitious open source crew to be the first to superhuman, widely available TTS.

fortran77•2mo ago

Stop blaming us old people for your lack of good TTS models.

onedognight•2mo ago

Username checks out.

bigfishrunning•2mo ago

> Boomers and older folks are not culturally or mentally equipped to handle it

I think a lot of younger people are also not mentally equipped to handle it. Outside of the hackernews sphere of influence, people are really bad at spotting AI slop (and also really bad at caring about it)

tonyarkles•2mo ago

Half tongue in cheek when I say this… that might be true, but what are the odds of them actually answering a phone call?

bsder•2mo ago

> people are really bad at spotting AI slop

Erm, guilty as charged? Although, I don't think you can blame people for that.

There was a video recently comparing a bunch of "influencer girls" that had signs of "This is AI" and "This is Real". They could all have been AI or could all have been real. I have zero confidence that I could actually spot the difference.

This is doubly true as an "Online Video Persona" has a bunch of idiosyncrasies that make them slightly ... off ... even if they're real (example: YouTube Thumbnail Face, face filters, etc.). AI is really good at twigging into those idiosyncrasies and it serves as nice camouflage for AI weirdness.

tatersolid•2mo ago

This isn’t true in my experience. I’m a parent of three teens and they immediately exclaim “clanker” and hit skip when encountering any form of AI generated content on any platform.

They recognize AI slop easily and definitely do care enough to avoid it. As do their friends.

AI-generated content has near-zero commercial value long-term.

8n4vidtmkvmk•2mo ago

I'm very skeptical about this zero commercial value claim. I don't think everyone skips past it, and even if they do, that's just the stuff they're detecting as AI. How much have they not identified? What about in a couple years?

Heck, even humans subtly trying to sell something give off a vibe you can pick up quickly. But now and then they're entertaining or subliminal enough that they get through.

munk-a•2mo ago

> Boomers and older folks are not culturally or mentally equipped to handle it

I'm glad you mentioned this because the "Grandma - I was arrested and you need to send bail" scams are already ridiculously effective to run. Better TTS will make voice communication without some additional verification completely untrustworthy.

But, also, I don't want better TTS. I can understand the words current robotic TTS is saying so it's doing the job it needs to do. Right now there are useful ways to use TTS that provide real value to society - better TTS would just enable better cloaking of TTS and allow actors to more effectively waste human time. I would be perfectly happy if TTS remained at the level it is today.

8n4vidtmkvmk•2mo ago

I still think it would be fun for a video game. Write a backstory for a whole bunch of NPCs and let the player dig as deep as they like.

I'm not sure what the bottle neck right now is. Either this idea isn't as fun as I think or we can't do it in real time on consumer hardware yet.

munk-a•2mo ago

A lot of full conversion mods just find community members that want to do some VO for practice or as a resume booster or just for the funsies. I think you'd be surprised how easy it is to get half-decent voice actors if you've got an interesting idea to build out.

hdjrudni•2mo ago

The problem isn't hiring/paying the voice actors, it's that the NPC can say anything. It's not pre-scripted.

jsheard•2mo ago

What does it even mean for a TTS model to be "superhuman" when the goal is to imitate human speech?

observationist•2mo ago

A single model that can produce voices indistinguishable from human speech, cloning any voice perfectly, and produce audio faster than a human can. Superhuman specifically in the capability to be faster - minutes of pitch perfect voice per second of operation, for example.

rurban•2mo ago

The movie industry would love to use it for dubbing. And pay a lot. They think they pay too much for the speakers already.

andrewstuart•2mo ago

Commercial TTS mostly sucks too.

There’s flashes of brilliance but most of it is noticeably computer generated.

horhay•2mo ago

The Gemini models and Eleven V3, and whatever internal audio model Sora 2 uses are about neck and neck in converging performance. They have some unexplainable flavor to them though. Especially Sora.

actuallyalys•2mo ago

While it sounds like this blogger doesn't want to bother (and perhaps experimenting with AI is itself the appeal), I personally appreciate when authors read their posts instead of delegating the task to AI.

neilv•2mo ago

> After filtering by my stupid rule of open models,

That's a good rule.

> You must enable DRM to play some audio or video on this page.

Looks like `embed.podcasts.apple.com` isn't in the same spirit.

AlienRobot•2mo ago

>However, like many models in this leaderboard - I can’t use it - since it doesn’t support voice cloning.

That's such a strange requirement. A TTS is just that. It takes a text and speaks it out loud. The user generally doesn't care whose voice it is, and personally I think TTS's sharing the same voice is a good thing for authenticity since it lets users know that it's a TTS reading the script and not a real person.

You want your voice to be reading the script, but you don't want to personally record yourself reading the text? As far as I'm concerned that's an edge case. No wonder that TTS's can't do that properly since most people don't need that in first place.

lielvilla•2mo ago

I wouldn’t say suck, but it’s nowhere near plug-and-play yet.

Totally agree on the pain points - I covered similar thoughts in my post: https://lielvilla.com/blog/death-of-demo/

Zram as Swap

Green’s Dictionary of Slang - Five hundred years of the vulgar tongue

Nvidia CEO Says AI Capital Spending Is Appropriate, Sustainable

Show HN: StyloShare – privacy-first anonymous file sharing with zero sign-up

Part 1 the Persistent Vault Issue: Your Encryption Strategy Has a Shelf Life

Show HN: Teleop_xr – Modular WebXR solution for bimanual robot teleoperation

The Highest Exam: How the Gaokao Shapes China

Open-source framework for tracking prediction accuracy

India's Sarvan AI LLM launches Indic-language focused models

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Zram as Swap

Green’s Dictionary of Slang - Five hundred years of the vulgar tongue

Nvidia CEO Says AI Capital Spending Is Appropriate, Sustainable

Show HN: StyloShare – privacy-first anonymous file sharing with zero sign-up

Part 1 the Persistent Vault Issue: Your Encryption Strategy Has a Shelf Life

Show HN: Teleop_xr – Modular WebXR solution for bimanual robot teleoperation

The Highest Exam: How the Gaokao Shapes China

Open-source framework for tracking prediction accuracy

India's Sarvan AI LLM launches Indic-language focused models

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

TTS still sucks

Comments