Ask HN: What happens when AI-voice becomes good enough?

1•boa00•1h ago

I fell into the rabbit hole of TTS models lately. Tried all major paid tools (ElevenLabs/InWorld/etc.), and all the newest open-source models.

I started asking myself: what happens when the voice is "solved"? E.g. it gets impossible to distinguish it from a human. Wanted to hear your opinions!

Sketched some of my own thoughts, and I see two futures:

Future 1: the nuanced version

Audiobooks: I think established authors will still prefer human narrators. If you can afford a $3k–$4k fixed cost for narration, a good human voice is usually worth it. TTS may even push human narration prices down, making that choice easier.

But for new/self-published authors, especially in non-fiction, AI narration may become the default. The choice is often not “AI vs. human narrator,” but “AI audiobook vs. no audiobook.” There will be backlash, but I think people will partly get used to it.

The more interesting threat may be AI readers. If I can buy an ebook for $8–$10 and have it narrated in a voice/style I like for $1–$2, why pay for an AI-narrated audiobook as a separate product? This could partly unbundle audiobooks from platforms like Audible. I’m torn here: AI-narrated self-published audiobooks and AI readers may co-exist, but AI readers could eventually replace most non-human audiobook editions.

Business content: training videos, museum guides, phone systems, short ads, internal explainers, etc. will be mostly AI. Anywhere “good enough is good enough” meets budget pressure, TTS wins. It already does.

Content creation: YouTube, podcasts, TikTok, etc. are different. Among top creators, I think human narration still dominates because personality and authenticity matter. If the voice is part of the brand, TTS is counterproductive.

That said, AI narration will explode in low-effort content. As generative text/video tools create more slop, most of that slop will probably have AI narration. So maybe the ratio of human vs. TTS voices on social media becomes 1:10 by volume, but 10:1 by total viewership in favor of human voices.

Dubbing/translations: heavily AI-dominated, except for high-end creative work like major films or books.

Films: only humans for now, but it could change. I can easily see generative AI technology going far enough that films of Hollywood quality are fully produced with AI. It would involve a new type of “producer,” someone who could manipulate generative AI and mold it into something beautiful, and it would require a new set of tools. Essentially, there would be many, many Pixar-style studios focused on ultra-realistic video with relatively small budgets. For such cases, AI narration would be used, and eventually it could eat almost the whole industry.

Games: TTS seems especially strong here: many distinct voices, short lines, lots of minor characters, and poor economics for hiring actors for everything. I think studios will still use humans for main characters, but many NPCs and indie-game voices will become AI.

Future 2: the hardline version

Anything outside of personal-brand stuff would be AI-generated. If it gets cheap and good enough, and society accepts it, everything from books to films and ads would be AI-narrated.

Human narrator would evolve as a profession — you would “sell” the rights to your voice being AI-generated.

A new profession of AI sound engineers will emerge, who will use AI to get creative with voice design and voice orchestration to get the best results.

I also feel like voice is quite different from text or image generation, in the sense that there is a weaker uncanny valley. In 95% of cases, voice is just a tool to convey creatively written text, hopefully written by a human, correctly. And for tools, it is mostly a question of getting good enough.

It is also possible that it is not either/or between the two futures: the first future is the next 10 years, and the second future is a bit ahead of that.

Comments

kvasserman•1h ago

I think of it this way. LLMs suppose to be good at generating text/writing, right? Well, they are not very good at it. They generate plausible content that superficially makes sense. Most people can easily tell AI generated slop from human writing. I suspect that mimicking human voice is multiple levels more difficult for LLMs than mimicking human content. The level of nuance that humans produce in their speech is probably staggering. So I maybe completely wrong, but I see no evidence so far to support the idea that either LLM's writing or speaking is going to get much better any time soon.

ben_w•1h ago

Perhaps, but for what it's worth, when I first heard OpenAI's TTS demo, I assumed they were faking it and a human was speaking because it had "um"s and "err"s.

Right now, the main thing making these things recognisable is there's so few voices. The voices themselves are basically celebrities, albeit in the same way as some annoying D-list celebrity who somehow managed to get a bajillion contracts for advertising cheap tat.

Given that LLM slop is currently rapidly degrading the trustworthiness of search results (even moreso than SEO already had), it's probably for the best if the major AI providers don't release a bunch more voices.

boa00•47m ago

Not sure I agree here

Text is just human thoughts in their most simple form. Writing is about expressing ideas, and there is almost an infinite number of ways to express them. Extremely difficult task, and LLMs only "imitate" it to the best of their training

This is not at all true for voice. There are an infinite number of possible voices, but a finite number of tones and phonemes you can use to express the text.

It's a much easier technical problem; it's just that it's much harder to gather proper data (you cannot just scrape Reddit and hope for the best, as LLMs do). And voice gets like 1/100th of LLMs' funding

Scott Alexander's AI Opinions

Can I Buy Your KV Cache?

Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization

What Is an LLM Control Plane?

3D Map That Acts as Commercial Vessel and Geopolitics Intel Platform

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Reasoning

Show HN: Musefs – organize and tag music without touching the original files

A Visual Guide to DiffusionGemma

Writing Constant-Time Rust Is Not Enough

The First Trillionaire Is a Killer

What Does It Feel Like to Live Under the Threat of Redundancy?

NEURA: A Unified and Retargetable Compilation Framework for CGRAs

Why the AI Renaissance Keeps Not Arriving

Unified Contradiction‑Resolution Framework for Physics and Mathematics

NMOX Studio is being built by Fable

Devirt.dev – generic JavaScript deobfuscator built as a compiler

General purpose LLMs outperform specialized clinical AI on medical benchmarks

Show HN: Markdown Viewer for Mac Finder

Swift at Apple: Migrating the TrueType Hinting Interpreter

China's Juno detector outpaces decades of research in 59 days (science.org)

Urban pollution in wealthy world still adding to heart damage, study finds

Parking Spot Is Free. Should It Be?

Kagi Magic

Tyler Cowen: Is Mexico Safe Enough for the World Cup?

US and Iran have agreed to wording of a deal to end their war

Hacking Google with A.I. For $500k

GatekeeperAI – self-hosted governance platform for AI apps your team is building

It's like I was born to be here (in Postgres) on Talking Postgres podcast Ep40

The 98% Problem: A Survey of Harness Engineering for AI Agents

Sex n Crime 01