I started asking myself: what happens when the voice is "solved"? E.g. it gets impossible to distinguish it from a human. Wanted to hear your opinions!
Sketched some of my own thoughts, and I see two futures:
Future 1: the nuanced version
Audiobooks: I think established authors will still prefer human narrators. If you can afford a $3k–$4k fixed cost for narration, a good human voice is usually worth it. TTS may even push human narration prices down, making that choice easier.
But for new/self-published authors, especially in non-fiction, AI narration may become the default. The choice is often not “AI vs. human narrator,” but “AI audiobook vs. no audiobook.” There will be backlash, but I think people will partly get used to it.
The more interesting threat may be AI readers. If I can buy an ebook for $8–$10 and have it narrated in a voice/style I like for $1–$2, why pay for an AI-narrated audiobook as a separate product? This could partly unbundle audiobooks from platforms like Audible. I’m torn here: AI-narrated self-published audiobooks and AI readers may co-exist, but AI readers could eventually replace most non-human audiobook editions.
Business content: training videos, museum guides, phone systems, short ads, internal explainers, etc. will be mostly AI. Anywhere “good enough is good enough” meets budget pressure, TTS wins. It already does.
Content creation: YouTube, podcasts, TikTok, etc. are different. Among top creators, I think human narration still dominates because personality and authenticity matter. If the voice is part of the brand, TTS is counterproductive.
That said, AI narration will explode in low-effort content. As generative text/video tools create more slop, most of that slop will probably have AI narration. So maybe the ratio of human vs. TTS voices on social media becomes 1:10 by volume, but 10:1 by total viewership in favor of human voices.
Dubbing/translations: heavily AI-dominated, except for high-end creative work like major films or books.
Films: only humans for now, but it could change. I can easily see generative AI technology going far enough that films of Hollywood quality are fully produced with AI. It would involve a new type of “producer,” someone who could manipulate generative AI and mold it into something beautiful, and it would require a new set of tools. Essentially, there would be many, many Pixar-style studios focused on ultra-realistic video with relatively small budgets. For such cases, AI narration would be used, and eventually it could eat almost the whole industry.
Games: TTS seems especially strong here: many distinct voices, short lines, lots of minor characters, and poor economics for hiring actors for everything. I think studios will still use humans for main characters, but many NPCs and indie-game voices will become AI.
Future 2: the hardline version
Anything outside of personal-brand stuff would be AI-generated. If it gets cheap and good enough, and society accepts it, everything from books to films and ads would be AI-narrated.
Human narrator would evolve as a profession — you would “sell” the rights to your voice being AI-generated.
A new profession of AI sound engineers will emerge, who will use AI to get creative with voice design and voice orchestration to get the best results.
I also feel like voice is quite different from text or image generation, in the sense that there is a weaker uncanny valley. In 95% of cases, voice is just a tool to convey creatively written text, hopefully written by a human, correctly. And for tools, it is mostly a question of getting good enough.
It is also possible that it is not either/or between the two futures: the first future is the next 10 years, and the second future is a bit ahead of that.
kvasserman•1h ago
ben_w•1h ago
Right now, the main thing making these things recognisable is there's so few voices. The voices themselves are basically celebrities, albeit in the same way as some annoying D-list celebrity who somehow managed to get a bajillion contracts for advertising cheap tat.
Given that LLM slop is currently rapidly degrading the trustworthiness of search results (even moreso than SEO already had), it's probably for the best if the major AI providers don't release a bunch more voices.
boa00•47m ago
Text is just human thoughts in their most simple form. Writing is about expressing ideas, and there is almost an infinite number of ways to express them. Extremely difficult task, and LLMs only "imitate" it to the best of their training
This is not at all true for voice. There are an infinite number of possible voices, but a finite number of tones and phonemes you can use to express the text.
It's a much easier technical problem; it's just that it's much harder to gather proper data (you cannot just scrape Reddit and hope for the best, as LLMs do). And voice gets like 1/100th of LLMs' funding