The core idea is to insert an AI text pre-processor before TTS synthesis.
Instead of feeding raw text directly into TTS, an AI model parses and rewrites the text to optimize it for speech, handling things that current TTS pipelines do poorly unless the user is an SSML expert.
What the pre-processor would do:
1. Control pacing, rhythm and pitch: Automatically infer pauses, emphasis, and sentence flow. Most users don’t know SSML, but good pacing alone significantly improves perceived quality.
2. Context-aware pronunciation Example: “I want US to eat together.” Here, “US” should be pronounced as “us,” not “U.S.”
3. Rewrite text for pronunciation clarity.
Normalize numbers: 10 000 → 10,000 or “ten thousand”
Adjust foreign names or ambiguous words
Phonetic hints when needed (e.g., sake → “sayk”)
Small rewrites that preserve meaning but improve speech output
This wouldn’t reach the quality of full neural TTS, but it could dramatically narrow the gap, especially for:
low-resource environments
embedded systems
legacy TTS engines
cost-sensitive use cases
Curious if anyone has seen similar approaches in production, or if this is already being done quietly somewhere.