Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices.
When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals.
amelius•1h ago