Short version: I wanted local voice cloning that worked like a normal Mac app. Not a Gradio interface, not a Conda environment, not a Docker container. A real .app in my Applications folder with no dependencies.
Qwen3-TTS is a genuinely good open model, but using it means setting up Python, and dealing with conflicts every time something updates. Rough on the creative crowd, and they use these tools a lot. I figured if I ported it to MLX and wrapped it in SwiftUI, I could skip all of that.
The interesting technical bit: getting the 1.7B model to run well on 8GB Macs took some work. MLX's unified memory helps, but I still had to be careful about quantization tradeoffs- too aggressive and you lose the natural inflection that makes cloned voices sound real. Wound up with a great 5 bit quant.
Voice cloning needs just 3 seconds of audio. Longer samples (up to ~12 seconds) improve quality, but there are diminishing returns past that. On the iOS version, I'm capping reference audio at 8 seconds to stay within mobile memory constraints.
My favorite two features: Voice design (describe a voice, and it makes it up, usually really well), and voice instruction ("read this as if you're highly skeptical" or "yell this as if you're a bit out of breath"). The Qwen team is competing with state of the art techniques.
Happy to go deep on the MLX port, the quantization choices, or anything else. I'll be here all day.
SciFiDev•1h ago
Short version: I wanted local voice cloning that worked like a normal Mac app. Not a Gradio interface, not a Conda environment, not a Docker container. A real .app in my Applications folder with no dependencies.
Qwen3-TTS is a genuinely good open model, but using it means setting up Python, and dealing with conflicts every time something updates. Rough on the creative crowd, and they use these tools a lot. I figured if I ported it to MLX and wrapped it in SwiftUI, I could skip all of that.
The interesting technical bit: getting the 1.7B model to run well on 8GB Macs took some work. MLX's unified memory helps, but I still had to be careful about quantization tradeoffs- too aggressive and you lose the natural inflection that makes cloned voices sound real. Wound up with a great 5 bit quant.
Voice cloning needs just 3 seconds of audio. Longer samples (up to ~12 seconds) improve quality, but there are diminishing returns past that. On the iOS version, I'm capping reference audio at 8 seconds to stay within mobile memory constraints.
My favorite two features: Voice design (describe a voice, and it makes it up, usually really well), and voice instruction ("read this as if you're highly skeptical" or "yell this as if you're a bit out of breath"). The Qwen team is competing with state of the art techniques.
Happy to go deep on the MLX port, the quantization choices, or anything else. I'll be here all day.