LoudReader is what came out of it - an iOS app that reads essays, articles, and books aloud, fully on-device. No account, no network after install.
The model running once reading a sentence was the easy part. Making it not feel like a demo was the rest: streaming synthesis so playback starts before the sentence finishes, porting misaki to swift because I could only find python releases, thermal monitoring and strategy was a tough one as well. Runs well on iPhone 14 Pro(what I have) and newer. Tested on my mom's iPhone 12 Pro and it chokes sometimes, so I ported KittenTTS as a lighter fallback for older devices. The whole project took around 2-3 months on the weekends with claude code and codex.
Smooth TTS was the hard part but the app around it grew larger than I expected with EPUB/PDF import, Gutenberg browsing, a saved-articles queue, multi-week reading campaigns. Happy to dig into any of it in comments.
PDFs, especially academic papers and scanned docs, still annoy me. I built an OCR flow that handles regular documents, but scientific papers with two-column layouts, equations, and fine print are still messy. Curious if anyone here has shipped PDF extraction on mobile that actually handles this well.
This was my first time designing a user-facing product - I'm more of a deep-engineering person so any feedback is welcome too. I'll post a write up on the biggest hurdles in the comments as well.
If you've ever tried to listen to something long on a plane, you get why this exists.
mowmiatlas•1h ago
Streaming was the worst one. Kokoro doesn't expose a streaming interface as far as I could find, you hand it a chunk of text, it gives you back the full audio for that chunk. For a reading app you can't wait for a whole paragraph before playback starts, so the whole streaming layer had to be built on top. I didn't want to process the book then serve full audio, I wanted it to be interactive.
The basic shape: chunk into sentence-sized windows, render in the background, queue rendered chunks for playback, keep a small pre-render lookahead so playback never starves but the phone isn't speculatively rendering an entire chapter it might throw away on a skip.
Sentence chunking was its own fight. Too long and the model returns null and playback stops. Too short (four or five words at a time) and the naturalness diminishes, because the model uses context within a sentence to decide intonation. Chopped chunks sound like a bad GPS voice. I had to find the goldilocks window where the model is happy and the result still sounds good and handle long-sentence edge cases by splitting on secondary punctuation and stitching the audio back together without audible seams.
For battery-life there's cruise mode. When the screen is off and the next several sentences are already rendered and cached, the app swaps the whole synthesis/playback pipeline for a much lighter sequential AAC player, hardware-decoded audio files.
When the phone's on a charger, a background task pre-renders a chapter or two of upcoming audio and writes it to disk as M4A. That way, by the time you're actually reading, cruise mode has a cache to play from and the neural engine never has to wake up for long stretches. The system decides when to actually run the task, so it piggybacks on the phone's usual overnight charging window.
The Neural Engine was a disappointment. I was hoping to get Kokoro onto the ANE for the latency/efficiency win, seeing it works quite well on CPU, but it uses ops that CoreML doesn't route to the Neural Engine, so it falls back to GPU/CPU. The weird part: forcing .cpuAndNeuralEngine is actually slower than .cpuAndGPU on this model, probably partitioning cost from unsupported ops bouncing between compute units, but I don't fully understand why. If anyone on CoreML has a principled explanation I'd love to hear it.
iPhone 12 mini and lower, and simulators are cursed. They seem to run Kokoro successfully, i.e. no error, inference completes but the result is pure crackling/screeching gibberish audio. Same model, same weights, same code path. KittenTTS runs fine on the exact same hardware AND the XCode simulator. I still don't know what's going on here; Curious if anyone's seen similar.
KittenTTS was easy. Ported it as a fallback for older devices and published a minimal iOS example repo while I was at it: https://github.com/pepinu/KittenTTS-iOS if you just want to see how to get a neural TTS model running on iPhone without the full app machinery around it.
Before I got the iPhone optimization work far enough along, Kokoro ran in real time on a MacBook that I was literally putting a laptop on the passenger seat for long drives just to have something read to me. Very inconvenient, but it made me commit to getting the phone path right. The current build isn't really tested on Mac, maybe in the future.
On the LLM tooling question up front: YES, used Claude Code and Codex throughout. I might be too much into tokenmaxxing though, since I'd run several sessions in tandem for bug hunting and several more for review to get wisdom of the crowd of sorts.