We got it to 400 tok/s prefill, 100 tok/s generate, 1,800 lines of C++, no dependencies beyond MLX. Just not redoing work was a 125x improvement.
Favorite moment: the model suggested enabling MetalFX to speed up inference. That's Apple's game graphics upscaler. It makes explosions look better.
AMA about any of it. We are working on the Qwen3.5 models. Local AI is going to get a lot better.
dirk94018•1h ago
We got it to 400 tok/s prefill, 100 tok/s generate, 1,800 lines of C++, no dependencies beyond MLX. Just not redoing work was a 125x improvement.
Favorite moment: the model suggested enabling MetalFX to speed up inference. That's Apple's game graphics upscaler. It makes explosions look better.
AMA about any of it. We are working on the Qwen3.5 models. Local AI is going to get a lot better.