So i built oMLX. it persists KV cache blocks to SSD, and when a previous context comes back, it restores from disk instead of recomputing. this alone made Qwen3-Coder-80B on my M3 Ultra actually usable for real coding sessions.
Some other stuff it does: continuous batching, multi-model serving (LLM + embedding + reranker at once), prefix sharing with copy-on-write, and a native mac menubar app so you don't have to touch the terminal.
Just shipped a built-in benchmark tool too, so you can test your own setup easily.