Hi HN, I'm the developer. I built oMLX because every Mac LLM server I tried fell apart with coding agents.
The core problem: tools like Claude Code and OpenClaw send requests where the prompt prefix keeps shifting. Most servers invalidate the KV cache after each turn, forcing a full re-prefill. On large contexts (50-100K tokens), that means 30-90 seconds of waiting per request. After a few turns, it's practically unusable.
oMLX solves this with paged SSD caching. Every KV cache block is persisted to disk. When a previous prefix comes back, it's restored from SSD instead of being recomputed. In practice, TTFT drops from 30-90s to 1-3s on follow-up requests. Cache blocks survive server restarts too.
What's under the hood:
- Continuous batching via mlx-lm (multiple concurrent requests)
jundot•3h ago
The core problem: tools like Claude Code and OpenClaw send requests where the prompt prefix keeps shifting. Most servers invalidate the KV cache after each turn, forcing a full re-prefill. On large contexts (50-100K tokens), that means 30-90 seconds of waiting per request. After a few turns, it's practically unusable.
oMLX solves this with paged SSD caching. Every KV cache block is persisted to disk. When a previous prefix comes back, it's restored from SSD instead of being recomputed. In practice, TTFT drops from 30-90s to 1-3s on follow-up requests. Cache blocks survive server restarts too.
What's under the hood:
- Continuous batching via mlx-lm (multiple concurrent requests)
- Multi-model serving (LLM + VLM + Embedding + Reranker simultaneously, LRU eviction)
- OpenAI + Anthropic compatible APIs Tool calling support(JSON, Qwen, Gemma, MiniMax, GLM formats + MCP)
- Native macOS menubar app (PyObjC, signed DMG) — download, drag, done
As of v0.2.0: Vision-Language Model support with the same tiered caching
Benchmarks on M3 Ultra 512GB with Qwen3-Coder-Next-8bit: 58.7 tok/s single request, 243 tok/s at 8x batch. Full results on the README.
It reuses LM Studio's model directory, so you don't need to re-download anything.
100% open source (Apache 2.0): https://github.com/jundot/omlx