We built MetalRT from scratch in 48 hours: pure C++ to Metal, no abstractions, no compromises. Result is the fastest decode performance available today on Apple Silicon.
658 tokens per second on Qwen3-0.6B (4-bit) using a single M4 Max.
We benchmarked against the strongest competitors on the exact same hardware (M4 Max, 64 GB, macOS 26.3):
- MetalRT
- uzu (Rust production engine)
- mlx-lm (Apple's official MLX framework)
- llama.cpp
- Ollama (REST API)
MetalRT is fastest on 3 of 4 models and wins the only clean apples-to-apples comparison: 1.10–1.19× faster than Apple's own MLX using identical model files.
Average 1.67× faster than llama.cpp, 1.59× faster than Ollama.
TTFT on Qwen3-0.6B: 6.6 ms.
Same model weights = same output quality. Only the speed is different.
Public access coming soon as part of MetalRT by RunAnywhere Team.
SilverElfin•2h ago
If you built it that quick - was it generated using AI?
sanchitmonga•2h ago
We benchmarked against the strongest competitors on the exact same hardware (M4 Max, 64 GB, macOS 26.3): - MetalRT - uzu (Rust production engine) - mlx-lm (Apple's official MLX framework) - llama.cpp - Ollama (REST API)
Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B (all 4-bit quantized, greedy decoding, 5 runs, best reported).
MetalRT is fastest on 3 of 4 models and wins the only clean apples-to-apples comparison: 1.10–1.19× faster than Apple's own MLX using identical model files. Average 1.67× faster than llama.cpp, 1.59× faster than Ollama. TTFT on Qwen3-0.6B: 6.6 ms.
Same model weights = same output quality. Only the speed is different.
Public access coming soon as part of MetalRT by RunAnywhere Team.
SilverElfin•2h ago
sanchitmonga•1h ago