The goal was to explore what a Rust-native control plane looks like for modern multimodal inference: continuous batching, KV-aware admission control, predictable behavior under load, and proper streaming semantics.
The system exposes an OpenAI-compatible API, supports multi-image inputs, and is designed to degrade gracefully under overload rather than OOM or stall. It’s organized as a single monorepo with a gateway, GPU workers, scheduler, and pluggable engine adapters.
We’ve also included a benchmark suite focused on real-world scenarios (TTFT, cancellation, overload, fairness) rather than synthetic tokens/sec numbers.
Would love feedback from folks building or operating inference infrastructure.