InferX is a ground-up rewrite. We snapshot the entire GPU state (weights, KV cache, CUDA context) and restore it on-demand in <2s. This isn't incremental; it's a leap.
The result? Insane speedups (up to 10x) and 90%+ GPU utilization. We can even hot-swap models mid-flight, treating them like threads.
Anyone still doing inference the old way, seriously?
Tech deep dive & benchmarks: https://github.com/inferx-net/inferx
(We're also open-sourcing parts of this soon.)