We’ve updated Docker Model Runner to support vLLM alongside the existing llama.cpp backend. The goal is to bridge the gap between local prototyping (often done with GGUF/llama.cpp) and high-throughput production (often done with Safetensors/vLLM) using a consistent Docker workflow.
Key technical details:
Auto-routing: The tool detects the model format. If you pull a GGUF model, it routes to llama.cpp. If you pull a Safetensors model, it routes to vLLM.
API: It exposes an OpenAI-compatible API (/v1/chat/completions), so the client code doesn't need to change based on the backend.
Usage: It’s just docker model run ai/smollm2-vllm.
Current Limitations:
Right now, the vLLM backend is optimized for x86_64 with Nvidia GPUs.
We are actively working on WSL2 support for Windows users and DGX Spark compatibility.
Happy to answer any questions about the integration or the roadmap!
https://www.docker.com/blog/docker-model-runner-integrates-v...
ericcurtin•1h ago