We have just published a short demo of the WoolyAI GPU Hypervisor, showcasing VRAM memory sharing/deduplication. Load a single base model once, then run multiple isolated LoRA stacks or VLLM stacks on the same GPU.
Why this matters
Higher capacity: Share the base model in VRAM; add more adapters or vertical inference stacks per GPU without increasing memory usage.
Isolation & control: Each stack is its own process with independent batching and SLA-aware scheduling.
While vLLM supports multiple adapters on a single vLLM process, many teams need predictable per-adapter SLAs—this is where running independent stacks with a shared base model in VRAM can enable doing it all on the same GPU.
The demo uses LoRA inference using Pytorch, but the same applies when using vLLM.
If you’re scaling LoRA inference across business units or model variants and need predictable latency without overprovisioning GPUs, I’d love your feedback. Comment or DM to chat.
medicis123•2h ago
Why this matters
Higher capacity: Share the base model in VRAM; add more adapters or vertical inference stacks per GPU without increasing memory usage.
Isolation & control: Each stack is its own process with independent batching and SLA-aware scheduling.
While vLLM supports multiple adapters on a single vLLM process, many teams need predictable per-adapter SLAs—this is where running independent stacks with a shared base model in VRAM can enable doing it all on the same GPU.
The demo uses LoRA inference using Pytorch, but the same applies when using vLLM. If you’re scaling LoRA inference across business units or model variants and need predictable latency without overprovisioning GPUs, I’d love your feedback. Comment or DM to chat.