SnapLLM: Switch between local LLM in under 1ms Multi-model&-modal serving engine

1•maheshvaikri99•2h ago

Comments

maheshvaikri99•2h ago

Hey everyone,

I've been working on SnapLLM for a while now and wanted to share it with the community. The problem: If you run local models, you know the pain. You load Llama 3, chat with it, then want to try Gemma or Qwen. That means unloading the current model, waiting 30-60 seconds for the new one to load, and repeating this cycle every single time. It breaks your flow and wastes a ton of time.

What SnapLLM does: It keeps multiple models hot in memory and switches between them in under 1 millisecond (benchmarked at ~0.02ms). Load your models once, then snap between them instantly. No more waiting. How it works: Built on top of llama.cpp and stable-diffusion.cpp Uses a vPID (Virtual Processing-In-Disk) architecture for instant context switching Three-tier memory management: GPU VRAM (hot), CPU RAM (warm), SSD (cold) KV cache persistence so you don't lose context

What it supports: Text LLMs: Llama, Qwen, Gemma, Mistral, DeepSeek, Phi, Unsloth AI models, and anything in GGUF format Vision models: Gemma 3 + mmproj, Qwen-VL + mmproj, LLaVA Image generation: Stable Diffusion 1.5, SDXL, SD3, FLUX via stable-diffusion.cpp OpenAI/Anthropic compatible API so you can plug it into your existing tools Desktop UI, CLI, and REST API

Model switch time between any of these: 0.02ms Getting started is simple: Clone the repo and build from source Download GGUF models from Hugging Face (e.g., gemma-3-4b Q5_K_M) Start the server locally Load models through the Desktop UI or API and point to your model folder Start chatting and switching

NVIDIA CUDA is fully supported for GPU acceleration. CPU-only mode works too.

With SLMs getting better every month, being able to quickly switch between specialized small models for different tasks is becoming more practical than running one large model for everything. Load a coding model, a medical model, and a general chat model side by side and switch based on what you need.

Ideal Use Cases: Multi-domain applications (medical + legal + general) Interactive chat with context switching Document QA with repeated queries On-Premise Edge deployment Edge devices like drones, self-driving vehicles, autonomous vehicles, etc Multi-agent workflow

Demo Videos: SnapLLM Desktop App Demo (Vimeo): https://vimeo.com/1157629276 SnapLLM Server and API Demo (Vimeo): https://vimeo.com/1157624031

The server demo walks through starting the server locally after cloning the repo, downloading models from Hugging Face, and loading them through the UI.

Links: GitHub: https://github.com/snapllm/snapllm Arxiv Paper: https://arxiv.org/submit/7238142/view

Star this repository - It helps others discover SnapLLM

MIT licensed. PRs and feedback welcome. If you have questions about the architecture or run into issues, drop them here or open a GitHub issue.

Sammy Jankins – An Autonomous AI Living on a Computer in Dover, New Hampshire

$10M factory in a 600 square foot room

Show HN: A Deployable Cross-Platform SIMD RNG Library for C++ (With Bnchmks)

NVD – CVE-2026-2070

FelPawns: (Update) AI assisted world generation in RimWorld

Show HN: Maravel-Framework 10.62.8 speeds up the console via commands:cache

My Nanbeige4.1 3B chat room can now generate micro applications [video]

Underrated Music Software – Royalty-Free

Dune II written in HTML5/JS

Show HN: Crypthold – Deterministic, Tamper-Evident Secure State Engine

Language models imply world models

Echoed.gg – Discord Alternative

GLM-5 topped the coding benchmarks. Then I used it

Show HN: PrivateWhisper – Run Whisper locally on macOS (offline transcription)

A minimal terminal coding agent harness

It Isn't the Tool, but the Hands – A Response to "Something Big Is Happening"

Dbt-Workbench, an open-source UI for working with dbt projects

Show HN: PolyMCP – A framework for building and orchestrating MCP agents

Dao Heart 3.11 Identity Preserving Value Evolution for Frontier AI Systems

Backboard.io Becomes First AI Platform to Lead Both Major Memory Benchmarks

Show HN: An automaton's code review of Gas Town with sycophancy-mode disabled

'RageCheck' Points Out Manipulative Language in News Articles

Ask HN: Hacker News Fixed Width for Widescreen Monitors" Userstyle?

Extend Trust Across the Software Supply Chain with Red Hat Trusted Libraries

CIA, Pentagon reviewed secret 'Havana syndrome' device in Norway, WaPo reports

I Analyzed 227M Rows of Medicaid Data. Here's a Sample of What I Found in Maine

AI: A Bridge Toward Diverse Intelligence

How to Write Mathematical Papers by Bruce C. Berndt [pdf]

Curosr: Expanding our long-running agents research preview

Show HN: Cappu – ADHD'er take on a different task manager