After playing around with local AI setups for a while, I kept getting annoyed at having to juggle different llama.cpp servers for each model. Switching between them was such a pain and I always had to restart things just to load up a new model.
So I ended up building something to fix that. It's called FlexLLama -
https://github.com/yazon/flexllama
Basically, it's a tool that lets you run multiple llama.cpp instances easily, spread across CPU and GPUs if you got'em. Everything sits behind a single OpenAI-compatible API.
You can run chat models, embeddings, rerankers - all at once. The models assigned to the runners are reloaded on the fly.
There's a little web dashboard to monitor and manage runners.
It's super easy to get started: just pip install from the repo, or grab the Docker image for a speedy setup.
I've been using it myself with things like OpenWebUI and some VS Code extensions (Roo Code, Cline, Continue.dev), and it works flawlessly.