At the time I dropped it for LMStudio, which to be fair was not fully open source either, but at least exposed the model folder and integrated with HF rather than a proprietary model garden for no good reason.
Had to go down the same rabbit hole of finding where things are, how they're sorted/separated/etc. It was unnecessarily painful
Actually they do. It's an option in the server configuration file.
Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.
My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.
[1] https://huggingface.co/blog/ggml-org/model-management-in-lla...
The article mentions llama-swap does this
also you might be the only person in the wild I've seen admit to this
I will switch once we have good user experience on simple features.
A new model is released on HF or the Ollama registry? One `ollama pull` and it's available. It's underwhelming? `ollama rm`.
Seems like maybe, at least some of the time, you’re being underwhelmed my ollama not the model.
The better performance point alone seems worth switching away
https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...
I started with Ollama, and it was great. But I moved to llama.cpp to have more up-to-date fixes. I still use Ollama to pull and list my models because it's so easy. I then built my own set of scripts to populate a separate cache directory of hardlinks so llama-swap can load the gguf's into llama.cpp.
I’m open to suggestions, but the alternatives outlined in the blog post ain’t it.
> LM Studio gives you a GUI if that’s what you want. It uses llama.cpp under the hood, exposes all the knobs, and supports any GGUF model without lock-in.
> Jan(https://www.jan.ai/) is another open-source desktop app with a clean chat interface and local-first design.
> Msty(https://msty.ai/) offers a polished GUI with multi-model support and built-in RAG. koboldcpp is another option with a web UI and extensive configuration options.
API wise: LM Studio has REST, OpenAI and more API Compatibilities.
llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000 (with MCP support and web chat interface)
and you have OpenAI API on the same 8000 port. (https://github.com/ggml-org/llama.cpp/tree/master/tools/serv... lists the endpoints)
Easier than what?
I came across LM Studio (mentioned in the post) about 3 years ago before I even knew what Ollama as. It was far better even then.
Just in case you haven't seen it yet, llama.cpp now has a router mode that lets you hot-swap models. I've switched over from llama-swap and have been happy with it.
Ollama v0.0.1 "Fast inference server written in Go, powered by llama.cpp" https://github.com/ollama/ollama/tree/v0.0.1
It's a joke... but also not really? I mean VLC is "just" an interface to play videos. Videos are content files one "interact" with, mostly play/pause and few other functions like seeking. Because there are different video formats VLC relies on codecs to decode the videos, so basically delegating the "hard" part to codecs.
Now... what's the difference here? A model is a codec, the interactions are sending text/image/etc to it, output is text/image/etc out. It's not even radically bigger in size as videos can be huge, like models.
I'm confused as why this isn't a solved problem, especially (and yes I'm being a big sarcastic here, can't help myself) in a time where "AI" supposedly made all smart wise developers who rely on it 10x or even 1000x more productive.
Weird.
I think the codec analogy is neat but isn't the codec here llama.cpp, and the models are content files? Then the equivalent of VLC are things like LMStudio etc. which use llama.cpp to let you run models locally?
I'd guess one reason we haven't solved the "codec" layer is that there doesn't seem to be a standard that open model trainers have converged on yet?
One command, and you are running the models even with the rocm drivers without knowing.
If llama provides such UX, they failed terrible at communicating that. Starting with the name. Llama.cpp: that's a cpp library! Ollama is the wrapper. That's the mental model. I don't want to build my own program! I just want to have fun :-P
In fact the first line of the wikipedia article is:
> llama.cpp is an open source software library
brew install llama.cpp
llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000
Go to localhost:8000 for the Web UI. On Linux it accelerates correctly on my AMD GPU, which Ollama failed to do, though of course everyone's mileage seems to vary on this.
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4' llama_model_load_from_file_impl: failed to load model
It makes a bunch of decisions for you so you don't have to think much to get a model up and running.
>One command
Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`, and that running things in the terminal is already a gigantic UX blocker (Ollama's popularity comes from the fact that it has a GUI), why are you putting the blame back on an open source project that owes you approximately zero communication ?
It's not the GUI, it's the curated model hosting platform. Way easier to use than HF for casual users.
No wonder I get downvoted to hell every time I mention this... People here can't even tell anymore. They just find this horrible slop completely normal. HN is just another dead website filled with slop articles, time to move on to some smaller reddit communities...
The progression follows the pattern cleanly:
1. Launch on open source, build on llama.cpp, gain community trust
2. Minimize attribution, make the product look self-sufficient to investors
3. Create lock-in, proprietary model registry format, hashed filenames that don’t work with other tools
4. Launch closed-source components, the GUI app
5. Add cloud services, the monetization vector pacman -Ss ollama | wc -l
16
pacman -Ss llama.cpp | wc -l
0
pacman -Ss lmstudio | wc -l
0
Maybe some day.There are packages for Vulkan, ROCm and CUDA. They all work.
FWIW llama.cpp does almost everything ollama does better than ollama with the exception of model management, but like, be real, you can just ask it to write an API of your preferred shape and qwen will handle it without issue.
Edit: or maybe that was your point. I guess that for historical reasons this is a kind of generic name for local deployments now (see https://www.reddit.com/r/LocalLLaMA) just like people will call anything ChatGPT.
% ramalama run qwen3.5-9b
Error: Manifest for qwen3.5-9b:latest was not found in the Ollama registry
Zetaphor•4h ago
robot-wrangler•1h ago
kelsolaar•1h ago
brabel•1h ago
It’s truly open source, backed by Mozilla, openly uses llama.cpp and was created by wizard Justine Tunney of CosmopolitanC fame.
cachius•28m ago
Mario9382•1h ago