Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
What does unsloth-studio bring on top?
Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
https://www.youtube.com/live/G5OVcKO70ns
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
greenstevester•2h ago
It's essentially a model that's learned to do the absolute minimum amount of work while still getting paid. I respect that enormously.
It scores 1441 on Arena Elo — roughly the same as Qwen 3.5 at 397B and Kimi k2.5 at 1100B.
Ollama v0.19 switched to Apple's MLX framework on Apple Silicon. 93% faster decode.
They've also improved caching so your coding agents don't have to re-read the entire prompt every time, about time I'd say.
The gist covers the full setup: install, auto-start on boot, keep the model warm in memory.
It runs on a 24GB Mac mini, which means the most expensive part of your local AI setup is still the desk you put it on.
krzyk•36m ago
And considering that this Mac mini won't be doing anything else is there a reason why not just buy subscription from Claude, OpenAI, Google, etc.?
Are those open models more performant compared to Sonnet 4.5/4.6? Or have at least bigger context?