Stop Using Ollama

https://sleepingrobots.com/dreams/stop-using-ollama/

327•Zetaphor•4h ago

Comments

Zetaphor•4h ago

I got tired of repeating the same points and having to dig up sources every time, so here's the timeline (as I know it) in one place with sources.

robot-wrangler•1h ago

Thanks, did not know any of this.

kelsolaar•1h ago

Great writing, thanks for the summary and timeline.

brabel•1h ago

Thanks for writing this, I hope people here will actually read this and not assume this is some unfounded hit piece. I was involved a little bit in llama.cpp and knew most of what you wrote and it’s just disgusting how ollama founders behaved! For people looking for alternatives, I would also recommend llama-file, it’s a one file executable for any OS that includes your chosen model: https://github.com/mozilla-ai/llamafile?tab=readme-ov-file

It’s truly open source, backed by Mozilla, openly uses llama.cpp and was created by wizard Justine Tunney of CosmopolitanC fame.

cachius•28m ago

I also thought llamafile deserves a mention. Once you have all model params and tunings done bakes 'em into a single portable binary!

Mario9382•1h ago

Really nice. I wasn't aware of any of this.

usernomdeguerre•3h ago

Do they still not let you change the default model folder? You had to go through this whole song and dance to manually register a model via a pointless dockerfile wannabe that then seemed to copy the original model into their hash storage (again, unable to change where that storage lived).

At the time I dropped it for LMStudio, which to be fair was not fully open source either, but at least exposed the model folder and integrated with HF rather than a proprietary model garden for no good reason.

andreidbr•1h ago

This also annoyed me a lot. I was running it before upgrading the SSD storage and I wanted to compare with LM Studio. Figured it would be good to have both interfaces use the same models downloaded from HF.

Had to go down the same rabbit hole of finding where things are, how they're sorted/separated/etc. It was unnecessarily painful

zozbot234•22m ago

> Do they still not let you change the default model folder?

Actually they do. It's an option in the server configuration file.

dnnddidiej•1h ago

On a practical note if fumbles connection handling as to be unusable to download anything.

tyfon•1h ago

I think the biggest advantage for me with ollama is the ability to "hotswap" models with different utility instead of restarting the server with different models combined with the simple "ollama pull model". In other words, it has been quite convenient.

Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.

My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.

[1] https://huggingface.co/blog/ggml-org/model-management-in-lla...

segmondy•1h ago

You can do that with llama-server

majorchord•1h ago

> the ability to "hotswap" models with different utility instead of restarting the server

The article mentions llama-swap does this

hacker_homie•1h ago

Llama.cpp added the ability load/switch models on demand with the max-models and models preset flags.

dackdel•1h ago

i use goose by block

sudb•1h ago

seems pretty unrelated to the post?

also you might be the only person in the wild I've seen admit to this

yokoprime•1h ago

i had no idea about all this. especially the performance and bugs. thanks for informing me!

speedgoose•1h ago

I prefer Ollama over the suggested alternatives.

I will switch once we have good user experience on simple features.

A new model is released on HF or the Ollama registry? One `ollama pull` and it's available. It's underwhelming? `ollama rm`.

kennywinker•1h ago

> This creates a recurring pattern on r/LocalLLaMA: new model launches, people try it through Ollama, it’s broken or slow or has botched chat templates, and the model gets blamed instead of the runtime.

Seems like maybe, at least some of the time, you’re being underwhelmed my ollama not the model.

The better performance point alone seems worth switching away

speedgoose•1h ago

I follow the llama.cpp runtime improvements and it’s also true for this project. They may rush a bit less but you also have to wait for a few days after a model release to get a working runtime with most features.

Maxious•43m ago

Model authors are welcome to add support to llama.cpp before release like IBM did for granite 4 https://github.com/ggml-org/llama.cpp/pull/13550

pheggs•1h ago

you can pull directly from huggingface with llama.cpp, and it also has a decent web chat included

speedgoose•1h ago

Does it have a model registry with an API and hot swapping or you still have to use sometime like llama swap as suggested in the article ? Or is it CLI?

dminik•1h ago

You can have multiple models served now with loading/unloading with just the server binary.

https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...

0xbadcafebee•1h ago

No mention of the fact that Ollama is about 1000x easier to use. Llama.cpp is a great project, but it's also one of the least user friendly pieces of software I've used. I don't think anyone in the project cares about normal users.

I started with Ollama, and it was great. But I moved to llama.cpp to have more up-to-date fixes. I still use Ollama to pull and list my models because it's so easy. I then built my own set of scripts to populate a separate cache directory of hardlinks so llama-swap can load the gguf's into llama.cpp.

AndroTux•1h ago

Exactly. The blog post states that the alternatives listed are similarly intuitive. They are not. If you just need a chat app, then sure, there’s plenty of options. But if you want an OpenAI compatible API with model management, accessibility breaks down fast.

I’m open to suggestions, but the alternatives outlined in the blog post ain’t it.

mentalgear•55m ago

The reported alternatives seem pretty User-Friendly to me:

> LM Studio gives you a GUI if that’s what you want. It uses llama.cpp under the hood, exposes all the knobs, and supports any GGUF model without lock-in.

> Jan(https://www.jan.ai/) is another open-source desktop app with a clean chat interface and local-first design.

> Msty(https://msty.ai/) offers a polished GUI with multi-model support and built-in RAG. koboldcpp is another option with a web UI and extensive configuration options.

API wise: LM Studio has REST, OpenAI and more API Compatibilities.

homarp•13m ago

like someone said above: brew install llama.cpp

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000 (with MCP support and web chat interface)

and you have OpenAI API on the same 8000 port. (https://github.com/ggml-org/llama.cpp/tree/master/tools/serv... lists the endpoints)

BrissyCoder•52m ago

> No mention of the fact that Ollama is about 1000x easier to use.

Easier than what?

I came across LM Studio (mentioned in the post) about 3 years ago before I even knew what Ollama as. It was far better even then.

throw9393rj•36m ago

I spend like 2 hours trying to get vulkan acceleration working with ollama, no luck (half models are not supported and crash it). With llama.cpp podman container starts and works in 5 minutes.

flux3125•6m ago

> so llama-swap can load

Just in case you haven't seen it yet, llama.cpp now has a router mode that lets you hot-swap models. I've switched over from llama-swap and have been happy with it.

fy20•1h ago

It feels like a bit of history is missing... If ollama was founded 3 years before llama.cpp was released, what engine did they use then? When did they transition?

Maxious•47m ago

They spent several years in stealth mode but the initial release was llama.cpp.

Ollama v0.0.1 "Fast inference server written in Go, powered by llama.cpp" https://github.com/ollama/ollama/tree/v0.0.1

wolvoleo•45m ago

I don't think that is the case. Llama.cpp appeared within weeks after meta released llama to select researchers (which then made it out to the public). 3 years before that nobody knew of the name llama. I'm sure that llama.cpp existed first

TomGarden•1h ago

The performance issues are crazy. Thanks for sharing this

osmsucks•1h ago

I noticed the performance issues too. I started using Jan recently and tried running the same model via llama.cpp vs local ollama, and the llama.cpp one was noticeably faster.

utopiah•1h ago

Not sure why VLC doesn't do that.

It's a joke... but also not really? I mean VLC is "just" an interface to play videos. Videos are content files one "interact" with, mostly play/pause and few other functions like seeking. Because there are different video formats VLC relies on codecs to decode the videos, so basically delegating the "hard" part to codecs.

Now... what's the difference here? A model is a codec, the interactions are sending text/image/etc to it, output is text/image/etc out. It's not even radically bigger in size as videos can be huge, like models.

I'm confused as why this isn't a solved problem, especially (and yes I'm being a big sarcastic here, can't help myself) in a time where "AI" supposedly made all smart wise developers who rely on it 10x or even 1000x more productive.

Weird.

sudb•1h ago

What problem is it that you are confused isn't solved?

I think the codec analogy is neat but isn't the codec here llama.cpp, and the models are content files? Then the equivalent of VLC are things like LMStudio etc. which use llama.cpp to let you run models locally?

I'd guess one reason we haven't solved the "codec" layer is that there doesn't seem to be a standard that open model trainers have converged on yet?

cientifico•1h ago

For most users that wanted to run LLM locally, ollama solved the UX problem.

One command, and you are running the models even with the rocm drivers without knowing.

If llama provides such UX, they failed terrible at communicating that. Starting with the name. Llama.cpp: that's a cpp library! Ollama is the wrapper. That's the mental model. I don't want to build my own program! I just want to have fun :-P

anakaine•1h ago

Llama.cpp now has a gui installed by default. It previously lacked this. Times have changed.

OtherShrezzing•59m ago

While that might be true, for as long as its name is “.cpp”, people are going to think it’s a C++ library and avoid it.

RobotToaster•35m ago

It would make sense to just make the GUI a separate project, they could call it llama.gui.

eterm•33m ago

This is the first I'm learning that it isn't just a C++ library.

In fact the first line of the wikipedia article is:

> llama.cpp is an open source software library

figassis•28m ago

This is correct, and I avoided it for this reason, did not have the bandwidth to get into any cpp rabbit hole so just used whatever seemed to abstract it away.

nikodunk•50m ago

Having read above article, I just gave llama.cpp a shot. It is as easy as the author says now, though definitely not documented quite as well. My quickstart:

brew install llama.cpp

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000

Go to localhost:8000 for the Web UI. On Linux it accelerates correctly on my AMD GPU, which Ollama failed to do, though of course everyone's mileage seems to vary on this.

teekert•16m ago

Was hoping it was so easy :) But I probably need to look into it some more.

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4' llama_model_load_from_file_impl: failed to load model

mijoharas•49m ago

Frankly I think the cli UX and documentation is still much better for ollama.

It makes a bunch of decisions for you so you don't have to think much to get a model up and running.

FrozenSynapse•41m ago

but if ollama is much slower, that's cutting on your fun and you'll be having better fun with a faster GUI

amelius•11m ago

Whip that llama! Oh wait, that's a different program.

well_ackshually•9m ago

>solved the UX problem.

>One command

Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`, and that running things in the terminal is already a gigantic UX blocker (Ollama's popularity comes from the fact that it has a GUI), why are you putting the blame back on an open source project that owes you approximately zero communication ?

zozbot234•7m ago

> Ollama's popularity comes from the fact that it has a GUI

It's not the GUI, it's the curated model hosting platform. Way easier to use than HF for casual users.

arcza•1h ago

I find the style of writing incredibly annoying (it doesn't make the point, full of hyperbole) and the website has the standard slopsite black background and glowing CSS.

Karuma•12m ago

That's because it was fully written by an LLM, as usual lately with all the articles on the front page of HN.

No wonder I get downvoted to hell every time I mention this... People here can't even tell anymore. They just find this horrible slop completely normal. HN is just another dead website filled with slop articles, time to move on to some smaller reddit communities...

mentalgear•1h ago

> Ollama is a Y Combinator-backed (W21) startup, founded by engineers who previously built a Docker GUI that was acquired by Docker Inc. The playbook is familiar: wrap an existing open-source project in a user-friendly interface, build a user base, raise money, then figure out monetization.

    The progression follows the pattern cleanly:

    1. Launch on open source, build on llama.cpp, gain community trust
    2. Minimize attribution, make the product look self-sufficient to investors
    3. Create lock-in, proprietary model registry format, hashed filenames that don’t work with other tools
    4. Launch closed-source components, the GUI app
    5. Add cloud services, the monetization vector

goodpoint•47m ago

The missing attribution pattern is nasty.

dhruv3006•47m ago

ollama is pretty intuitive to use still - dont see why will stop.

san_tekart•43m ago

The CLI is great locally, but the architecture fights you in production. Putting a stateful daemon that manages its own blob storage inside a container is a classic anti-pattern. I ended up moving to a proper stateless binary like llama-server for k8s.

DeathArrow•37m ago

I see no mention of vLLM in the article.

paganel•37m ago

Another scummy YCombinator project, one of many lately. Looks like no-one is left at the wheel, at least as long as the valuations (and hence money) keep coming in.

Havoc•32m ago

Alas people want convenience and don’t care about this sort of stuff.

NamlchakKhandro•31m ago

drop ollama in the bin, no one needs it.

NamlchakKhandro•30m ago

LM Studio is 1000x easier to use than ollama btw

denismi•29m ago

Hmm..

  pacman -Ss ollama | wc -l                                                                                                              
  16
  pacman -Ss llama.cpp | wc -l
  0
  pacman -Ss lmstudio | wc -l
  0

Maybe some day.

mongrelion•4m ago

llama.cpp moves too quickly to be added as a stable package. Instead, you can get it directly from AUR: https://aur.archlinux.org/packages?O=0&K=llama.cpp

There are packages for Vulkan, ROCm and CUDA. They all work.

thot_experiment•29m ago

I was pretty big on ollama, it seemed like a great default solution. I had alpha that it was a trash organization but I didn't listen because I just liked having a reliable inference backend that didn't require me to install torch. I switched to llama.cpp for everything maybe 6 months ago because of how fucking frustrating every one of my interactions with ollama (the organization) were. I wanna publicly apologize to everyone who's concerns I brushed off. Ollama is a vampire on the culture and their demise cannot come soon enough.

FWIW llama.cpp does almost everything ollama does better than ollama with the exception of model management, but like, be real, you can just ask it to write an API of your preferred shape and qwen will handle it without issue.

zxcholmes•22m ago

The name "llama.cpp" doesn't seem very friendly anymore nowadays... Back then, "llama" probably referred to those models from Facebook, and now those Llama series models clearly can't represent the strongest open-source models anymore...

kgwgk•11m ago

Doesn't the "llama" in "ollama" present exactly the same issue?

Edit: or maybe that was your point. I guess that for historical reasons this is a kind of generic name for local deployments now (see https://www.reddit.com/r/LocalLLaMA) just like people will call anything ChatGPT.

eternaut•22m ago

the article nails it!

_bobm•13m ago

amen

mrkeen•6m ago

> Red Hat’s ramalama is worth a look too, a container-native model runner that explicitly credits its upstream dependencies front and center. Exactly what Ollama should have done from the start.

  % ramalama run qwen3.5-9b
  Error: Manifest for qwen3.5-9b:latest was not found in the Ollama registry

Darkbloom – Private inference on idle Macs

Stop Using Ollama

IPv6 traffic crosses the 50% mark

FSF trying to contact Google about spammer sending 10k+ mails from Gmail account

RedSun: System user access on Win 11/10 and Server with the April 2026 Update

The paper computer

Cybersecurity looks like proof of work now

RamAIn (YC W26) Is Hiring

A Look into NaviDial, Japan's Legacy Phone Service

Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus

Too much discussion of the XOR swap trick

Fast and Easy Levenshtein distance using a Trie (2011)

Rewriting a 20-year-old Python library

ChatGPT for Excel

I made a terminal pager

Cal.com is going closed source

Introduction to spherical harmonics for graphics programmers

Google broke its promise to me – now ICE has my data

North American English Dialects

FIXAPL

Germany suspends military approval for long stays abroad for men under 45

Show HN: Libretto – Making AI browser automations deterministic

Retrofitting JIT Compilers into C Interpreters

The buns in McDonald's Japan's burger photos are all slightly askew

PiCore - Raspberry Pi Port of Tiny Core Linux

US v. Heppner (S.D.N.Y. 2026) no attorney-client privilege for AI chats [pdf]

CRISPR takes important step toward silencing Down syndrome’s extra chromosome

Live Nation illegally monopolized ticketing market, jury finds

Stealth signals are bypassing Iran’s internet blackout

Agent - Native Mac OS X coding ide/harness

Stop Using Ollama

Comments

Darkbloom – Private inference on idle Macs

Stop Using Ollama

IPv6 traffic crosses the 50% mark

FSF trying to contact Google about spammer sending 10k+ mails from Gmail account

RedSun: System user access on Win 11/10 and Server with the April 2026 Update

The paper computer

Cybersecurity looks like proof of work now

RamAIn (YC W26) Is Hiring

A Look into NaviDial, Japan's Legacy Phone Service

Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus

Too much discussion of the XOR swap trick

Fast and Easy Levenshtein distance using a Trie (2011)

Rewriting a 20-year-old Python library

ChatGPT for Excel

I made a terminal pager

Cal.com is going closed source

Introduction to spherical harmonics for graphics programmers

Google broke its promise to me – now ICE has my data

North American English Dialects

FIXAPL

Germany suspends military approval for long stays abroad for men under 45

Show HN: Libretto – Making AI browser automations deterministic

Retrofitting JIT Compilers into C Interpreters

The buns in McDonald's Japan's burger photos are all slightly askew

PiCore - Raspberry Pi Port of Tiny Core Linux

US v. Heppner (S.D.N.Y. 2026) no attorney-client privilege for AI chats [pdf]

CRISPR takes important step toward silencing Down syndrome’s extra chromosome

Live Nation illegally monopolized ticketing market, jury finds

Stealth signals are bypassing Iran’s internet blackout

Agent - Native Mac OS X coding ide/harness