`lms chat` has existed, `lms daemon up` / "llmster" is the new command.
Ah, this is great, been waiting for this! I naively created some tooling on top of the API from the desktop app after seeing they had a CLI, then once I wanted to deploy and run it on a server, I got very confused that the desktop app actually installs the CLI and it requires the desktop app running.
Great that they finally got it working fully headless now :)
I had used oobabooga back in the day and found ollama unnecessary.
I get that I can run local models, but all the paid for (remote) models are superior.
So is the use-case just for people who don’t want to use big tech’s models? Is this just for privacy conscious people? Or is this just for “adult” chats, ie porn bots?
Not being cynical here, just wanting to understand the genuine reasons people are using it.
For me the main BIG deal is that cloud models have online search embedded etc, while this one doesn't.
However, if you don't need that (e.g., translate, summarize text, writing code) probably is good enough.
I exclusively run local models. On par with Opus 4.5 for most things. gpt-oss is pretty capable. Qwen3 as well.
?
Are you asking it for capital cities or what?
I've invested heavily in local inference. For me, it's a mixture privacy, control, stability, cognitive security.
Privacy - my agents can work on tax docs, personal letters, etc.
Control - I do inference steering with some projects: constraining which token can be generated next at any point in time. Not possible with API endpoints.
Stability - I had many bad experiences with frontier labs' inference quality shifting within the same day, likely due to quantization due to system load. Worse, they retire models, update their own system prompts, etc. They're not stable.
Cognitive Security - This has become more important as I rely more on my agents for performing administrative work. This is intermixed with the Control/Stability concerns, but the focus is on whether I can trust it to do what I intended it to do, and that it's acting on my instructions, rather than the labs'.
You don't need LM Studio to run local models, it just (was, formerly), a nice UI to download and manage HF models and llama.cpp updates, quickly and easily manually switch between CPU / Vulkan / ROCm / CUDA (depending on your platform).
Regarding your actual question, there are several reasons.
First off, your allusion to privacy - absolutely, yes, some people use it for adult role-play, however, consider the more productive motivations for privacy, too: a lot of businesses with trade secrets they may want to discuss or work on with local models without ever releasing that information to cloud providers, no matter how much those cloud providers pinky promise to never peek at it. Google, Microsoft, Meta, et al have consistently demonstrated that they do not value or respect customer privacy expectations, that they will eagerly comply with illegal, unconstitutional NSA conspiracies to facilitate bulk collection of customer information / data. There is no reason to believe Anthropic, OpenAI, Google, xAI would act any differently today. In fact, there is already a standing court order forcing OpenAI to preserve all customer communications, in a format that can be delivered to the court (i.e. plaintext, or encryption at rest + willing to provide decryption keys to the court), in perpetuity (https://techstartups.com/2025/06/06/court-orders-openai-to-p...)
There are also businesses which have strict, absolute needs for 24/7 availability and low latency, which remote APIs never have offered. Even if the remote APIs were flawless, and even if the businesses have a robust multi-WAN setup with redundant UPS systems, network downtime or even routing issues are more or less an inevitable fact of life, sooner or later. Having local models means you have inference capability as long as you have electricity.
Consider, too, the integrity front: frontier labs may silently modify API-served models to be lower quality for heavy users with little means of detection by end users (multiple labs have been suspected / accused of this; a lack of proof isn't evidence that it didn't happen) or that the API-served models can be modified over time to patch behaviors that may have been previously relied upon for legitimate workloads (imagine a red team that used a jailbreak to get a model to produce code for process hollowing, for instance). This second example absolutely has happened with almost every inference provider.
The open weight local models also have zero marginal cost besides electricity once the hardware is present, unlike PAYG API models, which create financial lock-in and dependency that is in direct contrast with the financial interests of the customers. You can argue about the amortized costs of hardware, but that's a decision for the customer to make using their specific and personal financial and capex / hardware information that you don't have at the end of the day.
Further, the gap between frontier open weight models and frontier proprietary models has been rapidly shrinking and continues to. See Kimi K2.5, Xiaomi MiMo v2, GLM 4.7, etc. Yes, Opus 4.5, Gemini 3 Pro, GPT-5.2-xhigh are remarkably good models and may beat these at the margin, but most work done via LLMs does not need the absolute best model; many people will opt for a model that gets 95% of the output quality of the absolute frontier model when it can be had for 1/20th the cost (or less).
Thats what convinced me they are ready to do real work, are they going to replace claude code...not currently. But it is insane to me that such a small model can follow those explicit directions and consistently perform that workflow.
I've during that experimentation, even when not putting the sql explicit it was able to craft the queries on its own from just text description, and has no issue navigating the cli and file system doing basic day to day things.
I'm sure there are a lot of people doing "adult" things, but my interest is sparked because they finally at the level they can be a tool in a homelab, and no longer is llm usage limits subsidized like they used to be. Not to mention I am really disillusioned with big tech having my data or exposing a tool making API calls to them that then can make actions on my system.
I'll still keep using claude code day to day coding. But for small system based tasks I plan on moving to local llms. Their capabilities have inspired me to write my own agentic framework to see what work flows can be put together for just management and automation of day to day task. Ideally it would be nice to just chat with an llm and tell it to add an appointment or call at x time or make sure I do it that day and it can read my schedule and remind-me at a chill time of my day to make the call, and then check up that I followed through. I also plan on seeing if I can also set it up to remind me and help to practice mindfulness and just general stress management I should do. While sure a simple reminder might work, but as someone with adhd who easily forgets reminders as soon as they pop up if I can get to them now, being pestered by an agent that wakes up and engages with me seems like it might be an interesting workflow.
And the hacker aspect, now that they are capable I really want to mess around with persistent knowledge in databases and making them intercommunicate and work together. Might even give them access to rewrite themselves and access the application during run time with a lisp. But to me local llms have gotten to the point they are fun and not annoying. I can run a model that is better than chatgpt 3.5 for the most part, its knowledge is more distilled and narrower, but for what they do understand their correctness is much better.
But then I decided I'm just a chemical reaction and a product of my environment, so I gave chatGPT all my dirt anyway.
But before, I cared about my privacy.
To your point though, if the successors to Strix Halo, Serpent Lake (x86 intel CPU + Nvidia iGPU) and Medusa Halo (x86 AMD CPU + AMD iGPU) come in at a similar price point, I'll probably go with Serpent Lake, given the specs are otherwise similar (both are looking at 384-bit unified memory bus to LPDDR6 with 256GB unified memory options). CUDA is better than ROCm, no argument there.
That said, this has nothing to do with the (now resolved) issue I was experiencing with LM Studio not respecting existing Developer Mode settings with this latest update. There are good reasons to want to switch between different back-ends (e.g. debugging whether early model release issues, like those we saw with GLM-4.7-Flash, are specific to Vulkan - some of them were in that specific example). Bugs like that do exist, but I've had even fewer stability issues on Vulkan than I've had on CUDA on my 4080.
although, as an amd user, he should know that both vulkan and rocm backends have equal propensity to crap the bed...
"looks like a toy" has very little to do with its use anyway.
On your inference machine:
you@yourbox:~/Downloads/llama.cpp/bin$ ./llama-server -m <path/to/your/model.gguf> --alias <your-alias> --jinja --ctx-size 32768 --host 0.0.0.0 --port 8080 -fa on
Obviously, feel free to change your port, context size, flash attention, other params, etc.Then, on the system you're running Claude Code on:
export ANTHROPIC_BASE_URL=http://<ip-of-your-inference-system>:<port>
export ANTHROPIC_AUTH_TOKEN="whatever"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude --model <your-alias> [optionally: --system "your system prompt here"]
Note that the auth token can be whatever value you want, but it does need to be set, otherwise a fresh CC install will still prompt you to login / auth with Anthropic or Vertex/Azure/whatever.but it's a bit too little too late. people running this probably can already setup llama.cpp pretty easily.
lmstudio also has some overhead like ollama; llama.cpp or mlx alone are always faster.
https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1...
jiqiren•2h ago
observationist•1h ago
Thanks for the updates!
nubg•23m ago