Ask HN: What's Your Useful Local LLM Stack?

67•Olshansky•23h ago

What I’m asking HN:

What does your actually useful local LLM stack look like?

I’m looking for something that provides you with real value — not just a sexy demo.

---

After a recent internet outage, I realized I need a local LLM setup as a backup — not just for experimentation and fun.

My daily (remote) LLM stack:

  - Claude Max ($100/mo): My go-to for pair programming. Heavy user of both the Claude web and desktop clients.

  - Windsurf Pro ($15/mo): Love the multi-line autocomplete and how it uses clipboard/context awareness.

  - ChatGPT Plus ($20/mo): My rubber duck, editor, and ideation partner. I use it for everything except code.

Here’s what I’ve cobbled together for my local stack so far:

Tools

  - Ollama: for running models locally

  - Aider: Claude-code-style CLI interface

  - VSCode w/ continue.dev extension: local chat & autocomplete

Models

  - Chat: llama3.1:latest

  - Autocomplete: Qwen2.5 Coder 1.5B

  - Coding/Editing: deepseek-coder-v2:16b

Things I’m not worried about:

  - CPU/Memory (running on an M1 MacBook)

  - Cost (within reason)

  - Data privacy / being trained on (not trying to start a philosophical debate here)

I am worried about:

  - Actual usefulness (i.e. “vibes”)

  - Ease of use (tools that fit with my muscle memory)

  - Correctness (not benchmarks)

  - Latency & speed

Right now: I’ve got it working. I could make a slick demo. But it’s not actually useful yet.

---

Who I am

  - CTO of a small startup (5 amazing engineers)

  - 20 years of coding (since I was 13)

  - Ex-big tech

Comments

sshine•21h ago

I just use Claude Code ($20/mo.)

Sometimes with Vim, sometimes with VSCode.

Often just with a terminal for testing the stuff being made.

orthecreedence•21h ago

What plugins/integrations are you using in vim?

sshine•10h ago

  :syn on
  :set tabstop=4
  :set shiftwidth=4
  :set expandtab

Nothing AI-related.

quxbar•21h ago

IMO you're better off investing in tooling that works with or without LLMs: - extremely clean, succinct code - autogenerated interfaces from openAPI spec - exhaustive e2e testing

Once that is set up, you can treat your agents like (sleep-deprived) junior devs.

mhogers•21h ago

Autogenerated interfaces from openAPI spec is so key - agents are extremely good at creating React code based on these interfaces (+ typescript + tests + lints.. for extra feedback loops etc..)

codybontecou•21h ago

It seems like you have a decent local stack in place. Unfortunately these systems feel leagues behind Claude Clode and the current SOTA agentic coding. But they're great for referencing simple search like syntax.

Where I've found the most success with local models is with image generation, text-to-speech, and text-to-text translations.

bix6•21h ago

Following as I haven’t found a solution. To me the local models feel outdated and no internet lookup causes issues.

alexhans•20h ago

Unless I'm misremembering, aider with Playwright [1] works with local models, so you can scrape the web.

Depending on your hardware you could do something like:

aider --model "ollama_chat/deepseek-r1:14b" --editor-model "ollama_chat/qwen2.5-coder:14b"

[1] - https://aider.chat/docs/install/optional.html#enable-playwri...

ashwinsundar•21h ago

I just go outside when my internet is down for 15 minutes a year. Or tether to my cell phone plan if the need is urgent.

I don't see the point of a local AI stack, outside of privacy or some ethical concerns (which a local stack doesn't solve anyway imo). I also *only* have 24GB of RAM on my laptop, which it sounds like isn't enough to run any of the best models. Am I missing something by not upgrading and running a high-performance LLM on my machine?

filchermcurr•21h ago

I would say cost is a factor. Maybe not for OP, but many people aren't able to spend $135 a month on AI services.

ashwinsundar•21h ago

Does the cost of a new computer not get factored in? I think I would need to spend $2000+ to run a decent model locally, and even then I can only run open source models

Not to mention, running a giant model locally for hours a day is sure to shorten the lifespan of the machine…

dpoloncsak•20h ago

$2000 for a new machine is only a little over a year in AI costs for OP

lm28469•4h ago

Electricity isn't free and these things are basically continuously ON bread toasters.

haiku2077•20h ago

The computer is a general purpose tool, though. You can play games, edit video and images, and self-host a movie/TV collection with real time transcoding with the same hardware. Many people have powerful PCs for playing games and running professional creative software already.

There's no reason running a model would shorten a machine's lifespan. PSUs, CPUs, motherboards, GPUs and RAM will all be long obsolete before they wear out even under full load. At worst you might have to swap thermal paste/pads a couple of years sooner. (A tube of paste is like, ten bucks.)

outworlder•19h ago

> Not to mention, running a giant model locally for hours a day is sure to shorten the lifespan of the machine…

That is not a thing. Unless there's something wrong (badly managed thermals, an undersized PSU at the limit of its capacity, dusty unfiltered air clogging fans, aggressive overclocking), that's what your computer is built for.

Sure, over a couple of decades there's more electromigration than would otherwise have happened at idle temps. But that's pretty much it.

> I think I would need to spend $2000+ to run a decent model locally

Not really. Repurpose second hand parts and you can do it for 1/4 of that cost. It can also be a server and do other things when you aren't running models.

FuriouslyAdrift•21h ago

I use Reasoner v1 (based on Qwen 2.5-Coder 7B) running locally for programming help/weird ideas/etc. $0

shock•20h ago

There were many hits when I searched for "Reasoner v1 (based on Qwen 2.5-Coder 7B)". Do you have a link to the one you are using?

FuriouslyAdrift•18h ago

Nomic GPT4All https://www.nomic.ai/blog/posts/gpt4all-scaling-test-time-co...

https://github.com/nomic-ai/gpt4all/releases

throwawayffffas•21h ago

What I have setup:

- Ollama: for running llm models

- OpenWebUI: For the chat experience https://docs.openwebui.com/

- ComfyUI: For Stable diffusion

What I use:

Mostly ComfyUI and occasionally the llms through OpenWebUI.

I have been meaning to try Aider. But mostly I use claude at great expense I might add.

Correctness is hit and miss.

Cost is much lower and latency is better or at least on par with cloud model at least on the serial use case.

Caveat, in my case local means running on a server with gpus in my lan.

alkh•21h ago

I personally found Qwen2.5 Coder 7B to be on pair with deepseek-coder-v2:16b(but consumes less RAM on inference and faster), so that's what I am using locally. I actually created a custom model called "oneliner" that uses Qwen2.5 Coder 7B as a base and this system prompt:

SYSTEM """ You are a professional coder. You goal is to reply to user's questions in a consise and clear way. Your reply must include only code orcommands , so that the user could easily copy and paste them.

Follow these guidelines for python: 1) NEVER recommend using "pip install" directly, always recommend "python3 -m pip install" 2) The following are pypi modules: ruff, pylint, black, autopep8, etc. 3) If the error is module not found, recommend installing the module using "python3 -m pip install" command. 4) If activate is not available create an environment using "python3 -m venv .venv". """

I specifically use it for asking quick questions in terminal that I can copy & paste straight away(for ex. about git). For heavy-lifting I am using ChatGPT Plus(my own) + Github Copilot(provided by my company) + Gemini(provided by my company as well).

Can someone explain how one can set up autocomplete via ollama? That's something I would be interested to try.

CamperBob2•20h ago

NEVER recommend using "pip install" directly, always recommend "python3 -m pip install"

Just out of curiosity, what's the difference?

Seems like all the cool kids are using uv.

th0ma5•20h ago

You mean to say that there is a lot of hype for uv because it is nice and quick but also gives an easy rhetorical win for junior people in any discussion about packaging in Python currently, so obviously that's going to be very popular even if it doesn't work for everyone.

The difference is to try to decouple the environment from the runtime essentially.

alkh•20h ago

I only recently switched to uv and previously used pyenv, so this was more relevant to me before. There is a case when pip might not be pointing to the right python version, while `python3 -m pip` ensures you use the same one as your environment. For me it is mostly a habbit :)

jdthedisciple•20h ago

uv? guess I'm old school.

pip install it is for me

kh_hk•5h ago

There's nothing old school / cool kids about uv and pip. uv is a pip/venv/... interface. If you know how to use pip and venv, you know how to use uv. I use it as a useful toolchain to circumvent missing project/setup.py/requirements shenanigans

dent9•12h ago

If you ever find yourself arguing about the best Python package manager then you've already lost. Just use a real language with real library management. I dropped Python for Go and haven't looked back. There's plenty other alternatives. Python is such a waste of time.

trylist•3h ago

Said unironically in a discussion about local llms and AI models.

jbotz•7h ago

When you do `python -m pip install` you're going to get a version of pip that has the same idea of what its environment looks like as the python executable, which is what you want, and which isn't guaranteed with the the `pip` executable in your path.

As an aside, I disagree with the `python3` part... the `python3` name is a crutch that it's long past time to discard; if in 2025 just typing `python` gives you a python 2.x executable your workstation needs some serious updating and/or clean-up, and the sooner you find that out, the better.

instagib•21h ago

It looks like continue.dev has a RAG implementation but for other files something else? PDF, word, and other languages.

I’ve been going thru some of the neovim plugins for local llm support.

clvx•21h ago

In a related subject, what’s the best hardware to run local LLM’s for this use case? Assuming a budget of no more of $2.5K.

And, is there an open source implementation of an agentic workflow (search tools and others) to use it with local LLM’s?

seanmcdirmid•21h ago

I got a M3 max (the higher end one) with 64GB/ram macbook pro a while back for $3k, might be cheaper now now that the M3 ultra is out.

haiku2077•20h ago

I'm using Zed which supports Ollama on my M4 Macs.

https://zed.dev/blog/fastest-ai-code-editor

prettyblocks•20h ago

You can build a pretty good PC with a used 3090 for that budget. It will outperform anything else in terms of speed. Otherwise, you can get something like an m4 pro mac with 48gb ram.

apparent•14h ago

I've wondered about this also. I have an MBA and like that it's lightweight and relatively cheap. I could buy a MBP and max out the RAM, but I think getting a Mac mini with lots of RAM could actually make more sense. Has anyone set up something like this to make it available to their laptop/iPhone/etc.?

Seems like there would be cost advantages and always-online advantages. And the risk of a desktop computer getting damaged/stolen is much lower than for laptops.

dent9•12h ago

You can get used RTX 3090 for $750-800 each. Pro tip; look for 2.5 slot sized models line EVGA XC3 or the older blower models. Then you can get two for $1600, fit them in a full size case, 128GB DDR5 for $300, some Ryzen CPU like the 9900X and a mobo and case and PSU to fill up the rest of the budget. If you want to skimp you can drop one of the GPUs until you're sure you need 48GB VRAM and some of the RAM but you really don't save that much. Just make sure you get a case that can fit multiple full size GPU and a mobo that can support it as well. The slot configurations are pretty bad on the AM5 generation for multi GPU. You'll probably end up with a mobo such as Asus ProArt

Also none of this is worth the money because it's simply not possible to run the same kinds of models you pay for online on a standard home system. Things like ChatGPT 4o use more VRAM than you'll ever be able to scrounge up unless your budget is closer to $10,000-25,000+. Think multiple RTX A6000 cards or similar. So ultimately you're better off just paying for the online hosted services

beefnugs•5h ago

I think this proves one of the suckpoints of AI : there are clearly certain things that the smaller models should be fine at... but there doesn't seem to be frameworks or something that constantly analyze and simulate and evaluate what you could be doing with smaller and cheaper models

Of course the economics are completely at odds with any real engineering: nobody wants you to use smaller local models, nobody wants you to consider cost/efficiency saving

timr•21h ago

I use Copilot, with the occasional free query to the other services. During coding, I mostly use Claude Sonnet 3.7 or 4 in agent mode, but Gemini 2.5 Pro is a close second. ChatGPT 4o is useless except for Q&A. I see no value in paying more -- the utility rapidly diminishes, because at this point the UI surrounding the models is far less important than the models themselves, which in turn are generally less important than the size of their context windows. Even Claude is only marginally better than Gemini (at coding), and they all suck to the point that I wouldn't trust any of them without reviewing every line. Far better to just pick a tool, get comfortable with it, and not screw around too much.

I don't understand people who pay hundreds of dollars a month for multiple tools. It feels like audiophiles paying $1000 for a platinum cable connector.

th0ma5•20h ago

For sure when people don't understand the fundamentals (or in the case of LLMs they are unknowable) then all you have is superstition.

650REDHAIR•18h ago

Why was this flagged?

ttkciar•15h ago

Senior software engineer with 46 years of experience (since I was 7). LLM inference hasn't been too useful for me for writing code, but it has proven very useful for explaining my coworkers' code to me.

Recently I had Gemma3-27B-it explain every Python script and library in a repo with the command:

$ find -name '*.py' -print -exec /home/ttk/bin/g3 "Explain this code in detail:\n\n`cat {}`" \; | tee explain.txt

There were a few files it couldn't figure out without other files, so I ran a second pass with those, giving it the source files it needed to understand source files that used them. Overall, pretty easy, and highly clarifying.

My shell script for wrapping llama.cpp's llama-cli and Gemma3: http://ciar.org/h/g3

That script references this grammar file which forces llama.cpp to infer only ASCII: http://ciar.org/h/ascii.gbnf

Cost: electricity

I've been meaning to check out Aider and GLM-4, but even if it's all it's cracked up to be, I expect to use it sparingly. Skills which aren't exercised are lost, and I'd like to keep my programming skills sharp.

ensocode•6h ago

Thanks for asking. I’ve been thinking about this lately as well, but I always come back to the idea that paying for online services is worth it for now, since this tech is evolving so quickly that a stable local setup wouldn’t justify the time spent tinkering — it could be outdated by tomorrow.

DuckDuckGo Is Down

Ask HN: Is it time to fork HN into AI/LLM and "Everything else/other?"

Learning Founder Led Sales

Ask HN: How do you stay on top of AI tech?

Ask HN: What is the best way to learn 3D modeling for 3D printing?

Ask HN: Is anyone using Super Grok Heavy for code?

Circlecropimage.net –image tools: crop, enhance, compress, remove background

Question: Has anyone used mcp in production?

Open Source Multimodal Semantic Search

Ask HN: What's Your Useful Local LLM Stack?

Ask HN: How did Soham Parekh get so many jobs?

Ask HN: What should be included in a standard library?

The IDE isn't going away

Ask HN: Did Anyone Here Lose Interest in Coding After a While?

Ask HN: What is your window management solution?

Ask HN: How much of OpenAI code is written by AI?

The German Works Council has blocked Amazon's performance reviews

Is Firebase Console Down

Telnyx launches automatic noise suppression for AI Voice Agents

Ask HN: How are you productively using Claude code?

Ask HN: How to find mentors while working remote?

Ask HN: How is my MacBook temp getting misread?

Cloudflare DNS Down in UK/EU

Cloudflare's 1.1.1.1 DNS server seems to be down

Cyberpunk and Politics: Neon Dystopias, Power, and Resistance

Ask HN: Is it true that early humans were more 'gatherers' than 'hunters'?

Tell HN: Lobste.rs blocking the Brave browser

Dyan – A Visual REST API Builder You Can Self-Host

Tell HN: I Lost Joy of Programming

Ask HN: Are there any tools for tracking GPU prices over time?

DuckDuckGo Is Down

Ask HN: Is it time to fork HN into AI/LLM and "Everything else/other?"

Learning Founder Led Sales

Ask HN: How do you stay on top of AI tech?

Ask HN: What is the best way to learn 3D modeling for 3D printing?

Ask HN: Is anyone using Super Grok Heavy for code?

Circlecropimage.net –image tools: crop, enhance, compress, remove background

Question: Has anyone used mcp in production?

Open Source Multimodal Semantic Search

Ask HN: What's Your Useful Local LLM Stack?

Ask HN: How did Soham Parekh get so many jobs?

Ask HN: What should be included in a standard library?

The IDE isn't going away

Ask HN: Did Anyone Here Lose Interest in Coding After a While?

Ask HN: What is your window management solution?

Ask HN: How much of OpenAI code is written by AI?

The German Works Council has blocked Amazon's performance reviews

Is Firebase Console Down

Telnyx launches automatic noise suppression for AI Voice Agents

Ask HN: How are you productively using Claude code?

Ask HN: How to find mentors while working remote?

Ask HN: How is my MacBook temp getting misread?

Cloudflare DNS Down in UK/EU

Cloudflare's 1.1.1.1 DNS server seems to be down

Cyberpunk and Politics: Neon Dystopias, Power, and Resistance

Ask HN: Is it true that early humans were more 'gatherers' than 'hunters'?

Tell HN: Lobste.rs blocking the Brave browser

Dyan – A Visual REST API Builder You Can Self-Host

Tell HN: I Lost Joy of Programming

Ask HN: Are there any tools for tracking GPU prices over time?

Ask HN: What's Your Useful Local LLM Stack?

Comments