I want everything local – Building my offline AI workspace

https://instavm.io/blog/building-my-offline-ai-workspace

206•mkagenius•2h ago

Comments

shaky•2h ago

This is something that I think about quite a bit and am grateful for this write-up. The amount of friction to get privacy today is astounding.

sneak•1h ago

This writeup has nothing of the sort and is not helpful toward that goal.

frank_nitti•1h ago

I'd assume they are referring to being able to run your own workloads in a home-built system, rather then surrendering that ownership to the tech giants alone

Imustaskforhelp•1h ago

Also you get a sort of complete privacy that the data never leaves your home too whereas at best you would have to trust the AI cloud providers that they are not training or storing that data.

Its just more freedom and privacy in that matter.

doctorpangloss•34m ago

The entire stack involved sends so much telemetry.

noelwelsh•1h ago

It's the hardware more than the software that is the limiting factor at the moment, no? Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.

colecut•1h ago

This is rapidly improving

https://simonwillison.net/2025/Jul/29/space-invaders/

Imustaskforhelp•1h ago

I hope it improves at such a steady rate! Please lets just hope that there is still room for improvement to packing even more improvements in such LLMS which can help the home labbing community in general.

ramesh31•1h ago

>Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.

And "good" is still questionable. The thing that makes this stuff useful is when it works instantly like magic. Once you find yourself fiddling around with subpar results at slower speeds, essentially all of the value is gone. Local models have come a long way but there is still nothing even close to Claude levels when it comes to coding. I just tried taking the latest Qwen and GLM models for a spin through OpenRouter with Cline recently and they feel roughly on par with Claude 3.0. Benchmarks are one thing, but reality is a completely different story.

ahmedbaracat•1h ago

Thanks for sharing. Note that the GitHub at the end of the article is not working…

mkagenius•1h ago

Thanks for the heads up. It's fixed now -

Coderunner-UI: https://github.com/instavm/coderunner-ui

Coderunner: https://github.com/instavm/coderunner

navbaker•1h ago

Open Web UI is a great alternative for a chat interface. You can point to an OpenAI API like vLLM or use the native Ollama integration and it has cool features like being able to say something like “generate code for an HTML and JavaScript pong game” and have it display the running code inline with the chat for testing

dmezzetti•1h ago

I built TxtAI with this philosophy in mind: https://github.com/neuml/txtai

pyman•1h ago

Mr Stallman? Richard, is that you?

tcdent•1h ago

I'm constantly tempted by the idealism of this experience, but when you factor in the performance of the models you have access to, and the cost of running them on-demand in a cloud, it's really just a fun hobby instead of a viable strategy to benefit your life.

As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.

Coupled with the dramatically inferior performance of the weights you would be running in a local environment, it's just not worth it.

I expect this will change in the future, and am excited to invest in a local inference stack when the weights become available. Until then, you're idling a relatively expensive, rapidly depreciating asset.

braooo•1h ago

Running LLMs at home is a repeat of the mess we make with "run a K8s cluster at home" thinking

You're not OpenAI or Google. Just use pytorch, opencv, etc to build the small models you need.

You don't need Docker even! You can share over a simple code based HTTP router app and pre-shared certs with friends.

You're recreating the patterns required to manage a massive data center in 2-3 computers in your closet. That's insane.

frank_nitti•1h ago

For me, this is essential. On priciple, I won't pay money to be a software engineer.

I never paid for cloud infrastructure out of pocket, but still became the go-to person and achieved lead architecture roles for cloud systems, because learning the FOSS/local tooling "the hard way" put me in a better position to understand what exactly my corporate employers can leverage with the big cash they pay the CSPs.

The same is shaping up in this space. Learning the nuts and bolts of wiring systems together locally with whatever Gen AI workloads it can support, and tinkering with parts of the process, is the only thing that can actually keep me interested and able to excel on this front relative to my peers who just fork out their own money to the fat cats that own billions worth of compute.

I'll continue to support efforts to keep us on the track of engineers still understanding and able to 'own' their technology from the ground up, if only at local tinkering scale

jtbaker•28m ago

Self hosting my own LLM setup in the homelab was what really helped me learn the fundamentals of K8s. If nothing else I'm grateful for that!

Imustaskforhelp•1h ago

So I love linux and would wish to learn devops one day in its entirety to be an expert to actually comment on the whole post but

I feel like they actually used docker for just the isolation part or as a sandbox (technically they didn't use docker but something similar to it for mac (apple containers) ) I don't think that it has anything to do with k8s or scalability or pre shared cert or http router :/

jeremyjh•1h ago

I expect it will never change. In two years if there is a local option as good as GPT-5 there will be a much better cloud option and you'll have the same tradeoffs to make.

c-hendricks•1h ago

Why would AI be one of the few areas where locally-hosted options can't reach "good enough"?

hombre_fatal•1h ago

For some use-cases, like making big complex changes to big complex important code or doing important research, you're pretty much always going to prefer the best model rather than leave intelligence on the table.

For other use-cases, like translations or basic queries, there's a "good enough".

bbarnett•1h ago

I grew up in a time when listening to an mp3 was too computationally expensive and nigh impossible for the average desktop. Now tiny phones can decode high def video realtime due to CPU extensions.

And my phone uses a tiny, tiny amount of power, comparatively, to do so.

CPU extensions and other improvements will make AI a simple, tiny task. Many of the improvements will come from robotics.

pfannkuchen•1h ago

It might change once the companies switch away from lighting VC money on fire mode and switch to profit maximizing mode.

I remember Uber and AirBnB used to seem like unbelievably good deals, for example. That stopped eventually.

jeremyjh•4m ago

This I could see.

duxup•1h ago

Maybe, but my phone has become is a "good enough" computer for most tasks compared to a desktop or my laptop.

Seems plausible the same goes for AI.

kasey_junk•1h ago

I’d be surprised by that outcome. At one point databases were cutting edge tech with each engine leap frogging each other in capability. Still the proprietary db often have features that aren’t matched elsewhere.

But the open db got good enough that you need to justify not using them with specific reasons why.

That seems at least as likely an outcome for models as they continue to improve infinitely into the stars.

victorbjorklund•1h ago

Next two years probably. But at some point we will either hit scales where you really dont need anything better (lets say cloud is 10000 token/s and local is 5000 token/s. Makes no difference for most individual users) or we will hit som wall where ai doesnt get smarter but cost of hardware continues to fall

kvakerok•1h ago

What is even a point of having a self hosted gpt5 equivalent that's not into petabytes of knowledge?

zwnow•50m ago

You know there's a ceiling to all this with the current LLM approaches right? They won't become that much better, its even more likely they will degrade. There are cases of bad actors attacking LLMs by feeding it false information and propaganda. I dont see this changing in the future.

Aurornis•32m ago

There will always be something better on big data center hardware.

However, small models are continuing to improve at the same time that large RAM capacity computing hardware is becoming cheaper. These two will eventually intersect at a point where local performance is good enough and fast enough.

meta_ai_x•1h ago

This is especially true since AI is a large multiplicative factor to your productivity.

If Cloud LLMs have 10 IQ points > local LLM, within a month, you'll notice you'll be struggling behind the dude who just used Cloud LLM.

LocalLlama is for hobbies or your job depends on running locallama.

This is not one-time upfront setup cost vs payoff later tradeoff. It is a tradeoff you are making every query which compounds pretty quickly.

Edit : I expect nothing better than downvotes from this crowd. How HN has fallen on AI will be a case study for the ages

bigyabai•57m ago

> anything you pick up second-hand will still deprecate at that pace

Not really? The people who do local inference most (from what I've seen) are owners of Apple Silicon and Nvidia hardware. Apple Silicon has ~7 years of decent enough LLM support under it's belt, and Nvidia is only now starting to depreciate 11-year-old GPU hardware in drivers.

If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s. Maybe even faster inference because of MoE architectures or improvements in the backend.

Aurornis•26m ago

> If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s.

I think this is the difference between people who embrace hobby LLMs and people who don’t:

The token/s output speed on affordable local hardware for large models is not great for me. I already wish the cloud hosted solutions were several times faster. Any time I go to a local model it feels like I’m writing e-mails back and forth to an LLM, not working with it.

And also, the first Apple M1 chip was released less than 5 years ago, not 7.

Uehreka•9m ago

People on HN do a lot of wishful thinking when it comes to the macOS LLM situation. I feel like most of the people touting the Mac’s ability to run LLMs are either impressed that they run at all, are doing fairly simple tasks, or just have a toy model they like to mess around with and it doesn’t matter if it messes up.

And that’s fine! But then people come into the conversation from Claude Code and think there’s a way to run a coding assistant on Mac, saying “sure it won’t be as good as Claude Sonnet, but if it’s even half as good that’ll be fine!”

And then they realize that the heavvvvily quantized models that you can run on a mac (that isn’t a $6000 beast) can’t invoke tools properly, and try to “bridge the gap” by hallucinating tool outputs, and it becomes clear that the models that are small enough to run locally aren’t “20-50% as good as Claude Sonnet”, they’re like toddlers by comparison.

People need to be more clear about what they mean when they say they’re running models locally. If you want to build an image-captioner, fine, go ahead, grab Gemma 7b or something. If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu.

motorest•48m ago

> As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.

Can you explain your rationale? It seems that the worst case scenario is that your setup might not be the most performant ever, but it will still work and run models just as it always did.

This sounds like a classical and very basic opex vs capex tradeoff analysis, and these are renowned for showing that on financial terms cloud providers are a preferable option only in a very specific corner case: short-term investment to jump-start infrastructure when you do not know your scaling needs. This is not the case for LLMs.

OP seems to have invested around $600. This is around 3 months worth of an equivalent EC2 instance. Knowing this, can you support your rationale with numbers?

tcdent•19m ago

When considering used hardware you have to take quantization into account; gpt-oss-120b for example is running a very new MXFP4 which will use far more than 80GB to fit into the available fp types on older hardware or Apple silicon.

Open models are trained on modern hardware and will continue to take advantage of cutting edge numeric types, and older hardware will continue to suffer worse performance and larger memory requirements.

motorest•4m ago

You're using a lot of words to say "I believe yesterday's hardware might not run models as as fast as today's hardware."

That's fine. The point is that yesterday's hardware is quite capable of running yesterday's models, and obviously it will also run tomorrow's models.

So the question is cost. Capex vs opex. The fact is that buying your own hardware is proven to be far more cost-effective than paying cloud providers to rent some cycles.

I brought data to the discussion: for the price tag of OP's home lab, you only afford around 3 months worth of an equivalent EC2 instance. What's your counter argument?

Aurornis•33m ago

I think the local LLM scene is very fun and I enjoy following what people do.

However every time I run local models on my MacBook Pro with a ton of RAM, I’m reminded of the gap between local hosted models and the frontier models that I can get for $20/month or nominal price per token from different providers. The difference in speed and quality is massive.

The current local models are very impressive, but they’re still a big step behind the SaaS frontier models. I feel like the benchmark charts don’t capture this gap well, presumably because the models are trained to perform well on those benchmarks.

I already find the frontier models from OpenAI and Anthropic to be slow and frequently error prone, so dropping speed and quality even further isn’t attractive.

I agree that it’s fun as a hobby or for people who can’t or won’t take any privacy risks. For me, I’d rather wait and see what an M5 or M6 MacBook Pro with 128GB of RAM can do before I start trying to put together another dedicated purchase for LLMs.

EVa5I7bHFq9mnYK•3m ago

[delayed]

sneak•1h ago

Halfway through he gives up and uses remote models. The basic premise here is false.

Also, the term “remote code execution” in the beginning is misused. Ironically, remote code execution refers to execution of code locally - by a remote attacker. Claude Code does in fact have that, but I’m not sure if that’s what they’re referring to.

thepoet•1h ago

The blog says more about keeping the user data private. The remote models in the context are operating blind. I am not sure why you are nitpicking, almost nobody reading the blog would take remote code execution in that context.

mark_l_watson•1h ago

That is fairly cool. I was talking about this on X yesterday: another angle however, I use a local web scraper and search engine via meilisearch the main tech web sites I am interested in. For my personal research I use three web search APIs, but there is some latency. Having a big chuck of the web that I am interested in available locally with close to zero latency is nice when running local models, my own MCP services that might need web search, etc.

luke14free•1h ago

you might want to check out what we built -> https://inference.sh supports most major open source/weight models from wan 2.2 video, qwen image, flux, most llms, hunyan 3d etc.. works in a containerized way locally by allowing you to bring your own gpu as an engine (fully free) or allows you to rent remote gpu/pool from a common cloud in case you want to run more complex models. for each model we tried to add quantized/ggufs versions to even wan2.2/qwen image/gemma become possible to execute with as little as 8gb vram gpus. mcp support coming soon in our chat interface so it can access other apps from the ecosystem.

rshemet•1h ago

if you ever end up trying to take this in the mobile direction, consider running on-device AI with Cactus –

https://cactuscompute.com/

Blazing-fast, cross-platform, and supports nearly all recent OS models.

xt00•1h ago

Yea in an ideal world there would be a legal construct around AI agents in the cloud doing something on your behalf that could not be blocked by various stakeholders deciding they don't like the thing you are doing even if totally legal. Things that would be considered fair use, or maybe annoying to certain companies should not be easy for companies to just wholesale block by leveraging business relationships. Barring that, then yea, a local AI setup is the way to go.

sabareesh•1h ago

Here is my rig, running GLM 4.5 Air. Very impressed by this model

https://sabareesh.com/posts/llm-rig/

https://huggingface.co/zai-org/GLM-4.5

mkummer•1h ago

Super cool and well thought out!

I'm working on something similar focused on being able to easily jump between the two (cloud and fully local) using a Bring Your Own [API] Key model – all data/config/settings/prompts are fully stored locally and provider API calls are routed directly (never pass through our servers). Currently using mlc-llm for models & inference fully local in the browser (Qwen3-1.7b has been working great)

[1] https://hypersonic.chat/

Imustaskforhelp•1h ago

I think I still prefer local but I feel like that's because that most AI inference is kinda slow or comparable to local. But I recently tried out cerebras or (I have heard about groq too) and honestly when you try things at 1000 tk/s or similar, your mental model really shifts and becomes quite impatient. Cerebras does say that they don't log your data or anything in general and you would have to trust me to say that I am not sponsored by them (Wish I was tho) Its just that they are kinda nice.

But I still hope that we can someday actually have some meaningful improvements in speed too. Diffusion models seem to be really fast in architecture.

retrocog•1h ago

Its all about context and purpose, isn't it? For certain lightweight uses cases, especially those concerning sensitive user data, a local implementation may make a lot of sense.

kaindume•58m ago

Self hosted and offline AI systems would be great for privacy but the hardware and electricity cost are much too high for most users. I am hoping for a P2P decentralized solution that runs on distributed hardware not controlled by a single corporation.

user3939382•32m ago

I’d settle for homomorphic encryption but that’s a long way off if ever

woadwarrior01•49m ago

> LLMs: Ollama for local models (also private models for now)

Incidentally, I decided to try to Ollama macOS app yesterday, and the first thing it tries to do upon launch is try to connect to some google domain. Not very private.

https://imgur.com/a/7wVHnBA

Aurornis•43m ago

Automatic update checks https://github.com/ollama/ollama/blob/main/docs/faq.md

abtinf•39m ago

Yep, and I’ve noticed the same thing with in vscode with both the cline plugin and the copilot plugin.

I configure them both to use local ollama, block their outbound connections via little snitch, and they just flat out don’t work without the ability to phone home or posthog.

Super disappointing that Cline tries to do so much outbound comms, even after turning off telemetry in the settings.

adsharma•31m ago

https://github.com/adsharma/ask-me-anything

Supports MLX on Apple silicon. Electron app.

There is a CI to build downloadable binaries. Looking to make a v0.1 release.

andylizf•25m ago

This is fantastic work. The focus on a local, sandboxed execution layer is a huge piece of the puzzle for a private AI workspace. The `coderunner` tool looks incredibly useful.

A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.

(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.

Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."

Code: https://github.com/yichuan-w/LEANN Paper: https://arxiv.org/abs/2405.08051

sebmellen•15m ago

I know next to nothing about embeddings.

Are there projects that implement this same “pruned graph” approach for cloud embeddings?

doctoboggan•9m ago

> A vector database for years of emails can easily exceed 50GB.

In 2025 I would consider this a relatively meager requirement.

com2kid•18m ago

> Even with help from the "world's best" LLMs, things didn't go quite as smoothly as we had expected. They hallucinated steps, missed platform-specific quirks, and often left us worse off.

This shows how little native app training data is even available.

People rarely write blog posts about designing native apps, long winded medium tutorials don't exist, heck even the number of open source projects for native desktop apps is a small percentage compared to mobile and web apps.

Historically Microsoft paid some of the best technical writers in the world to write amazing books on how to code for Windows (see: Charles Petzold), but now days that entire industry is almost dead.

These types of holes in training data are going to be a larger and larger problem.

Although this is just representative of software engineering in general - few people want to write native desktop apps because it is a career dead end. Back in the 90s knowing how to write Windows desktop apps was great, it was pretty much a promised middle class lifestyle with a pretty large barrier to entry (C/C++ programming was hard, the Windows APIs were not easy to learn, even though MS dumped tons of money into training programs), but things have changed a lot. Outside of the OS vendors themselves (Microsoft, Apple) and a few legacy app teams (Adobe, Autodesk, etc), very few jobs exist for writing desktop apps.

gen2brain•13m ago

People are talking about AI everywhere, but where can we find documentation, examples, and proof of how it works? It all ends with chat. Which chat is better and cheaper? This local story is just using some publicly available model, but downloaded? When is this going to stop?

bling1•3m ago

On a similar vibe, we developed app.czero.cc to run an LLM inside your chrome browser on your machine hardware without installation (you do have to download the models). Hard to run big models, but it doesnt get more local than that without having to install anything.

Open source lowcode builder – looks awesome for business needs

Simon Willison on the Talking Postgres podcast: AI for data engineers"

Trump Wants UCLA to Pay $1B to Restore Its Research Funding

ChatGPT Is Still a Bullshit Machine

A Guide Dog for the Face-Blind

"Magic" Cleaning Sponges Found to Release Trillions of Microplastic Fibers

Study finds flavor bans cut youth vaping but slow decline in cigarette smoking

Slack Threads are utter dog shit, so I made a quote reply extension with gpt5

Ask HN: Have we reached the acceptance phase of generative AI?

Is Gen X Nostalgia Just Trauma-Bonding?

Buttercup is now open-source

DHH: The Framework Desktop is a beast

Why good finance gets ignored

Thinking Is Becoming a Luxury Good

Putin Tells U.S. He'll Halt War in Exchange for Eastern Ukraine

Chinese biz using AI to hit US politicians, influencers with propaganda

OpenAI will bring back ChatGPT-4o for plus users

A Deeper Dive into Apache Iceberg V3

Glass bottles found to contain more microplastics than plastic bottles

Leann – Claude Code–compatible semantic search with 97% smaller vector index

Python backoff repository was archived

Are you in a mid-career to senior job? Don't fear AI

Jim Lovell Has Died

New signs found of giant gas planet in 'Earth's neighbourhood'

A Vulkan on Metal Mesa 3D Graphics Driver

Apollo 13 astronaut Jim Lovell dies

Show HN: Text Cleanse – Free Online Text Cleaner and Case Converter

First Ethernet-Based AI Memory Fabric System to Increase LLM Efficiency – News

Show HN: GPT OSS: How to run and fine-tune

PayPal to let U.S. businesses accept payment in more than 100 cryptocurrencies

I want everything local – Building my offline AI workspace

Comments

Open source lowcode builder – looks awesome for business needs

Simon Willison on the Talking Postgres podcast: AI for data engineers"

Trump Wants UCLA to Pay $1B to Restore Its Research Funding

ChatGPT Is Still a Bullshit Machine

A Guide Dog for the Face-Blind

"Magic" Cleaning Sponges Found to Release Trillions of Microplastic Fibers

Study finds flavor bans cut youth vaping but slow decline in cigarette smoking

Slack Threads are utter dog shit, so I made a quote reply extension with gpt5

Ask HN: Have we reached the acceptance phase of generative AI?

Is Gen X Nostalgia Just Trauma-Bonding?

Buttercup is now open-source

DHH: The Framework Desktop is a beast

Why good finance gets ignored

Thinking Is Becoming a Luxury Good

Putin Tells U.S. He'll Halt War in Exchange for Eastern Ukraine

Chinese biz using AI to hit US politicians, influencers with propaganda

OpenAI will bring back ChatGPT-4o for plus users

A Deeper Dive into Apache Iceberg V3

Glass bottles found to contain more microplastics than plastic bottles

Leann – Claude Code–compatible semantic search with 97% smaller vector index

Python backoff repository was archived

Are you in a mid-career to senior job? Don't fear AI

Jim Lovell Has Died

New signs found of giant gas planet in 'Earth's neighbourhood'

A Vulkan on Metal Mesa 3D Graphics Driver

Apollo 13 astronaut Jim Lovell dies

Show HN: Text Cleanse – Free Online Text Cleaner and Case Converter

First Ethernet-Based AI Memory Fabric System to Increase LLM Efficiency – News

Show HN: GPT OSS: How to run and fine-tune

PayPal to let U.S. businesses accept payment in more than 100 cryptocurrencies