Ask HN: What Does Your Self-Hosted LLM Stack Look Like in 2025?

17•anditherobot•1d ago

Back when web development was taking off, there was always a go-to stack — something like Postgres + Django + jQuery, or .NET + Bootstrap, SQLITE. Over the years we had proven tech and proven patterns like : MVC, SPA etc...

Now that local LLMs are gaining traction, I’m wondering what the equivalent stack looks like today.

Models, Runtime, hardware and other tools.

That could rival the Claudes, ChatGPTs or Geminis, etc

Thanks

Comments

fazlerocks•1d ago

Running Llama 3.1 70B on 2x4090s with vLLM. Memory is a pain but works decent for most stuff.

Tbh for coding I just use the smaller ones like CodeQwen 7B. way faster and good enough for autocomplete. Only fire up the big model when I actually need it to think.

The annoying part is keeping everything updated, new model drops every week and half don't work with whatever you're already running.

bluejay2387•22h ago

2x 3090's running Ollama and VLLM... Ollama for most stuff and VLLM for the few models that I need to test that don't run on Ollama. Open Web UI as my primary interface. I just moved to Devstral for coding using the Continue plugin in VSCode. I use Qwen 3 32b for creative stuff and Flux Dev for images. Gemma 3 27b for most everything else (slightly less smart than Qwen, but its faster). Mixed Bread for embeddings (though apparently NV-Embed-v2 is better?). Pydantic as my main utility library. This is all for personal stuff. My stack at work is completely different and driven more by our Legal teams than technical decisions.

gabriel_dev•19h ago

Ollama + mac mini 24gb (inference)

runjake•19h ago

Ollama + M3 Max 36GB Mac. Usually with Python + SQLite3.

The models vary depending on the task. DeepSeek distilled has been a favorite for the past several months.

I use various smaller (~3B) models for simpler tasks.

xyc•18h ago

recurse.chat + M2 max Mac

v5v3•5h ago

Ollama on a M1 MacBook pro but will be moving to a Nvidia GPU setup.

More Federal Workers Are Flooding the Job Market, with Worsening Prospects

Analysis of the Spyware That Helped to Compromise a Syrian Army from Within

Someone's fresh Switch 2 arrived with factory firmware on it

What Type of [] Are You?

Jared Isaacman reveals how space agency might have looked under his watch

McSee: Evaluating Rowhammer Attacks and Defenses via DRAM Traffic Analysis

Lessons from a FAT God: An Introspective on the FAT filesystems

Doge Developed Error-Prone AI Tool to "Munch" Veterans Affairs Contracts

Chocolate Quake source port preserving original experience even bugs and quirks

Displaying Korean Text Efficiently

Napster.com Faced ISP Piracy Blockade for "Massive Copyright Violations"

Processing Nebula Images with Open Source Tools vs$350 Software

M&S hackers sent abuse and ransom demand directly to CEO

Xiaohongshu(Rednote) released its dots.llm open source AI model

Crustal to mantle melt storage during the evolution of Hawaiian volcanoes

AI Leaders and Builders Fireside Chat

Natural Intelligence Is Sexy

Show HN: Colorcura – Visualize color palettes inside live UI components

Competitive Coder's Handbook [pdf]

How to (actually) send DTMF on Android without being the default call app

I Let ChatGPT Make All My Architectural Decisions for a Month: The Surprising R

Alpaca's MCP Server for Trading/Quotes

AI isn't coming for your job–it's coming for your company

Three Types of Math Acceleration

BetterAuth vs. NextAuth

Swift and Cute 2D Game Framework: Setting Up a Project with CMake

How to Use Vheer Text to Image Generator: A Beginner's Guide

Show HN: Book to help you build a PostgreSQL-like database server from scratch

Conventional commit generator using local LLMs

MCP Resources Are for Caching