Yet outside of this community, local LLMs still don’t seem mainstream. My hunch: *great UX and durable apps are still thin on the ground.*
If you are using local models, I’d love to learn from your setup and workflows. Please be specific so others can calibrate:
Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M).
Runtime/tooling: e.g., Ollama, LM studio, etc.
Hardware: CPU/GPU details (VRAM/RAM), OS. If laptop/edge/home servers, mention that.
Workflows where local wins: privacy/offline, data security, coding, huge amount extraction, RAG over your files, agents/tools, screen capture processing—what’s actually sticking for you?
Pain points: quality on complex reasoning, context management, tool reliability, long‑form coherence, energy/thermals, memory, Windows/Mac/Linux quirks.
Favorite app today: the one you actually open daily (and why).
Wishlist: the app you wish existed.
Gotchas/tips: config flags, quant choices, prompt patterns, or evaluation snippets that made a real difference.
If you’re not using local models yet, what’s the blocker—setup friction, quality, missing integrations, battery/thermals, or just “cloud is easier”? Links are welcome, but what helps most is concrete numbers and anecdotes from real use.
A simple reply template (optional):
``` Model(s): Runtime/tooling: Hardware: Use cases that stick: Pain points: Favorite app: Wishlist: ```
Also curious how people think about privacy and security in practice. Thanks!
incomingpain•1d ago
Cloud llm are able to run 1 trillion parameters and have all of python knowledge in a transparent rag that's 100gbit or faster. Of course they'll be the bestest on the block.
But when the new GPT coding benchmarks only barely behind grok 4 or gpt5 with high reasoning.
>Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M).
My most reliable setup is Devstral + openhands. unsloth Q6_K_XL, 85,000 context, flash attention, kcache and vcache quant at Q8.
Second most reliable. GPT-OSS-20B + opencode. Default MXFP4, I can only load up 31,000 context or it fails?(still plenty but hoping this bug gets fixed), you cant use flash attention or kv or v quantization or it becomes dumb as rocks. This harmony stuff is annoying.
Still preliminary, just got working today, but testing is really good. Qwen3-30b-a3b-thinking-2507 + roo code or qwencode, 80,000 context, unsloth q4_k_xl, flash attention, kcache and vcache quant at Q8.
>Runtime/tooling: e.g., Ollama, LM studio, etc.
LM studio. I need vulkan for my setup. rocm is just a pain in the ass. They need to support way more linux distros.
24gb vram.
briansun•22h ago