frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Shimmy v1.7.0: Running 42B Moe Models on Consumer GPUs with 99.9% VRAM Reduction

https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/v1.7.0
3•MKuykendall•4mo ago

Comments

MKuykendall•4mo ago
I just released Shimmy v1.7.0 with MoE (Mixture of Experts) CPU offloading support, and the results are pretty exciting for anyone who's hit GPU memory walls. What this solves If you've tried running large language models locally, you know the pain: a 42B parameter model typically needs 80GB+ of VRAM, putting it out of reach for most developers. Even "smaller" 20B models often require 40GB+. The breakthrough MoE CPU offloading intelligently moves expert layers to CPU while keeping active computation on GPU. In practice: Phi-3.5-MoE 42B: Runs on 8GB consumer GPUs (was impossible before) GPT-OSS 20B: 71.5% VRAM reduction (15GB → 4.3GB, measured) DeepSeek-MoE 16B: Down to 800MB VRAM with Q2 quantization The tradeoff is 2-7x slower inference, but you can actually run these models instead of not running them at all. Technical implementation Built on enhanced llama.cpp bindings with new with_cpu_moe() and with_n_cpu_moe(n) methods Two CLI flags: --cpu-moe (automatic) and --n-cpu-moe N (manual control) Cross-platform: Windows MSVC CUDA, macOS Metal, Linux x86_64/ARM64 Still sub-5MB binary with zero Python dependencies Ready-to-use models I've uploaded 9 quantized models to HuggingFace specifically optimized for this: Phi-3.5-MoE variants (Q8.0, Q4 K-M, Q2 K) DeepSeek-MoE variants GPT-OSS 20B baseline Getting started # Install cargo install shimmy

# Download a model huggingface-cli download MikeKuykendall/phi-3.5-moe-q4-k-m-cpu-offload-gguf

# Run with MoE offloading ./shimmy serve --cpu-moe --model-path phi-3.5-moe-q4-k-m.gguf Standard OpenAI-compatible API, so existing code works unchanged. Why this matters This democratizes access to state-of-the-art models. Instead of needing a $10,000 GPU or cloud spending, you can run expert models on gaming laptops or modest server hardware. It's not just about making models "work" - it's about sustainable AI deployment where organizations can experiment with cutting-edge architectures without massive infrastructure investments. The technique itself isn't novel (llama.cpp had MoE support), but the Rust bindings, production packaging, and curated model collection make it accessible to developers who just want to run large models locally. Release: https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/... Models: https://huggingface.co/MikeKuykendall Happy to answer questions about the implementation or performance characteristics.

Poland to probe possible links between Epstein and Russia

https://www.reuters.com/world/poland-probe-possible-links-between-epstein-russia-pm-tusk-says-202...
1•doener•3m ago•0 comments

Effectiveness of AI detection tools in identifying AI-generated articles

https://www.ijoms.com/article/S0901-5027(26)00025-1/fulltext
1•XzetaU8•9m ago•0 comments

Warsaw Circle

https://wildtopology.com/bestiary/warsaw-circle/
1•hackandthink•10m ago•0 comments

Reverse Engineering Raiders of the Lost Ark for the Atari 2600

https://github.com/joshuanwalker/Raiders2600
1•pacod•15m ago•0 comments

The AI4Agile Practitioners Report 2026

https://age-of-product.com/ai4agile-practitioners-report-2026/
1•swolpers•16m ago•0 comments

Digital Independence Day

https://di.day/
1•pabs3•19m ago•0 comments

What a bot hacking attempt looks like: SQL injections galore

https://old.reddit.com/r/vibecoding/comments/1qz3a7y/what_a_bot_hacking_attempt_looks_like_i_set_up/
1•cryptoz•20m ago•0 comments

Show HN: FlashMesh – An encrypted file mesh across Google Drive and Dropbox

https://flashmesh.netlify.app
1•Elevanix•22m ago•0 comments

Show HN: AgentLens – Open-source observability and audit trail for AI agents

https://github.com/amitpaz1/agentlens
1•amit_paz•22m ago•0 comments

Show HN: ShipClaw – Deploy OpenClaw to the Cloud in One Click

https://shipclaw.app
1•sunpy•25m ago•0 comments

Unlock the Power of Real-Time Google Trends Visit: Www.daily-Trending.org

https://daily-trending.org
1•azamsayeedit•27m ago•1 comments

Explanation of British Class System

https://www.youtube.com/watch?v=Ob1zWfnXI70
1•lifeisstillgood•27m ago•0 comments

Show HN: Jwtpeek – minimal, user-friendly JWT inspector in Go

https://github.com/alesr/jwtpeek
1•alesrdev•31m ago•0 comments

Willow – Protocols for an uncertain future [video]

https://fosdem.org/2026/schedule/event/CVGZAV-willow/
1•todsacerdoti•32m ago•0 comments

Feedback on a client-side, privacy-first PDF editor I built

https://pdffreeeditor.com/
1•Maaz-Sohail•36m ago•0 comments

Clay Christensen's Milkshake Marketing (2011)

https://www.library.hbs.edu/working-knowledge/clay-christensens-milkshake-marketing
2•vismit2000•43m ago•0 comments

Show HN: WeaveMind – AI Workflows with human-in-the-loop

https://weavemind.ai
9•quentin101010•48m ago•2 comments

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

https://seedream5ai.org
1•dallen97•50m ago•0 comments

A contributor trust management system based on explicit vouches

https://github.com/mitchellh/vouch
2•admp•52m ago•1 comments

Show HN: Analyzing 9 years of HN side projects that reached $500/month

3•haileyzhou•52m ago•0 comments

The Floating Dock for Developers

https://snap-dock.co
2•OsamaJaber•53m ago•0 comments

Arcan Explained – A browser for different webs

https://arcan-fe.com/2026/01/26/arcan-explained-a-browser-for-different-webs/
2•walterbell•54m ago•0 comments

We are not scared of AI, we are scared of irrelevance

https://adlrocha.substack.com/p/adlrocha-we-are-not-scared-of-ai
1•adlrocha•56m ago•0 comments

Quartz Crystals

https://www.pa3fwm.nl/technotes/tn13a.html
2•gtsnexp•58m ago•0 comments

Show HN: I built a free dictionary API to avoid API keys

https://github.com/suvankar-mitra/free-dictionary-rest-api
2•suvankar_m•1h ago•0 comments

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

https://kybera.xyz
3•xipz•1h ago•0 comments

Show HN: brew changelog – find upstream changelogs for Homebrew packages

https://github.com/pavel-voronin/homebrew-changelog
1•kolpaque•1h ago•0 comments

Any chess position with 8 pieces on board and one pair of pawns has been solved

https://mastodon.online/@lichess/116029914921844500
2•baruchel•1h ago•1 comments

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

https://cyber-omelette.com/posts/the-abstraction-rises.html
3•birdculture•1h ago•0 comments

Projecting high-dimensional tensor/matrix/vect GPT–>ML

https://github.com/tambetvali/LaegnaAIHDvisualization
1•tvali•1h ago•1 comments