frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Shimmy v1.7.0: Running 42B Moe Models on Consumer GPUs with 99.9% VRAM Reduction

https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/v1.7.0
3•MKuykendall•2h ago

Comments

MKuykendall•2h ago
I just released Shimmy v1.7.0 with MoE (Mixture of Experts) CPU offloading support, and the results are pretty exciting for anyone who's hit GPU memory walls. What this solves If you've tried running large language models locally, you know the pain: a 42B parameter model typically needs 80GB+ of VRAM, putting it out of reach for most developers. Even "smaller" 20B models often require 40GB+. The breakthrough MoE CPU offloading intelligently moves expert layers to CPU while keeping active computation on GPU. In practice: Phi-3.5-MoE 42B: Runs on 8GB consumer GPUs (was impossible before) GPT-OSS 20B: 71.5% VRAM reduction (15GB → 4.3GB, measured) DeepSeek-MoE 16B: Down to 800MB VRAM with Q2 quantization The tradeoff is 2-7x slower inference, but you can actually run these models instead of not running them at all. Technical implementation Built on enhanced llama.cpp bindings with new with_cpu_moe() and with_n_cpu_moe(n) methods Two CLI flags: --cpu-moe (automatic) and --n-cpu-moe N (manual control) Cross-platform: Windows MSVC CUDA, macOS Metal, Linux x86_64/ARM64 Still sub-5MB binary with zero Python dependencies Ready-to-use models I've uploaded 9 quantized models to HuggingFace specifically optimized for this: Phi-3.5-MoE variants (Q8.0, Q4 K-M, Q2 K) DeepSeek-MoE variants GPT-OSS 20B baseline Getting started # Install cargo install shimmy

# Download a model huggingface-cli download MikeKuykendall/phi-3.5-moe-q4-k-m-cpu-offload-gguf

# Run with MoE offloading ./shimmy serve --cpu-moe --model-path phi-3.5-moe-q4-k-m.gguf Standard OpenAI-compatible API, so existing code works unchanged. Why this matters This democratizes access to state-of-the-art models. Instead of needing a $10,000 GPU or cloud spending, you can run expert models on gaming laptops or modest server hardware. It's not just about making models "work" - it's about sustainable AI deployment where organizations can experiment with cutting-edge architectures without massive infrastructure investments. The technique itself isn't novel (llama.cpp had MoE support), but the Rust bindings, production packaging, and curated model collection make it accessible to developers who just want to run large models locally. Release: https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/... Models: https://huggingface.co/MikeKuykendall Happy to answer questions about the implementation or performance characteristics.

Satanic panic – how Dublin's Hellfire Club inspired a new video game

https://www.rte.ie/culture/2025/1009/1536327-how-dublins-hellfire-club-inspired-a-new-video-game/
1•austinallegro•2m ago•0 comments

The Indian messaging app that wants to take on WhatsApp

https://www.bbc.com/news/articles/cy50299w5vwo
1•vinni2•2m ago•0 comments

Free Tools for YouTube Channels

https://utubekit.com/
1•ayushchat•5m ago•0 comments

Redis Security Advisory: CVE-2025-49844

https://redis.io/blog/security-advisory-cve-2025-49844/
1•StefanBatory•6m ago•0 comments

AI Visibility Drift: The Quiet Collapse Between Retrains

https://www.aivojournal.org/visibility-drift-the-quiet-collapse-between-retrains/
1•businessmate•6m ago•0 comments

The Unknotting Number Is Not Additive

https://divisbyzero.com/2025/10/08/the-unknotting-number-is-not-additive/
2•JohnHammersley•9m ago•0 comments

Fuzzing as the basis for effective development a case study of LuaJIT [video]

https://www.youtube.com/watch?v=GwHZaynqh98
1•todsacerdoti•9m ago•0 comments

Room with a View

https://www.thomasmoes.com/52obsessions/room-with-a-view
1•thomoes•11m ago•0 comments

Analysis of 43 official IDF videos–recycled 3D environments by unrelated artists

https://twitter.com/JackSapoch/status/1975965515911921883
1•wahnfrieden•14m ago•0 comments

Show HN: Tdycoder – Local AI code editor using Ollama LLM

https://github.com/TDYSKY/TDYCODER
1•TDYSKY•15m ago•0 comments

Performance Killers in Axum, Tokio, Diesel, WebRTC, and Reqwest

https://autoexplore.medium.com/hidden-performance-killers-in-axum-tokio-diesel-webrtc-and-reqwest...
1•Havunen•15m ago•1 comments

UK universities offered to monitor students' social media

https://www.theguardian.com/education/2025/oct/08/uk-universities-offered-to-monitor-student-soci...
1•treebrained•16m ago•0 comments

Show HN: Sora2 watermark is heavy, so I built a one-click remover

https://sorawatermarkremover.net
1•wushi•17m ago•0 comments

Technological Approach to Mind Everywhere: A Grounded Framework (pdf, 2022)

https://www.frontiersin.org/journals/systems-neuroscience/articles/10.3389/fnsys.2022.768201/full
1•asplake•18m ago•0 comments

The End of Tt-Rss.org

https://community.tt-rss.org/t/the-end-of-tt-rss-org/7164
1•hysan•20m ago•1 comments

Launch your open source journey in climate and sustainability

https://climatetriage.com
1•protontypes•23m ago•0 comments

New keyboard suggestions from the Gboard team for 2025

https://blog.google/intl/ja-jp/products/android-chrome-play/gboard-2025/
1•caminanteblanco•23m ago•0 comments

To Startup and Solopreneur: From Idea to MVP

https://help.paraflow.com/to-startup-and-solopreneur-from-idea-to-mvp
1•Julie309•24m ago•0 comments

Remote HFT shop is hiring

http://mailto:filip@numbagoup.com
1•arbingo•24m ago•1 comments

Digital ID is almost here for everyone in every country

https://www.youtube.com/watch?v=zTdJzSKo5rg
1•EGreg•25m ago•1 comments

The Polygons of Doom: PSX

https://fabiensanglard.net/doom_psx/index.html
1•joexbayer•25m ago•0 comments

URL in children's books redirects to porn site

https://www.theguardian.com/education/2025/oct/08/publisher-libraries-website-childrens-book-porn
1•piqufoh•29m ago•1 comments

UK Cider makers toast bumper summer – but drinks are too strong to sell

https://www.thetimes.com/uk/environment/article/cider-makers-toast-bumper-summer-but-drinks-are-t...
2•petethomas•34m ago•0 comments

LLM Poisoning [1/3] – Reading the Transformers Thougts

https://www.synacktiv.com/en/publications/llm-poisoning-13-reading-the-transformers-thoughts
2•charlestrodet•37m ago•0 comments

The First Long Context Guardrail

https://huggingface.co/GeneralAnalysis/GA_Guard_Core
3•rhavaeis•38m ago•0 comments

ChatGPT image snares suspect in deadly Pacific Palisades fire

https://www.bbc.com/news/articles/c8exz5yg14ko
3•arberavdullahu•41m ago•1 comments

Apple Is Helping the U.S. Government Chill Speech on ICE

https://pxlnv.com/linklog/apple-helping-chill-ice/
2•BallsInIt•44m ago•0 comments

AI generated Code is 10% Bullshit

https://alexanderweichart.de/4_Projects/ai-bs-code/AI-generated-Code-is-10-percent-Bullshit
2•surrTurr•44m ago•0 comments

Numair Faraz Culture of Who's

1•kwoii•50m ago•0 comments

Build Something That Lasts

https://velagao.substack.com/p/build-something-that-lasts
1•velapod•54m ago•0 comments