DirectStorage LLM Weight Streaming: 4x faster loading, MoE expert streaming

https://github.com/kibbyd/llm_upper/blob/main/PROJECT_RECORD.md

1•kibbyd1985•1h ago

Comments

kibbyd1985•1h ago

Author here. This project started with a simple question: can you run a 70B MoE model on 8GB VRAM by streaming weights from NVMe SSD to GPU using DirectStorage?

The short answer: the streaming works, but public MoE models don't cooperate.

The long version:

*What works well:* DirectStorage uses DMA to transfer weights from NVMe SSD to GPU via D3D12 staging buffers, skipping the OS page cache that standard I/O relies on. I built a C++ DLL (MSVC) that handles DirectStorage + D3D12 + CUDA interop, with Go bindings loaded via syscall (no CGO), integrated into Ollama's Backend.Load(). Double-buffered staging with D3D12 fences imported as CUDA external semaphores. On codestral (12.6 GB, 57 layers), it loads 4.1x faster than stock Ollama — the advantage grows with model size because standard I/O depends on OS page cache.

Note: the weights still need VRAM and RAM — DirectStorage changes the transfer path, not where the weights end up. The win is that DMA doesn't depend on the OS cache being warm.

*The MoE work:* Built full expert streaming — CUDA VMM for sparse-resident pools, lazy physical allocation, on-demand SSD→GPU streaming during Forward(), one-token-lag exact routing (use token t's expert indices to prefetch for t+1), LRU eviction. Ran qwen3:30b (128 experts/layer, 8 active) on 40GB RAM + 8GB VRAM. Pipeline sustains ~1.9 GB/s.

*Where it breaks:* Both models tested (gpt-oss:20b, qwen3:30b) are temporally dense. Over ~50 tokens, every expert gets touched. Reducing cache capacity by 25% causes >1000 faults/token. The temporal working set equals the full model.

The hardest bugs were: (1) Windows DLL search order differences between EXE and DLL contexts causing E_NOTIMPL, (2) D3D12 picking Intel iGPU while CUDA was on NVIDIA dGPU (LUID matching fixed it), (3) D3D12 fence completion not establishing memory visibility for CUDA — had to import the fence as a CUDA external semaphore.

The evaluation harness (max_resident_per_layer, faulted_experts_per_token) is probably the most useful piece — it can immediately tell you if a new MoE model is temporally sparse enough for small-VRAM inference. If anyone knows of MoE models trained with temporal locality objectives, I'd love to test them.

Repos: - https://github.com/kibbyd/llm_upper (research & docs) - https://github.com/kibbyd/llm_upper_ollama (Ollama fork) - Full writeup: https://github.com/kibbyd/llm_upper/blob/main/PROJECT_RECORD...

The Anthropic Hive Mind

The Sling: Humanity's Forgotten Power

Noam Chomsky, Jeffrey Epstein and the Politics of Betrayal

Why securing AI model weights isn't enough

Cursor Composer 1.5

Stop Telling Users Their DNS Is Wrong

Show HN: Voice Legacy: AI that interviews your parents before it's too late

Have the patents for H.264 MPEG-4 AVC expired yet?

The Great Displacement: AI and the Next Fifty Years

GitButler CLI Is Good

Modern Keystroke Visualizer for Linux

AskHN: Is Auth0 Down Again?

Strengthening Windows trust and security through User Transparency and Consent

Paragraphic – Parametric graphic design app made in Godot

Ask HN: I experienced an Attack on Telegram and simcards gone!!!

Ask HN: Any good open source projects written by AI agents?

European Processor Initiative

Show HN: Linkpreview.io – Debug and preview social share cards

The power of anime: using anime for education and outreach in STEM

German patent classified as state secret

Show HN: MumbleFlow – $5 local voice-to-text (whisper.cpp, Rust, no cloud)

Hims cancels plans to sell compounded GLP-1 pill after FDA backlash

Regime-Declared Mathematics as Survivor Sets

National data, local stories: ICE detention in 2026

Is AI the Paperclip?

AGI/Singularity: 9,300 Predictions Analyzed

America Has a Tungsten Problem

Show HN: A last-minute romantic gift app with private links

Show HN: I killed my Calendly link after people booking randomly

Show HN: Claude Cowork for Startup Market Analysis