Time to build a GPU OS? Here is the first step

https://www.notion.so/yifanqiao/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141

48•Jrxing•4h ago

Comments

CharlesW•1h ago

Actual title: "Solve the GPU Cost Crisis with kvcached: A library to enable virtualized, elastic KV cache for LLM serving on shared GPUs"

jewel•1h ago

In my imagination, I thought that the large GPU clusters were dynamically allocating whole machines to different tasks depending on load.

So, hypothetically, if ChatGPT's peak load and their minimum load were a 3× ratio, they'd reallocate 2/3 of their servers to training when it's not peak time.

Doing the same thing inside an individual GPU seems irrelevant to anyone operating at scale when they can approximate the same behavior with entire servers or even entire racks.

noxa•1h ago

Neat! As someone working in this space and feeling like I've been taking crazy pills from how these "duh, CPU solved this 30 years ago" things keep slipping it's great to see more people bridging the gap! Unfortunately CUDA/HIP (and the entire stack beneath them) virtual memory management ops are very expensive host APIs (remapping a big block of pages can be O(n^2) with page count and fully synchronize host/device (forced wait idle), take kernel locks, etc) so it hasn't been viable in all cases. If your workloads are submit/wait with host in the loop the VM tricks are ok but if you are trying to never block the GPU (pipeline depth > 0) you really want to avoid anything that does a page table modification (until we get GPUs that can pipeline those). vkQueueBindSparse is one of the few async APIs I've seen, and CUDA has cuMemMapArrayAsync but I haven't yet used it (because arrays are annoying and without being able to inspect the driver I'm sure it's probably doing the wrong thing).

I've had good luck with indirection tables used during lookup inside of the kernels consuming/producing the kvcache data - it's essentially user-mode remapping like they do here: you can publish a buffer offset table and threads are uniform, have coalesced reads to the table, and cache the offsets no problem. You have the same memory locality issues as VM (contiguous virtual but potentially random physical) but are not limited to device page sizes and since you can update while work is in-flight you can be much more aggressive about reuse and offload (enqueue DMA to cold storage to evict from VRAM, enqueue DMA to copy from cold memory into reused VRAM, enqueue offset table update, enqueue work using them, repeat - all without host synchronization). You can also defrag in-flight if you do want to try to restore the physical locality. It's nothing crazy and fairly normal in CPU land (or even classic virtual texturing), but in ML GPU land I could write a big paper on it and call it SuperDuperFancyAttention4 and publish press releases...

BergAndCo•1h ago

AI-written paper posted by JiaRong Xing

Username is Jrxing

"GPU OS" turns out to be just more LLM spam

Replacing a $3000/mo Heroku bill with a $55/mo server

Doomsday Scoreboard

Build Your Own Database

rlsw – Raylib software OpenGL renderer in less than 5k LOC

LLMs can get "brain rot"

Neural audio codecs: how to get audio into LLMs

We rewrote OpenFGA in pure Postgres

Mathematicians have found a hidden 'reset button' for undoing rotation

Minds, brains, and programs (1980) [pdf]

NASA chief suggests SpaceX may be booted from moon mission

Lottery-fication of Everything: 0 day options, perps, parlays are now mainstream

Wikipedia says traffic is falling due to AI search summaries and social video

Foreign hackers breached a US nuclear weapons plant via SharePoint flaws

The Salt and Pepper Shaker Museum

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code

Flexport Is Hiring SDRs in Chicago

Show HN: Katakate – Dozens of VMs per node for safe code exec

ChatGPT Atlas

Diamond Thermal Conductivity: A New Era in Chip Cooling

AWS multiple services outage in us-east-1

Ilo – a Forth system running on UEFI

The death of thread per core

Show HN: bbcli – A TUI and CLI to browse BBC News like a hacker

Our modular, high-performance Merkle Tree library for Rust

What do we do if SETI is successful?

Binary Retrieval-Augmented Reward Mitigates Hallucinations

The Programmer Identity Crisis

Apple alerts exploit developer that his iPhone was targeted with gov spyware

60k kids have avoided peanut allergies due to 2015 advice, study finds

The Greatness of Text Adventures