frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Virtualizing Nvidia HGX B200 GPUs with Open Source

https://www.ubicloud.com/blog/virtualizing-nvidia-hgx-b200-gpus-with-open-source
63•ben_s•3h ago

Comments

ben_s•2h ago
(author of the blog post here)

For me, the hardest part was virtualizing GPUs with NVLink in the mix. It complicates isolation while trying to preserve performance.

AMA if you want to dig into any of the details.

checker659•1h ago
Isn't SR-IOV a thing with these big GPUs? Or, is it that you're not concerned with fractional granularity?
ben_s•44m ago
In this article, we're primarily concerned with whole-GPU or multi-GPU partitions that preserve NVLink bandwidth, rather than finer-grained fractional sharing of a single GPU.
girfan•2h ago
Cool post. Have you looked at slicing a single GPU up for multiple VMs? Is there anything other than MIG that you have come across to partition SMs and memory bandwidth within a single GPU?
ben_s•1h ago
Thanks! I haven't looked deeply into slicing up a single GPU. My understanding is that vGPU (which we briefly mention in the post) can partition memory but time-shares compute, while MIG is the only mechanism that provides partitioning of both SMs and memory bandwidth within a single GPU.
namibj•1h ago
Last I checked MIG was the only one that made hard promises about especially memory bandwidth; as long as your memory access patterns aren't secret and you have enough trust in the other guests not being highly unfriendly with their cache usage behavior, you should be able to get away with much less strict isolation. Think docker vs. VMs-with-dedicated-cores.

But I thought MIG did do the job of chopping a GPU that's too big for most individual users into something that behaves very close to a literal array of smaller GPUs stuffed into the same PCIe card form factor? Think how a Tesla K80 was pretty much just two GK210 "GPUs" on a PLX "PCIe switch" which connects them to each other and to the host. Obviously trivial to give one to each of two VMs (at least if the PLX didn't interfere with IOMMU separation or such.... for mere performance isolation it certainly sufficed (once you block a heavy user from power budget throttling the sibling, at least).

tptacek•53m ago
Can you pass a MIG device into a KVM VM? The team we worked with didn't believe it was possible (they suggested we switch to VMWare); the MIG system interface gives you a UUID, not a PCI BDF.
otterley•1h ago
Is Nvidia’s Fabric Manager and other control plane software Open Source? If so, that’s news to me. It’s not clear that anything in this article relates to Open Source at all; publishing how to do VM management doesn’t qualify. Maybe “open kimono.”

Also, how strong are the security boundaries among multiple tenants when configured in this way? I know, for example, that AWS is extremely careful about how hardware resources are shared across tenants of a physical host to prevent cross-tenant data leakage.

ben_s•1h ago
Fabric Manager itself is not open source. It's NVIDIA-provided software, and today it's required to bring up and manage the NVLink/NVSwitch fabric on HGX systems. What we meant by "open" is that everything around it - the hypervisor, our control plane logic, partition selection, host configuration, etc. - is implemented in the open and available in our repos. You're right that this isn't a fully open GPU stack.

On isolation: in Shared NVSwitch Multitenancy mode, isolation is enforced at multiple layers. Fabric Manager programs the NVSwitch routing tables so GPUs in different partitions cannot exchange NVLink traffic, and each VM receives exclusive ownership of its assigned GPUs via VFIO passthrough. Large providers apply additional hardening and operational controls beyond what we describe here. We're not claiming this is equivalent to AWS's internal threat model, but it does rely on NVIDIA's documented isolation mechanisms.

mindcrash•1h ago
In case all of this sounds interesting:

After skimming the article I noticed a large chunk of this article (specifically the bits on deattaching/attaching drivers, qemu and vfio) applies more or less to general GPU virtualization under Linux too!

1) Replace any "nvidia" for "amdgpu" for Team Red based setups when needed

2) The PCI ids are all different, so you'll have look them up with lspci yourselves

3) Note that with consumer GPU's you need to deattach and attach a pair of two devices (GPU video and GPU audio); else things might get a bit wonky

ben_s•1h ago
Thanks for the comment! You're right that a lot of the mechanics apply more generally. On point (3) specifically: we handle this by allocating at the IOMMU-group level rather than individual devices. Our allocator selects an IOMMU group and passes through all devices in that group (e.g., GPU video + audio), which avoids the partial-passthrough wonkiness you mentioned. For reference: https://github.com/ubicloud/ubicloud/blob/main/scheduling/al...
moondev•1h ago
In Shared NVSwitch Multitenancy Mode - are there any considerations for leveraging infiniband devices inside each vm at full performance?
ben_s•49m ago
We haven't looked deeply at inter-machine communication yet. NVLink/NVSwitch (which this post focuses on) are intra-node, so InfiniBand is mostly orthogonal I think and comes down to NIC passthrough, NUMA/PCIe placement, and validating RDMA inside the VM.
tptacek•54m ago
Did you ever manage to get vGPU's working in any other hardware configuration? I know it's not what Hx00 customers want. I bloodied my forehead on that for a month or two with Cloud Hypervisor --- I got to the "light reverse engineering of drivers" stage before walking away.
ben_s•27m ago
We didn't focus on vGPU and largely avoided it on purpose. Instead, we focused on whole-GPU and NVSwitch-partitioned passthrough (Shared NVSwitch Multitenancy Mode), which is a better fit for the workloads we care about.

Beginning January 2026, all ACM publications will be made open access

https://dl.acm.org/openaccess
389•Kerrick•1h ago•41 comments

Classical statues were not painted horribly

https://worksinprogress.co/issue/were-classical-statues-painted-horribly/
306•bensouthwood•4h ago•166 comments

Your job is to deliver code you have proven to work

https://simonwillison.net/2025/Dec/18/code-proven-to-work/
219•simonw•2h ago•191 comments

Virtualizing Nvidia HGX B200 GPUs with Open Source

https://www.ubicloud.com/blog/virtualizing-nvidia-hgx-b200-gpus-with-open-source
63•ben_s•3h ago•15 comments

Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction

19•sidmanchkanti21•1h ago•5 comments

Are Apple gift cards safe to redeem?

https://daringfireball.net/linked/2025/12/17/are-apple-gift-cards-safe-to-redeem
236•tosh•2h ago•185 comments

Using TypeScript to Obtain One of the Rarest License Plates

https://www.jack.bio/blog/licenseplate
81•lafond•2h ago•65 comments

Jonathan Blow has spent the past decade designing 1,400 puzzles for you

https://arstechnica.com/gaming/2025/12/jonathan-blow-has-spent-the-past-decade-designing-1400-puz...
210•furcyd•6d ago•276 comments

Please Just Try Htmx

http://pleasejusttryhtmx.com/
172•iNic•2h ago•168 comments

Creating apps like Signal could be 'hostile activity' claims UK watchdog

https://www.techradar.com/vpn/vpn-privacy-security/creating-apps-like-signal-or-whatsapp-could-be...
298•donohoe•5h ago•200 comments

RCE via ND6 Router Advertisements in FreeBSD

https://www.freebsd.org/security/advisories/FreeBSD-SA-25:12.rtsold.asc
101•weeha•8h ago•56 comments

Microscopic robots that sense, think, act, and compute

https://www.science.org/doi/10.1126/scirobotics.adu8009
5•XzetaU8•4d ago•0 comments

Slowness is a virtue

https://blog.jakobschwichtenberg.com/p/slowness-is-a-virtue
178•jakobgreenfeld•6h ago•68 comments

Dogalog: A realtime Prolog-based livecoding music environment

https://github.com/danja/dogalog
18•triska•4d ago•3 comments

Gemini 3 Flash: Frontier intelligence built for speed

https://blog.google/products/gemini/gemini-3-flash/
1072•meetpateltech•1d ago•564 comments

Hightouch (YC S19) Is Hiring

https://hightouch.com/careers
1•joshwget•5h ago

Egyptian Hieroglyphs: Lesson 1

https://www.egyptianhieroglyphs.net/egyptian-hieroglyphs/lesson-1/
129•jameslk•11h ago•51 comments

I got hacked: My Hetzner server started mining Monero

https://blog.jakesaunders.dev/my-server-started-mining-monero-this-morning/
534•jakelsaunders94•19h ago•328 comments

Show HN: A local-first memory store for LLM agents (SQLite)

https://github.com/CaviraOSS/OpenMemory
29•nullure•4d ago•14 comments

What is an elliptic curve? (2019)

https://www.johndcook.com/blog/2019/02/21/what-is-an-elliptic-curve/
118•tzury•10h ago•12 comments

After ruining a treasured water resource, Iran is drying up

https://e360.yale.edu/features/iran-water-drought-dams-qanats
264•YaleE360•6h ago•214 comments

It's all about momentum

https://combo.cc/posts/its-all-about-momentum-innit/
93•sph•7h ago•32 comments

systemd v259 Released

https://github.com/systemd/systemd/releases/tag/v259
39•voxadam•2h ago•16 comments

Heart and Kidney Diseases and Type 2 Diabetes May Be One Ailment

https://www.scientificamerican.com/article/heart-and-kidney-diseases-plus-type-2-diabetes-may-be-...
30•Brajeshwar•1h ago•10 comments

From profiling to kernel patch: the journey to an eBPF performance fix

https://rovarma.com/articles/from-profiling-to-kernel-patch-the-journey-to-an-ebpf-performance-fix/
23•todsacerdoti•4d ago•1 comments

Most parked domains now serving malicious content

https://krebsonsecurity.com/2025/12/most-parked-domains-now-serving-malicious-content/
94•bookofjoe•4h ago•28 comments

The Big City; Save the Flophouses (1996)

https://www.nytimes.com/1996/01/14/magazine/the-big-city-save-the-flophouses.html
30•ChadNauseam•3d ago•10 comments

AI helps ship faster but it produces 1.7× more bugs

https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
99•birdculture•4h ago•110 comments

Working quickly is more important than it seems (2015)

https://jsomers.net/blog/speed-matters
231•bschne•3d ago•110 comments

Spain fines Airbnb €65M: Why the government is cracking down on illegal rentals

https://www.euronews.com/travel/2025/12/15/spain-fines-airbnb-65-million-why-the-government-is-cr...
89•robtherobber•2h ago•87 comments