frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Virtualizing Nvidia HGX B200 GPUs with Open Source

https://www.ubicloud.com/blog/virtualizing-nvidia-hgx-b200-gpus-with-open-source
52•ben_s•2h ago

Comments

ben_s•1h ago
(author of the blog post here)

For me, the hardest part was virtualizing GPUs with NVLink in the mix. It complicates isolation while trying to preserve performance.

AMA if you want to dig into any of the details.

checker659•3m ago
Isn't SR-IOV a thing with these big GPUs? Or, is it that you're not concerned with fractional granularity?
girfan•1h ago
Cool post. Have you looked at slicing a single GPU up for multiple VMs? Is there anything other than MIG that you have come across to partition SMs and memory bandwidth within a single GPU?
ben_s•46m ago
Thanks! I haven't looked deeply into slicing up a single GPU. My understanding is that vGPU (which we briefly mention in the post) can partition memory but time-shares compute, while MIG is the only mechanism that provides partitioning of both SMs and memory bandwidth within a single GPU.
namibj•34m ago
Last I checked MIG was the only one that made hard promises about especially memory bandwidth; as long as your memory access patterns aren't secret and you have enough trust in the other guests not being highly unfriendly with their cache usage behavior, you should be able to get away with much less strict isolation. Think docker vs. VMs-with-dedicated-cores.

But I thought MIG did do the job of chopping a GPU that's too big for most individual users into something that behaves very close to a literal array of smaller GPUs stuffed into the same PCIe card form factor? Think how a Tesla K80 was pretty much just two GK210 "GPUs" on a PLX "PCIe switch" which connects them to each other and to the host. Obviously trivial to give one to each of two VMs (at least if the PLX didn't interfere with IOMMU separation or such.... for mere performance isolation it certainly sufficed (once you block a heavy user from power budget throttling the sibling, at least).

otterley•54m ago
Is Nvidia’s Fabric Manager and other control plane software Open Source? If so, that’s news to me. It’s not clear that anything in this article relates to Open Source at all; publishing how to do VM management doesn’t qualify. Maybe “open kimono.”

Also, how strong are the security boundaries among multiple tenants when configured in this way? I know, for example, that AWS is extremely careful about how hardware resources are shared across tenants of a physical host to prevent cross-tenant data leakage.

ben_s•31m ago
Fabric Manager itself is not open source. It's NVIDIA-provided software, and today it's required to bring up and manage the NVLink/NVSwitch fabric on HGX systems. What we meant by "open" is that everything around it - the hypervisor, our control plane logic, partition selection, host configuration, etc. - is implemented in the open and available in our repos. You're right that this isn't a fully open GPU stack.

On isolation: in Shared NVSwitch Multitenancy mode, isolation is enforced at multiple layers. Fabric Manager programs the NVSwitch routing tables so GPUs in different partitions cannot exchange NVLink traffic, and each VM receives exclusive ownership of its assigned GPUs via VFIO passthrough. Large providers apply additional hardening and operational controls beyond what we describe here. We're not claiming this is equivalent to AWS's internal threat model, but it does rely on NVIDIA's documented isolation mechanisms.

otterley•3m ago
Then I would kindly ask you to replace the words Open Source with something more accurate as it’s misleading.
mindcrash•52m ago
In case all of this sounds interesting:

After skimming the article I noticed a large chunk of this article (specifically the bits on deattaching/attaching drivers, qemu and vfio) applies more or less to general GPU virtualization under Linux too!

1) Replace any "nvidia" for "amdgpu" for Team Red based setups when needed

2) The PCI ids are all different, so you'll have look them up with lspci yourselves

3) Note that with consumer GPU's you need to deattach and attach a pair of two devices (GPU video and GPU audio); else things might get a bit wonky

ben_s•18m ago
Thanks for the comment! You're right that a lot of the mechanics apply more generally. On point (3) specifically: we handle this by allocating at the IOMMU-group level rather than individual devices. Our allocator selects an IOMMU group and passes through all devices in that group (e.g., GPU video + audio), which avoids the partial-passthrough wonkiness you mentioned. For reference: https://github.com/ubicloud/ubicloud/blob/main/scheduling/al...
moondev•10m ago
In Shared NVSwitch Multitenancy Mode - are there any considerations for leveraging infiniband devices inside each vm at full performance?

FrontierCS: Evolving Challenges for Evolving Intelligence

https://arxiv.org/abs/2512.15699
1•belter•17s ago•0 comments

Show HN: GPT Clicker. An idle game about building an AI empire

https://gpt-clicker.pixdeo.com
1•mromanuk•40s ago•0 comments

Board Oversight of AI-Driven Workforce Displacement

https://corpgov.law.harvard.edu/2025/12/18/board-oversight-of-ai-driven-workforce-displacement/
1•smurda•1m ago•0 comments

When the Internet Grew Up – and Locked Out Its Kids

https://www.techdirt.com/2025/12/17/when-the-internet-grew-up-and-locked-out-its-kids/
1•hn_acker•1m ago•0 comments

Show HN: headson - structure‑aware head/tail for JSON/YAML and source code

https://github.com/kantord/headson
1•kantord•2m ago•0 comments

Ask HN: Why does AI feel safe for code, but fragile for application state?

1•NilsJacobsen•3m ago•0 comments

Variable pay for tech roles: good idea or bad idea?

https://eventuallymaking.io/p/variable-pay-for-tech-roles-good-idea-or-bad-idea
2•hlassiege•3m ago•0 comments

Thermodynamic Alignment: Replacing RLHF with Entropic Loss Functions

https://zenodo.org/records/17976755
1•NyX_AI_ZERO_DAY•3m ago•1 comments

Show HN: BustAPI – Python web framework powered by Rust

https://github.com/GrandpaEJ/BustAPI
1•bozon_69•4m ago•0 comments

Show HN: Beads Viewer (Bv)

https://twitter.com/doodlestein/status/2001673810135646342
1•eigenvalue•6m ago•0 comments

Show HN: TimetoTest – AI agent runs UI/API tests

https://timetotest.tech/
1•VincentPresh•6m ago•0 comments

Why Senior Engineers Fail "Google SRE" Interviews (2026 Analysis)

1•ysreddy591•7m ago•0 comments

Pressured by chatbots, newsrooms push past the one-story-fits-all model

https://www.niemanlab.org/2025/12/pressured-by-chatbots-newsrooms-push-past-the-one-story-fits-al...
1•mooreds•7m ago•0 comments

Show HN: Composify – Open-Source Visual Editor / Server-Driven UI for React

https://github.com/composify-js/composify
1•injung•8m ago•0 comments

Be Your Best in 2026: The Most Important Lessons from the Knowledge Project

https://fs.blog/knowledge-project-podcast/best-of-2025/
1•feross•8m ago•0 comments

Show HN: Bogami – Android camera for immutable image provenance (C2PA/Solana)

https://www.bogami.app/
1•croolstudio•8m ago•1 comments

U.S. Mental Health Ratings Continue to Worsen

https://news.gallup.com/poll/700079/mental-health-ratings-continue-worsen.aspx
1•hn_acker•9m ago•0 comments

Why 'Thank You' Might Be the Best Metric in AI Products

https://flopsandfinance.substack.com/p/why-thank-you-might-be-the-best-metric
1•pickleballcourt•9m ago•0 comments

Claude Code Tools We Can't Live Without

https://www.kasava.dev/blog/4-claude-code-tools-we-cant-live-without
1•benbeingbin•9m ago•0 comments

EU Launches Multi-Billion "Scaleup Europe Fund"

https://eic.ec.europa.eu/eic-fund/scaleup-europe-fund_en
3•nkko•10m ago•0 comments

Lightning talk: Your Very Own Godbolt [video]

https://www.youtube.com/watch?v=cIK-UEbHvtA
1•mattgodbolt•12m ago•0 comments

Registered voters flagged as "potential noncitizens" proved citizenship

https://www.votebeat.org/texas/2025/12/18/texas-voter-roll-citizens-investigation-save-database-t...
1•hn_acker•13m ago•1 comments

Cabin – Modern, Cargo-like package manager and build system for C++

https://cabinpkg.com/
1•whou•13m ago•0 comments

EU Commissioner Hoekstra defends scrapping 2035 ban on combustion engines

https://www.euronews.com/my-europe/2025/12/17/exclusive-eu-commissioner-hoekstra-defends-scrappin...
2•teleforce•13m ago•0 comments

AI surpasses 2024 Bitcoin mining in energy usage

https://www.tomshardware.com/tech-industry/artificial-intelligence/ai-surpasses-2024-bitcoin-mini...
5•speckx•14m ago•1 comments

Effectiveness of Orthokeratology in Myopia Control

https://link.springer.com/article/10.1186/s12886-025-04478-x
2•bilegeek•16m ago•0 comments

Gamifying the past: embodied LLMs in DIY archaeological video games

https://researchprofiles.ku.dk/en/publications/gamifying-the-past-embodied-llms-in-diy-archaeolog...
1•geox•16m ago•1 comments

How do you reliably prove when data existed?

1•timeproofs•16m ago•2 comments

Latest MLX Release Includes Jaccl RDMA Back End over TB5

https://twitter.com/awnihannun/status/2001667839539978580
1•geerlingguy•17m ago•0 comments

Show HN: jax-js, an ML library and compiler for the web

https://ekzhang.substack.com/p/jax-js-an-ml-library-for-the-web
2•ekzhang•18m ago•0 comments