frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Study confirms experience beats youthful enthusiasm

https://www.theregister.com/2026/02/07/boomers_vs_zoomers_workplace/
1•Willingham•39s ago•0 comments

The Big Hunger by Walter J Miller, Jr. (1952)

https://lauriepenny.substack.com/p/the-big-hunger
1•shervinafshar•1m ago•0 comments

The Genus Amanita

https://www.mushroomexpert.com/amanita.html
1•rolph•6m ago•0 comments

We have broken SHA-1 in practice

https://shattered.io/
1•mooreds•7m ago•1 comments

Ask HN: Was my first management job bad, or is this what management is like?

1•Buttons840•8m ago•0 comments

Ask HN: How to Reduce Time Spent Crimping?

1•pinkmuffinere•9m ago•0 comments

KV Cache Transform Coding for Compact Storage in LLM Inference

https://arxiv.org/abs/2511.01815
1•walterbell•14m ago•0 comments

A quantitative, multimodal wearable bioelectronic device for stress assessment

https://www.nature.com/articles/s41467-025-67747-9
1•PaulHoule•16m ago•0 comments

Why Big Tech Is Throwing Cash into India in Quest for AI Supremacy

https://www.wsj.com/world/india/why-big-tech-is-throwing-cash-into-india-in-quest-for-ai-supremac...
1•saikatsg•16m ago•0 comments

How to shoot yourself in the foot – 2026 edition

https://github.com/aweussom/HowToShootYourselfInTheFoot
1•aweussom•16m ago•0 comments

Eight More Months of Agents

https://crawshaw.io/blog/eight-more-months-of-agents
3•archb•18m ago•0 comments

From Human Thought to Machine Coordination

https://www.psychologytoday.com/us/blog/the-digital-self/202602/from-human-thought-to-machine-coo...
1•walterbell•19m ago•0 comments

The new X API pricing must be a joke

https://developer.x.com/
1•danver0•19m ago•0 comments

Show HN: RMA Dashboard fast SAST results for monorepos (SARIF and triage)

https://rma-dashboard.bukhari-kibuka7.workers.dev/
1•bumahkib7•20m ago•0 comments

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

https://github.com/2015xli/jqassistant-graph-rag
1•artigent•25m ago•0 comments

Python Only Has One Real Competitor

https://mccue.dev/pages/2-6-26-python-competitor
3•dragandj•26m ago•0 comments

Tmux to Zellij (and Back)

https://www.mauriciopoppe.com/notes/tmux-to-zellij/
1•maurizzzio•27m ago•1 comments

Ask HN: How are you using specialized agents to accelerate your work?

1•otterley•28m ago•0 comments

Passing user_id through 6 services? OTel Baggage fixes this

https://signoz.io/blog/otel-baggage/
1•pranay01•29m ago•0 comments

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

https://davmail.sourceforge.net/
1•todsacerdoti•30m ago•0 comments

Visual data modelling in the browser (open source)

https://github.com/sqlmodel/sqlmodel
1•Sean766•32m ago•0 comments

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

https://github.com/chinonsochikelue/tharos
1•fluantix•32m ago•0 comments

Oddly Simple GUI Programs

https://simonsafar.com/2024/win32_lights/
1•MaximilianEmel•33m ago•0 comments

The New Playbook for Leaders [pdf]

https://www.ibli.com/IBLI%20OnePagers%20The%20Plays%20Summarized.pdf
1•mooreds•33m ago•1 comments

Interactive Unboxing of J Dilla's Donuts

https://donuts20.vercel.app
1•sngahane•35m ago•0 comments

OneCourt helps blind and low-vision fans to track Super Bowl live

https://www.dezeen.com/2026/02/06/onecourt-tactile-device-super-bowl-blind-low-vision-fans/
1•gaws•36m ago•0 comments

Rudolf Vrba

https://en.wikipedia.org/wiki/Rudolf_Vrba
1•mooreds•37m ago•0 comments

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

https://www.medpagetoday.com/neurology/autism/119747
1•paulpauper•38m ago•0 comments

Wellness Hotels Discovery Application

https://aurio.place/
1•cherrylinedev•38m ago•1 comments

NASA delays moon rocket launch by a month after fuel leaks during test

https://www.theguardian.com/science/2026/feb/03/nasa-delays-moon-rocket-launch-month-fuel-leaks-a...
1•mooreds•39m ago•0 comments
Open in hackernews

Virtualizing Nvidia HGX B200 GPUs with Open Source

https://www.ubicloud.com/blog/virtualizing-nvidia-hgx-b200-gpus-with-open-source
116•ben_s•1mo ago

Comments

ben_s•1mo ago
(author of the blog post here)

For me, the hardest part was virtualizing GPUs with NVLink in the mix. It complicates isolation while trying to preserve performance.

AMA if you want to dig into any of the details.

checker659•1mo ago
Isn't SR-IOV a thing with these big GPUs? Or, is it that you're not concerned with fractional granularity?
ben_s•1mo ago
In this article, we're primarily concerned with whole-GPU or multi-GPU partitions that preserve NVLink bandwidth, rather than finer-grained fractional sharing of a single GPU.
spwa4•1mo ago
Would it be possible to implement "virtual memory" for a GPU this way? Let's say you have GPUs at 30% utilization, but memory limited. Could you run 2 workloads by offloading the GPU memory when not in use?
ben_s•1mo ago
Once you oversubscribe GPU memory, performance usually collapses. Frameworks like vLLM can explicitly offload things like the KV cache to CPU memory, but that's an application-level tradeoff, not transparent GPU virtual memory.
girfan•1mo ago
Cool post. Have you looked at slicing a single GPU up for multiple VMs? Is there anything other than MIG that you have come across to partition SMs and memory bandwidth within a single GPU?
ben_s•1mo ago
Thanks! I haven't looked deeply into slicing up a single GPU. My understanding is that vGPU (which we briefly mention in the post) can partition memory but time-shares compute, while MIG is the only mechanism that provides partitioning of both SMs and memory bandwidth within a single GPU.
namibj•1mo ago
Last I checked MIG was the only one that made hard promises about especially memory bandwidth; as long as your memory access patterns aren't secret and you have enough trust in the other guests not being highly unfriendly with their cache usage behavior, you should be able to get away with much less strict isolation. Think docker vs. VMs-with-dedicated-cores.

But I thought MIG did do the job of chopping a GPU that's too big for most individual users into something that behaves very close to a literal array of smaller GPUs stuffed into the same PCIe card form factor? Think how a Tesla K80 was pretty much just two GK210 "GPUs" on a PLX "PCIe switch" which connects them to each other and to the host. Obviously trivial to give one to each of two VMs (at least if the PLX didn't interfere with IOMMU separation or such.... for mere performance isolation it certainly sufficed (once you block a heavy user from power budget throttling the sibling, at least).

tptacek•1mo ago
Can you pass a MIG device into a KVM VM? The team we worked with didn't believe it was possible (they suggested we switch to VMWare); the MIG system interface gives you a UUID, not a PCI BDF.
moondev•1mo ago
Kubevirt has some examples passing a vpgu into kvm

https://kubevirt.io/user-guide/compute/host-devices/

tptacek•1mo ago
Right, vGPUs are explicitly set up to generate BDF addresses that can be passed through (but require host driver support; they're essentially paravirtualized). I'm asking about MIG.
namibj•1mo ago
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supp... says GPU passhtrough is supported on MIG...
my123•1mo ago
There's a MIG vGPU mode usable for this
tptacek•1mo ago
Have you used it? How does it work? How do you drive it? We tried a lot of different things. Is it not paravirtualized, the way vGPUs are?
my123•1mo ago
It works with SR-IOV instead of mdev afaik

Still needs some host SW to drive it but actually does static partitioning

IIRC it's usable through using the MIG-marked vGPU types

otterley•1mo ago
Is Nvidia’s Fabric Manager and other control plane software Open Source? If so, that’s news to me. It’s not clear that anything in this article relates to Open Source at all; publishing how to do VM management doesn’t qualify. Maybe “open kimono.”

Also, how strong are the security boundaries among multiple tenants when configured in this way? I know, for example, that AWS is extremely careful about how hardware resources are shared across tenants of a physical host to prevent cross-tenant data leakage.

ben_s•1mo ago
Fabric Manager itself is not open source. It's NVIDIA-provided software, and today it's required to bring up and manage the NVLink/NVSwitch fabric on HGX systems. What we meant by "open" is that everything around it - the hypervisor, our control plane logic, partition selection, host configuration, etc. - is implemented in the open and available in our repos. You're right that this isn't a fully open GPU stack.

On isolation: in Shared NVSwitch Multitenancy mode, isolation is enforced at multiple layers. Fabric Manager programs the NVSwitch routing tables so GPUs in different partitions cannot exchange NVLink traffic, and each VM receives exclusive ownership of its assigned GPUs via VFIO passthrough. Large providers apply additional hardening and operational controls beyond what we describe here. We're not claiming this is equivalent to AWS's internal threat model, but it does rely on NVIDIA's documented isolation mechanisms.

mindcrash•1mo ago
In case all of this sounds interesting:

After skimming the article I noticed a large chunk of this article (specifically the bits on deattaching/attaching drivers, qemu and vfio) applies more or less to general GPU virtualization under Linux too!

1) Replace any "nvidia" for "amdgpu" for Team Red based setups when needed

2) The PCI ids are all different, so you'll have look them up with lspci yourselves

3) Note that with consumer GPU's you need to deattach and attach a pair of two devices (GPU video and GPU audio); else things might get a bit wonky

ben_s•1mo ago
Thanks for the comment! You're right that a lot of the mechanics apply more generally. On point (3) specifically: we handle this by allocating at the IOMMU-group level rather than individual devices. Our allocator selects an IOMMU group and passes through all devices in that group (e.g., GPU video + audio), which avoids the partial-passthrough wonkiness you mentioned. For reference: https://github.com/ubicloud/ubicloud/blob/main/scheduling/al...
moondev•1mo ago
In Shared NVSwitch Multitenancy Mode - are there any considerations for leveraging infiniband devices inside each vm at full performance?
ben_s•1mo ago
We haven't looked deeply at inter-machine communication yet. NVLink/NVSwitch (which this post focuses on) are intra-node, so InfiniBand is mostly orthogonal I think and comes down to NIC passthrough, NUMA/PCIe placement, and validating RDMA inside the VM.
tptacek•1mo ago
Did you ever manage to get vGPU's working in any other hardware configuration? I know it's not what Hx00 customers want. I bloodied my forehead on that for a month or two with Cloud Hypervisor --- I got to the "light reverse engineering of drivers" stage before walking away.
ben_s•1mo ago
We didn't focus on vGPU and largely avoided it on purpose. Instead, we focused on whole-GPU and NVSwitch-partitioned passthrough (Shared NVSwitch Multitenancy Mode), which is a better fit for the workloads we care about.
tryauuum•1mo ago
can someone explain me like I'm 10 what is a BAR?

Like it says something about mmaping 256 GB of per GPU. But wouldn't it waste 2T of RAM? or do I fail in my understanding of what "mmap" is as well..

EDIT: yes, seems like my understanding of mmap wasn't good, it wastes not RAM but address space

convolvatron•1mo ago
Base Address Register

this term can be used at a couple different points (including mappings from physical addresses to physical hardware in the memory network), but a PCI BAR is a register in the configuration space that tells the card what PCI host addresses map to internal memory regions in the card. one BAR per region.

the PCI BARs are usually configured by the driver after allocating some address space from the kernel.

DRAM BARs in the switching network are generally configured by something running at the BIOS level based on probes of memory controllers and I2C reads from the DIMMS to find out capacity.

ckastner•1mo ago
A lot of this coincides with my own experiments I did to pass-through consumer AMD GPUs into VMs [1], which the Debian ROCm Team uses in their CI.

The Debian package rocm-qemu-support ships scripts that facilitate most of this. I've since generalized this by adding NVIDIA support, but I haven't uploaded the new gpuisol-qemu package [2] to the official Archive yet. It still needs some polishing.

Just dumping this here, to add more references (especially the further reading section, the Gentoo and Arch wikis had a lot of helpful data).

[1]: https://salsa.debian.org/rocm-team/community/team-project/-/...

[2]: https://salsa.debian.org/ckk/gpu-isolation-tools

latchkey•1mo ago
A couple open relevant issues here:

https://github.com/amd/MxGPU-Virtualization/issues/6

https://github.com/amd/MxGPU-Virtualization/issues/16

ckastner•1mo ago
Coincidentally, the first issue (referencing Navi 21) was the one I started these experiments with, and this turned out to be pretty informative.

Our Navi 21 would almost always go AWOL after a test run had been completed, requiring a full reboot. At some point, I noticed that this only happened when our test runner was driving the test; I never had an issue when testing interactively. I eventually realized that our test driver was simply killing the VM when the test was done, which is fine for a CPU-based test, but this messed with the GPU's state. When working interactively, I was always shutting down the host cleanly, which apparently resolved this. A patch to our test runner to cleanly shut down VMs fixed this.

And I've had no luck with iGPUs, as referenced by the second issue.

From what I understand, I don't think that consumer AMD GPUs can/will ever be fully supported, because the GPU reset mechanisms of older cards are so complex. That's why things like vendor-reset [3] exist, which apparently duplicate a lot of the in-kernel driver code but ultimately only twiddle some bits.

[3]: https://github.com/gnif/vendor-reset