Kernel: Introduce Multikernel Architecture Support

https://lwn.net/ml/all/20250918222607.186488-1-xiyou.wangcong@gmail.com/

214•ahlCVA•4mo ago

Comments

andutu•4mo ago

Pretty cool, sound similar to what Barrelfish OS enabled (https://barrelfish.org/).

intermerda•4mo ago

Tim Roscoe gave an interesting Keynote at OSDI '21 titled "It's Time for Operating Systems to Rediscover Hardware" - https://www.youtube.com/watch?v=36myc8wQhLo. He was involved with the Barrelfish project.

mensi•4mo ago

That just reminded me that he's always insisted on not being called Tim: https://people.inf.ethz.ch/troscoe/

intermerda•4mo ago

Huh I did not know that. I’m usually on careful about not using well-known nicknames unless the person goes by them.

perching_aix•4mo ago

Does this mean resiliency against kernel panics?

sedatk•4mo ago

> - Improved fault isolation between different workloads

Yes.

ATechGuy•4mo ago

That's what the author is claiming. Practically, VM-level strong fault isolation cannot be achieved without isolation support from the hardware aka virtualization.

eqvinox•4mo ago

Hardware without something like SR-IOV is straight up going to be unshareable for the foreseeable future; things like ring buffers would need a whole bunch of coordination between kernels to share. SR-IOV (or equivalent) makes it workable, an IOMMU (or equivalent) then provides isolation.

skissane•4mo ago

You could have a “nanokernel” which owns the ring buffers and the other kernels act as its clients… or for a “primary kernel” which owns the ring buffers and exposes an API the other kernels could call. If different devices have different ring buffers, the “primary kernel” could be different for each one.

yencabulator•4mo ago

That sounds like the Exokernel project. Notably different from a "multikernel"/"replicated kernel" design.

rwmj•4mo ago

Sounds similar to CoLinux where you could run a "cooperative Linux" alongside Windows http://www.colinux.org/

brcmthrowaway•4mo ago

This was underrated!

joseph2024•4mo ago

HP printers are similar. They run Linux on two cores and an RTOS on the other.

lpribis•4mo ago

Relatively common in more powerful embedded systems. See for example https://www.openampproject.org/ which is a framework for running multiple cores of Linux, baremtal, or an RTOS together.

p_l•4mo ago

OpenAMP focuses on RTOS running on "subordinate" coprocessors.

This is more of a good old classic partitioning, which was rare outside of hardware with special support for it.

Most RTOS + non-RTOS combinations use RTOS doubling as hypervisor, with RT tasks running at guaranteed timeframes and non-RTOS guest running in more relaxed form.

lpribis•4mo ago

You can also use it in a "SMP" processor. For example I use it on Zynqs which are dual or quad ARMs symmetrical ARM SoCs. Linux or the bare code can be the master or subordinate.

cout•4mo ago

IIRC, colinux is similar to user mode Linux, which boots a kernel in userland. That is, the kernel runs under windows as an application rather than alongside it.

rwmj•4mo ago

It says on the web page I linked that:

Unlike in other Linux virtualization solutions such as User Mode Linux (or the forementioned VMware), special driver software on the host operating system is used to execute the coLinux kernel in a privileged mode (known as ring 0 or supervisor mode).

By constantly switching the machine's state between the host OS state and and the coLinux kernel state, coLinux is given full control of the physical machine's MMU (i.e, paging and protection) in its own specially allocated address space, and is able to act just like a native kernel, achieving almost the same performance and functionality that can be expected from a regular Linux which could have ran on the same machine standalone.

So my understanding is that it's a Windows driver which contains a full Linux kernel and does some (scary sounding!) time sharing with the Windows kernel running at the same CPL.

cout•4mo ago

I did not remember that from when I used to use colinux.

The colinux home page also says:

To cooperatively share hardware with the host operating system, coLinux does not access I/O devices directly. Instead, it interfaces with emulated devices provided by the coLinux drivers in the host OS. For example, a regular file in Windows can be used as a block device in coLinux. All real hardware interrupts are transparently forwarded to the host OS, so this way the host OS's control of the real hardware is not being disturbed and thus it continues to run smoothly.

So just like UML, colinux hooks int 80h (or sysenter) and forwards the request to windows. Thus while it may make use of direct access to the MMU, most devices iirc are virtualized.

vaastav•4mo ago

How is this different from/similar to Barrelfish?

exe34•4mo ago

mainline vs abandoned.

zokier•4mo ago

Interestingly the author has a startup revolving around this technology. Their webpage has some info: https://multikernel.io/

sargun•4mo ago

The author (Cong Wang) is building all sorts of neat stuff. Recently, they built kernelscript: https://github.com/multikernel/kernelscript -- another DSL for BPF that's much more powerful than the C alternatives, without the complexity of C BPF. Previously, they were at Bytedance, so there's a lot of hope that they understand the complexities of "production".

rurban•4mo ago

I see. Even better than Xen, but needs much more memory than all the kvm instances. And as I heard memory is the real deal for mass hosters, not speed. So I am sceptical. I also don't understand how it handles concurrent writes and states of shared hardware. Seems like a lot of overhead compared to kvm or Xen.

9cb14c1ec0•4mo ago

It would be interesting to see a detailed security assessment of this. Would it provide security improvements over docker?

eqvinox•4mo ago

Docker is the wrong thing to compare against, especially considering it is an application and not a technology; the technology would be containerization. This competes against hardware virtualization support, if anything.

esseph•4mo ago

If you want some security improvements, move from docker to podman rootless + distroless containers.

If you need more security/isolation, go to a VM or bare metal.

messe•4mo ago

Reminds me of exokernel architectures[0.5][1.5][2.5]. How is non-CPU resource multiplexing handled, or planned to be handled?

[0.5]: https://en.wikipedia.org/wiki/Exokernel

[1.5]: https://wiki.osdev.org/Exokernel

[2.5]: "Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration." — Stan Kelly-Bootle

yencabulator•4mo ago

More like Barrelfish or NRKernel.

https://barrelfish.org/

https://nrkernel.systems/

duendefm•4mo ago

would this allow running both linux and bsd kernels?

tremon•4mo ago

It should be possible in theory, as long as both use the same communication interface. In practice, I think getting it to work on just one kernel is already a huge amount of work.

viraptor•4mo ago

It's been done with more crazy setups already though: http://www.colinux.org/ win+lin

ch_123•4mo ago

Reminds me of OpenVMS Galaxy on DEC Alpha systems, which allowed multiple instances of the OS to run side by side on the same hardware without virtualization.

https://www.digiater.nl/openvms/doc/alpha-v8.3/83final/aa_re...

skissane•4mo ago

IBM mainframes and Power servers have “partitions” (LPARs). My understanding of how they work, is they actually are software-based virtualisation, but the hypervisor is in the system firmware, not the OS. And some of the firmware is loaded from disk at boot-up, making it even closer to something like Xen-labelling it as “hardware” not “software” is more about marketing (and which internal teams own it within IBM) than than technical reality. Their mainframe partitioning system, PR/SM, apparently began life as a stripped-down version of VM/CMS, although I’m not sure how close the relationship between PR/SM and z/VM is in current releases.

This sounds like running multiple kernels in a shared security domain, which reduces the performance cost of transitions and sharing, but you lose the reliability and security advantages that a proper VM gives you. It reminds me of coLinux (essentially, a Linux kernel as a Windows NT device driver)

Does anyone have more details on how OpenVMS Galaxy was actually implemented? I believe it was available for both Alpha and Itanium, but not yet x86-64 (and probably never…)

octotoad•4mo ago

AFAIK, Galaxy was exclusive to Alpha, with no equivalent on Itanium, or any other platform.

p_l•4mo ago

Galaxy depended on carefully coded cooperation between kernels plus firmware support, but otherwise was operating exactly like the proposed Multikernel Architecture.

The firmware support was mainly there to provide booting of separate partitions, but otherwise no virtualisation was involved - all resources were exclusively owned.

tremon•4mo ago

"while sharing the underlying hardware resources"? At the risk of sounding too positive, my guess is that hell will freeze over before that will work reliably. Alternating access between the running kernels is probably the "easy" part (DMA and command queues solve a lot of this for free), but I'm thinking more of all the hardware that relies on state-keeping and serialization in the driver. There's no way that e.g. the average usb or bluetooth vendor has "multiple interleaved command sequences" in their test setup.

I think Linux will have to move to a microkernel architecture before this can work. Once you have separate "processes" for hardware drivers, running two userlands side-by-side should be a piece of cookie (at least compared to the earlier task of converting the rest of the kernel).

Will be interesting to see where this goes. I like the idea, but if I were to go in that direction, I would choose something like a Genode kernel to supervise multiple Linux kernels.

elteto•4mo ago

You just don't share certain devices, like Bluetooth. The "main" kernel will probably own the boot process and manage some devices exclusively. I think the real advantage is running certain applications isolated within a CPU subset, protected/contained behind a dedicated kernel. You don't have the slowdown of VMs, or have to fight against the isolation sieve that is docker.

yjftsjthsd-h•4mo ago

That's fine for

  - Enhanced security through kernel-level separation
  - Better resource utilization than traditional VM (KVM, Xen etc.)

but I don't think it works for

  - Improved fault isolation between different workloads
  - Potential zero-down kernel update with KHO (Kernel Hand Over)

since if the "main" kernel crashes or is supposed to get upgraded then you have to hand hardware back to it.

raron•4mo ago

> since if the "main" kernel crashes or is supposed to get upgraded then you have to hand hardware back to it.

Isn't that similar to starting up from hibernate to disk? Basically all of your peripherals are powered off and so probably can not keep their state.

Also you can actually stop a disk (member of a RAID device), remove the PCIe-SATA HBA card it is attached to, replace it with a different one, connect all back together without any user-space application noticing it.

yjftsjthsd-h•4mo ago

I trust hardware to mostly be reasonable when starting from off, but we're discussing the case where it's on and stays on but gets handed from one kernel to another and I don't trust it nearly as well in that case. I think the comparison is kexec rather than hibernate, and while it often works, kexec can result in misbehaving hardware.

yencabulator•4mo ago

Many peripherals have a mechanism to reset the device, to get it back to a known good state. Generally device drivers will do this when they receive a message they don't understand from the device, or a command sent to the device times out without response.

Here's my graphics chip getting reset:

  [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
  [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
  [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
  amdgpu 0000:c6:00.0: amdgpu: MODE2 reset
  amdgpu 0000:c6:00.0: amdgpu: GPU reset succeeded, trying to resume

samus•4mo ago

The old kernel boots the new kernel, possibly in a "passive" mode, performs a few sanity checks of the new instance, hands over control, and finally shuts itself down.

vlovich123•4mo ago

Is there anything that says that multiple kernels will be responsible for owning the drivers for HW? It could be that one kernel owns the hardware while the rest speak to the main kernel using a communication channel. That's also presumably why KHO is a thing because you have to hand over when shutting down the kernel responsible for managing the driver.

p_l•4mo ago

This is something that was actually implemented and used on multiple platforms, and generally requires careful development of all interacting OSes. Some resources that have to be multiplexed are handled through IPC between running kernels, otherwise resources were set to be exclusively owned.

This allowed cheap "logical partitioning" of machines without actually using a hypervisor or special hardware support.

JackSlateur•4mo ago

Think "cloud providers"

Today, you can grab a physical NIC and create some number of virtual NICs. Same for GPUs.

I guess the idea is that you have some hardware, and each kernel (read "virtual machine") will get:

  - some dedicated CPU
  - some physical memory
  - some virtual NICs
  - some storage, maybe (if dedicated; if through network, then nothing to do here)
  - maybe a virtual GPU for the AI hype train

Every kernel will mostly think it owns real hardware, while in fact it only deals with part of it (all of this due to virtualized hardware support that can be found in many places)

This feature does not seem like a general-usage feature, which can be used in our laptop

IAmLiterallyAB•4mo ago

What's preventing a compromised kernel on one core from hijacking the other cores? This doesn't seem like much of a security boundary

viraptor•4mo ago

Nothing prevents it if you achieve code execution. But where it helps is scenarios like syscall / memory mapping exploits where a user process can only affect resources attached to their current kernel. For example https://dirtycow.ninja/ would have a limited scope.

loeg•4mo ago

Insane idea, but very cool.

yalogin•4mo ago

It’s not clear to me but do these kernels run directly on the hardware? If so how are they able to talk to each other, DMA? That could open up some security flaws, hopefully they thought through that

agentkilo•4mo ago

IIUC, yes, all the kernels involved run directly on the hardware, in a "cooperative" way, i.e. they must agree on not touching others' memory regions.

I think the architecture assumes all loaded kernels are trusted, and imposes no isolation other than having them running on different CPUs.

Given the (relative) simplicity of the PoC, it could be really performant.

yalogin•4mo ago

Wonder what the use cases are. Doesn’t feel like the kernels are hotswappable, so why is it preferred over VMs?

yxhuvud•4mo ago

If nothing else, it is a path to making them hotswappable.

yjftsjthsd-h•4mo ago

Can't the kernel set up hardware-backed memory maps to partially blind itself to other memory regions? (Only "partially" because even then I expect it could just change the mappings, but it's still a protection against accidental corruption)

p_l•4mo ago

That's standard part of setups like that, the cooperative aspect is that they have to set the memory map so they don't overlap.

josemanuel•4mo ago

How are IOMMUs managed?

pabs3•4mo ago

You used to also be able to get the opposite; one Linux kernel with a unified userspace distributed across a cluster.

https://sourceforge.net/projects/kerrighed/

rwmj•4mo ago

That's cool! Similar is the idea of running a single large VM across multiple hosts. There have been several iterations of that idea, the latest being a presentation at this year's KVM Forum: GiantVM: A Many-to-one Virtualization System Built Atop the QEMU/KVM Hypervisor - Songtao Xue, Xiong Tianlei, Muliang Shou https://kvm-forum.qemu.org/2025/

PhilipRoman•4mo ago

I wonder if modern numa-aware software could take advantage of this if the Linux APIs report the correct topology.

samus•4mo ago

This could open up ways to run Linux as a guest kernel of proper microkernel operating systems to aid with hardware compatibility.

da-x•4mo ago

There are various hardware singletons that need to be managed for this to work properly. This raises many questions.

Which of the kernel does the PCI enumeration, for instance, and how it is determined which kernel gets ownership over a PCI device? How about ACPI? Serial ports?

How does this architecture transfers ownership over RAM between each kernel, or is it a fixed configuration? How about NUMA-awareness? (Likely you would want to partition systems so that RAM is along with the CPUs of the same NUMA node).

Looks to me that one kernel would need to be have 'hypervisor'-like behavior in order to divvy up resources to other kernels. I think PVM (https://lwn.net/Articles/963718/) would be a preferred solution in this case, because the software stack of managing hypervisor resources can already be reused with it.

ncr100•4mo ago

Could this be used to compile a new kernel? (Which could then be executed by the compiling kernel?)

Could the new kernel be genetically scored for effectiveness (security, performance, etc), and iterated upon automatically, by e.g. an AI?

rse287•4mo ago

From a complete isolation perspective, this seems very similar to Jailhouse: https://github.com/siemens/jailhouse

Show HN: Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Show HN: Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Kernel: Introduce Multikernel Architecture Support

Comments