MD RAID or DRBD can be broken from userspace when using O_DIRECT

https://bugzilla.kernel.org/show_bug.cgi?id=99171

67•vbezhenar•9h ago

Comments

saurik•7h ago

So... one can, on a filesystem that is mirrored using MD RAID, from userspace, and with no special permissions (as it seems O_DIRECT does not require any), create a standard-looking file that has two different possible contents, depending from which RAID mirror it happens to be read from today? And, this bug, which has been open for a decade now, has, somehow, not been considered to be an all-hands-on-deck security issue that undermines the integrity of every single mechanism people might ever use to validate the content of a file, because... checks notes... we should instead be "teaching [the attacker] not to use [O_DIRECT]"?

(FWIW, I appreciate the performance impact of a full fix here might be brutal, but the suggestion of requiring boot-args opt-in for O_DIRECT in these cases should not have been ignored, as there are a ton of people who might not actively need or even be using O_DIRECT, and the people who do should be required to know what they are getting into.)

summa_tech•7h ago

Wouldn't the performance impact be that of setting the page read-only when the request is submitted, then doing a copy-on-write if the user process does write it? I mean, that's nonzero, TLB flushes being what they are. But they do happen a bunch anyway...

vbezhenar•6h ago

Please note that some filesystems, namely bcachefs, btrfs, zfs seem to be immune to this issue, probably because they don't just directly delegate writes to the block layer with O_DIRECT flag. But it is important to be aware of this issue.

saurik•6h ago

While those are all "filesystems", they are also (internally) alternatives to MD RAID; like, you could run zfs on top of MD RAID, but it feels like a waste of zfs (and the same largely goes for btrfs and bcachefs). It thereby is not at all clear to me that it is the filesystems that are "immune to this issue" rather than their respective RAID-like behaviors, as it seems to be the latter that the discussion was focussing on (hence the initial mention of potentially adding btrfs to the issue, which did not otherwise mention any filesystem at all). Put another way: if you did do the unusual thing of running zfs on top of MD RAID, I actually bet you are still vulnerable to this scenario.

(Oh, unless you are maybe talking about something orthogonal to the fixes mentioned in the discussion thread, such as some property of the extra checksumming done by these filesystems? And so, even if the disks de-synchronize, maybe zfs will detect an error if it reads "the wrong one" off of the underlying MD RAID, rather than ending up with the other content?)

ludocode•4h ago

These filesystems are not really alternatives because mdraid supports features those filesystems do not. For example, parity raid is still broken in btrfs (so it effectively does not support it), and last I checked zfs can't grow a parity raid array while mdraid can.

I run btrfs on top of mdraid in RAID6 so I can incrementally grow it while still having copy-on-write, checksums, snapshots, etc.

I hope that one day btrfs fixes its parity raid or bcachefs will become stable enough to fully replace mdraid. In the meantime I'll continue using mdraid with a copy-on-write filesystem on top.

bananapub•3h ago

> zfs can't grow a parity raid array while mdraid can.

indeed out of date - that was merged a long time ago and shipped in a stable version earlier this year.

Polizeiposaune•4h ago

ZFS puts checksums in the block pointer, so, unless you disable checksums, it always knows the expected checksum of a block it is about to read.

When the actual checksum of what was read from storage doesn't match the expected value, it will try reading alternate locations (if there are any), and it will write back the corrected block if it succeeds in reconstructing a block with the expected checksum.

weinzierl•6h ago

Linus very much opposed O_DIRECT from the start. If I remember correctly he only introduced it at the pressure from the "database people" i.e. his beloved Oracle.

No wonder O_DIRECT never saw much love.

"I hope some day we can just rip the damn disaster out."

-- Linus Torvalds, 2007

https://lkml.org/lkml/2007/1/10/235

jandrewrogers•5h ago

This is one of several examples where Linus thinks something is bad because he doesn't understand how it is used.

Something like O_DIRECT is critical for high-performance storage in software for well-understood reasons. It enables entire categories of optimization by breaking a kernel abstraction that is intrinsically unfit for purpose; there is no way to fix it in the kernel, the existence of the abstraction is the problem as a matter of theory.

As a database performance enjoyer, I've been using O_DIRECT for 15+ years. Something like it will always exist because removing it would make some high-performance, high-scale software strictly worse.

jeffbee•4h ago

His lack of industry experience is the root cause of many issues in Linux.

vacuity•4h ago

Although this is somewhat true, I think the bigger issue is expecting Linux to support all these use cases. Even if Linus accepted all use cases, it's a different story to maintain a kernel/OS that supports them all. The story from an engineering standpoint is just too unwieldy. A general-purpose OS can only go so far to optimize countless special-purpose uses.

tremon•3h ago

This is not some minor niche use case though, and all other operating systems seem to have no trouble supporting OS fscache bypass.

vacuity•2h ago

Considering how big Linux is and how many different use cases it supports, this could well be an undue maintenance burden for Linux where it wouldn't be for other operating systems. Though, I'll grant that I don't know the details here, and of course Linus is...opinionated.

karmakaze•5h ago

This is nuts. I've used both MD RAID and O_DIRECT though luckily not together on the same system. One system was with btrfs so may have been spared anyway. Footguns/landmines.

rwaksmunski•6h ago

This, fsync() data corruption, BitterFS issues, lack of Audit on io_uring, triplicated EXT2,3,4 code bases. For the past 20 years, every time I consider moving mission critical data from FreeBSD/ZFS something like this pops up.

zokier•6h ago

Personally I think these problems are a sign that posix fs/io apis are just not that good. Or rather they have been stretched and extended way past their original intent/design. Stuff like zenfs give interesting glimpse of what could be.

burnt-resistor•4h ago

FreeBSD 13+ threw away their faithful adaptation of production-proven code for OpenZFS (ZoL).[0,1] I refuse to use OpenZFS (ZoL) because a RHEL file server self-corrupted, wouldn't mount rw any longer, and ZoL shrugged it off without any resolution except "buy more drives and start over".

Overall, there's grossly insufficient comprehensive testing tools, techniques, and culture in FOSS (FreeBSD, Linux, and most projects) rely upon informal/under-documented, ad-hoc, meat-based scream testing rather than proper, formal verification of correctness. Although no one ever said high-confidence software engineering was easy, it's essential to avoid entire classes of CVEs and unexpected operation bugs.

0: https://www.freebsd.org/releases/13.0R/relnotes/

1: https://lists.freebsd.org/pipermail/freebsd-fs/2018-December...

guipsp•4h ago

The link you posted explains exactly why they threw it away. You may disagree, but the stakeholders did not.

burnt-resistor•2h ago

Yes, I know. And I know iXsystems folks too. If you want stable, battle-tested ZFS, Solaris is the only supportable option on Sun hardware like the good ol' (legacy) Thumper. OpenZFS isn't tested well enough and there's too much hype and religious zealotry around it. For person use, it's probably fine for some people but, at this point, semi alternatives such as xfs and btrfs [thanks to Meta] have billions more hours of production usage.

Root System Drawings

Tinnitus Neuromodulator

Flowistry: An IDE plugin for Rust that focuses on relevant code

What Dynamic Typing Is For

Who invented deep residual learning?

Chen-Ning Yang, Nobel laureate, dies at 103

./watch

Is Postgres read heavy or write heavy?

Solution to CIA’s kryptos sculpture is found in Smithsonian vault

Using CUE to unify IoT sensor data

Andrej Karpathy – It will take a decade to work through the issues with agents

Secret diplomatic message deciphered after 350 years

Liva AI (YC S25) Is Hiring

Ripgrep 15.0

Show HN: The Shape of YouTube

K8s with 1M nodes

New Work by Gary Larson

Titan submersible’s $62 SanDisk memory card found undamaged at wreckage site

Picturing Mathematics

Coral NPU: A full-stack platform for Edge AI

Why the open social web matters now

Attention is a luxury good

SQL Anti-Patterns

Ruby Blocks

Lux: A luxurious package manager for Lua

The Hunt for the World's Oldest Story

Fast calculation of the distance to cubic Bezier curves on the GPU

Our Paint – a featureless but programmable painting program

When you opened a screen shot of a video in Paint, the video was playing in it

AMD's Chiplet APU: An Overview of Strix Halo