The objection is the tiniest bug-fix windows get everything but the kitchen sink.
These are both uncomfortable positions to occupy, without doubt.
And the whole reason for a filesystem's existence is to store and maintain your data, so if that is what the patch if for, yes, it should be under consideration as a hotfix.
There's also the broader context: it's a major problem for stabilization if we can't properly support the people using it so they can keep testing.
More context: the kernel as a whole is based on fixed time tables and code review, which it needs because QA (especially automated testing) is extremely spotty. bcachefs's QA, both automated testing and community testing, is extremely good, and we've had bugfix patchsets either held up or turn into flamewars because of this mismatch entirely too many times.
Then what you do is you try to split your work in two. You could think of a stopgap measure or a workaround which is small, can be reviewed easily, and will reduce the impact of the bug while not being a "proper" fix, and prepare the "properer" fix when the merge window opens.
I would ask, since the bug probably lived since the last stable release, how come it fell through the crack and had only been noticed recently? Could it be that not all setups are affected? If so, can't they live with it until the next merge window?
By making a "feature that fixes the bug for real", you greatly expand the area in which new, unknown bugs may land, with very little time to give it proper testing. This is inevitable, evident by the simple fact that the bug you were trying to fix exists. You can be good, but not that good. Nobody is that good. If anybody was that good, they wouldn't have the bug in the first place.
If you have commercial clients who use your filesystem and you have contractual obligations to fix their bugs and keep their data intact, you could (I'd even say "should") maintain an out-of-tree version with its own release and bugfix schedule. This is IMO the only reasonable way to have it, because the kernel is a huge administrative machine with lots of people, and by mainlining stuff, you necessarily become co-dependent on the release schedule for the whole kernel. I think a conflict between kernel's release schedule and contractual obligations, if you have any, is only a matter of time.
That is indeed what I normally do. For example, 6.14 and 6.15 had people discovering btree iterator locking bugs (manifesting as assertion pops) while running evacuates on large filesystems (it's hard to test a sufficiently deep tree depth in virtual machine tests with our large btree nodes); some small hotfixes went out in rc kernels, but the majority of the work (a whole project to add assertions for path->should_be_locked, which should shut these down for good) waited until the 6.16 merge window.
That was for a less critical bug - your machine crashing is somewhat less severe than losing a filesystem.
In this case, we had a bug pop up in 6.15 where the link count in the VFS inode getting screwed up caused an inode to be deleted that shouldn't have been - a subvolume root - and then an untested repair path took out the entire subvolume.
Ouuuuch.
That's why the repair code was rushed; it had already gotten one filesystem back, and I'd just gotten another report of someone else hitting it - and for every bug report there are almost always more people who hit it and don't report it.
And considering that a lot of people running bcachefs now are getting it from distro kernels and don't know how to build kernels - that is why it was important to get this out quickly through the normal channels.
In addition, the patch wasn't risky, contrary to what Ted was saying. It's a code path that's very well covered by automated tests, including KASAN/UBSAN/lockdep variants - those would exploded if this patch was incorrect.
When to ship a patch is always a judgement call, and part of how you make that call is how well your QA process can guarantee the patch is correct. Part of what was going on here is a disconnect between those of us who do make heavy use of modern QA infrastructure and those who do it the old school way, relying heavily on manual review and long testing periods for rc kernels.
At work we have our main application which also contains a lot of customer integrations. Our policy has been new features in trunk only, except if it's entirely contained inside a customer-specific integration module.
We do try to avoid it, but this does allow us to be flexible with regards to customer needs, while keeping the base application stable.
This new recovery feature was, as far as I could see, entirely contained within the bcachefs kernel code. Given the experimental status, as long as it was clearly communicated to users, I don't see a huge problem allowing such self-contained features during the RC phase.
Obviously a requirement must be that it doesn't break the build.
I'm not saying those concerns are wrong, but when it's causing a fallout like being kicked out from the kernel, the downsides clearly are more severe than any potential benefits.
I know a lot of people heavily use slack/discord these days, but personally I find the web interfaces way too busy. IRC all the way, for me.
But the problem of communicating effectively enough to produce a coherent design is very real - this goes back to Fred Brooks (Mythical Man Month). I think bcachefs turned out very well with the way the process has gone to date, and now that it's gotten bigger, with more distinct subsystems, I am very eagerly looking forward to the date when I can hand off ownership of some of those subsystems. Lately we've had some sharp developers getting involved - for the past several years it's been mainly users testing it (and some of them have gotten very good at debugging at this point).
So it's happening.
>negative effects on the rest of the kernel
Needing to design and support an API is not purely negative for kernel developers. It also gives a change to have a proper interface for drivers to use and follow. Take a look at the Rust for Linux which keeps running into undocumented APIs that make little sense and are just whatever <insert most popular driver> does.
We already have that, with the "don't break userspace" policy combined with all of the modules being in-tree.
> Needing to design and support an API is not purely negative for kernel developers.
Sure, it's not purely negative, but it's overall a big net negative.
> Take a look at the Rust for Linux which keeps running into undocumented APIs that make little sense and are just whatever <insert most popular driver> does.
That's an argument against a stable module API! Those things are getting fixed as they get found, but if we had a stable module API, we'd be stuck with them forever.
I recommend reading https://docs.kernel.org/process/stable-api-nonsense.html
Bcachefs is not user space.
>with all of the modules being in-tree.
That is not true. There are out of tree modules such as ZFS.
>That's an argument against a stable module API!
My point was that there was 0 thought put into creating a good API. Additionally API could be evolved over time and have a support period if you care about being able to evolve it and deprecate the old one. And likely even with a better interface there is probably a way to make the old API still function.
bcachefs is still in-tree.
> That is not true. There are out of tree modules such as ZFS.
ZFS could be in-tree in no time at all if Oracle would fix its license. And until they do that, it's not safe to use ZFS-on-Linux anyway, since Oracle could sue you for it.
> My point was that there was 0 thought put into creating a good API.
There is thought put into it: it's exactly what we need right now, because if what we need ever changes, we'll change the API too, thus avoiding YAGNI and similar problems.
> Additionally API could be evolved over time and have a support period if you care about being able to evolve it.
If a temporary "support period" is what you want, then just use the LTS kernels. That's already exactly what they give you.
> And likely even with a better interface there is probably a way to make the old API still function.
That's the big net negative I was mentioning and that https://docs.kernel.org/process/stable-api-nonsense.html talks about too. Sometimes there isn't a feasible way to support part of an old API anymore, and it's not worth holding the whole kernel back just for the out-of-tree modules.
IANAL, but I don't believe either of these things are true.
OpenZFS contains enough code not authored by Sun/Oracle that relicensing it now is effectively impossible.
OTOH, it is under the CDDL, which is a perfectly good open source license; AFAICT the problem, if one exists at all[0], only manifests when distributing the combination of CDDL (OpenZFS) and GPL (Linux) software. If you download CDDL software and compile it into GPL software yourself (say, with DKMS) then it should be fine because you aren't distributing it.
[0] This is a case where I'm going to really emphasize that I'm really not a lawyer and merely point out that ex. Canonical's lawyers do seem to think CDDL+GPL is okay.
Which excludes a vast amount of activity one might want to use Linux for which is otherwise allowed. Like selling a device with a Linux installation, distributing VM or system restore images, etc.
It's not against the license to use them together.
>If a temporary "support period" is what you want, then just use the LTS kernels. That's already exactly what they give you.
Only the Android one does. The regular LTS one has no such guarantee.
>FreeBSD (which is often invoked in these complaints) would have amdgpu for example.
In such a hypothetical FreeBSD could reimplement the stable API of Linux.
Like it does with the userland API of Linux, which is stable:
I worked for HP on storage drivers for a decade or so, and had their been a stable ABI, HP would have shipped proprietary storage drivers for everything. Even without a stable ABI, they shipped proprietary drivers at considerable effort, compiling for myriad different distro kernels. It was a nightmare, and good thing too, or there wouldn't be any open source drivers.
There is absolutely no good reason not to share driver source though so that's a terrible use case to optimize for.
(it's not obvious that having to occasionally disassemble/patch closed-source drivers is worse than the collective effort wasted trying to get every single thing in the kernel and keep it up to date).
However, Kent, if you read this: please just settle down and follow the rules. Quit deliberately antagonizing Linus. The constant drama is incredibly offputting. Don't jeopardize the entire future of bcachefs over the silliest and most temporary concerns.
If you absolutely must argue about some rule or other, then make that argument without having your opening move be to blatantly violate them and then complain when people call you out.
You were the one who wanted into the kernel despite many suggestions that it was too early. That comes with tradeoffs. You need to figure out how to live with that, at least for a year or two. Stop making your self-imposed problems everyone else's problems.
Kent is absolutely technically capable of, and has the vision to, finally displace ext4, xfs, and zfs with a new filesystem that Does Not Lose Data. To jeopardize that by refusing to work within the well-established structure is madness.
And I encourage anyone who wants to contribute to join the IRC channel. It's not a one man show, I work with a lot of people there.
The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.
The extended (not extant) family (including ext4) don't support copy-on-write. Using them as your primary FS after 2020 (or even 2010) is like using a non-journaling file system after 2010 (or even 2001)--it's a non-negotiable feature at this point. Btrfs has been stable for a decade, and if you don't like or trust it, there's always ZFS, which has been stable 20 years now. Apple now has AppFS, with CoW, on all their devices, while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.
They seem to be slowly introducing it to the masses, Dev drives you set up on Windows automatically use ReFS
Speed is sometimes more important than absolute reliability, but it’s still an undesirable tradeoff.
Being able to quickly take a "backup" copy of some multi-gb directory tree before performing some potentially destructive operation on it is such a nice safety net to have.
It's also a handy way to backup file metadata, like mtime, without having to design a file format for mapping saved mtimes back to their host files.
You're thinking of the optimization technique of CoW, as in what Linux does when spawning a new thread or forking a process. I'm talking about it in the context of only ever modifying copies of file system data and metadata blocks, for the purpose of ensuring file system integrity, even in the context of sudden power loss (EDIT: wrong link): https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
If anything, ordinary file IO is likely to be slightly slower on a CoW file system, due to it always having to copy a block before said block can be modified and updating block pointers.
What kind of journaling though? By default ext4 only uses journaling for metadata updates, not data updates (see "ordered" mode in ext4(5)).
So if you have a (e.g.) 1000MB file, and you update 200MB in the middle of it, you can have a situation where the first 100MB is written out and the system dies with the other 100MB vanishing.
With a CoW, if the second 100MB is not written out and the file sync'd, then on system recovery you're back to the original file being completely intact. With ext4 in the default configuration you have a file that has both new-100MB and stale-100MB in the middle of it.
The updating of the file data and the metadata are two separate steps (by default) in ext4:
* https://www.baeldung.com/linux/ext-journal-modes
* https://michael.kjorling.se/blog/2024/ext4-defaulting-to-dat...
* https://fy.blackhats.net.au/blog/2024-08-13-linux-filesystem...
Whereas with a proper CoW (like ZFS), updates are ACID.
I read that more as "we have filesystems at home, and also get off my lawn".
... It does hard links? After checking: It does hard links.
... Why didn't any programs I had noticeably take advantage of that?
That's pretty much built into most mass storage devices already.
> If a disk bitflips one of my files
The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.
> ext* won't do anything about it.
What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block? Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
> That's pretty much built into most mass storage devices already.
And ZFS has shown that it is not sufficient (at least for some use-cases, perhaps less of a big deal for 'residential' users).
> The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.
Not worth it to whom? Not having the option available at all is the problem. I can do a zfs set checksum=off pool_name/dataset_name if I really want that extra couple percentage points of performance.
> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
Depends on the data involved: if it's part of the file system tree metadata there are often multiple copies even for a single disk on ZFS. So instead of the kernel consuming corrupted data and potentially panicing (or going off into the weeds) it can find a correct copy elsewhere.
If you're in a fancier configuration with some level of RAID, then there could be other copies of the data, or it could be rebuilt through ECC.
With ext*, LVM, and mdadm no such possibility exists because there are no checksums at any of those layers (perhaps if you glom on dm-integrity?).
And with ZFS one can set copies=2 on a per-dataset basis (perhaps just for /home?), and get multiple copies strewn across the disk: won't save you from a drive dying, but could save you from corruption.
I looked at that, in hopes of being able to protect my data. Unfortunately, I considered this something of a fatal flaw:
> It uses journaling for guaranteeing write atomicity by default, which effectively halves the write speed.
That's 10^14 bits for a consumer drive. That's just 12TB. A heavy user (lots of videos or games) would see a bit flip a couple times a year.
According to that 10^14 metric I should see read errors just about every month. Except I have just about zero.
Current disks are ~4 years, runs 24/7, and excluding a bad cable incident I've had a single case of a read error (recoverable, thanks ZFS).
I suspect those URE numbers are made by the manufacturers figuring out they can be sure the disk will do 10^14, but they don't actually try to find the real number because 10^14 is good enough.
> What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block?
Well, that's what it does now, and I think that's a problem.
> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
Linux can fail a read, and IMHO should do so if it cannot return correct data. (I support the ability to override this and tell it to give you the corrupted data, but certainly not by default.) On ZFS, if a read fails its checksum, the OS will first try to get a valid copy (ex. from a mirror or if you've set copies=2), and then if the error can't be recovered then the file read fails and the system reports/records the failure, at which point the user should probably go do a full scrub (which for our purposes should probably count as fsck) and restore the affected file(s) from backup. (Or possibly go buy a new hard drive, depending on the extent of the problem.) I would consider that ideal.
You get that with APFS by default on macOS these days and those features come for free in btrfs, some in XFS, etc on Linux.
That makes checksums and journals of only marginal usefulness.
I wish some review website would have a robot plug and unplug the power cable in a test rig for a few weeks and rate which SSD manufacturers are robust to this stuff.
Kent is in the wrong. Having a lead position in development I would kick Kent of the team.
One thing is to challenge things. What Kent is doing is something completely different. It is obvious he introduced a feature, not only a Bugfix.
If the rules are set in a way that rc1+ gets only Bugfixes, then this is absolutely clear what happens with the feature. Tolerating this once or twice is ok, but Kent is doing this all the time, testing Linus.
Linus is absolutely in the right to kick this out and it's Kent's fault if he does so.
A way to handle this would be with one person (or more) in between Kent and Linus. And maybe a separate tree only for changes and fixes from bcachefs that those people in between would forward to Linus. A staging of sorts.
Software take up no space and there is no scarcity. Theoretically there could be any number of maintainers and what gets uptake is the de facto upstream. That's what people refer to when they talk about free software development in terms of meritocracy.
It's a damn shame too because bcachefs has some unique features/potential
It would be interesting how strict the rules are in the Linux kernel for other people. Other projects have nepotistic structures where some developers can do what they want but others cannot.
Anyway, if Linus had developed the kernel with this kind of strictness from the beginning, maybe it wouldn't have taken off. I don't see why experimental features should follow the rules for stable features.
Ideally they'd be developed outside the kernel until they are perfect, but Kent addresses this in his LWN comment: There is no funding/time to make that ideal scenario possible.
If you’re using experimental file systems, I’d expect you to be pretty competent in being able to hold your own in a storage emergency, like compiling a kernel if that’s the way out.
This is a made up emergency, to break the rules.
Like the sibling commenter, I suspect the word “experimental” is being used here to try and evade the rules that, somehow, every other part of the kernel manages to do just fine with.
That means Linus has to check each of his PRs assuming that it might be pushing the boundaries without warning.
No amount of post hoc justification gets you that trust back, not when this has happened multiple times now.
https://lore.kernel.org/linux-fsdevel/4xkggoquxqprvphz2hwnir...
The PR description would have been fine - if it had been in the right stage of the process.
There is always a point where you have to say "no I can't work with this person any more", but while you are still trying to it's worth trying to figure out why someone is behaving as they do.
bCacheFS, not BCA Chefs. I’m not clued into the kernel at this level so I racked my brain a bit.
Apparently bcachefs won't be the successor. Filesystem development for Linux needs a big shakeup.
Maybe a university could do it.
The amount of "mostly OK" and still an "unstable" RAID6 implementation. Not going to trust a file system with "mostly OK" device replace. Anecdotally, you can search the LKML and here for tons of data loss stories.
guerrilla•6h ago
mschuster91•6h ago
tpolzer•5h ago
While you should have a backup of your data anyway.
quotemstr•5h ago
Tostino•3h ago
msgodel•5h ago
Opens LKML archive hoping for another Linus rant.
rendaw•4h ago