Understanding ZFS Scrubs and Data Integrity

https://klarasystems.com/articles/understanding-zfs-scrubs-and-data-integrity/

93•zdw•3w ago

Comments

itchingsphynx•2w ago

>Most systems that include ZFS schedule scrubs once per month. This frequency is appropriate for many environments, but high churn systems may require more frequent scrubs.

Is there a more specific 'rule of thumb' for scrub frequency? What variables should one consider?

nubinetwork•2w ago

Total pool size and speed. Less data scrubs faster, as do faster disks or disk topology (a 3 way stripe of nvme will scrub faster than a single sata ssd)

For what its worth, I scrub daily mostly because I can. It's completely overkill, but if it only takes half an hour, then it can run in the middle of the night while I'm sleeping.

ssl-3•2w ago

The cost of a scrub is just a flurry of disk reads and a reduction in performance during a scrub.

If this cost is affordable on a daily basis, then do a scrub daily. If it's only affordable less often, then do it less often.

(Whatever the case: It's not like a scrub causes any harm to the hardware or the data. It can run as frequently as you elect to tolerate.)

agapon•2w ago

With HDDs, it's also mechanical wear and increased chance of a failure. SSDs are not fully immune to increased load either.

ssl-3•2w ago

Is there any evidence that suggests that reading from a hard drive (instead of it just spinning idle) increases physical wear in any meaningful way? Likewise, is there any evidence of this for solid-state storage?

rcxdude•2w ago

Yes. Hard drives have published "Annualized Workload Rate" ratings, which are in TB/year, and the manufacturers state there is no difference between reads and writes for the purpose of this rating.

(https://www.toshiba-storage.com/trends-technology/mttf-what-...)

For SSDs, writes matter a lot more. Reads may increase the temperature of the drive, so they'll have some effect, but I don't think I've seen a read endurance rating for an SSD.

digiown•2w ago

Reading from it requires the read head to move, as opposed to spinning idle where the heads are parked on the side. Moving parts generally wear out over time.

toast0•2w ago

Once a month seems like a reasonable rule of thumb.

But you're balancing the cost of the scrub vs the benefit of learning about a problem as soon as possible.

A scrub does a lot of I/O and a fair amount of computing. The scrub load competes with your application load and depending on the size of your disk(s) and their read bandwidth, it may take quite some time to do the scrub. There's even maybe some potential that the read load could push a weak drive over the edge to failure.

On my personal servers, application load is nearly meaningless, so I do an about monthly scrub from cron which I think will only scrub one zpool at a time per machine, which seems reasonable enough to me. I run relatively large spinning disks, so if I scrubbed on a daily basis, the drives would spend most of the day scrubbing and that doesn't seem reasonable. I haven't run ZFS in a work environment... I'd have to really consider how the read load impacted the production load and if scrubbing with limits to reduce production impact would complete in a reasonable amount of time... I've run some systems that are essentially alwayd busy and if a scrub would take several months, I'd probably only scrub when other systems indicate a problem and I can take the machine out of rotation to examine it.

If I had very high reliability needs or a long time to get replacement drives, I might scrub more often?

If I was worried about power consumption, I might scrub less often (and also let my servers and drives go into standby). The article's recommendation to scan at least once every 4 months seems pretty reasonable, although if you have seriously offline disks, maybe once a year is more approachable. I don't think I'd push beyond that, lots of things don't like to sit for a year and then turn on correctly.

atmosx•2w ago

Once a month is fine ("/etc/cron.monthly/zfs-scrub"):

    #!/bin/bash
    #
    # ZFS scrub script for monthly maintenance
    # Place in /etc/cron.monthly/zfs-scrub
    
    POOL="storage"
    TAG="zfs-scrub"
    
    # Log start
    logger -t "$TAG" -p user.notice "Starting ZFS scrub on pool: $POOL"
    
    # Run the scrub
    if /sbin/zpool scrub "$POOL"; then
        logger -t "$TAG" -p user.notice "ZFS scrub initiated successfully on pool: $POOL"
    else
        logger -t "$TAG" -p user.err "Failed to start ZFS scrub on pool: $POOL"
        exit 1
    fi
    
    exit 0

chungy•2w ago

That script might do with the "-w" parameter passed to scrub. Then "zpool scrub" won't return until the scrub is finished.

k_bx•2w ago

Didn't know about the logger script, looks nice. Can it wrap the launch of the scrub itself so that it logs like logger too, or do you separately track its stdout/stderr when something happens?

update: figured how you can improve that call to add logs to logger

nubinetwork•2w ago

Scrub doesn't log anything by default, you run it and it returns quickly... you have to get the results out of zpool status or through zed.

atmosx•2w ago

The status of the pool is tracked by Prometheus though an ZFS exporter and faults are reported via alert manager & pushover.

kanbankaren•2w ago

Once a month might be too high because HDDs are rated at ~ 180 TB workload/year. Remember, the workload/year limit includes read & writes and doesn't vary much by capacity, so a 10 TB HDD scrubbed monthly consumes 67% of the workload, let alone any other usage.

Scrubbing every quarter is usually sufficient without putting high wear on the HDD.

Hakkin•2w ago

A scrub only reads allocated space, so in your 10TB example, a scrub would only read whatever portion of that 10TB is actually occupied by data. It's also usually recommended to keep your usage below 80% of the total pool size to avoid performance issues, so the worst case in your scenario would be more like ~53% assuming you follow the 80% rule.

formerly_proven•2w ago

Is the 80% rule real or just passed down across decades like other “x% free” rules? Those waste enormous amounts of resources on modern systems and I kind of doubt ZFS actually needs a dozen terabytes or more of free space in order to not shit the bed. Just like Linux doesn’t actually need >100 GB of free memory to work properly.

cornonthecobra•2w ago

Speaking strictly about ZFS internal operations, the free space requirement is closer to 5% on current ZFS versions. That allows for CoW and block reallocations in real-world pools. Heavy churn and very large files will increase that margin.

barrkel•2w ago

In practice you see noticeable degradation of performance for streaming reads of large files written after 85% or so. Files you used to be able to expect to get 500+MB/sec could be down to 50MB/sec. It's fragmentation, and it's fairly scale invariant, in my experience.

magicalhippo•2w ago

> Is the 80% rule real or just passed down across decades like other “x% free” rules?

As I understand it, the primary reason for the 80% was that you're getting close to another limit, which IIRC was around 90%, where the space allocator would switch from finding a nearby large-enough space to finding the best-fitting space. This second mode tanks performance and could lead to much more fragmentation. And since there's no defrag tool, you're stuck with that fragmentation.

It has also changed, now[1] the switch happens at 96% rather than 90%. Also the code has been improved[2] to better keep track of free space.

However, performance can start to degrade before you reach this algorithm switch[3], as you're more likely to generate fragmentation the less free space you have.

However, it was also a generic advice, which was ignorant to your specific workload. If you have a lot of cold data, low churn but it's fairly equal in size, then you're probably less affected than if you have high churn with lots of files of varied sizes.

[1]: https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

[2]: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZpoolFra...

[3]: https://www.bsdcan.org/2016/schedule/attachments/366_ZFS%20A...

kanbankaren•2w ago

Still 53% of the useful life of a HDD for just scrubbing is excessive.

You don't lose tracks in 3 months. If you don't read the tracks for a year and if the HDD is operated in high temperatures, then the controller might struggle to read them.

The very act of scrubbing generates heat, so we should use it sparingly.

barrkel•2w ago

I scrub once a quarter because scrubs take 11 days to complete. I have 8x 18TB raidz2 pool, and I keep a couple of spare drives on hand so I can start a resilver as soon as an issue crops up.

In the past, I've gone for a few years between scrubs. One system had a marginal I/O setup and was unreliable for high streaming load. When copying the pool off of it, I had to throttle the I/O to keep it reliable. No data loss though.

Scrubs are intensive. They will IMO provoke failure in drives sooner than not doing them. But they're the kind of failures you want to bring forward if you can afford the replacements (and often the drives are under warranty anyway).

If you don't scrub, eventually you generally start seeing one of two things: delays in reads and writes because drive error recovery is reading and rereading to recover data; or, if you have that disk behaviour disabled via firmware flags (and you should, unless you're reslivering and on your last disk of redundancy), you see zfs kicking a drive out of the pool during normal operations.

If I start seeing unrecoverable errors, or a drive dropping out of the pool, I'll disable scrubs if I don't have a spare drive on hand to start mirroring straight away. But it's better to have the spares. At least two, because often a second drive shows weakness during resilver.

There is a specific failure mode that scrubs defend against: silent disk corruption that only shows up when you read a file, but for files you almost never read. This is a pretty rare occurrence - it's never happened to me in about 50 drives worth of pools over 15 years or so. The way I think about this is, how is it actionable? If it's not a failing disk, you need to check your backups. And thus your scrub interval should be tied to your backup retention.

thatcks•2w ago

The article is correct but it downplays an important limitation of ZFS scrubs when it talks about how they're different from fsck and chkdsk. As the article says (in different words), ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems. Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems, and as it stands today ZFS doesn't have anything that either checks or corrects these. Sometimes you find them through incorrect results; sometimes you discover they exist through ZFS assertion failures triggering kernel panics.

(We run ZFS in production and have not been hit by these issues, at least not that we know about. But I know of some historical ZFS bugs in this area and mysterious issues that AFAIK have never been fully diagnosed.)

mustache_kimono•2w ago

    "Scrubs differ significantly from traditional filesystem checks. Tools such as fsck or chkdsk examine logical structures and attempt to repair inconsistencies related to directory trees, allocation maps, reference counts, and other metadata relationships. ZFS does not need to perform these operations during normal scrubs because its transactional design ensures metadata consistency. Every transaction group moves the filesystem from one valid state to another. The scrub verifies the correctness of the data and metadata at the block level, not logical relationships."

> ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems

A scrub literally reads the object from disk. And, for each block, the checksums are read up the tree. The object is therefore guaranteed to be correct and consistent at least re: the tree of blocks written.

> Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems

Can you give a more concrete example of what you mean? It sounds like you have some experience with ZFS, but "ZFS doesn't have an fsck" is also some truly ancient FUD, so you will forgive my skepticism.

I'm willing to believe that you request an object and ZFS cannot return that object because of ... a checksum error or a read error in a single disk configuration, but what I have never seen is a scrub that indicates everything is fine, and then reads which don't return an object (because scrubs are just reads themselves?).

Now, are things like pool metadata corruption possible in ZFS? Yes, certainly. I'm just not sure fsck would or could help you out of the same jam if you were using XFS or ext4. AFAIK fsck may repair inconsistencies but I'm not sure it can repair metadata any better than ZFS can?

ori_b•2w ago

Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.

mustache_kimono•2w ago

> Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.

A few things: 1) Is this an actual ZFS issue you encountered or is this a hypothetical? 2) And -- you don't imagine this would be discovered during a scrub? Why not? 3) But -- you do imagine it would be discovered and repaired by an fsck instead? Why so? 4) If so, wouldn't this just be a bug, like a fsck, not some fundamental limitation of the system?

FWIW I've never seen anything like this. I have seen Linux plus a flaky ALPM implementation drop reads and writes. I have seen ZFS notice at the very same moment when the power dropped via errors in `zpool status`. I do wonder if ext4's fsck or XFS's fsck does the same when someone who didn't know any better (like me!) sets the power management policy to "min_power" or "med_power_with_dipm".

ori_b•2w ago

Here's an example: https://www.illumos.org/issues/17734. But it would not be discovered by a scrub because the hashes are valid. Scrubs check hashes, not structure. It would be discovered by a fsck because the structure is invalid. Fscks check structure, not hashes.

They are two different tools, with two different uses.

mustache_kimono•2w ago

> Scrubs check hashes, not structure.

How is the structure not valid here? Can you explain to us how an fsck would discover this bug (show an example where an fsck fixed a similar bug) but ZFS could never? The point I take contention with is that missing an fsck is a problem for ZFS, so more specifically can you answer my 4th Q:

>> 4) If so, wouldn't this just be a bug, like (a bug in) fsck, not some fundamental limitation of the system?

So -- is it possible an fsck might discover an inconsistency ZFS couldn't? Sure. Would this be a fundamental flaw of ZFS, which requires an fsck, instead of merely a bug? I'm less sure.

You do seem to at least understand my general contention with the parent's point. However, the parent is also making a specific claim about a bug which would be extraordinary. Parent's claim is this is a bug which a scrub, which is just a read, wouldn't see, but a subsequent read would reveal.

So -- is it possible an fsck might discover this specific kind of extraordinary bug in ZFS, after a scrub had already read back the data? Of that I'm highly dubious.

ori_b•2w ago

> Can you show us how an fsck would discover this bug but ZFS could never?

I'd have to read closer to be certain, but if my understanding of it is correct, you'd have orphaned objects in the file system. The orphaned objects would be detectable, but would have correct hashes.

Here's a better example of one where scrub clearly doesn't catch it, but fsck would (and zdb does). The metaslab structure could be checked in theory, but isn't. https://neurrone.com/posts/openzfs-silent-metaslab-corruptio...

> if so, wouldn't this just be a bug, like (a bug in) fsck, not some fundamental limitation of the system?

It's not a bug or a fundamental limitation of the system, it's just that fsck != scrub, and nobody has written the code for fsck. If someone wanted to write the code, they could. I suspect it wouldn't even be particularly hard.

But fsck != scrub, and they catch different things.

agapon•2w ago

Generally, it's possible to have data which is not corrupted but which is logically inconsistent (incorrect).

Imagine that a directory ZAP has an entry that points to a bogus object ID. That would be an example. The ZAP block is intact but its content is inconsistent.

Such things can only happen through a logical bug in ZFS itself, not through some external force. But bugs do happen.

If your search through OpenZFS bugs you will find multiple instances. Things like leaking objects or space, etc. That's why zdb now has support for some consistency checking (bit not for repairs).

mustache_kimono•2w ago

> Imagine that a directory ZAP has an entry that points to a bogus object ID. That would be an example. The ZAP block is intact but its content is inconsistent.

The above is interesting and fair enough, but a few points:

First, I'm not sure that makes what seems to be the parent's point -- that scrub is an inadequate replacement for an fsck.

Second, I'm really unsure if your case is the situation the parent is referring to. Parent seems to be indicating actual data loss is occurring. Not leaking objects or space or bogus object IDs. Parent seems to be saying she/he scrubs with no errors and then when she/he tries to read back a file, oops, ZFS can't.

rincebrain•2w ago

The two obvious examples that come to mind are native encryption bugs and spacemap issues.

Nothing about walking the entire tree of blocks and checking hashes validates the spacemaps - they only come up when you're dealing with allocating new blocks, and there have been a number of bugs where ZFS panics because the spacemaps say something insane, so you wind up needing to readonly import or discard the ZIL because it panics about trying to allocate an already-allocated segment if you import RW - and if your ondisk spacemaps are inconsistent in a way that discarding the ZIL doesn't work around, you would need some additional tool to try and repair this, because ZFS has no knobs for it.

Native encryption issues wouldn't be noticed because scrubbing doesn't attempt to untransform data blocks - you indirectly do that when you're walking the structures involved, but the L0 data blocks don't get decompressed or decrypted, since all your hashes are of the transformed blocks. And if you have a block where the hash in the metadata is correct but it doesn't decrypt, for any reason, scrub won't notice, but you sure will if you ever try to decrypt it.

mustache_kimono•2w ago

> The two obvious examples

Appreciate this rincebrain. Know that you know better than most and this certainly covers my 2nd point. I don't imagine these cases cover my first point though? These are not bugs of the type a fsck would catch?

rincebrain•2w ago

A fsck could pretty readily notice and repair the spacemap inconsistencies - zdb already generates its own spacemaps and compares to reality on import.

If you have the keys, technically nothing stops a fsck from noticing the encryption problems, but yes, usually it wouldn't unless you had some known issue you added special detection for, like when XFS years ago had problems with if you mounted it with inode64 once and then not the next time so the inode numbers would wraparound.

magicalhippo•2w ago

> Can you give a more concrete example of what you mean?

There's been several instances. For example, the send/receive code has had bugs leading to cases[1] where the checksum and hence scrub look fine but the data is not.

edit: the recent block cloning has also had some issues, eg[2][3].

I'm pretty sure it's also possible for hardware errors like bad memory to cause the data to get corrupted but the checksum gets computed on the corrupted data, thus it looks ok when scrubbed.

[1]: https://github.com/openzfs/zfs/issues/4809

[2]: https://github.com/openzfs/zfs/issues/15526

[3]: https://github.com/openzfs/zfs/issues/15933

SubjectToChange•2w ago

I like how ZFS doesn’t have “bugs”, it has “defects”.

mustache_kimono•2w ago

> There's been several instances.

I think you're missing the 2nd feature to the parent's point that I take issue with, which is this is not just a bug that a scrub wouldn't find, but it must also be a bug which an fsck would find.

The parent's point is -- ZFS should have an fsck tool because an fsck does something ZFS cannot do by other means. I disagree. Yes, ZFS has bugs like any filesystem. However, I'm not sure an fsck tool would make that situation better?

magicalhippo•2w ago

> I think you're missing the 2nd feature to the parent's point that I take issue with

You're right, I did gloss over that point.

I guess it should be noted that a lot of what fsck does[1] on say ext4 is something ZFS does on pool import, like replying the journal (ZIL) or trying older superblocks (uberblocks[2]). In that regard it's acting more like XFS[3] from what I can see, which just exits with 0.

[1]: https://linux.die.net/man/8/fsck.ext4

[2]: https://openzfs.github.io/openzfs-docs/man/master/8/zpool-im...

[3]: https://linux.die.net/man/8/fsck.xfs

magicalhippo•2w ago

> like replying the journal

Sigh... that's of course replaying the journal.

thatcks•2w ago

Two examples that I can find are https://github.com/openzfs/zfs/issues/7910, where very old versions of ZFS appear to have quietly written slightly incorrect ACL information, and https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/190... where Ubuntu 21.10 shipped with a bug that created corrupted ZFS filesystems. I believe https://www.illumos.org/issues/9847 may be another example of this, although less severe, where ZFS leaked disk space under some circumstances.

E39M5S62•2w ago

Ubuntu shipped with a bug that they introduced by way of a very badly done patch. While I get your point, I don't think it's fair to use Ubuntu as a source - they're grossly incompetent when it comes to handling ZFS.

mustache_kimono•2w ago

> Two examples that I can find

I think you may be misreading my point above. I am not arguing ZFS doesn't have bugs. That's nuts. I am arguing that the bug the parent says he has would be an extraordinary bug.

This is not just a bug that a scrub wouldn't find, but also it is a bug which an fsck would find. And it is not just a bug in the spacemaps or other metadata, but the parent's claim is this is a bug which a scrub, which is just a read, wouldn't see, but a subsequent read would reveal.

thatcks•2w ago

I am the parent, and any competent fsck should find these issues. Fsck traditionally explicitly verifies claimed free space against actual free space, and in a filesystem with ACLs it should also verify that filesystem level metadata like ACLs is sane, just as fsck verifies (for example) sane inode flags and inode field values. ZFS scrubs explicitly do not verify spacemap consistency or a lot of other sorts of consistency.

ZFS scrubs don't even verify that a filesystem's directory tree is acyclic and can reach every claimed in-use filesystem object, but I'm not aware of ZFS bugs in that area. This is because ZFS's 'metadata' for scrubs is much different than how it works in a traditional filesystem. To phrase it in conventional filesystem terms, ZFS has a big list of all in-use inodes, and it verifies the filesystem checksums by going through this list. The 'filesystem metadata' that a scrub verifies is the structure of this list of in-use inodes, plus some other things around the corners.

wereHamster•2w ago

A loooong time age (OpenSolaris days) I had a system that had corrupted its zfs. No fsck was available because the developers claimed (maybe still do) that it's unnecessary.

I had to poke around the raw device (with dd and such) to restore the primary superblock with one of the copies (that zfs keeps in different locations on the device). So clearly the zfs devs thought about the possibility of a corrupt superblock, but didn't feel the need to provide a tool to compare the superblocks and restore one from the other copies. That was the point when I stopped trusting zfs.

Such arrogance…

fvv•2w ago

it's still the case even with now openzfs ? what do you trust now ?

barrkel•2w ago

That's a fine fit of pique - and I once had an awkward file on one of my zfs pools, about three pools ago - but how does it leave you better off, if you want what zfs offers?

Dylan16807•2w ago

> That's a fine fit of pique

So you're rejecting a story about a real bug because...?

> how does it leave you better off

That's a really mercenary way to look at learning about your tools.

But presumably they take smaller risks around zfs systems than they otherwise would.

throw0101a•2w ago

> So clearly the zfs devs thought about the possibility of a corrupt superblock, but didn't feel the need to provide a tool to compare the superblocks and restore one from the other copies.

This mailing list post from 2008 talks about using zdb(8) to mark mark certain uberblocks an invalid so another one would be used:

* https://zfs-discuss.opensolaris.narkive.com/Tx4FaUMv/need-he...

ZDB = ZFS debugger. It's been there since the original Solaris release of ZFS.

> That was the point when I stopped trusting zfs.

As opposed to trusting other file systems and volume managers, which do not have checksums, and so you wouldn't even know about the problem in the first place?

rincebrain•2w ago

That's not using zdb to change anything - it's readonly, all the time. The person reached out and used dd on the disk to corrupt the copies of the uberblock with bad data so that ZFS would be forced to use the older ones (what zpool import -T does, basically, but doing it the hard way).

p_l•2w ago

In my experience[1], the fsck for given filesystem will happily replicate the errors, sometimes in random ways, because often it cannot figure which road to take in face of inconsistency. If anything, OpenZFS built upon that by now documenting the previously deeply hidden option to "rewind" ZFS uberblock if the breakage is recent enough.

[1] I've seen combination of ubuntu bug in packaging (of grub, of all things) and e2fsck nearly wipe a small company from existence, because e2fsck ended up trusting the data it got from superblock when it was not consistent.

phil21•2w ago

> If anything, OpenZFS built upon that by now documenting the previously deeply hidden option to "rewind" ZFS uberblock if the breakage is recent enough.

One of the most "wizardry" moments in my career I've personally witnessed was a deep-level ZFS expert (core OpenZFS developer) we had on retainer come in during a sev0 emergency and rapidly diagnose/rollback a very broken ZFS filesystem to a previous version from a few hours before the incident happened.

This was entirely user error (an admin connected redundant ZFS "heads" to the same JBOD in an incorrect manner so both thought they were primary and both wrote to the disks) that we caught more or less immediately so the damage was somewhat limited. At the time we thought we were screwed and would have to restore from the previous days backup with a multi-day (at best) time to repair.

This was on illumos a few years after the Solaris fork, so I don't think this feature was documented at the time. It certainly was a surprise to me, even though I knew that "in theory" such capability existed. The CLI incantations though were pure wizardry level stuff, especially watching it in real time with terminal sharing with someone who very much knew what they were doing.

klempner•2w ago

>HDDs typically have a BER (Bit Error Rate) of 1 in 1015, meaning some incorrect data can be expected around every 100 TiB read. That used to be a lot, but now that is only 3 or 4 full drive reads on modern large-scale drives. Silent corruption is one of those problems you only notice after it has already done damage.

While the advice is sound, this number isn't the right number for this argument.

That 10^15 number is for UREs, which aren't going to cause silent data corruption -- simple naive RAID style mirroring/parity will easily recover from a known error of this sort without any filesystem layer checksumming. The rates for silent errors, where the disk returns the wrong data that benefit from checksumming, are a couple of orders of magnitude lower.

iberator•2w ago

This is pure theory. Ber shouldn't be counted per sector etc? We shouldn't tread all disk space as single entity, IMO

thfuran•2w ago

Why would that make a difference unless some sectors have higher/lower error rates than others?

Dylan16807•2w ago

For a fixed bit error rate, making your typical error 100x bigger means it will happen 100x less often.

If the typical error is an entire sector, that's over 30 thousand bits. 1:1e15 BER could mean 1 corrupted bit every 100 terabytes or it could mean 1 corrupted sector every 4 exabytes. Or anything in between. If there's any more detailed spec for what that number means, I'd love to see it.

digiown•2w ago

This stat is also complete bullshit. If it were true, your scrubs of any 20+TB pool would get at least corrected errors quite frequently. But this is not the case.

The consumer grade drives are often given an even lower spec of 1 in 1e14. For a 20TB drive, that's more than one error every scrub, which does not happen. I don't know about you, but I would not consider a drive to be functional at all if reading it out in full would produce more than one error on average. Pretty much nothing said on that datasheet reflects reality.

alexfoo•2w ago

> This stat is also complete bullshit. If it were true, your scrubs of any 20+TB pool would get at least corrected errors quite frequently. But this is not the case.

I would expect the ZFS code is written with the expected BER in mind. If it reads something, computes the checksum and goes "uh oh" then it will probably first re-read the block/sector, see that the result is different, possibly re-read it a third time and if all OK continue on without even bothering to log an obvious BER related error. I would expect it only bothers to log or warn about something when it repeatedly reads the same data that breaks the checksum.

Caveat Reddit but https://www.reddit.com/r/zfs/comments/3gpkm9/statistics_on_r... has some useful info in it. The OP starts off with a similar premise that a BER of 10^-14 is rubbish but then people in charge of very large pools of drives wade in with real world experience to give more context.

digiown•2w ago

That's some very old data. I'm curious as to how stuff have changed with all the new advancements like helium drives, HAMR, etc. From the stats Backblaze helpfully publish, I feel like the huge amount of variance between models far outweigh the importance of this specific stat in terms of considering failure risks.

I also thought that it's "URE", i.e. unrecoverable with all the correction mechanisms. I'm aware that drives use various ways to protect against bitrot internally.

allanjude•2w ago

RAID would only be able to recover if it KNEW the data was wrong.

Without a checksum, hardware RAID has no way to KNOW it needs to use the parity to correct the block.

klempner•1w ago

My point is that the most common type of failure here has the drive returning an error, not silently returning bogus data.

kalleboo•2w ago

> HDDs typically have a BER (Bit Error Rate) of 1 in 1015, meaning some incorrect data can be expected around every 100 TiB read. That used to be a lot, but now that is only 3 or 4 full drive reads on modern large-scale drives

I remember this argument way back 16 years ago when the "Why RAID 5 stops working in 2009" article[0] blew up. It's BS. Those aren't the actual average error rates. Those are inflated error rates below which the manufacturer does not want to bother supplying a warranty for.

I have a pool with 260 TB worth of 10/14 TB disks in it 80% full, with monthly scrubs going back years. Not a single checksum error, and in total something like 30 reallocated sectors seen in SMART (half of those on a 7 year old drive).

[0] https://www.zdnet.com/article/why-raid-5-stops-working-in-20...

mnw21cam•2w ago

Agreed. I have a couple of servers each with 168 hard drives, about 6 years old. A few hard drives are starting to fail. ZFS counts read errors (the drive reported an error because its checksum didn't match) and checksum errors (the drive returned data that was actually incorrect and ZFS caught it with a checksum). I have seen lots of read errors, but not a single checksum error yet. Though these are server-grade drives, which might be better than consumer-grade.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

Bye Bye Humanity: The Potential AMOC Collapse

SectorC: A C Compiler in 512 bytes (2023)

Total surface area required to fuel the world with solar (2009)

Speed up responses with fast mode

Software factories and the agentic moment

LLMs as the new high level language

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

First Proof

Why there is no official statement from Substack about the data leak

Vouch

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

FDA intends to take action against non-FDA-approved GLP-1 drugs

The AI boom is causing shortages everywhere else

Learning from context is harder than we thought

Where did all the starships go?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Selection rather than prediction

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

I write games in C (yes, C) (2016)

An Update on Heroku

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

Bye Bye Humanity: The Potential AMOC Collapse

SectorC: A C Compiler in 512 bytes (2023)

Total surface area required to fuel the world with solar (2009)

Speed up responses with fast mode

Software factories and the agentic moment

LLMs as the new high level language

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

First Proof

Why there is no official statement from Substack about the data leak

Vouch

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

FDA intends to take action against non-FDA-approved GLP-1 drugs

The AI boom is causing shortages everywhere else

Learning from context is harder than we thought

Where did all the starships go?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Selection rather than prediction

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

I write games in C (yes, C) (2016)

An Update on Heroku

Understanding ZFS Scrubs and Data Integrity

Comments