Is DWPD Still a Useful SSD Spec?

https://klarasystems.com/articles/is-dwpd-still-useful-ssd-spec/

45•zdw•2mo ago

Comments

igtztorrero•2mo ago

The most common catastrophic failure you’ll see in SSDs: the entire drive simply drops off the bus as though it were no longer there.

Happened to me last week.

I just put it in a plastic bag into the freezer during 15 minutes, and works.

I made a copy to my laptop and then install a new server.

But not always works like charms.

Please always have a backup for documents, and a recent snapshot for critical systems.

lvl155•2mo ago

Always make backups to HDD and cloud (and possibly tape if you are a data nut).

zamadatix•2mo ago

I don't think one should worry as much about what medias they are backing up to as if they are answering the question "does my data resiliency match my retention needs".

And regularly test restores actually work, nothing worse than thinking you had backups and then they don't restore right.

serf•2mo ago

to be perfectly fair though, this isn't a new failure mode when SSDs arrived on the scene.

drive controllers on HDDs just suddenly go to shit and drop off buses, too.

I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.

PunchyHamster•2mo ago

We have a fleet of few hundred HDDs that is basically being replaced "on next failure" with SSD and that is BY FAR rarer on HDDs, maybe one out of 100 "just dies".

Usually it either starts returning media errors, or slows down (and if it is not replaced in time, slowing down drive usually turns into media error one).

SSDs (at least a big fleet of samsung ones we had) are much worse, just off, not even turning readonly. Of course we have redundancy so it's not really a problem, but if same happened on someone's desktop they'd be screwed if they don't have backups.

toast0•2mo ago

> I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.

This is exactly the opposite of my lived experience. Spinners fail more often than SSDs, but I don't remember any sudden failures with spinners, as far as I can recall, they all have pre-failure indicators, like terrible noises (doesn't help for remote disks), SMART indicators, failed read/write on a couple sectors here and there, etc. If you don't have backups, but you notice in a reasonable amount of time, you can salvage most of your data. Certainly, sometimes the drives just won't spin up because of a bearing/motor issue; but sometimes you can rotate the drive manually to get it started and capture some data.

The vast majority of my SSD failures have been disappear from the bus; lots of people say they should fail read only, but I've not seen it. If you don't have backups, your data is all gone.

Perhaps I missed the pre-failure indicators from SMART, but it's easier when drives fail but remain available for inspection --- look at a healthy drive, look at a failed drive, see what's different, look at all your drives, predict which one fails next. For drives that disappear, you've got to read and collect the stats regularly and then go back and see if there was anything... I couldn't find anything particularly predictive. I feel disappear from the bus is more in the firmware error category vs physical storage problem, so there may not be real indications, unless it's a power on time based failure...

jandrese•2mo ago

For what it is worth the SMART diagnostics and health indicators have rarely been useful for me, either on SSDs or HDDs. I don't think I've ever had a SMART health warning before a drive dies. Although I did have one drive that gave a "This drive is on DEATH'S DOOR! Replace it IMMEDIATELY!" error for 3 years before I finally got around to replacing it, mostly to avoid having my OS freak out every time it booted up.

toast0•2mo ago

Oh, the overall smart status is mostly useless. But some of the individual fields are helpful.

The ones for relocated sectors, pending sectors, etc. When those add up to N, it's time to replace and you can calibrate that based on your monitoring cycle and backup needs. For a look every once in a while, single copy use case, I'd replace around 10 sectors; for daily monitoring, multiple copies, I'd replace towards 100 sectors. You probably won't get warranty coverage at those numbers though.

Mostly I've only seen the smart status warning fire for too many power on hours, which isn't very useful. Power on hours isn't a good indicator of impending doom (unless there's a firmware error at specific values, which can happen for SSDs or spinners)

seanw444•2mo ago

My experience has been the same. Hard drives fail more gracefully than SSDs.

> The vast majority of my SSD failures have been disappear from the bus; lots of people say they should fail read only, but I've not seen it. If you don't have backups, your data is all gone.

I just recovered data a couple weeks ago from my boss's SATA SSD that gave out and went read-only.

magicalhippo•2mo ago

> Spinners fail more often than SSDs, but I don't remember any sudden failures with spinners

I've had a fair numbet of HDDs throughout the years. My first one, well my dad's, was a massive 20 MB. I've had a 6+ disk ZFS pool going 24/7 since 2007. Oldest disks had over 7 years on-time according to SMART data, replaced them due to capacity.

Out of all that I've only had one HDD go poof gone. The infamous IBM Deathstar[1].

I've had some develop a few bad blocks and that's it, and one which just got worse and worse. But only one which died a sudden death.

Meanwhile I've had multiple SSDs which just stopped working suddenly. Articles write about them going into read-only mode but the ones I've had that went bad just stopped working.

[1]: https://en.wikipedia.org/wiki/Deskstar#IBM_Deskstar_75GXP_fa...

jandrese•2mo ago

I don't know how true this is, but it seems to me that SSD firmware has to be more complex than HDD firmware and I've seen far more SSDs die due to firmware failure than HDDs. I've seen HDDs with corrupt firmware (junk strings and nonsense values in the SMART data for example), but usually the drive still reads and writes data. In contrast I've had multiple SSDs, often with relatively low power-on hours, just suddenly die with no warning. Some of them even show up as a completely different (and totally useless) device on the bus. Drives with Sandforce controllers used to do this all of the time, which was a problem because Sandforce hardware was apparently quite affordable and many third party drives used their chips.

I have had a few drives go completely read only on me, which is always a surprise to the underlying OS when it happens. What is interesting is you can't predict when a drive might go read-only on you. I've had a system drive that was only a couple of years old and running on a lightly loaded system claim to have exhausted the write endurance and go read only, although to be fair that drive was a throwaway Inland brand one I got almost for free at Microcenter.

If you really want to see this happen try setting up a Raspberry Pi or similar SBC off of a micro-SD card and leave it running for a couple of years. There is a reason people who are actually serious about those kinds of setups go to great lengths to put the logging on a ramdisk and shut off as much stuff as possible that might touch the disk.

fragmede•2mo ago

> it seems to me that SSD firmware has to be more complex than HDD firmware

I think they’re complicated in different ways. A hard desk drive has to have an electromagnet powered up in a motor that arm that moves and reads the magnetic balance of the part of the drive under the read head and correlate that to something? Oh, and there are multiple read heads. Seems ridiculously complex!

jandrese•2mo ago

Yet somehow firmware bugs are endemic on SSDs far more than they were on HDDs.

pkaye•2mo ago

I worked on SSD firmware for more than a decade from the early days of SLC memory to TLC memory. SLC memory was so rock solid that you hardly needed any ECC protection. You could go months of use without any errors. And the most common error was erase error which just means to no longer use that back.

But then as the years progressed, the transistors were made smaller and MLC and TLC were introduced all to increase capacity but it made the NAND worse in every other way like endurance, retention, write/erase performance, read disturb. It also makes the algorithms and error recovery process more complicated.

Another difficult thing is recovering the FTL mapping tables from a sudden power loss. Having those power loss protection capacitors makes it so much more robust in every way. I wish more consumer drives included them. It probably just adds $2-3 to the product cost.

namibj•2mo ago

That's kind of that ZNS is for: make the SSD dumb but in exchange predictable; let the database on top that already uses some type of CoW structure handle the quantization of erasure blocks; expose all overprovisioning from the start and just give back less usable capacity after an erasure block for erased and skip over any read access sized blocks that got killed off there when mapping logical addresses to physical ones. That has to exist anyways because due yield reasons some percentage of blocks is expected dead from the factory.

dale_glass•2mo ago

> I just put it in a plastic bag into the freezer during 15 minutes, and works.

What's that supposed to do for a SSD?

It was a trick for hard disks because on ancient drives the heads could get stuck to the platter, and that might help sometimes. But even for HDDs that's dubiously useful these days.

ahartmetz•2mo ago

Semiconductors generally work better the colder they are. Extreme overclockers don't use liquid nitrogen primarily to keep chips at room temperature at extreme power consumption, but to actually run them at temperatures far below freezing.

butvacuum•2mo ago

Complex issue- analog NAND doesn't work anything like the Logic in CPUs.

Far more often it's the act of simply letting a device sit unpowered itself that 'fixes' the issue. Speculation on what changed invariably goes on indefinitely

rcxdude•2mo ago

It could be due to a dodgy connection - changing temperature might make the two halves of a broken conductor touch again.

ssl-3•2mo ago

> It was a trick for hard disks because on ancient drives the heads could get stuck to the platter, and that might help sometimes.

Stuck heads were/are part of the freezing trick.

Another other part of that trick has to do with printed circuit boards and their myriad of connections -- you know, the stuff that both HDDs and SSDs have in common.

Freezing them makes things on the PCB contract, sometimes at different rates, and sometimes that change makes things better-enough, long-enough to retrieve the data.

I've recovered data from a few (non-ancient) hard drives that weren't stuck at all by freezing them. Previous to being frozen, they'd spin up fine at room temperature and sometimes would even work well-enough to get some data off of them (while logging a ton of errors). After being frozen, they became much more complacent.

A couple of them would die again after warming back up, and only really behaved while they were continuously frozen. But that was easy enough, too: Just run the USB cable from the adapter through the door seal on the freezer and plug it into a laptop.

This would work about the same for an SSD, in that: If it helps, then it is helpful.

Havoc•2mo ago

After getting burned by consumer drives I decided it’s time for a zfs array from used enterprise ssds. Tons of writes on them but full mirrored config and zfs is easier to backup so should be ok. And the really noisy stuff like logging im just sticking into optanes - those are 6+ dwpd depending on model which may as well be unlimited for personal use scenarios

hangonhn•2mo ago

Do you just source these from eBay? Any guidelines for what's a good used enterprise SSD? I had considered this route after I built my ZFS array based on consumer SSDs. The endurance numbers on the enterprise drives are just so much higher.

Havoc•2mo ago

Yeah ebay. In general I've found buying enterprise stuff off ebay to be quite safe...all the shysters are in consumer space

>Any guidelines for what's a good used enterprise SSD?

Look at the sellers other items. You want them to be data-center stuff.

Look at how many they have to sell - someone clearing out a server won't have 1 drive, they'll have half a dozen plus.

Look for smart data in the post / guaranteed minimum health.

I mostly bought S3500/3600/3700 series intel SSDs. The endurance numbers vary so you'll need to look up what you find

>The endurance numbers on the enterprise drives are just so much higher.

That plus I'm more confident they'll actually hit them

0manrho•2mo ago

> Any guidelines for what's a good used enterprise SSD

Micron, Samsung and Intel (enterprise, branded DCxxxx) / SK Hynix / Solidigm (Intel sold it's SSD business to SK which they merged into Solidigm) are the go to's for brands. HGST can also be good.

The best guideline is buying from reputable sellers with a decent volume of business (eg, >1000 sales with high ratings on ebay) that focus on enterprise hardware and have a decent return/DOA policy.

You should expect these drives to be partially worn (regardless of the SMART data, that often gets wiped) if for no other reason than the secure erasure process mandated by a lot of org's data security policies resulting in multiple intensive disk writes, but also due to actually having been used. Drives that have recently been released (within 12 months, eg Micron 7600) are suspect as that implies there was a bad batch or that they were misused - especially if they aren't write focused drives. Not uncommon for a medium to smaller-end large business pinching pennies and buying the wrong drives and then wrecking them and their vendor/VAR rejecting warranty claims. That said, that's not always the case, it's entirely possible to get perfectly good recently made drives from reputable 2nd hand market sellers, just don't expect a massive discount in that case.

Otherwise best advice I can give you, is redundancy is your friend. If you can't afford buy at least 2 drives for an array, you should probably stick to buying new. I've had a few lemons over the years, but since availability on the second hand market for any given model can be variable and you tend to want to build arrays from like-devices, you should purchase them with the expectation that at least 1 per batch will be bad just to be safe. Worst case scenario you end up with an extra drive/hotspare.

butvacuum•2mo ago

And exclude China from the eBay regions. All the drives I've had with reset smart data came from China.

I'd rather see 3Pb of 5Pb writes used than an obviously pretend 2Gib written.

0manrho•2mo ago

Disagree. You should never trust SMART data from second hand drives, full stop. No matter if it's wiped or not.

If you're US domestic market, then yeah, you can usually avoid Chinese vendors. If you're EU or elsewhere, China can often be the main/only source of affordable drives vs domestic market. Really depends (I don't shop for international buds/clients, but I constantly hear about how the homelabbers across the pond have significantly higher prices/lower availability for surplus enterprise gear in general)

Stick to the rules on reliable vendors with a return policy, buy in bulk with the expectation that some will be bad (not a given, but good to be prepared), and the only issue from buying from china is delayed shipping times.

butvacuum•2mo ago

It's not about the numbers as much as it is the LIE. There's no legitimate reason to take the extra steps to wipe a drives internal state clean. Like rolling odometers back, it has one purpose: fraud.

0manrho•2mo ago

Oh I agree, but worrying about that in my opinion is stressing over things beyond my control as I can't verify that data is accurate before or after taking receipt of it, and it's kind of a moot point when buying a second hand consumable device as it's pretty much guaranteed to be used/worn. Also, devices can change hands multiple times before it ends up on the second hand market, with anyone in that chain potentially being responsible for fudging the numbers. It's why I start from the zero-trust assumption that the data is unreliable, always buy in quantity and always assume a non-zero failure rate of some kind (thus the reputable vendor with a return/refund policy). The failure rate is rarely actually that high in my experience, but it does happen from time to time and it sucks if you're trying to deploy for example a 4 device array, and you buy 4 devices but 1 doesn't work and as a result now the whole array can't be deployed.

mdtancsa•2mo ago

dropping off the bus is the best case fail really. Its more annoying when writes become slower than the other disks often causing confusing performance profiles of the overall array. Having good metrics for each disk (we use telegraf) will help flag it early. On my zfs pools, monitoring disk io for each disk, smartmon metrics help tease that out. For SSDs probably the worst is when there is some firmware bug that triggers on all disks at the same time. e.g. the infamous HP SSD Failure at 32,768 Hours of Use. Yikes!

PunchyHamster•2mo ago

we had ones that turned into that failure mode at like 80% life left. Zero negative SMART metrics, just slowed down.

My hunch is that they don't expose anything because that makes it harder to refund on warranty

mgerdts•2mo ago

This article misses several important points.

- Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.

- Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.

- A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.

- Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.

- per the NVMe spec there are indicators of drive health in the SMART log page.

- Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.

Aurornis•2mo ago

You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.

> - Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.

This is true, but despite all of the controversy about this feature it’s hard to encounter this in practical consumer use patterns.

With the 980 Pro 1TB you can write 113GB before it slows down. (Source https://www.techpowerup.com/review/samsung-980-pro-1-tb-ssd/... ) So you need to be able to source that much data from another high speed SSD and then fill nearly 1/8th of the drive to encounter the slowdown. Even when it slows down you’re still writing at 1.5GB/sec. Also remember that the drive is factory overprovisioned so there is always some amount of space left to handle some of this burst writing.

For as much as this fact gets brought up, I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations, but even in slow mode you’re filling the entire drive capacity in under 10 minutes.

nyrikki•2mo ago

> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.

This has always been the case, thus why even a decade ago the “pro” drives were odd sizes like 120g vs 128g.

Products like that still exist today and the problem tends to show up as drives age and that pool shrinks.

DWPD and TB written like modern consumer drives use are just different ways of communicating that contract.

FWIW I’d you do a drive wide discard and then only partition 90% of the drive you can dramatically improve the garbage collection slowdown on consumer drives.

In the world of ML and containers you can hit that if you say have fstrim scheduled once a week to avoid the cost of online discards.

I would rather have visibility into the size of the reserve space through smart, but I doubt that will happen.

mgerdts•2mo ago

> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.

I think it is safe to say that all drives have this. Refer to the available spare field in the SMART log page (likely via smartctl -a) to see the percentage of factory overprovisioned blocks that are still available.

I hypothesize that as this OP space dwindles writes get slower because they are more likely to get bogged down behind garbage collection.

> I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations

I agree. I agree so much that I question the assertion that drive slowness is a major factor in machines feeling slow. My slow laptop is about 5 years old. Firefox spikes to 100+% CPU for several seconds on most page loads. The drive is idle during that time. I place the vast majority of the blame on software bloat.

That said, I am aware of credible assertions that drive wear has contributed to measurable regression in VM boot time for a certain class of servers I’ve worked on.

justinclift•2mo ago

> With the 980 Pro 1TB you can write 113GB before it slows down.

113GB is pretty easily reached with video files.

kvemkon•2mo ago

Now that PCIe 5.0 SSDs are available since 6+ months and you could backup your SSD with 15 GB/s but:

> you’re still writing at 1.5GB/sec.

Except of few seconds at the start, the whole process lasts as if you had PCIe 2.0 (15+ years ago). Having so fast SSDs there is no chance to make a quick backup/restore. And during restore you're second time in a row too slow.

It's crazy that instead of using slow PLC at the time of slow PCIe 1.0, back then fast SLC was in use. Now with PCIe 5.0 when you really need fast SLC, you get slow TLC or very slow QLC or even worse PLC coming.

markhahn•2mo ago

Text is wrong about CRCs: everyone uses pretty heavy ECC, so it's not just a re-read. This also provides a somewhat graduated measure of the block's actual health, so the housekeeping firmware can decide whether to stop using the block (ie, move the content elsewhere).

I'm also not a fan of buy bigger storage concept, or the conspiracy-theory on 480 v 512.

It sure would be nice if when considering a product, you could just look at some claimed stats from the vendor about time-related degradation, firmware sparing policy, etc. we shouldn't have to guess!

saurik•2mo ago

> I'm also not a fan of buy bigger storage concept, or the conspiracy-theory on 480 v 512.

I don't understand why this is being called a "conspiracy theory"; but, if you want some very concrete evidence that this is how they work, a paper was recently published that analyzed the behavior and endurance of various SSDs, and the data would be very difficult to describe using any other theory than that, comparing apples-to-apples on drives that have better write endurance, they are merely overprovisioned to allow the wear-level algorithm to not cause as much write amplification while reorganizing.

https://news.ycombinator.com/item?id=44985619

> OP on write-intensive SSD. SSD vendors often offer two versions of SSDs with similar hardware specifications, where the lower-capacity model is typically marketed as “write-optimized” or “mixed-use”. One might expect that such write-optimized SSDs would demonstrate improved WAF characteristics due to specialized internal designs. To investigate this, we compared two Micron SSD models: the Micron 7450 PRO, designed for “read-intensive” workloads with a capacity of 960 GB, and the Micron 7450 MAX, intended for “mixed-use” workloads with a capacity of 800 GB. Both SSDs were tested under identical workloads and dataset sizes, as shown in Figure 7b. The WAF results for both models were identical and closely matched the results from the simulator. This suggests that these Micron SSDs, despite being marketed for different workloads, are essentially identical in performance, with the only difference being a larger OP on the “mixed-use” model. For these SSD models, there appear to be no other hardware or algorithmic improvements. As a result, users can achieve similar performance by manually reserving free space on the “read-intensive” SSD, offering a practical alternative to purchasing the “mixed-use” model.