My first in-prod corrupted hard drive problem

https://blog.pavementlink.ch/2026/05/07/my-first-corrupted-hard-drive-problem/

19•r1chk1t•1h ago

Comments

Retr0id•1h ago

> So how were we able to recover the database and the data inside it? Most of the data was probably still intact, only a few sectors were unreadable. Once those were either restored (rewritten with a strong signal) or remapped by the drive’s firmware, the filesystem and the database engine could read the file end-to-end again. SQL Server pages also have checksums, so if any page came back wrong rather than unreadable, we’d have known. We got lucky: the corruption was at the magnetic-signal level, not at the “platter is scratched” level.

This doesn't quite seem to follow. As described, neither of the "recovery" methods actually restore lost data. So why weren't any of the SQL pages left in a bad state?

benlivengood•1h ago

As best as I can tell it was intermittent read failures on some sectors, not permanent failures.

So if you keep rereading that section of the disk you eventually get all the data, save it somewhere, write a bunch of new patterns over it, then write the original data and verify it reads back correctly many times.

I believe the article's analysis about RAID is wrong though; most controllers will start resilvering or just fail a drive once it experiences too many IO errors.

jtchang•1h ago

Confused as to the actual root cause. Don't all hard drives provide SMART diagnostics these days? Was it really bad sectors?

r1chk1t•31m ago

Yes there was bad sectors in the SMART diagnostic

pixel_popping•58m ago

I feel the pain OP.

Over the last decade, I've ran hundreds of servers if not thousands, and I entirely stopped using hard drives, now it's solely SSD/NVMe where the failure rate in practice is incredibly lower, I've had my fair share of middle-night runs because websites are offline or whatever to end-up in a hard drive diagnosis circus.

Imo, the peace of mind you get worth the cost, it also allows you to rethink development entirely, typical example would be that suddenly, copying all node_modules or rust deps is a great idea with 10Gbit/s bandwidth and fast drives (yes, I expect people to shit on me for saying this, please give me the counterarguments if you downvote me), many things change if you have a higher base performance assumption, storage is relatively cheap as well. I would never advise anyone that wants to run continuously in prod with low friction to get servers with HDD.

I get that for some use cases it's not possible, but for large majority of use cases, it's clearly not HDD that is the cost burden. $50 servers gets you TBs of SSD, of course don't go with VPS or "Cloud" if you intend to change your development based on new performance assumptions, it blows my mind the numbers of people paying thousand of dollars just to handle what, 100K visitors a day? That fits on a $100 server and a bunch of Kimsufi hosted across the world as a CDN.

People are overcomplicating infrastructure, big time (which leads to more problems, higher maintenance, security issues and so on).

Retr0id•54m ago

It is quite remarkable how quickly a modern SSD can scan over TBs of data, I'm less afraid of O(n) queries than I used to be.

toast0•42m ago

> Over the last decade, I've ran hundreds of servers if not thousands, and I entirely stopped using hard drives, now it's solely SSD/NVMe where the failure rate in practice is incredibly lower, I've had my fair share of middle-night runs because websites are offline or whatever to end-up in a hard drive diagnosis circus.

My experience is that (most) spinners give off reliable pre-failure indicators (if you take the time to look/script looking), but SSDs fail by disappearing from the bus. The SSDs do fail much less often, but they still fail from time to time and recovery is harder.

Either way, if your data is important to you/your customers, you really need a backup/recovery plan.

I dunno about recent pricing, but not so long ago, it felt like spinners had a pretty high price floor and SSDs didn't... If you don't need a lot of space, you could find a small SSD that was still around the same $/GB as a medium sized SSD, but for spinners, there's a floor in dollars and space. So if you don't need a lot of space, you save money with an SSD and get better perf for free... If you need a lot of space and not a lot of perf, big spinners are more attainable than big SSDs.

ryandrake•32m ago

> My experience is that (most) spinners give off reliable pre-failure indicators (if you take the time to look/script looking), but SSDs fail by disappearing from the bus. The SSDs do fail much less often, but they still fail from time to time and recovery is harder.

I'm not a pro, just a smalltime dork with a homelab. I use cheap WD HDDs on my NAS system connected to an LSI hardware RAID controller. I'll boast that I have a 100% record so far of preventing downtime and data loss by simply listening for the controller's audible alarm and swapping drives right away (I keep brand new spares). I also have offline backups, but have so far never needed them. Not sure how this would change if I moved to SSDs.

Felger•12m ago

Well, SAS disk tend to go in failed state immediately or very quickly, most of the time without going first through the warning state.

SATA disk are indeed generally more predictable failure-wise. Most issues are related to a failing head stack assembly. Rarely platter demagnetization for some disks (Toshiba laptop).

Other failure issues are usually related to a friggin' manufactured firmware issue from Dell, HP or Lenovo corp.

pixel_popping•25m ago

Agree with the diagnostic part.

> Either way, if your data is important to you/your customers, you really need a backup/recovery plan.

You'd be surprised at how many devs/companies walk on eggshells all the time (praying that the fatal moment never arrive) because they aren't "brave" enough to do a proper backup system, which is often few minutes/hours of setup only.

pshirshov•34m ago

So, you were not using a striped mirror ZFS for a prod database? What could go wrong, yep.

r1chk1t•31m ago

learned the hard way

proactivesvcs•25m ago

I'm surprised to have read to the end and found that they're still not performing any hardware monitoring and alerting. SMART may not always show up pre-failure warnings but when it does they can usually be trusted.

Felger•23m ago

Hi, I believe you are quite new to workstation/hardware admin. Lots of things to say here (not native english speaking so basic style, sorry for that) :

Disk errors logged in the system event log are from the I/O layer, low-level class driver (msahci.sys) / filter drivers. See Windows Storage Driver Architecture : https://learn.microsoft.com/en-us/windows-hardware/drivers/s...

A disk error of this type showing in the event log must immediately be treated as an actual disk issue. This is a low level issue below the actual filesystem and application/services. Seems here the .mdf/ldf of your SQL database used one or more bad sectors on the disk surface.

Your disk seems to be only one on the system, so the first thing to do is check SMART status, for example with Crystaldiskinfo (the most used and user-friendly free portable windows software).

It would very probably have shown a warning state for the internal disk, with probably one or more (judging the quantity of disk error entries in your log) for Attribute C5 "Current Pending sectors" and probably some in Att 05 - "Reallocated sectors count" and/or Att C4 - "Realloc event Count".

Second thing to do is trying to backup your data as fast as possible. In your case related to a Ms SQL database, trying to dump it / backup first was the good move. Sadly (DR pro experience here), weak surface / failing Head Stack Assembly of a traditionnal HDD from most vendors has more difficulties reading correctly a sector than writing it.

If the dump/backup fails, the second choice would have be to try to a sector-to-sector dump approach of the whole disk, with either a online (from OS) software capable of reading sectors from the boot disk (didn't try if HDD Raw Copy Tool 2.6 supports it), or an offline solution like Clonezilla, Acronis True Image, Aomei backupper, etc. But offline solution means offline computer and service...

I didn't exactly understood if you had an actual backup of the data or an image of the whole disk. Considering the critical usage of this station, you should have both running : daily data backup or more + up-to-date disk image ready. whatever the type of disk (HDD/SSD). And a spare, identical computer.

As for repair of HDDs "weak sectors" (meaning Current Pending Sectors), it is indeed possible, often with complete data recovery. If not, the sector will be left as is, or may be remapped if written to 0 (it will then shift from Current pending to realloc sector count).

Hard disk Sentinel Pro as such features (Disk repair, Quick Fix), it works quite well. The result vary greatly from one type of failure to another, as from one disk maker to another.

Note that if the SMART shows more than a little dozen of sectors, the head (amp/preamp) is probably failing, making weaker magnetically-wise sectors too difficult to read and/or write. In this case, the count of current and remapped increases every repair/check pass made by the tools. In this case, the drive is toast and must be replaced ASAP.

SSD are a complete different case for repair.

A older autonomous tool, Spinrite, was specialized for this usage (accurate recovery of data), but veeeeery slow.

RAID pertinence : fortunately, it is an expecteed case as most SATA disks are prone to HSA failure before not initialyzing at all. A RAID 1 mirror would have protected you from a mirrored defect accross the two disks.

The RAID controller (true hardware controller like LSI/Avago or Microsemi) or even fake raid like Intel RST / VROC maintain data integrity accross the array's disks. The defective disk will raise bad blocks (that will get marked in metadata of the Raid Volume), but the others disks are fine and the data can be read safely. If too many errors are reported on a disk (very few in fact on most controllers), it will be labelled as failed and taken down from the array.

Google Cloud Fraud Defence is just WEI repackaged

Discord Incident

AI is breaking two vulnerability cultures

Cartoon Network Flash Games

Man Finds $1M Worth of Yu-Gi-Oh Cards in a Dumpster

Serving a website on a Raspberry Pi Zero running in RAM

You gave me a u32. I gave you root. (io_uring ZCRX freelist LPE)

My first in-prod corrupted hard drive problem

An Introduction to Meshtastic

David Attenborough's 100th Birthday

Google Broke reCAPTCHA for De-Googled Android Users

A web page that shows you everything the browser told it without asking

PC Engine CPU

Roadside Attraction

How do I deal with memory leaks? (2022)

Show HN: GETadb.com – every GET request creates a DB

Rumors of my death are slightly exaggerated

Cloudflare to cut about 20% of its workforce

Mojo 1.0 Beta

Apple, Intel have reached preliminary chip-making deal

Poland is now among the 20 largest economies

US Government releases first batch of UAP documents and videos

Canvas online again as ShinyHunters threatens to leak schools’ data

What we lost the last time code got cheap

Maybe you shouldn't install new software for a bit

Show HN: Git for AI Agents

Podman rootless containers and the Copy Fail exploit

Ask HN: We just had an actual UUID v4 collision...

Dirtyfrag: Universal Linux LPE

GeoJSON

My first in-prod corrupted hard drive problem

Comments

Google Cloud Fraud Defence is just WEI repackaged

Discord Incident

AI is breaking two vulnerability cultures

Cartoon Network Flash Games

Man Finds $1M Worth of Yu-Gi-Oh Cards in a Dumpster

Serving a website on a Raspberry Pi Zero running in RAM

You gave me a u32. I gave you root. (io_uring ZCRX freelist LPE)

My first in-prod corrupted hard drive problem

An Introduction to Meshtastic

David Attenborough's 100th Birthday

Google Broke reCAPTCHA for De-Googled Android Users

A web page that shows you everything the browser told it without asking

PC Engine CPU

Roadside Attraction

How do I deal with memory leaks? (2022)

Show HN: GETadb.com – every GET request creates a DB

Rumors of my death are slightly exaggerated

Cloudflare to cut about 20% of its workforce

Mojo 1.0 Beta

Apple, Intel have reached preliminary chip-making deal

Poland is now among the 20 largest economies

US Government releases first batch of UAP documents and videos

Canvas online again as ShinyHunters threatens to leak schools’ data

What we lost the last time code got cheap

Maybe you shouldn't install new software for a bit

Show HN: Git for AI Agents

Podman rootless containers and the Copy Fail exploit

Ask HN: We just had an actual UUID v4 collision...

Dirtyfrag: Universal Linux LPE

GeoJSON