Less than a year after I built it, magic started rebooting almost daily while one of its drives was exhibiting some worrisome smart errors. On the way back from Palo Alto Aiport, with my fiancée's visiting family in tow, I thought I'd stop by the data center on the way, swap the power supply and the bad drive. It was supposed to be a 10mn job.|
Yes, you already know the rest, it wasn't.
First, the machine never rebooted after I put in the new power supply, nor would it power up with the old one (well, the fans started, but no POST). I eventually gave up and brought the machine home for further diagnostics. I found out in the end that one of the CPU slots on the motherboard donated by benley went bad, and the machine would not boot with any CPU in it (the CPUs themselves still seemed ok).
Luckily, I got an old machine called 'ins1' a while ago, as a spare should something like this happen, so it was just a matter of switching motherboards and CPUs. Good thing I had planned for that.
The part where I screwed up is that I had to replace sda with a new drive that I had prepared. I had 6 drives in the machine and no way to know which one was which outside of a label I had made on the front of the box, for a case just like this. So, I pulled the drive, and put a new one in and rebooted the machine with one CPU. I had meant to boot single user mode, but I messed up the boot command line, and when I tried to sysrq to stop multiuser, it wasn't working and the machine eventually booted in multi user mode and started to write on the degraded raid set. (turns out I had a mini keyboard that didn't support sending sysrq)
It's only a bit later that I logged in and realized that I had pulled the wrong drive and since I had written on the raidset I couldn't just shut down and put the good drive back in without some amount of filesystem corruption (I did have to do this once because I had no choice, but it's not something you do first).
(oh, and it was the wrong drive because during the install, I replaced that sata board for another one, and the other board had its port in reverse order, so my labels were also in reverse order...
By then, I only had once choice left, rebuild on a drive that was already good by using the failing drive, and sure enough the failing drive had bad sectors that prevented the rebuild to complete. I still could have forced the raid to discard the bad drive and rebuild the raidset by forcing options to use the drive I was rebuilding on, as a good drive. It works perfectly if you didn't write on raidset in between, but since I had, I figured I'd try to just clone the bad drive since it only had about 5 bad blocks.
First, I went with
dd conv=noerror,sync bs=512, but then googled during the long copy that there was a better way: Gnu ddrescue (don't get confused between that in the older dd_rescue and ddr_help). ddrescue is really mostly the same, except that it copies bigger blocks until it finds and error, had a logfile with recovery, and will retry bad blocks a few times before giving up on them (dd just skips them and replaces them with zeros, which you won't find with with rsync, unless you call rsync with
-c and you even know which file(s) have 0s in side, which is very non trivial with a filesystem over lvm over raid5).
The magic command is therefore:
ddrescue -v -r 10 -d /dev/sda4 /dev/sdd4 log
which takes about 3H on a 250GB drive at 25MB/s average speed.
If ddrescue isn't able to rescue the bad blocks, in theory I should be able to compute the parity for just those blocks from the other drives (including the one I was rebuilding on), hoping/assuming that those blocs weren't ones that got changed in the short amount of time the good drive was removed from the raid. Unfortunately, doing so is pretty non trivial, and there are no tools that I could find to hand pick sectors to rebuild in one direction vs another direction (not counting that it would be super error prone).
The good news is that ddrescue -r 10 was about right: it tried to re-read my bad block 3 times and was able to get the data off the 3rd time, so I got a perfect mirror copy of my drive with issues and won't have to wonder later which portion of which filesystem got a bunch of 0s in the middle of it. Yeah! :)
(the actual data wasn't that important, I had backups of most of it, but it would have been a bit of a pain to recreate, and I always use such an opportunity to learn about the different recovery techniques and tools so that I know what to do the day I come across something very important to restore, hopefully not my data :) )