Marc's Public Blog - Linux Hacking


All | Aquariums | Arduino | Btrfs | Cars | Cats | Clubbing | Dining | Diving | Electronics | Exercising | Flying | Hiking | Linux | Linuxha | Public | Rc | Sciencemuseums | Snow | Solar | Trips

This page has a few of my blog entries about linux, but my main linux page is here
Picture of Linus


>>> Back to post index <<<

2008/11/30 Magic Motherboard Crash And Raid Rebuild With DD Rescue
π 2008-11-30 00:00 in Linux
Less than a year after I built it, magic started rebooting almost daily while one of its drives was exhibiting some worrisome smart errors. On the way back from Palo Alto Aiport, with my fiancée's visiting family in tow, I thought I'd stop by the data center on the way, swap the power supply and the bad drive. It was supposed to be a 10mn job.
Yes, you already know the rest, it wasn't.

First, the machine never rebooted after I put in the new power supply, nor would it power up with the old one (well, the fans started, but no POST). I eventually gave up and brought the machine home for further diagnostics. I found out in the end that one of the CPU slots on the motherboard donated by benley went bad, and the machine would not boot with any CPU in it (the CPUs themselves still seemed ok).
Luckily, I got an old machine called 'ins1' a while ago, as a spare should something like this happen, so it was just a matter of switching motherboards and CPUs. Good thing I had planned for that.

The part where I screwed up is that I had to replace sda with a new drive that I had prepared. I had 6 drives in the machine and no way to know which one was which outside of a label I had made on the front of the box, for a case just like this. So, I pulled the drive, and put a new one in and rebooted the machine with one CPU. I had meant to boot single user mode, but I messed up the boot command line, and when I tried to sysrq to stop multiuser, it wasn't working and the machine eventually booted in multi user mode and started to write on the degraded raid set. (turns out I had a mini keyboard that didn't support sending sysrq)
It's only a bit later that I logged in and realized that I had pulled the wrong drive and since I had written on the raidset I couldn't just shut down and put the good drive back in without some amount of filesystem corruption (I did have to do this once because I had no choice, but it's not something you do first).
(oh, and it was the wrong drive because during the install, I replaced that sata board for another one, and the other board had its port in reverse order, so my labels were also in reverse order...

By then, I only had once choice left, rebuild on a drive that was already good by using the failing drive, and sure enough the failing drive had bad sectors that prevented the rebuild to complete. I still could have forced the raid to discard the bad drive and rebuild the raidset by forcing options to use the drive I was rebuilding on, as a good drive. It works perfectly if you didn't write on raidset in between, but since I had, I figured I'd try to just clone the bad drive since it only had about 5 bad blocks.

First, I went with dd conv=noerror,sync bs=512, but then googled during the long copy that there was a better way: Gnu ddrescue (don't get confused between that in the older dd_rescue and ddr_help). ddrescue is really mostly the same, except that it copies bigger blocks until it finds and error, had a logfile with recovery, and will retry bad blocks a few times before giving up on them (dd just skips them and replaces them with zeros, which you won't find with with rsync, unless you call rsync with -c and you even know which file(s) have 0s in side, which is very non trivial with a filesystem over lvm over raid5).

The magic command is therefore: ddrescue -v -r 10 -d /dev/sda4 /dev/sdd4 log which takes about 3H on a 250GB drive at 25MB/s average speed.

If ddrescue isn't able to rescue the bad blocks, in theory I should be able to compute the parity for just those blocks from the other drives (including the one I was rebuilding on), hoping/assuming that those blocs weren't ones that got changed in the short amount of time the good drive was removed from the raid. Unfortunately, doing so is pretty non trivial, and there are no tools that I could find to hand pick sectors to rebuild in one direction vs another direction (not counting that it would be super error prone).
The good news is that ddrescue -r 10 was about right: it tried to re-read my bad block 3 times and was able to get the data off the 3rd time, so I got a perfect mirror copy of my drive with issues and won't have to wonder later which portion of which filesystem got a bunch of 0s in the middle of it. Yeah! :)
(the actual data wasn't that important, I had backups of most of it, but it would have been a bit of a pain to recreate, and I always use such an opportunity to learn about the different recovery techniques and tools so that I know what to do the day I come across something very important to restore, hopefully not my data :) )


More pages: January 2017 October 2016 August 2016 July 2016 June 2016 February 2016 January 2016 May 2015 March 2015 January 2015 October 2014 May 2014 April 2014 March 2014 January 2014 November 2013 September 2013 May 2013 March 2013 January 2013 December 2012 August 2012 May 2012 March 2012 January 2012 December 2011 August 2011 July 2011 January 2011 October 2010 August 2010 June 2010 April 2010 March 2010 January 2010 December 2009 November 2009 September 2009 August 2009 July 2009 May 2009 January 2009 December 2008 November 2008 October 2008 January 2008 November 2007 August 2007 July 2006 January 2006 August 2005 April 2005 November 2004 March 2004 February 2004

>>> Back to post index <<<