Marc's Public Blog - Linux Hacking


All | Aquariums | Arduino | Btrfs | Cars | Cats | Clubbing | Computers | Dining | Diving | Electronics | Exercising | Festivals | Flying | Halloween | Hbot | Hiking | Linux | Linuxha | Mexico | Monuments | Museums | Outings | Public | Rc | Sciencemuseums | Solar | Tfsf | Trips

This page has a few of my blog entries about linux, but my main linux page is here
Picture of Linus

Here is a list of older linux event reports I made before my blog was started, then the rest are below
1996/11/18-21:Linux Pavillion Comdex Fall 1996 (photos only). I've been going since then to help at the linux pavillion.
1997/11/18-21: Linux Pavillion Comdex Fall 1997 (photos only)
1998/05/28-30: Linuxexpo 1998 (photos only)
1998/11/16-20: Linux Pavillion Comdex Fall 1998 (full report)
1998/11/11: Silicon Valley Tea Party (report with pictures)
1999/02/15: Windows Refund Day (report with pictures)
1999/03/20: SVLUG KTEH night (photos only)
1999/03/01-04: LinuxWorld Expo Winter 99 (complete report with many pictures)
1999/03/31: Mozilla Party one year anniversary (photos only)
1999/05/18-22: Linuxexpo 1999 (complete report with many pictures)
1999/06/07: June 99 Balug meeting with Linus
1999/08/09-12: LinuxWorld Expo Summer 99 (complete report with many pictures)
1999/11/15-19: Linux Business Show at Comdex Fall 1999 (full report with pictures)
2000/08/14-17: LinuxWorld Expo Summer 2000 (complete report with many pictures)
2001/01/17-20: Linux.conf.au/LCA 2001 (complete report with pictures)
2001/07/25-28: OLS 2001 (photos only)
2001/08/25: Linux 10th Anniversary (report with pictures)
2001/09/27-30: LinuxWorld Expo Summer 2001 report with pictures)
2001/11/05-10: ALS 2001 (photos only)
2002/06/26-29: OLS 2002 (photos only)
2003/01/20-25: LCA 2003 (photos only)
2003/07/23-26: OLS 2003 (photos only)
2004/01/12-17: LCA 2004 (photos only)
2004/07/21-24: OLS 2004 (photos only)
2005/04/18-23: LCA 2005 (photos only)
2006/01/24-28: LCA 2006 (photos only)
2007/01/17-21: LCA 2007 (photos only)

Here is a list of all the talks I've given:

And below are my blog posts:



>>> Back to post index <<<

2008/11/30 Magic Motherboard Crash And Raid Rebuild With DD Rescue
π 2008-11-30 01:01 in Linux
Less than a year after I built it, magic started rebooting almost daily while one of its drives was exhibiting some worrisome smart errors. On the way back from Palo Alto Aiport, with my fiancée's visiting family in tow, I thought I'd stop by the data center on the way, swap the power supply and the bad drive. It was supposed to be a 10mn job.
Yes, you already know the rest, it wasn't.

First, the machine never rebooted after I put in the new power supply, nor would it power up with the old one (well, the fans started, but no POST). I eventually gave up and brought the machine home for further diagnostics. I found out in the end that one of the CPU slots on the motherboard donated by benley went bad, and the machine would not boot with any CPU in it (the CPUs themselves still seemed ok).
Luckily, I got an old machine called 'ins1' a while ago, as a spare should something like this happen, so it was just a matter of switching motherboards and CPUs. Good thing I had planned for that.

The part where I screwed up is that I had to replace sda with a new drive that I had prepared. I had 6 drives in the machine and no way to know which one was which outside of a label I had made on the front of the box, for a case just like this. So, I pulled the drive, and put a new one in and rebooted the machine with one CPU. I had meant to boot single user mode, but I messed up the boot command line, and when I tried to sysrq to stop multiuser, it wasn't working and the machine eventually booted in multi user mode and started to write on the degraded raid set. (turns out I had a mini keyboard that didn't support sending sysrq)
It's only a bit later that I logged in and realized that I had pulled the wrong drive and since I had written on the raidset I couldn't just shut down and put the good drive back in without some amount of filesystem corruption (I did have to do this once because I had no choice, but it's not something you do first).
(oh, and it was the wrong drive because during the install, I replaced that sata board for another one, and the other board had its port in reverse order, so my labels were also in reverse order...

By then, I only had once choice left, rebuild on a drive that was already good by using the failing drive, and sure enough the failing drive had bad sectors that prevented the rebuild to complete. I still could have forced the raid to discard the bad drive and rebuild the raidset by forcing options to use the drive I was rebuilding on, as a good drive. It works perfectly if you didn't write on raidset in between, but since I had, I figured I'd try to just clone the bad drive since it only had about 5 bad blocks.

First, I went with dd conv=noerror,sync bs=512, but then googled during the long copy that there was a better way: Gnu ddrescue (don't get confused between that in the older dd_rescue and ddr_help). ddrescue is really mostly the same, except that it copies bigger blocks until it finds and error, had a logfile with recovery, and will retry bad blocks a few times before giving up on them (dd just skips them and replaces them with zeros, which you won't find with with rsync, unless you call rsync with -c and you even know which file(s) have 0s in side, which is very non trivial with a filesystem over lvm over raid5).

The magic command is therefore: ddrescue -v -r 10 -d /dev/sda4 /dev/sdd4 log which takes about 3H on a 250GB drive at 25MB/s average speed.

If ddrescue isn't able to rescue the bad blocks, in theory I should be able to compute the parity for just those blocks from the other drives (including the one I was rebuilding on), hoping/assuming that those blocs weren't ones that got changed in the short amount of time the good drive was removed from the raid. Unfortunately, doing so is pretty non trivial, and there are no tools that I could find to hand pick sectors to rebuild in one direction vs another direction (not counting that it would be super error prone).
The good news is that ddrescue -r 10 was about right: it tried to re-read my bad block 3 times and was able to get the data off the 3rd time, so I got a perfect mirror copy of my drive with issues and won't have to wonder later which portion of which filesystem got a bunch of 0s in the middle of it. Yeah! :)
(the actual data wasn't that important, I had backups of most of it, but it would have been a bit of a pain to recreate, and I always use such an opportunity to learn about the different recovery techniques and tools so that I know what to do the day I come across something very important to restore, hopefully not my data :) )


More pages: February 2004 March 2004 November 2004 April 2005 August 2005 January 2006 July 2006 August 2007 November 2007 January 2008 October 2008 November 2008 December 2008 January 2009 May 2009 July 2009 August 2009 September 2009 November 2009 December 2009 January 2010 March 2010 April 2010 June 2010 August 2010 October 2010 January 2011 July 2011 August 2011 December 2011 January 2012 March 2012 May 2012 August 2012 December 2012 January 2013 March 2013 May 2013 September 2013 November 2013 January 2014 March 2014 April 2014 May 2014 October 2014 January 2015 March 2015 May 2015 January 2016 February 2016 March 2016 June 2016 July 2016 August 2016 October 2016 January 2017 September 2017 January 2018 March 2018 December 2018 January 2019 January 2020 May 2020 September 2021 March 2023 April 2023

>>> Back to post index <<<

Contact Email