Marc's Public Blog - Misc Public Entries


All | Aquariums | Arduino | Btrfs | Cars | Cats | Clubbing | Dining | Diving | Electronics | Exercising | Flying | Hiking | Linux | Linuxha | Public | Rc | Sciencemuseums | Snow | Solar | Trips




More pages: May 2017 April 2017 December 2016 November 2016 September 2016 June 2016 May 2016 September 2015 May 2015 April 2015 October 2014 September 2014 July 2014 April 2014 March 2014 February 2014 October 2013 May 2013 April 2013 January 2013 October 2012 September 2012 August 2012 July 2012 May 2012 April 2012 December 2011 November 2011 July 2011 April 2011 March 2011 December 2010 November 2010 October 2010 August 2010 July 2010 June 2010 April 2010 March 2010 February 2010 December 2009 November 2009 October 2009 September 2009 August 2009 June 2009 May 2009 April 2009 March 2009 February 2009 January 2009 December 2008 November 2008 October 2008 June 2008 May 2008 April 2008 March 2008 November 2007 October 2007 September 2007 May 2007 March 2007 December 2006 November 2006 October 2006 September 2006 August 2006 June 2006 May 2006 February 2006 January 2006 December 2005 November 2005 October 2005 October 2004 August 2004 June 2004 May 2004 March 2004 September 1997 July 1996 September 1993 July 1991 December 1988 December 1985 January 1980



2007/11/28 magic took a dive (updated)
π 2007-11-28 23:58 by Merlin in Public

My main server, magic.merlins.org, which you are reading this page on, had its biggest downtime in a while: 5 to 8 hours depending on the services (www came back up first).
I could actually have brought the services back up quicker by failing over to my secondary live server, but because of state involved, and work involved in making my secondary server, primary for mail, and then switching back (this includes making my mailman backup primary too, and then dealing with queues, archives, and all that fun stuff).
After asserting that I'd be able to bring magic back up, I just opted to ride the downtime and not worry about switching the services to moremagic, and then back to magic a few hours later: too much work was involved, and I had enough work on my hands recovering magic as is.
That said, if magic were to really die one day, like the hardware dying (and it could happen, I found out that one of my two CPUs in there actually has died and that the server is continuing to work with one CPU left), then I would do a bona fide switchover to moremagic.

So what happened?
I went to the colo to upgrade the drives in my external array (from 36G to 180G, upping the external storage to 1TB).
Unfortunately, while I was swapping the drives on the live server, for some reason, I decided to run rescan-scsi-bus to see my new drives were being seen, and something went very wrong there: that command caused something very bad to happen on my primary system SCSI bus and caused the system array to fail.
When I rebooted (oh and that was with a new kernel, since I used the reboot to upgrade kernels too), my raid5 array was not being seen, and I only had my root filesystem: no /usr, /var, or anything else.
From there, I started debugging, and trying the typical commands to bring back a raid array that was killed, but it would only bring one drive back out of 5, which was insufficient.
At that point, the next step is to rebuild the raid5 array on top of itself, which is supposed to bring every back up. I had done this in the very distant past, and it had worked.
Unfortunately, it worked enough for my raid5 array to function as a physical volume for my lvm volume group, and it even showed my logical volumes within that VG. I thought I was home free, until I got the dreaded error that none of my filesystems were mountable or even looked like ext3.
After several reboots which were not fun because I had to boot with init=/bin/bash due to a problem with the new kernel (I didn't know that yet), and then manually bring up udev, udevd, lvm, and raid5 (it's become non trivial to do this nowadays), I realized that the new mdadm tools created a different default raid5 array when the tools from 2002, so I had overlayed new md blocks that weren't compatible with the data I had on disk (yet, it was close since I could see my VG and LVs). After more time and more reboots, I realized that the chunck size for raid had changed from 32K to 64K and that the new default raid layout was left-symmetric instead of left-asymmetric (WTF did they have to change that).
Well, 2H later, I had my raid array back up, with my VG and LVs. I was then able to mount all my filesystems, except /var which had been damaged beyond e2fsck recovery (i.e the entire filesystem was in pieces in lost+found). In hindsight, I should have backed up that data before wiping it, but at the time, I felt the data was toast, and I didn't have the time to wait for a 10GB copy to another partition.
My recovery plan was to copy /var from moremagic, which would be close, but not quite the same (it was as different machine, but I had some shared data pieces that were rsynced daily), and then rsync/overlay the real data that I had on an almost full machine backup on my main disk server at home.
Then, I had to add the missing pieces (like recent pictures), from my laptop.
In the end, it took 4 to 6 hours of copies to get most of the system back to where it was, with very little data loss. I did lose files that had recently been uploaded to my ftp server (I don't back that up, it's too big), and I did lose 8 hours of work and frustration to piece everything back together.

I was then able to bring apache back up first, but I had to wait longer for Email for a 2GB mailman sync to finish. As I write this, I'm still rsyncing logs back and it'll probably take another 12H or so, but the server has been back up and working since about 17:30.
On one side, I'm glad I had reasonable backups and lost virtually nothing, as well as the fact that I was able to rebuild the server in place instead of bringing it back home and having to make a new one from scratch, but on the other side, the 8 or so hours I spent doing this, sucked.
I'm also concerned that I was able to lose an entire partition just for running rescan-scsi-bus, which I had run many times in the past without such problems.

Update1
Actually, I found out that I lost most of my archived web logs from 1999 to 2005. I'm kind of sad about that, but such is life I guess. It could have been much worse...

Update2
Never mind, I actually didn't lose anything, except a lot of time. After rebooting this morning (after my last backup restores had finished over night, a full 24H after the machine went down), I just realized that /var/ftp, which I thought I lost was indeed a separate partition (duh!) and therefore wasn't lost when /var was lost. This means that in the end I didn't lose any data at all, except a lot of time.
I can't quite say that I haven't lost anything on a raid5 array anymore, but at least I didn't lose the actual data since I had backups of it all. Pffeew...

More pages: May 2017 April 2017 December 2016 November 2016 September 2016 June 2016 May 2016 September 2015 May 2015 April 2015 October 2014 September 2014 July 2014 April 2014 March 2014 February 2014 October 2013 May 2013 April 2013 January 2013 October 2012 September 2012 August 2012 July 2012 May 2012 April 2012 December 2011 November 2011 July 2011 April 2011 March 2011 December 2010 November 2010 October 2010 August 2010 July 2010 June 2010 April 2010 March 2010 February 2010 December 2009 November 2009 October 2009 September 2009 August 2009 June 2009 May 2009 April 2009 March 2009 February 2009 January 2009 December 2008 November 2008 October 2008 June 2008 May 2008 April 2008 March 2008 November 2007 October 2007 September 2007 May 2007 March 2007 December 2006 November 2006 October 2006 September 2006 August 2006 June 2006 May 2006 February 2006 January 2006 December 2005 November 2005 October 2005 October 2004 August 2004 June 2004 May 2004 March 2004 September 1997 July 1996 September 1993 July 1991 December 1988 December 1985 January 1980