Marc's Public Blog - Linux Hacking


vvv Click on the categories below to see other topic specific pages vvv



>>> Back to post index <<<

π 2007-11-28 23:58 by Merlin in Computers, Linux, Public

My main server, magic.merlins.org, which you are reading this page on, had its biggest downtime in a while: 5 to 8 hours depending on the services (www came back up first).
I could actually have brought the services back up quicker by failing over to my secondary live server, but because of state involved, and work involved in making my secondary server, primary for mail, and then switching back (this includes making my mailman backup primary too, and then dealing with queues, archives, and all that fun stuff).
After asserting that I'd be able to bring magic back up, I just opted to ride the downtime and not worry about switching the services to moremagic, and then back to magic a few hours later: too much work was involved, and I had enough work on my hands recovering magic as is.
That said, if magic were to really die one day, like the hardware dying (and it could happen, I found out that one of my two CPUs in there actually has died and that the server is continuing to work with one CPU left), then I would do a bona fide switchover to moremagic.

So what happened?
I went to the colo to upgrade the drives in my external array (from 36G to 180G, upping the external storage to 1TB).
Unfortunately, while I was swapping the drives on the live server, for some reason, I decided to run rescan-scsi-bus to see my new drives were being seen, and something went very wrong there: that command caused something very bad to happen on my primary system SCSI bus and caused the system array to fail.
When I rebooted (oh and that was with a new kernel, since I used the reboot to upgrade kernels too), my raid5 array was not being seen, and I only had my root filesystem: no /usr, /var, or anything else.
From there, I started debugging, and trying the typical commands to bring back a raid array that was killed, but it would only bring one drive back out of 5, which was insufficient.
At that point, the next step is to rebuild the raid5 array on top of itself, which is supposed to bring every back up. I had done this in the very distant past, and it had worked.
Unfortunately, it worked enough for my raid5 array to function as a physical volume for my lvm volume group, and it even showed my logical volumes within that VG. I thought I was home free, until I got the dreaded error that none of my filesystems were mountable or even looked like ext3.
After several reboots which were not fun because I had to boot with init=/bin/bash due to a problem with the new kernel (I didn't know that yet), and then manually bring up udev, udevd, lvm, and raid5 (it's become non trivial to do this nowadays), I realized that the new mdadm tools created a different default raid5 array when the tools from 2002, so I had overlayed new md blocks that weren't compatible with the data I had on disk (yet, it was close since I could see my VG and LVs). After more time and more reboots, I realized that the chunck size for raid had changed from 32K to 64K and that the new default raid layout was left-symmetric instead of left-asymmetric (WTF did they have to change that).
Well, 2H later, I had my raid array back up, with my VG and LVs. I was then able to mount all my filesystems, except /var which had been damaged beyond e2fsck recovery (i.e the entire filesystem was in pieces in lost+found). In hindsight, I should have backed up that data before wiping it, but at the time, I felt the data was toast, and I didn't have the time to wait for a 10GB copy to another partition.
My recovery plan was to copy /var from moremagic, which would be close, but not quite the same (it was as different machine, but I had some shared data pieces that were rsynced daily), and then rsync/overlay the real data that I had on an almost full machine backup on my main disk server at home.
Then, I had to add the missing pieces (like recent pictures), from my laptop.
In the end, it took 4 to 6 hours of copies to get most of the system back to where it was, with very little data loss. I did lose files that had recently been uploaded to my ftp server (I don't back that up, it's too big), and I did lose 8 hours of work and frustration to piece everything back together.

I was then able to bring apache back up first, but I had to wait longer for Email for a 2GB mailman sync to finish. As I write this, I'm still rsyncing logs back and it'll probably take another 12H or so, but the server has been back up and working since about 17:30.
On one side, I'm glad I had reasonable backups and lost virtually nothing, as well as the fact that I was able to rebuild the server in place instead of bringing it back home and having to make a new one from scratch, but on the other side, the 8 or so hours I spent doing this, sucked.
I'm also concerned that I was able to lose an entire partition just for running rescan-scsi-bus, which I had run many times in the past without such problems.

Update1
Actually, I found out that I lost most of my archived web logs from 1999 to 2005. I'm kind of sad about that, but such is life I guess. It could have been much worse...

Update2
Never mind, I actually didn't lose anything, except a lot of time. After rebooting this morning (after my last backup restores had finished over night, a full 24H after the machine went down), I just realized that /var/ftp, which I thought I lost was indeed a separate partition (duh!) and therefore wasn't lost when /var was lost. This means that in the end I didn't lose any data at all, except a lot of time.
I can't quite say that I haven't lost anything on a raid5 array anymore, but at least I didn't lose the actual data since I had backups of it all. Pffeew...

More pages: July 2002 February 2004 March 2004 November 2004 April 2005 August 2005 January 2006 July 2006 August 2007 November 2007 December 2007 January 2008 October 2008 November 2008 December 2008 January 2009 May 2009 July 2009 August 2009 September 2009 November 2009 December 2009 January 2010 March 2010 April 2010 June 2010 August 2010 October 2010 January 2011 July 2011 August 2011 December 2011 January 2012 March 2012 May 2012 August 2012 December 2012 January 2013 March 2013 May 2013 September 2013 November 2013 January 2014 March 2014 April 2014 May 2014 October 2014 January 2015 March 2015 May 2015 January 2016 February 2016 June 2016 July 2016 August 2016 October 2016 January 2017 September 2017 January 2018 March 2018 December 2018 January 2019 August 2019 January 2020 May 2020 January 2021 September 2021 March 2023 April 2023 December 2023 June 2024 September 2024 November 2024 July 2025 August 2025 October 2025 November 2025

>>> Back to post index <<<

Contact Email