|2007/11/28 magic took a dive (updated)
π 2007-11-28 23:58 by Merlin in Public
My main server, magic.merlins.org, which you are reading this page on, had its biggest downtime in a while: 5 to 8 hours depending on the services (www came back up first).
I could actually have brought the services back up quicker by failing over to my secondary live server, but because of state involved, and work involved in making my secondary server, primary for mail, and then switching back (this includes making my mailman backup primary too, and then dealing with queues, archives, and all that fun stuff).
After asserting that I'd be able to bring magic back up, I just opted to ride the downtime and not worry about switching the services to moremagic, and then back to magic a few hours later: too much work was involved, and I had enough work on my hands recovering magic as is.
That said, if magic were to really die one day, like the hardware dying (and it could happen, I found out that one of my two CPUs in there actually has died and that the server is continuing to work with one CPU left), then I would do a bona fide switchover to moremagic.
So what happened?
I went to the colo to upgrade the drives in my external array (from 36G to 180G, upping the external storage to 1TB).
Unfortunately, while I was swapping the drives on the live server, for some reason, I decided to run rescan-scsi-bus to see my new drives were being seen, and something went very wrong there: that command caused something very bad to happen on my primary system SCSI bus and caused the system array to fail.
When I rebooted (oh and that was with a new kernel, since I used the reboot to upgrade kernels too), my raid5 array was not being seen, and I only had my root filesystem: no /usr, /var, or anything else.
From there, I started debugging, and trying the typical commands to bring back a raid array that was killed, but it would only bring one drive back out of 5, which was insufficient.
At that point, the next step is to rebuild the raid5 array on top of itself, which is supposed to bring every back up. I had done this in the very distant past, and it had worked.
Unfortunately, it worked enough for my raid5 array to function as a physical volume for my lvm volume group, and it even showed my logical volumes within that VG. I thought I was home free, until I got the dreaded error that none of my filesystems were mountable or even looked like ext3.
After several reboots which were not fun because I had to boot with init=/bin/bash due to a problem with the new kernel (I didn't know that yet), and then manually bring up udev, udevd, lvm, and raid5 (it's become non trivial to do this nowadays), I realized that the new mdadm tools created a different default raid5 array when the tools from 2002, so I had overlayed new md blocks that weren't compatible with the data I had on disk (yet, it was close since I could see my VG and LVs). After more time and more reboots, I realized that the chunck size for raid had changed from 32K to 64K and that the new default raid layout was left-symmetric instead of left-asymmetric (WTF did they have to change that).
Well, 2H later, I had my raid array back up, with my VG and LVs. I was then able to mount all my filesystems, except /var which had been damaged beyond e2fsck recovery (i.e the entire filesystem was in pieces in lost+found). In hindsight, I should have backed up that data before wiping it, but at the time, I felt the data was toast, and I didn't have the time to wait for a 10GB copy to another partition.
My recovery plan was to copy /var from moremagic, which would be close, but not quite the same (it was as different machine, but I had some shared data pieces that were rsynced daily), and then rsync/overlay the real data that I had on an almost full machine backup on my main disk server at home.
Then, I had to add the missing pieces (like recent pictures), from my laptop.
In the end, it took 4 to 6 hours of copies to get most of the system back to where it was, with very little data loss. I did lose files that had recently been uploaded to my ftp server (I don't back that up, it's too big), and I did lose 8 hours of work and frustration to piece everything back together.
I was then able to bring apache back up first, but I had to wait longer for Email for a 2GB mailman sync to finish. As I write this, I'm still rsyncing logs back and it'll probably take another 12H or so, but the server has been back up and working since about 17:30.
On one side, I'm glad I had reasonable backups and lost virtually nothing, as well as the fact that I was able to rebuild the server in place instead of bringing it back home and having to make a new one from scratch, but on the other side, the 8 or so hours I spent doing this, sucked.
I'm also concerned that I was able to lose an entire partition just for running rescan-scsi-bus, which I had run many times in the past without such problems.
Actually, I found out that I lost most of my archived web logs from 1999 to 2005. I'm kind of sad about that, but such is life I guess. It could have been much worse...
Never mind, I actually didn't lose anything, except a lot of time. After rebooting this morning (after my last backup restores had finished over night, a full 24H after the machine went down), I just realized that /var/ftp, which I thought I lost was indeed a separate partition (duh!) and therefore wasn't lost when /var was lost. This means that in the end I didn't lose any data at all, except a lot of time.
I can't quite say that I haven't lost anything on a raid5 array anymore, but at least I didn't lose the actual data since I had backups of it all. Pffeew...