Marc's Public Blog - Linux Hacking


vvv Click on the categories below to see other topic specific pages vvv



This page has a few of my blog entries about linux, but my main linux page is here
Picture of Linus

Here is a list of older linux event reports I made before my blog was started, then the rest are below
1996/11/18-21:Linux Pavillion Comdex Fall 1996 (photos only). I've been going since then to help at the linux pavillion.
1997/11/18-21: Linux Pavillion Comdex Fall 1997 (photos only)
1998/05/28-30: Linuxexpo 1998 (photos only)
1998/11/16-20: Linux Pavillion Comdex Fall 1998 (full report)
1998/11/11: Silicon Valley Tea Party (report with pictures)
1999/02/15: Windows Refund Day (report with pictures)
1999/03/20: SVLUG KTEH night (photos only)
1999/03/01-04: LinuxWorld Expo Winter 99 (complete report with many pictures)
1999/03/31: Mozilla Party one year anniversary (photos only)
1999/05/18-22: Linuxexpo 1999 (complete report with many pictures)
1999/06/07: June 99 Balug meeting with Linus
1999/08/09-12: LinuxWorld Expo Summer 99 (complete report with many pictures)
1999/11/15-19: Linux Business Show at Comdex Fall 1999 (full report with pictures)
2000/08/14-17: LinuxWorld Expo Summer 2000 (complete report with many pictures)
2001/01/17-20: Linux.conf.au/LCA 2001 (complete report with pictures)
2001/07/25-28: OLS 2001 (photos only)
2001/08/25: Linux 10th Anniversary (report with pictures)
2001/09/27-30: LinuxWorld Expo Summer 2001 report with pictures)
2001/11/05-10: ALS 2001 (photos only)
2002/06/26-29: OLS 2002 (photos only)
2003/01/20-25: LCA 2003 (photos only)
2003/07/23-26: OLS 2003 (photos only)
2004/01/12-17: LCA 2004 (photos only)
2004/07/21-24: OLS 2004 (photos only)
2005/04/18-23: LCA 2005 (photos only)
2006/01/24-28: LCA 2006 (photos only)
2007/01/17-21: LCA 2007 (photos only)

Here is a list of all the talks I've given:

And below are my blog posts:

Table of Content for linux:

More pages: July 2002 February 2004 March 2004 November 2004 April 2005 August 2005 January 2006 July 2006 August 2007 November 2007 December 2007 January 2008 October 2008 November 2008 December 2008 January 2009 May 2009 July 2009 August 2009 September 2009 November 2009 December 2009 January 2010 March 2010 April 2010 June 2010 August 2010 October 2010 January 2011 July 2011 August 2011 December 2011 January 2012 March 2012 May 2012 August 2012 December 2012 January 2013 March 2013 May 2013 September 2013 November 2013 January 2014 March 2014 April 2014 May 2014 October 2014 January 2015 March 2015 May 2015 January 2016 February 2016 June 2016 July 2016 August 2016 October 2016 January 2017 September 2017 January 2018 March 2018 December 2018 January 2019 August 2019 January 2020 May 2020 January 2021 September 2021 March 2023 April 2023 December 2023 June 2024 September 2024 November 2024 July 2025 August 2025 October 2025 November 2025



π 2007-11-17 17:31 by Merlin in Linux

In the old days, we had ifconfig, dhclient, APM, and things were simple.

First, came ACPI. This is not linux's fault, but boy did it make something simple as putting your laptop to sleep a real pain in the ass sometimes. I'm not sure how many hours I spent learning the acpi system, and getting it work on my thinkpad back before distros made it mostly work in most cases (but seriously APM, just worked, and ACPI was a pain in the ass)
I recently upgraded to a new laptop (thinkpad Z61p), on which I figured I'd put a brand new ubuntu feisty (now upgraded to gutsy), and I'm still running a recent kernel.org kernel instead of the vendor provided one. Maybe I'm getting punished for refusing to run gnome/KDE (I really tried, but gnome still sucks, and KDE still didn't quite do it, so I'm back to enlightenment), but simple things don't work:
  • For some obscure reason, Fn+F4 calls acpi_fakekey, which then does nothing (apparently, it might still be talking to the wrong /dev/input/event0), instead of just simply calling the sleep script. Why so complicated? I mean this crap:
    cat /etc/acpi/sleepbtn.sh
    #!/bin/bash
    . /usr/share/acpi-support/key-constants
    #acpi_fakekey $KEY_SLEEP
    /etc/acpi/sleep.sh
    
    Seriously, WTF is acpi_fakekey, and why is there no documentation for it?
  • tpb (thinkpad display) just worked, but was replaced by some complicated hotkey-setup package that does autodetection and still did the wrong thing for my laptop, and still doesn't do anything useful on my laptop with enlightenment (I had to hand re-install tpb, which ubuntu nicely made incompatible with hotkey-setup and ubuntu-desktop)
  • pulseaudio just did not work due to a misbuild (/tmp/.esd vs /tmp/esd-uid), yielding broken sound laptop-wide
  • but the best one is by far avahi, dhcdbd, and other network autoconfiguration stuff. Long are the days of simple ifplugd autoconfigure and /etc/network/interfaces is simply empty. Keeping up with all this stuff is starting to be really a mess, especially as documentation there is pretty light too.

I suppose that by the time all this is working, I'll still end up with a better config than what I can do on windows, but damn, it seems like it's getting unnecessarly hard...
π 2007-11-28 23:58 by Merlin in Computers, Linux, Public

My main server, magic.merlins.org, which you are reading this page on, had its biggest downtime in a while: 5 to 8 hours depending on the services (www came back up first).
I could actually have brought the services back up quicker by failing over to my secondary live server, but because of state involved, and work involved in making my secondary server, primary for mail, and then switching back (this includes making my mailman backup primary too, and then dealing with queues, archives, and all that fun stuff).
After asserting that I'd be able to bring magic back up, I just opted to ride the downtime and not worry about switching the services to moremagic, and then back to magic a few hours later: too much work was involved, and I had enough work on my hands recovering magic as is.
That said, if magic were to really die one day, like the hardware dying (and it could happen, I found out that one of my two CPUs in there actually has died and that the server is continuing to work with one CPU left), then I would do a bona fide switchover to moremagic.

So what happened?
I went to the colo to upgrade the drives in my external array (from 36G to 180G, upping the external storage to 1TB).
Unfortunately, while I was swapping the drives on the live server, for some reason, I decided to run rescan-scsi-bus to see my new drives were being seen, and something went very wrong there: that command caused something very bad to happen on my primary system SCSI bus and caused the system array to fail.
When I rebooted (oh and that was with a new kernel, since I used the reboot to upgrade kernels too), my raid5 array was not being seen, and I only had my root filesystem: no /usr, /var, or anything else.
From there, I started debugging, and trying the typical commands to bring back a raid array that was killed, but it would only bring one drive back out of 5, which was insufficient.
At that point, the next step is to rebuild the raid5 array on top of itself, which is supposed to bring every back up. I had done this in the very distant past, and it had worked.
Unfortunately, it worked enough for my raid5 array to function as a physical volume for my lvm volume group, and it even showed my logical volumes within that VG. I thought I was home free, until I got the dreaded error that none of my filesystems were mountable or even looked like ext3.
After several reboots which were not fun because I had to boot with init=/bin/bash due to a problem with the new kernel (I didn't know that yet), and then manually bring up udev, udevd, lvm, and raid5 (it's become non trivial to do this nowadays), I realized that the new mdadm tools created a different default raid5 array when the tools from 2002, so I had overlayed new md blocks that weren't compatible with the data I had on disk (yet, it was close since I could see my VG and LVs). After more time and more reboots, I realized that the chunck size for raid had changed from 32K to 64K and that the new default raid layout was left-symmetric instead of left-asymmetric (WTF did they have to change that).
Well, 2H later, I had my raid array back up, with my VG and LVs. I was then able to mount all my filesystems, except /var which had been damaged beyond e2fsck recovery (i.e the entire filesystem was in pieces in lost+found). In hindsight, I should have backed up that data before wiping it, but at the time, I felt the data was toast, and I didn't have the time to wait for a 10GB copy to another partition.
My recovery plan was to copy /var from moremagic, which would be close, but not quite the same (it was as different machine, but I had some shared data pieces that were rsynced daily), and then rsync/overlay the real data that I had on an almost full machine backup on my main disk server at home.
Then, I had to add the missing pieces (like recent pictures), from my laptop.
In the end, it took 4 to 6 hours of copies to get most of the system back to where it was, with very little data loss. I did lose files that had recently been uploaded to my ftp server (I don't back that up, it's too big), and I did lose 8 hours of work and frustration to piece everything back together.

I was then able to bring apache back up first, but I had to wait longer for Email for a 2GB mailman sync to finish. As I write this, I'm still rsyncing logs back and it'll probably take another 12H or so, but the server has been back up and working since about 17:30.
On one side, I'm glad I had reasonable backups and lost virtually nothing, as well as the fact that I was able to rebuild the server in place instead of bringing it back home and having to make a new one from scratch, but on the other side, the 8 or so hours I spent doing this, sucked.
I'm also concerned that I was able to lose an entire partition just for running rescan-scsi-bus, which I had run many times in the past without such problems.

Update1
Actually, I found out that I lost most of my archived web logs from 1999 to 2005. I'm kind of sad about that, but such is life I guess. It could have been much worse...

Update2
Never mind, I actually didn't lose anything, except a lot of time. After rebooting this morning (after my last backup restores had finished over night, a full 24H after the machine went down), I just realized that /var/ftp, which I thought I lost was indeed a separate partition (duh!) and therefore wasn't lost when /var was lost. This means that in the end I didn't lose any data at all, except a lot of time.
I can't quite say that I haven't lost anything on a raid5 array anymore, but at least I didn't lose the actual data since I had backups of it all. Pffeew...

More pages: July 2002 February 2004 March 2004 November 2004 April 2005 August 2005 January 2006 July 2006 August 2007 November 2007 December 2007 January 2008 October 2008 November 2008 December 2008 January 2009 May 2009 July 2009 August 2009 September 2009 November 2009 December 2009 January 2010 March 2010 April 2010 June 2010 August 2010 October 2010 January 2011 July 2011 August 2011 December 2011 January 2012 March 2012 May 2012 August 2012 December 2012 January 2013 March 2013 May 2013 September 2013 November 2013 January 2014 March 2014 April 2014 May 2014 October 2014 January 2015 March 2015 May 2015 January 2016 February 2016 June 2016 July 2016 August 2016 October 2016 January 2017 September 2017 January 2018 March 2018 December 2018 January 2019 August 2019 January 2020 May 2020 January 2021 September 2021 March 2023 April 2023 December 2023 June 2024 September 2024 November 2024 July 2025 August 2025 October 2025 November 2025

Contact Email