Marc's Public Blog - Linux Hacking

All | Aquariums | Arduino | Btrfs | Cars | Cats | Clubbing | Dining | Diving | Electronics | Exercising | Flying | Halloween | Hiking | Linux | Linuxha | Monuments | Museums | Public | Rc | Sciencemuseums | Snow | Solar | Trips

This page has a few of my blog entries about linux, but my main linux page is here
Picture of Linus

Table of Content for linux:

More pages: May 2020 January 2020 January 2019 December 2018 March 2018 January 2018 September 2017 January 2017 October 2016 August 2016 July 2016 June 2016 March 2016 February 2016 January 2016 May 2015 March 2015 January 2015 October 2014 May 2014 April 2014 March 2014 January 2014 November 2013 September 2013 May 2013 March 2013 January 2013 December 2012 August 2012 May 2012 March 2012 January 2012 December 2011 August 2011 July 2011 January 2011 October 2010 August 2010 June 2010 April 2010 March 2010 January 2010 December 2009 November 2009 September 2009 August 2009 July 2009 May 2009 January 2009 December 2008 November 2008 October 2008 January 2008 November 2007 August 2007 July 2006 January 2006 August 2005 April 2005 November 2004 March 2004 February 2004

2008/11/30 Magic Motherboard Crash And Raid Rebuild With DD Rescue
π 2008-11-30 01:01 in Linux
Less than a year after I built it, magic started rebooting almost daily while one of its drives was exhibiting some worrisome smart errors. On the way back from Palo Alto Aiport, with my fiancée's visiting family in tow, I thought I'd stop by the data center on the way, swap the power supply and the bad drive. It was supposed to be a 10mn job.
Yes, you already know the rest, it wasn't.

First, the machine never rebooted after I put in the new power supply, nor would it power up with the old one (well, the fans started, but no POST). I eventually gave up and brought the machine home for further diagnostics. I found out in the end that one of the CPU slots on the motherboard donated by benley went bad, and the machine would not boot with any CPU in it (the CPUs themselves still seemed ok).
Luckily, I got an old machine called 'ins1' a while ago, as a spare should something like this happen, so it was just a matter of switching motherboards and CPUs. Good thing I had planned for that.

The part where I screwed up is that I had to replace sda with a new drive that I had prepared. I had 6 drives in the machine and no way to know which one was which outside of a label I had made on the front of the box, for a case just like this. So, I pulled the drive, and put a new one in and rebooted the machine with one CPU. I had meant to boot single user mode, but I messed up the boot command line, and when I tried to sysrq to stop multiuser, it wasn't working and the machine eventually booted in multi user mode and started to write on the degraded raid set. (turns out I had a mini keyboard that didn't support sending sysrq)
It's only a bit later that I logged in and realized that I had pulled the wrong drive and since I had written on the raidset I couldn't just shut down and put the good drive back in without some amount of filesystem corruption (I did have to do this once because I had no choice, but it's not something you do first).
(oh, and it was the wrong drive because during the install, I replaced that sata board for another one, and the other board had its port in reverse order, so my labels were also in reverse order...

By then, I only had once choice left, rebuild on a drive that was already good by using the failing drive, and sure enough the failing drive had bad sectors that prevented the rebuild to complete. I still could have forced the raid to discard the bad drive and rebuild the raidset by forcing options to use the drive I was rebuilding on, as a good drive. It works perfectly if you didn't write on raidset in between, but since I had, I figured I'd try to just clone the bad drive since it only had about 5 bad blocks.

First, I went with dd conv=noerror,sync bs=512, but then googled during the long copy that there was a better way: Gnu ddrescue (don't get confused between that in the older dd_rescue and ddr_help). ddrescue is really mostly the same, except that it copies bigger blocks until it finds and error, had a logfile with recovery, and will retry bad blocks a few times before giving up on them (dd just skips them and replaces them with zeros, which you won't find with with rsync, unless you call rsync with -c and you even know which file(s) have 0s in side, which is very non trivial with a filesystem over lvm over raid5).

The magic command is therefore: ddrescue -v -r 10 -d /dev/sda4 /dev/sdd4 log which takes about 3H on a 250GB drive at 25MB/s average speed.

If ddrescue isn't able to rescue the bad blocks, in theory I should be able to compute the parity for just those blocks from the other drives (including the one I was rebuilding on), hoping/assuming that those blocs weren't ones that got changed in the short amount of time the good drive was removed from the raid. Unfortunately, doing so is pretty non trivial, and there are no tools that I could find to hand pick sectors to rebuild in one direction vs another direction (not counting that it would be super error prone).
The good news is that ddrescue -r 10 was about right: it tried to re-read my bad block 3 times and was able to get the data off the 3rd time, so I got a perfect mirror copy of my drive with issues and won't have to wonder later which portion of which filesystem got a bunch of 0s in the middle of it. Yeah! :)
(the actual data wasn't that important, I had backups of most of it, but it would have been a bit of a pain to recreate, and I always use such an opportunity to learn about the different recovery techniques and tools so that I know what to do the day I come across something very important to restore, hopefully not my data :) )

2008/11/29 Solved Disk Array Instability
π 2008-11-29 01:01 in Linux
Oh boy, do I feel like putting an egg in my face...
I finally found the problem that caused me soo much grief when I upgraded 5 of my drives from 250GB to 1TB a bit more than a year ago, and then the reason why since that upgrade, I've had repeated failures with my other array comprized of 500GB drives.
I spent countless hours debugging port multiplier problems and once that was stable enough to run (although it would still log loads of warnings/errors/retries), my 500GB drives started to be somewhat unreliable, and would have a high likelyhood of dying during the monthly scrub (/usr/share/mdadm/checkarray).

So, I'll give you the answer right away: my 600W power supply wasn't delivering enough power to the drives through the disk array. It's unclear how or why, the said disk array had multiple power connectors, but everything was working fine when I first set it up for power and load, back when I had 250GB drives.
It's only later as I upgraded the drives that the new ones were just a bit too power hungry, and that the disk array had poor power routing, causing some occasional unreliability (i.e. it worked well enough and long enough that I didn't suspect that a power problem had come back). The fix was pretty simple, power each disk array from a different power source (one now uses a molex power strand while the other uses a sata power strand). Just for fun, I'll add that the entire system actually only uses 200W out of its 600W power supply, so it didn't seem obvious at the time (and still isn't), that I was simply overloading one of the power branches, or that the disk arrays really needed more than one connector to be plugged in.

This was really the problem where you can cook a frog by slowly warming up the water it is in. I never noticed that I got into a situation where the power was marginal, because it happened slowly, and I got unclear symptoms: errors on PMP, but I started using PMP back from when it was unstable and errors were common, and I was getting drive failures on my 500GB drives while the 1TB ones were rock solid (on the same power bus, go figure). The worst part is that the seagate drives would develop real bad sectors as a result, so it just looked like PMP wasn't very stable still and that the seagate drives I had were crap (for the record, those drives are still iffy as they do not reallocate bad blocks by themselves, which is not supposed to happen, marginal power or not).
The haha moment was finally when I was testing my 3rd brand "new remanufactured" drive from seagate, that drive was having issues too, even though it only had 2 hours of runtime. Then I noticed with smartctl -HAi /dev/device that the drive had 168 power on events... in 2 hours! Yes, from there I could tell it had been losing power. The rest is history...

I'm happy I finally found the problem, but I must have put 40 hours down the drain over the last 2 years as a result of this power issue :(

2008/11/29 Ubuntu Intrepid Ibex Upgrade From Hell and Network Manager Sucks
π 2008-11-29 01:01 in Linux
This started with me trying to debug a networking issue with my networking jumping wireless networks behind my back. It was a pretty minor problem, but it had beeen annoying me a bit, so I figured I'd tackle it.
Against better judgement, I figured I'd first upgrade my ubuntu hardy to the just released Intrepid (I guess the name said it all). After the upgrade, I had no more networking, and no more X. Swell...

Networking was easy to bring up temporarily: I had to bring up the interface by hand and networkmanager would no more see loss of link and bring down eth0, which in turn triggered one of my scripts to bring up wireless on eth1.

The upgrade to Xorg 1.5.3 was supposed to be a good thing, but it made X crash every 10 minutes or so with fglrx (which was nicely upgraded for me), or with the radeonhd driver. I first had to upgrade my kernel to (from 2.6.24) to stop the crashes, and after a fair amount of work, got the radeon and radeonhd drivers working with my mobility firegl V5200 (3d almost works, it just crashes with radeonhd and is very slow with radeon, but when I have time, I'll do some svn pull to get even later drivers and it should work I'm told).
At least the good news is that I'm now running an OSS radeon driver, no more fglrx binary blob. I also get 3D and for the very first time: AIGLX and compositing in enlightenment.

Networkmanager might be supposed to be cool, but totally fucks up your life if you're not using it exactly the way it was intended. I had auto plugging working, that stopped after an upgrade. I had auto switching from wired to wireless (through dhcp scripts) and that stopped too (after networkmanager took over the function of ifplugd), and that stopped working too. After that, I even got networkmanager to just SEGV in protest.

Then, I tried to make networkmanager work, but I soon found out that it's been riddled with bugs and not been playing nice with the rest of the system if you have non standard configs or need to admin some interfaces by hand. Sure, it has an exclude mode, where it will now not even bring the interface down on loss of link (it used to), forcing me to go back to ifplugd or the newer wicd.
The old network manager had null asserts that I reported and were never fixed.
The new one is even worse, it segvs if I manually bring up eth1:

NetworkManager: <info> Unmanaged Device found; state CONNECTED forced. (see NetworkManager: <WARN> nm_supplicant_interface_add_cb(): Unexpected supplicant error getting interface: wpa_supplicant couldn't grab this interface. [1]+ Segmentation fault NetworkManager --no-daemon

Then I tried starting clean by removing all my interfaces from =/etc/network/interfaces=, and networkmanager refused to manage my interface anyway. It looks like it's one of the many problems that people have been seeing.
I like the fix, which says:

As a workaround removed network-manager
sudo apt-get remove network-manager
And i started my network device with:
sudo ifup eth0
Hope this helps
I filed my bug here anyway

And then, as I read the pretty light docs with no info on real troubleshooting, or WTF won't it even manage my eth0, I see gems like these:

you may want to restart the system-settings daemon using the command:
"sudo killall nm-system-settings" to apply those changes.
Err, what? You have to kill a daemon with killall to re-read config files? WTFBBQ?

NetworkManager, you're not managing any of my networks anymore. It looks like wicd will do the job, and if not I'll just go back to ifplugd and custom scripts.

Ubuntu folks: you put out a good distro, but your love affair with gnome and utter shite like networkmanager is not making you look good.

2008/11/12 MythTVs
π 2008-11-12 01:01 in Linux
So, since we have two TVs and rooms to watch them, I figured it would make sense to have two MythTVs when I only had one. My other motivation was that my current MythTV was getting a bit old and was unable to play 1080p content encoded in H264.
The solution was simple: just build a second mythtv box, move my main mythtv setup to the new hardware, make the old hardware a secondary frontend, and upgrade the hardware in the older PC after that. That was a good plan on paper.

So, the first part, the new PC went out ok because I used a bit of brains and threw money at the problem: I'm just too old to fuck around with PC hardware and build my own HTPC case: there are too many things that can not work together, requiring multiple trips to the store to exchange part, take stuff out and back in...
I sent a bid to microcenter, and they actually did a good job building the HTPC. I got a good enough case, was able to get drivers to talk to the front panel LCD, and effectively everything worked except the built in IR port that was hardwired to only talk to a microsoft remote (no thank you). After adding a PVR-350 and wiring its IR receiver, everything worked hardware-wise (Asus P5E-VM HDMI G35, dual core duo 3Ghz, and got the built in intel video chip to work with Xorg. The case is Antec Fusion Black 430 HTPC, which is not small but fairly nice).

Mmmh, and then:

Moving my mythtv setup to work on the new box cost me a lot of lost hair and sleep. This is where the DB in mythtv is a pain in the ass. I had to hand edit the DB to change the IP of my main mythtv server (I didn't even try anything as foolish as renaming the hostname, especially as I called my main myth server, 'myth', making a search/replace in the DB a guaranteed failure).
What happened is that I set a NULL value to my hostname and later fixed it back to be a 'NULL', except that phpmyadmin was nice enough to put the 'NULL' ASCII string instead of NULL, making debug output perplexing since it was effectively looking for NULL and not finding NULL in the DB.
Thanks to Mikal for steering me in the right direction for debugging this after about a week of pulling my hair... After that, everything was working.

This is where a smart person would have quit while he was ahead, but no, I had that 3rd task which was to upgrade the CPU in my old myth box (an AMD Semptron 3100+). I ordered an AMD 4000+ (2.6Ghz instead of 1.8Ghz), the fastest socket 754 upgrade available for that motherboard. I hoped that it would be fast enough to decode H264/1080p, but it turned out not to be that easy to find out.

My old HTPC

The CPU of Doom

One of my many attempts at making it working: a beefier PS from my desktop PC

So the full story took over a month, but basically the new CPU has an integrated memory controller that is very subtly incompatible with my motherboard (I probably have and older bad revision of the hardware).
End result: the CPU works fine if I limit the memory in linux to 252MB. Anything beyond that and it'll crash. Lovely!
(yes, yes, I really tried everything: other memory, other slots, better power supply, memtest, standing on one foot, etc...).
And the best part, kinda? After about a month of trying I did get linux to boot and work with 252MB, and was able to verify that even overclocked at 2.8Ghz, the new CPU can't decode H264 at 1080p anyway (including with the enhanced windows software decoder you pay for).

Boy, I want my 20 hours back!

More pages: May 2020 January 2020 January 2019 December 2018 March 2018 January 2018 September 2017 January 2017 October 2016 August 2016 July 2016 June 2016 March 2016 February 2016 January 2016 May 2015 March 2015 January 2015 October 2014 May 2014 April 2014 March 2014 January 2014 November 2013 September 2013 May 2013 March 2013 January 2013 December 2012 August 2012 May 2012 March 2012 January 2012 December 2011 August 2011 July 2011 January 2011 October 2010 August 2010 June 2010 April 2010 March 2010 January 2010 December 2009 November 2009 September 2009 August 2009 July 2009 May 2009 January 2009 December 2008 November 2008 October 2008 January 2008 November 2007 August 2007 July 2006 January 2006 August 2005 April 2005 November 2004 March 2004 February 2004

Contact Email