Marc's Public Blog - Misc Public Entries

All | Cars | Clubbing | Diving | Exercising | Flying | Hiking | Public | Rc | Snow
Most recent entry: 2008-12-15 00:00:00 -- Generated on 2009-01-06 22:46:51 by Rig3 0.3-391




More pages: December 2008 November 2008 October 2008 May 2008 April 2008 March 2008 January 2008 November 2007 October 2007 May 2007 March 2007 December 2006 November 2006 October 2006 September 2006 August 2006 June 2006 May 2006 February 2006 January 2006 December 2005 November 2005 October 2005 August 2005 October 2004 August 2004 June 2004 May 2004 March 2004 September 1997 July 1996 September 1993 July 1991 December 1988 December 1985 January 1980


2008/12/15 New Geotagging
2008-12-15 00:00 in Public
I've spent the last few days adding geotagging to my pictures after fixing/improving my pictprocess script and updated the geotagging with GPX Visualizer and gpsPhoto page.
I've also updated all my older blog entries in the hiking section to have pictures you can click on and see where they were taken, especially the John Muir Trail from Bishop to Whitney and other outdoors pictures like from San Francisco to Saulsalito or the Fremont Older and wine trail biking loop we often do
2008/12/03 Duplicate Posts On Google Reader
2008-12-03 00:00 in Public
For those who of you who are seeing duplicate posts in google reader or other atom/rss reader, sorry: when I rename posts in rig3, the old post disappears and a new one is created, as expected, but some blog readers don't seem to remove posts that are gone.
It's kind of annoying and I haven't found a fix for that yet, my main blog does not have this problem, and proper readers shouldn't.
2008/12/03 Onkyo TX-SR705 Receiver
2008-12-03 00:00 in Public
Interesting story: I bought a new receiver to replace my old Yamaha that just died, and used the new Audyssey auto speaker setup, the receiver kept telling me I had an error with my front left speaker. Yet, sound was coming out of it just fine, and an ohmeter confirmed it had an resistance of 4 ohms.

After some debugging, I opened the said speaker, and found that the magnet attached to the tweeter (by nothing else than magnetism), had detached itself from the tweeter and grafted itself onto the magnet of the lower speaker, messing with the sound somewhat.
I just put the magnet back where it belonged, and the sound setup completed.

I'm just impressed that Audyssey was able to detect the subtle sound difference due to that magnet in the wrong place, and that it reported it back to me :)

2008/12/01 The Power Of Open Source
2008-12-01 00:00 in Public
I bought a linksys WRT-600N wireless ABGN router, and it's actually a quite capable linux box. Thankfully, because it was not a telco product like a cell phone :) and because they were able to put the radio firmware in a way that open source people won't mess with it and break FCC rules, some folks started writing better firmware for routers of that class several years ago now.
End result, the excellent dd-wrt firmware turned that wireless internet router into a full linux box but with very low power use, form factor and heat dissipation.
It's now using 8 times less power than the old AMD K6 350Mhz linux box it's replacing, frees up a lot of room in my closet, no noise, and removes any worries of failing hard drives.

The only thing that router was missing for me was usb serial support so that I could monitor booting of my other PC server and/or debug the server remotely if networking goes down, or whatnot. I wanted to keep a config where I had two internet facing hosts I could ssh into, and use any of the two to get to my internal network, as well as use one to get the boot messages from the other one.

The problem was that dd-wrt didn't have usb-serial support, it was doable but the bits were missing. I also wanted to make a contribution to the folks who do the dd-wrt software, so I put out bounty to get usb serial working and it was fixed and added with 24H, woot!

And I'm sure when dd-wrt started out, people wondered why you'd want to replace the firmware on a router that came with already working and written firmware, kinda of why I initially wondered why some guy in sweden thought it would make sense to write a new firmware for the archos mp3 player back in the day (and later I found out how it turned out to be awesome).

2008/12/01 What Overachievement Means at Fedex
2008-12-01 00:00 in Public
I ordered a new receiver, which isn't technically meant to arrive until tomorrow as per the delivery SLA. However, as a nice surprise, it was actually here 3 days ago already (friday) and was on the truck for delivery, but "No attempt made, delivery scheduled for next business day".
I can already see the driver with his package on the truck saying "umph, this one looks heavy. It's early anyway so I'll probably worry about delivering it next week"

Not the end of the world, although I'd have been happy to play with the receiver last weekend.

      Tracking number       XXXXXXXXXXXXXXX               Reference             XXXXXXXXX
      Ship date             Nov 24, 2008                  Shipment ID           XXXXXXXXX
      Estimated delivery    Dec 2, 2008                   Destination           XXXXXXXXX
                                                          Service type          Home Delivery
                                                          Weight                35.0 lbs.
      Status                Delivery exception

Date/Time Activity Location Details Nov 29, 6:17 PM Delivery exception SAN JOSE, No attempt made, delivery scheduled for 2008 CA next business day 8:31 AM On FedEx vehicle SAN JOSE, for delivery CA 7:22 AM At local FedEx SAN JOSE, facility CA 3:47 AM Departed FedEx SACRAMENTO, location CA 12:54 Arrived at FedEx SACRAMENTO, AM location CA Nov 24, 10:08 Arrived at FedEx LEWISBERRY, 2008 PM location PA 7:00 PM Package data transmitted to FedEx 6:35 PM Picked up LEWISBERRY, PA

2008/11/30 Magic Motherboard Crash And Raid Rebuild With DD Rescue
2008-11-30 00:00 in Public
Less than a year after I built it, magic started rebooting almost daily while one of its drives was exhibiting some worrisome smart errors. On the way back from Palo Alto Aiport, with my fiancée's visiting family in tow, I thought I'd stop by the data center on the way, swap the power supply and the bad drive. It was supposed to be a 10mn job.
Yes, you already know the rest, it wasn't.

First, the machine never rebooted after I put in the new power supply, nor would it power up with the old one (well, the fans started, but no POST). I eventually gave up and brought the machine home for further diagnostics. I found out in the end that one of the CPU slots on the motherboard donated by benley went bad, and the machine would not boot with any CPU in it (the CPUs themselves still seemed ok).
Luckily, I got an old machine called 'ins1' a while ago, as a spare should something like this happen, so it was just a matter of switching motherboards and CPUs. Good thing I had planned for that.

The part where I screwed up is that I had to replace sda with a new drive that I had prepared. I had 6 drives in the machine and no way to know which one was which outside of a label I had made on the front of the box, for a case just like this. So, I pulled the drive, and put a new one in and rebooted the machine with one CPU. I had meant to boot single user mode, but I messed up the boot command line, and when I tried to sysrq to stop multiuser, it wasn't working and the machine eventually booted in multi user mode and started to write on the degraded raid set. (turns out I had a mini keyboard that didn't support sending sysrq)
It's only a bit later that I logged in and realized that I had pulled the wrong drive and since I had written on the raidset I couldn't just shut down and put the good drive back in without some amount of filesystem corruption (I did have to do this once because I had no choice, but it's not something you do first).
(oh, and it was the wrong drive because during the install, I replaced that sata board for another one, and the other board had its port in reverse order, so my labels were also in reverse order...

By then, I only had once choice left, rebuild on a drive that was already good by using the failing drive, and sure enough the failing drive had bad sectors that prevented the rebuild to complete. I still could have forced the raid to discard the bad drive and rebuild the raidset by forcing options to use the drive I was rebuilding on, as a good drive. It works perfectly if you didn't write on raidset in between, but since I had, I figured I'd try to just clone the bad drive since it only had about 5 bad blocks.

First, I went with dd conv=noerror,sync bs=512, but then googled during the long copy that there was a better way: Gnu ddrescue (don't get confused between that in the older dd_rescue and ddr_help). ddrescue is really mostly the same, except that it copies bigger blocks until it finds and error, had a logfile with recovery, and will retry bad blocks a few times before giving up on them (dd just skips them and replaces them with zeros, which you won't find with with rsync, unless you call rsync with -c and you even know which file(s) have 0s in side, which is very non trivial with a filesystem over lvm over raid5).

The magic command is therefore: ddrescue -v -r 10 -d /dev/sda4 /dev/sdd4 log which takes about 3H on a 250GB drive at 25MB/s average speed.

If ddrescue isn't able to rescue the bad blocks, in theory I should be able to compute the parity for just those blocks from the other drives (including the one I was rebuilding on), hoping/assuming that those blocs weren't ones that got changed in the short amount of time the good drive was removed from the raid. Unfortunately, doing so is pretty non trivial, and there are no tools that I could find to hand pick sectors to rebuild in one direction vs another direction (not counting that it would be super error prone).
The good news is that ddrescue -r 10 was about right: it tried to re-read my bad block 3 times and was able to get the data off the 3rd time, so I got a perfect mirror copy of my drive with issues and won't have to wonder later which portion of which filesystem got a bunch of 0s in the middle of it. Yeah! :)
(the actual data wasn't that important, I had backups of most of it, but it would have been a bit of a pain to recreate, and I always use such an opportunity to learn about the different recovery techniques and tools so that I know what to do the day I come across something very important to restore, hopefully not my data :) )

2008/11/29 Solved Disk Array Instability
2008-11-29 00:00 in Public
Oh boy, do I feel like putting an egg in my face...
I finally found the problem that caused me soo much grief when I upgraded 5 of my drives from 250GB to 1TB a bit more than a year ago, and then the reason why since that upgrade, I've had repeated failures with my other array comprized of 500GB drives.
I spent countless hours debugging port multiplier problems and once that was stable enough to run (although it would still log loads of warnings/errors/retries), my 500GB drives started to be somewhat unreliable, and would have a high likelyhood of dying during the monthly scrub (/usr/share/mdadm/checkarray).

So, I'll give you the answer right away: my 600W power supply wasn't delivering enough power to the drives through the disk array. It's unclear how or why, the said disk array had multiple power connectors, but everything was working fine when I first set it up for power and load, back when I had 250GB drives.
It's only later as I upgraded the drives that the new ones were just a bit too power hungry, and that the disk array had poor power routing, causing some occasional unreliability (i.e. it worked well enough and long enough that I didn't suspect that a power problem had come back). The fix was pretty simple, power each disk array from a different power source (one now uses a molex power strand while the other uses a sata power strand). Just for fun, I'll add that the entire system actually only uses 200W out of its 600W power supply, so it didn't seem obvious at the time (and still isn't), that I was simply overloading one of the power branches, or that the disk arrays really needed more than one connector to be plugged in.

This was really the problem where you can cook a frog by slowly warming up the water it is in. I never noticed that I got into a situation where the power was marginal, because it happened slowly, and I got unclear symptoms: errors on PMP, but I started using PMP back from when it was unstable and errors were common, and I was getting drive failures on my 500GB drives while the 1TB ones were rock solid (on the same power bus, go figure). The worst part is that the seagate drives would develop real bad sectors as a result, so it just looked like PMP wasn't very stable still and that the seagate drives I had were crap (for the record, those drives are still iffy as they do not reallocate bad blocks by themselves, which is not supposed to happen, marginal power or not).
The haha moment was finally when I was testing my 3rd brand "new remanufactured" drive from seagate, that drive was having issues too, even though it only had 2 hours of runtime. Then I noticed with smartctl -HAi /dev/device that the drive had 168 power on events... in 2 hours! Yes, from there I could tell it had been losing power. The rest is history...

I'm happy I finally found the problem, but I must have put 40 hours down the drain over the last 2 years as a result of this power issue :(


More pages: December 2008 November 2008 October 2008 May 2008 April 2008 March 2008 January 2008 November 2007 October 2007 May 2007 March 2007 December 2006 November 2006 October 2006 September 2006 August 2006 June 2006 May 2006 February 2006 January 2006 December 2005 November 2005 October 2005 August 2005 October 2004 August 2004 June 2004 May 2004 March 2004 September 1997 July 1996 September 1993 July 1991 December 1988 December 1985 January 1980