Marc's Public Blog - Linux Hacking

All | Aquariums | Arduino | Btrfs | Cars | Cats | Clubbing | Dining | Dinners | Diving | Electronics | Exercising | Flying | Hiking | Linux | Linuxha | Public | Rc | Sciencemuseums | Snow | Solar | Trips

This page has a few of my blog entries about linux, but my main linux page is here
Picture of Linus

>>> Back to post index <<<

2008/11/29 Solved Disk Array Instability
π 2008-11-29 00:00 in Linux
Oh boy, do I feel like putting an egg in my face...
I finally found the problem that caused me soo much grief when I upgraded 5 of my drives from 250GB to 1TB a bit more than a year ago, and then the reason why since that upgrade, I've had repeated failures with my other array comprized of 500GB drives.
I spent countless hours debugging port multiplier problems and once that was stable enough to run (although it would still log loads of warnings/errors/retries), my 500GB drives started to be somewhat unreliable, and would have a high likelyhood of dying during the monthly scrub (/usr/share/mdadm/checkarray).

So, I'll give you the answer right away: my 600W power supply wasn't delivering enough power to the drives through the disk array. It's unclear how or why, the said disk array had multiple power connectors, but everything was working fine when I first set it up for power and load, back when I had 250GB drives.
It's only later as I upgraded the drives that the new ones were just a bit too power hungry, and that the disk array had poor power routing, causing some occasional unreliability (i.e. it worked well enough and long enough that I didn't suspect that a power problem had come back). The fix was pretty simple, power each disk array from a different power source (one now uses a molex power strand while the other uses a sata power strand). Just for fun, I'll add that the entire system actually only uses 200W out of its 600W power supply, so it didn't seem obvious at the time (and still isn't), that I was simply overloading one of the power branches, or that the disk arrays really needed more than one connector to be plugged in.

This was really the problem where you can cook a frog by slowly warming up the water it is in. I never noticed that I got into a situation where the power was marginal, because it happened slowly, and I got unclear symptoms: errors on PMP, but I started using PMP back from when it was unstable and errors were common, and I was getting drive failures on my 500GB drives while the 1TB ones were rock solid (on the same power bus, go figure). The worst part is that the seagate drives would develop real bad sectors as a result, so it just looked like PMP wasn't very stable still and that the seagate drives I had were crap (for the record, those drives are still iffy as they do not reallocate bad blocks by themselves, which is not supposed to happen, marginal power or not).
The haha moment was finally when I was testing my 3rd brand "new remanufactured" drive from seagate, that drive was having issues too, even though it only had 2 hours of runtime. Then I noticed with smartctl -HAi /dev/device that the drive had 168 power on events... in 2 hours! Yes, from there I could tell it had been losing power. The rest is history...

I'm happy I finally found the problem, but I must have put 40 hours down the drain over the last 2 years as a result of this power issue :(

More pages: October 2016 August 2016 July 2016 June 2016 February 2016 January 2016 May 2015 March 2015 January 2015 October 2014 May 2014 April 2014 March 2014 January 2014 November 2013 September 2013 May 2013 March 2013 January 2013 December 2012 August 2012 May 2012 March 2012 January 2012 December 2011 August 2011 July 2011 January 2011 October 2010 August 2010 June 2010 April 2010 March 2010 January 2010 December 2009 November 2009 September 2009 August 2009 July 2009 May 2009 January 2009 December 2008 November 2008 October 2008 January 2008 November 2007 August 2007 July 2006 January 2006 August 2005 April 2005 November 2004 March 2004 February 2004

>>> Back to post index <<<