Marc's Public Blog - Linux Hacking


All | Aquariums | Arduino | Btrfs | Cars | Cats | Clubbing | Computers | Diving | Dreamstate | Edc | Electronics | Exercising | Festivals | Flying | Halloween | Hbot | Hiking | Linux | Linuxha | Monuments | Museums | Oshkosh | Outings | Public | Rc | Sciencemuseums | Solar | Tfsf | Trips



>>> Back to post index <<<

2008/11/29 Solved Disk Array Instability
π 2008-11-29 01:01 in Linux
Oh boy, do I feel like putting an egg in my face...
I finally found the problem that caused me soo much grief when I upgraded 5 of my drives from 250GB to 1TB a bit more than a year ago, and then the reason why since that upgrade, I've had repeated failures with my other array comprized of 500GB drives.
I spent countless hours debugging port multiplier problems and once that was stable enough to run (although it would still log loads of warnings/errors/retries), my 500GB drives started to be somewhat unreliable, and would have a high likelyhood of dying during the monthly scrub (/usr/share/mdadm/checkarray).

So, I'll give you the answer right away: my 600W power supply wasn't delivering enough power to the drives through the disk array. It's unclear how or why, the said disk array had multiple power connectors, but everything was working fine when I first set it up for power and load, back when I had 250GB drives.
It's only later as I upgraded the drives that the new ones were just a bit too power hungry, and that the disk array had poor power routing, causing some occasional unreliability (i.e. it worked well enough and long enough that I didn't suspect that a power problem had come back). The fix was pretty simple, power each disk array from a different power source (one now uses a molex power strand while the other uses a sata power strand). Just for fun, I'll add that the entire system actually only uses 200W out of its 600W power supply, so it didn't seem obvious at the time (and still isn't), that I was simply overloading one of the power branches, or that the disk arrays really needed more than one connector to be plugged in.

This was really the problem where you can cook a frog by slowly warming up the water it is in. I never noticed that I got into a situation where the power was marginal, because it happened slowly, and I got unclear symptoms: errors on PMP, but I started using PMP back from when it was unstable and errors were common, and I was getting drive failures on my 500GB drives while the 1TB ones were rock solid (on the same power bus, go figure). The worst part is that the seagate drives would develop real bad sectors as a result, so it just looked like PMP wasn't very stable still and that the seagate drives I had were crap (for the record, those drives are still iffy as they do not reallocate bad blocks by themselves, which is not supposed to happen, marginal power or not).
The haha moment was finally when I was testing my 3rd brand "new remanufactured" drive from seagate, that drive was having issues too, even though it only had 2 hours of runtime. Then I noticed with smartctl -HAi /dev/device that the drive had 168 power on events... in 2 hours! Yes, from there I could tell it had been losing power. The rest is history...

I'm happy I finally found the problem, but I must have put 40 hours down the drain over the last 2 years as a result of this power issue :(


More pages: July 2002 February 2004 March 2004 November 2004 April 2005 August 2005 January 2006 July 2006 August 2007 November 2007 January 2008 October 2008 November 2008 December 2008 January 2009 May 2009 July 2009 August 2009 September 2009 November 2009 December 2009 January 2010 March 2010 April 2010 June 2010 August 2010 October 2010 January 2011 July 2011 August 2011 December 2011 January 2012 March 2012 May 2012 August 2012 December 2012 January 2013 March 2013 May 2013 September 2013 November 2013 January 2014 March 2014 April 2014 May 2014 October 2014 January 2015 March 2015 May 2015 January 2016 February 2016 June 2016 July 2016 August 2016 October 2016 January 2017 September 2017 January 2018 March 2018 December 2018 January 2019 January 2020 May 2020 January 2021 September 2021 March 2023 April 2023 December 2023 June 2024 November 2024

>>> Back to post index <<<

Contact Email