Day 4: Conferences: Shared Storage Clusters

James Bottomley, who works at SteelEye Technology had the talk on Shared Storage Clusters. SteelEye is a new linux company that focuses on clusters.


Types of Clusters

You have high performance clusters like Bewolf, geared at computing, whereas HA clusters are designed to handle node failure. The two are supposed to merge eventually, but James chose to deal with HA clusters.

You get to choose between fully hardware based solutions (AT&T 3B20 where every piece of hardware is redundant), and fully software based ones.
Now, we're looking at hybrid solutions where you use commodity hardware.

Software only clusters are really cheap but the hardware you get isn't the most reliable and it's difficult to achieve data concurrency (it's usually done by data replication, either asynchronously, or synchronously, which can be slow and hard to achieve with ethernet).

Shared Storage Clusters

In shared storage clusters, each node has multiple connections to the storage array.
The primary advantage is that you have no data replication needs and all the nodes see the storage. On the other hand, it does cost more and those clusters are more complex.
All nodes see the storage, so you need to ensure protected access, which devices do with reservation. You also need to make sure that the operating system does support SCSI reservation, which linux doesn't do by default.
Another issue is tnat you need to be able to break a reservation when you have a node that dies, but then you also need to be able to detect that your reservation was broken by a node that thought you were dead.
Unfortunately, in addition to not supporting SCSI reservations, linux will treat them as an error and do what it does in those cases, reset the bus, which causes the reservation to be lost :-)
Another set of problems with linux is the buffer cache because it doesn't purge buffers when a disk is umounted. The problem is that when the disk is remounted, linux uses it cache, but that may be a bad idea because the drive may have been written to by another node. The solution that was agreed on was to manually purge buffers via ioctl on umount.

For interconnect, you get to use either shared SCSI, with 12m max cable length or fibre channel which gives you up to 2km. The cabling path should be redundant as you ought to be able to withstand a cable being cut, or a card going bad.
Plain SCSI doesn't cut it here, and fibre channel is the way to go.

If you look at shared SCSI, it's actually an old technology (8 years), the bus is suceptible to problems coming from just one rogue node, the cabling is daisy chained so it can't withstand any failure. Many problems can also be traced to not having unique IDs across devices.
Fibre channel however is wired like a network, hubs provide segment isolation and network status lights make for easy fault isolation. Soft loop IDs also assure unique device numbers.

Another problem is that you have to uniquely identify drives on your bus, and the /dev/sdx names under linux are not stable. You could read the filesystem UUID or partition label, but it doesn't work when the disk is locked. While devfs could provide a solution here, the better solution is to use new drives that have a unique SCSI identifier that can be read even when the drive is locked.

Cluster Operation

You have to recover identical resources (filesystems, IPs) and rely on the applicatoin to recover and restart. Cluster nodes must agree on a node to perform recovery and the application must be able to make itself sane and continue where it left off.

When you take over a failed node, you need to be able to put the partition back in a consistent state, and that's where a journalling filesystem comes in very handy.
In a shared cluster, the disk resource is the arbitrator as only a node with access to the disk storage can recover. You may also end up with a split cluster, which can be ok, but you may also end up with a necessary part of your cluster that is cut off your main network.

London Stock Exchange Failure

What went wrong?
The database purge program got too many requests the previous day and wasn't done in time when trading started, which ended up giving out bad data when the stock market opened, and forced a shutdown of the database.
In other words, no matter how good your cluster is, the whole is only as reliable as the application that you're using.

The talk was supposed to go on, but James ran out of time :-)

The talk slides can be found here

[library] Picture library [back] Back to Main Page [next] Next page

[ms free site] Email
Link to Home Page

2000/08/24 (00:39): Version 1.0