Day 4: Conferences: Shared Storage Clusters
James Bottomley, who works at SteelEye Technology had the talk on Shared
Storage Clusters. SteelEye is a new linux company that focuses on clusters.
Types of Clusters
You have high performance clusters like Bewolf, geared at computing, whereas
HA clusters are designed to handle node failure. The two are supposed to merge
eventually, but James chose to deal with HA clusters.
You get to choose between fully hardware based solutions (AT&T 3B20 where
every piece of hardware is redundant), and fully software based ones.
Now, we're looking at hybrid solutions where you use commodity hardware.
Software only clusters are really cheap but the hardware you get isn't the most
reliable and it's difficult to achieve data concurrency (it's usually done by
data replication, either asynchronously, or synchronously, which can be slow and
hard to achieve with ethernet).
Shared Storage Clusters
In shared storage clusters, each node has multiple connections to the storage
The primary advantage is that you have no data replication needs and all the
nodes see the storage. On the other hand, it does cost more and those clusters
are more complex.
All nodes see the storage, so you need to ensure protected access, which
devices do with reservation. You also need to make sure that the operating
system does support SCSI reservation, which linux doesn't do by default.
Another issue is tnat you need to be able to break a reservation when you have
a node that dies, but then you also need to be able to detect that your
reservation was broken by a node that thought you were dead.
Unfortunately, in addition to not supporting SCSI reservations, linux will treat
them as an error and do what it does in those cases, reset the bus, which causes
the reservation to be lost
Another set of problems with linux is the buffer cache because it doesn't
purge buffers when a disk is umounted. The problem is that when the disk is
remounted, linux uses it cache, but that may be a bad idea because the drive
may have been written to by another node. The solution that was agreed on was
to manually purge buffers via ioctl on umount.
For interconnect, you get to use either shared SCSI, with 12m max cable length
or fibre channel which gives you up to 2km. The cabling path should be redundant
as you ought to be able to withstand a cable being cut, or a card going bad.
Plain SCSI doesn't cut it here, and fibre channel is the way to go.
If you look at shared SCSI, it's actually an old technology (8 years), the bus
is suceptible to problems coming from just one rogue node, the cabling is daisy
chained so it can't withstand any failure. Many problems can also be traced to
not having unique IDs across devices.
Fibre channel however is wired like a network, hubs provide segment isolation
and network status lights make for easy fault isolation. Soft loop IDs also
assure unique device numbers.
Another problem is that you have to uniquely identify drives on your bus, and
the /dev/sdx names under linux are not stable. You could read the filesystem
UUID or partition label, but it doesn't work when the disk is locked. While
devfs could provide a solution here, the better solution is to use new drives
that have a unique SCSI identifier that can be read even when the drive is
You have to recover identical resources (filesystems, IPs) and rely on the
applicatoin to recover and restart. Cluster nodes must agree on a node to
perform recovery and the application must be able to make itself sane and
continue where it left off.
When you take over a failed node, you need to be able to put the partition
back in a consistent state, and that's where a journalling filesystem comes
in very handy.
In a shared cluster, the disk resource is the arbitrator as only a node
with access to the disk storage can recover. You may also end up with a split
cluster, which can be ok, but you may also end up with a necessary part of your
cluster that is cut off your main network.
London Stock Exchange Failure
What went wrong?
The database purge program got too many requests the previous day and wasn't
done in time when trading started, which ended up giving out bad data when
the stock market opened, and forced a shutdown of the database.
In other words, no matter how good your cluster is, the whole is only as
reliable as the application that you're using.
The talk was supposed to go on, but James ran out of time
The talk slides can be found here
Link to Home Page
2000/08/24 (00:39): Version 1.0