Day 4: Conferences: Distributed Filesystems on Linux: Today and Tomorrow

Peter Braam, who wrote Coda, a distributed FS that supports disconnected operation, has done a lot of research in that domain, so his talk was really something I was looking forward to.

[picture]

As he explained, the usual Unix semantics of last close wins don't always work all so well for network filesystems, especially when the clients are sometimes disconnected or don't sync with the server all that often.
Then, you have to decide whether the client should poll the server to check on the status of each file, like NFS does (although it caches attributes for a certain number of seconds). Then, you have the option of having the server tell the clients when a file is modified, like AFS and Coda will do.

[picture]

The advantage of NFS is that there is no state on the server, and it's very simple, so it's not all that hard to implement clients, but while its simple model works amazingly well even if it's not perfect.

Coda is based on ideas used in AFS, but one major advantage is that it still works in disconnected operation. Coda also has bandwidth adaptation so that it works well over slow links, a bit like rsync, and it also offers write caching. Those nice features however make filesystems like AFS, and especially Coda very complex as they have to handle versioning of files, and vectors between the different versions across the network. Of course, there is also the issue of conflict resolution when you connect back to the server.

For Peter, as disks have gotten more complex with more firmware, there is no reason why they couldn't run linux on fast and really cheap Arm CPUs.
Once you get there, there is no reason why you can't add a nic and IP on the disk itself and turn your disk into a network server. It may seem a bit far fetched, but if you look at devices like the SNAP server and NAS (network attached server), there is a definite demand for it.

Peter is now working on a new filesystem: InterMezzo, which uses a code from Coda, ext2, rsync, and others. InterMezzo is split into Presto which sits in the kernel and writes to the pages while keeping a log, and Lento which is the cache manager and handles replication and syncing.
The problem with Coda is that it's huge, 1/2 million lines of code, so the idea is that InterMezzo has to be simpler than that in order to work right. Now, it has 2500 lines of C code around ext2, and 3800 lines of Perl, and the whole thing was re-written 4 times, which was possible due to its size.
While InterMezzo isn't completely finished yet, Peter recommends to use it for more than a few tens of users because Coda doesn't scale past that, and will be difficult to fix because of its complexity.

Peter then switched to the subject of clusters. The idea is that the client should not care which machine it is talking to and the big gain is that for clusters, you get a lot of extra computing power with minimal management when you install new nodes.
As he explained, VAX clusters in the 80s were really an example of great design and reliability with 100Mbit links between each node, routers, and redundancy on each link. The diskless machines would use the disks from the VAXen that had disks. That technology has unfortunately gone away with DEC, but the idea is to rebuild it with linux clusters and the current technology.
Today, some of the other people working on clusters are Stephen Tweedie and Larry Mc Voy, and their projects are outlined in this slide:

[picture]

One of the current challenges is to get data from another machine's buffer cache if it's there, instead of needing a commit to disk before the other node can access the data.
Then there are the issues of broken locks when nodes go down, and braking read or write locks from other nodes if needed.

You can find more info on all this by looking at the slides which are in the middle of the picture library, and by visiting Peter's new company's web site: Stelias Computing, and the Intermezzo web site.

Picture library

Back to Main Page

Email
Link to Home Page

99/08/13 (12:16): Version 1.0