DragonFly kernel List (threaded) for 2007-10
Re: HAMMER filesystem update - design document
:Anyone up on ReiserFS ?
:(but still capable of a 'clean room' description :)
:As I recall, according to their docs it seems to have been one of the
:first to use BTrees in the general sense for internal structuring ..
:also as I recall, there were some performance problems in specific areas
:of requiring extra CPU for basic IO (due to having to compute tree
:operations rather than do simple pointer manipulations) and also
:concurrent IO (due to the need for complex tree locking types of things,
:possibly compounded by the extra cpu time)
:this kind of a thing is more for replication than 100% raw speed, but
:in any case.. just some topics for discussion I suppose ..
:I still need to read the more detailed commits.
:looking forward to it.
I've never looked at the Reiser code though the comments I get from
friends who use it are on the order of 'extremely reliable but not
the fastest filesystem in the world'.
I don't expect HAMMER to be slow. A B-Tree typically uses a fairly
small radix in the 8-64 range (HAMMER uses 8 for now). A standard
indirect block methodology typically uses a much larger radix, such
as 512, but is only able to organize information in a very restricted,
The are several tricks to making a B-Tree operate efficiently but
the main thing you want to do is to issue large I/Os which cover
multiple B-Tree nodes and then arrange the physical layout of the B-Tree
such that a linear I/O will cover the most likely path(s), thus
reducing the actual number of physical I/O's needed. Locality of
reference is important.
HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree
operations, because it doesn't need an intact B-Tree for recovery of the
filesystem. It can reconstruct the B-Tree for a cluster by scanning
the records in the cluster and using a stored transaction id verses the
transaction id in the records to determine what can be restored and
what still may have had pending asynchronous I/O and thus cannot.
HAMMER will implement one B-Tree per cluster (where a cluster is e.g.
64MB), and then hook clusters together at B-Tree leaf nodes (B-Tree
leaf -> root of some other cluster). This means that HAMMER will be
able to lock modifying operations cluster-by-cluster at the very
least and hopefully greatly improve the amount of parallelism supported
by the filesystem.
HAMMER uses a index-record-data approach. Each cluster has three types
of information in it: Indexes, records, and data. The index is a B-Tree
and B-Tree nodes will replicate most of the contents of the records as
well as supply a direct pointer to the related data. B-Tree nodes will
be localized in typed filesystem buffers (that is, grouped with other
B-Tree nodes), and B-Tree filesystem buffers will be intermixed with
data filesystem buffers to a degree, so it should have extremely
good caching characteristics. I tried to take into consideration how
hard drives cache data (which is typically whole tracks to begin with)
and incorporate that into the design.
Finally, HAMMER is designed to allow clusters-by-cluster reoptimization
of the storage layout. Anything that isn't optimally layed-out at the
time it was created can be re-layed-out at some later time, e.g. with
a continuously running background process or a nightly cron job or
something of that ilk. This will allow HAMMER to choose to use an
expedient layout instead of an optimal one in its critical path and then
'fix' the layout later on to make re-accesses optimal. I've left a ton
of bytes free in the filesystem buffer headers for records and clusters
for (future) usage-pattern tracking heuristics.
The radix tree bitmap allocator, which has been committed so you can
take a look at it if you want, is extremely sophisticated. It should
be able to optimally allocate and free various types of information
all the way from megabyte+ sized chunks down to a 64-byte boundary,
in powers of 2.
My expectation is that this will lead to a fairly fast filesystem. We
will know in about a month :-)