DragonFly kernel List (threaded) for 2008-05
Re: HEADS UP - Final HAMMER on-disk structural changes being made today
"Matthew Dillon" <firstname.lastname@example.org> wrote in message
:I guess this is probably a good time to ask.
:Have you thought at all about how HAMMER will scale to the current
:of NAND-based disks, and solid state storage in general? From my high
:understanding of the media and HAMMER in general I can hypothesize that
:looks like it will map fairly well in many respects, at least, much
:any traditional filesystem intended for magnetic storage.
:filesystems all gravitate toward being log-based and do buffering and
:in a fairly similar manner to HAMMER. Current NAND (as far as I
:chips also operate on 16k blocks (you have to issue an erasure before a
:There are various other things that these purpose-specific filesystems
:which is more media-specific, like spreading erasures/writes evenly over
:in an attempt to extend the life of the disk, re-mapping bad blocks, etc.
:LogFS,JFS2,YAFFS). Considering how well much of this seems to map to
:HAMMER's on-disk layout, and considering any of the extra higher-level
:could likely be integrated "when the time is right" without much pain. It
:wonder if you took solid state disks into consideration or if things just
:worked out that way. Also, if you have any plans or ideas to see HAMMER
:performant on SSD's?
:I wanted to go into more detail here and a couple weeks ago I even
:a couple of SSD manufacturers to inquire as to whether there were any (or
:proposed) standards or methods of device inspection. For attempting to do
:things like block allocation to exploit per-nand-chip performance, etc.
:(None of them have gotten back to me). At any rate, a higher level email
:probably more palatable anyway :)
Well I've thought about it quite a bit over the last month or two
and I think the answer is that HAMMER would not scale any better
then, say, UFS.
The reason is that even though HAMMER uses 16K blocks and even though
HAMMER doesn't 'delete' data or overwrite file data, it *DOES* modify
meta-data in-place. In addition to modifying meta-data in-place
HAMMER also uses an UNDO log for the meta-data, and it tracks
allocations and frees in the blockmaps in-place as well.
The UNDO log itself is fairly small... actually very small because
it does not contain file data (which doesn't get modified in-place),
only meta data.
But things like B-Tree and record elements do get modified in-place
in HAMMER, and that means it will wind up having to do block
at least as often as something like UFS would on a SSD.
But I'll also put forth the fact that insofar as SSDs go, and NAND
in particular, you have to think about it from two different points
(1) A storage subsystem with limited or no static-ram caching.
(2) A storage subsystem with extensive ram (either battery backed
or NOT battery backed), for caching purposes.
In the first case, if no front-end caching is available, then the only
way to get performance out of NAND is to write a filesystem explicitly
built to NAND.
In the second case, when front-end caching is available, I personally
do not believe that it matters as much. With even a moderate amount
of caching you can run any standard filesystem on top and still eek
out most of the performance. Maybe not 100% of the performance a
custom filesystem would give you, but it would be fairly close.
Let me give you an example of case (2), particularly as it applies to
Remember that UNDO log I mentioned? For meta-data? Well, lets say
I wanted to improve HAMMER's performance on SSD/NAND devices. The
big problem is all the in-place modification of the meta-data.
But I have that UNDO log. And a memory cache. HMMM. So, what if I
logged not only the UNDO information, but also the DO information.
is, if HAMMER had to update a field in some meta-data somewhere it
would lay down an UNDO record with the old contents of that field,
and I would also have it lay down a record with the NEW contents
of that same field.
If I were to do that, then I wouldn't actually have to update the
meta-data on-media for a very long period of time. Instead I could
simply cache the modifie meta-data bufers in memory and lay down
the UNDO+DO records on the flash. If the system crashes or is
I don't have to flush the dirty buffers in memory. All I would have
to do is when the system is booted up again and the filesystem mounted,
I would simply play back the UNDO+DO records and regenerate the dirty
buffers in memory.
Eventually the meta-data would have to be flushed to the media,
meaning block replacement of course, but with a reasonable amount of
memory a great deal of work could be cached before that had to happen
which means that multiple meta-data modifications could build up and
be written out far more efficiently.
So that is an example of case (2), where memory is available for
caching. Such a scheme would not work in case (1), where very little
memory is available for caching. I would theorize a significant
increase in performance on NAND/SSD devices were I to make that change.
You also have to consider the relative value of writing a filesystem
completely from scratch designed explicitly for NAND. I personally
do not have much of an interest in designing such a beast. If I were
to do it I would keep it simple stupid, with none of the bells and
whistles you see in HAMMER. It would be a turnkey product designed
for applications which are aware they are running on a flash-backed
Thanks for your insight.
Case 2 sounds like a much more optimal solution for most of the instances I
can think of where one might want to deploy SSD's in anything that could be
construed as the near future. LOTS of transactions and a > 95% read
workload. NAND-based SSD's perform very well, until you start issuing writes
(unless you're doing things very intelligently, which is what the
purpose-specific filesystems target). Anything over a 5% or so write load on
a traditional filesystem will pretty much hose your performance, based on
numbers I have seen (no links handy). Distributed OLTP applications,
specifically web-centric ones, come to mind.
That said, I wouldn't advocate any change targeting the performance of
NAND-based disks unless it would improve performance also on magnetic media
as well as any future flash-like/solid-state media. SSD's are finally at a
point where they are moving into the enterprise* to support read query heavy
databases and the like, but it would be sheer folly to think of NAND as the
endgame in that department.
*Mobile too, I guess, MacBook Air and Lenovo x300
There MAY BE relative value in doing something like case 2, assuming a case
can be made for it being "beneficial enough" to implement for magnetic
storage. (On first brush I would suspect low-RPM SATA disks might enjoy that
option.) .. As SSD's are becoming and will become more prevalent, especially
as prices fall.
Don't most filesystem implementations call their DO log a "Journal" or so?
I assume that laying out DO records would be exceedingly similar to UNDO
record logging, most of the code being common. (I'm adding a review of those
bits of HAMMER to my todo list), and that the added complexity would lie in
assembling/ collating/coalescing/etc. the metadata bits in memory, but more
than that, intelligently flushing those blocks to disk (expediently enough
to avoid creating too much memory pressure) and, of course, the recovery
code as you can't just assume your metadata is proper anymore. Are there
potentially portions of the filesystem (or future features) that DO logging
would simplify? Couldn't a DO log effectively be NOP'd (instead of bounded)
for low-memory situations?
I don't know if it should be pointed out or not, but seemingly the biggest
problem(s) with ZFS in practice all boil down to excess kernel memory
pressure related to the ARC cache. Especially on write-heavy workloads.
(This time with something resembling sane word wrapping)