DragonFly kernel List (threaded) for 2008-05
DragonFly BSD
DragonFly kernel List (threaded) for 2008-05
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: HEADS UP - Final HAMMER on-disk structural changes being made today

From: "Samuel J. Greear" <dragonflybsd@xxxxxxxxxxxx>
Date: Mon, 5 May 2008 22:28:31 -0700

"Matthew Dillon" <dillon@apollo.backplane.com> wrote in message 200805060403.m4643kvU096121@apollo.backplane.com">news:200805060403.m4643kvU096121@apollo.backplane.com...

:I guess this is probably a good time to ask.
:Have you thought at all about how HAMMER will scale to the current
:of NAND-based disks, and solid state storage in general? From my high level
:understanding of the media and HAMMER in general I can hypothesize that it
:looks like it will map fairly well in many respects, at least, much better
:than most/
:any traditional filesystem intended for magnetic storage. Purpose-specific
:filesystems all gravitate toward being log-based and do buffering and
:in a fairly similar manner to HAMMER. Current NAND (as far as I understand)
:chips also operate on 16k blocks (you have to issue an erasure before a
:There are various other things that these purpose-specific filesystems try
:to do
:which is more media-specific, like spreading erasures/writes evenly over the
:in an attempt to extend the life of the disk, re-mapping bad blocks, etc.
:LogFS,JFS2,YAFFS). Considering how well much of this seems to map to
:HAMMER's on-disk layout, and considering any of the extra higher-level bits
:could likely be integrated "when the time is right" without much pain. It
:made me
:wonder if you took solid state disks into consideration or if things just
:worked out that way. Also, if you have any plans or ideas to see HAMMER be
:performant on SSD's?
:I wanted to go into more detail here and a couple weeks ago I even emailed
:a couple of SSD manufacturers to inquire as to whether there were any (or
:proposed) standards or methods of device inspection. For attempting to do
:things like block allocation to exploit per-nand-chip performance, etc.
:(None of them have gotten back to me). At any rate, a higher level email is
:probably more palatable anyway :)

   Well I've thought about it quite a bit over the last month or two
   and I think the answer is that HAMMER would not scale any better
   then, say, UFS.

   The reason is that even though HAMMER uses 16K blocks and even though
   HAMMER doesn't 'delete' data or overwrite file data, it *DOES* modify
   meta-data in-place.  In addition to modifying meta-data in-place
   HAMMER also uses an UNDO log for the meta-data, and it tracks
   allocations and frees in the blockmaps in-place as well.

   The UNDO log itself is fairly small... actually very small because
   it does not contain file data (which doesn't get modified in-place),
   only meta data.

But things like B-Tree and record elements do get modified in-place
in HAMMER, and that means it will wind up having to do block replacement
at least as often as something like UFS would on a SSD.


   But I'll also put forth the fact that insofar as SSDs go, and NAND
   in particular, you have to think about it from two different points
   of view:

(1) A storage subsystem with limited or no static-ram caching.

   (2) A storage subsystem with extensive ram (either battery backed
or NOT battery backed), for caching purposes.

   In the first case, if no front-end caching is available, then the only
   way to get performance out of NAND is to write a filesystem explicitly
   built to NAND.

   In the second case, when front-end caching is available, I personally
   do not believe that it matters as much.  With even a moderate amount
   of caching you can run any standard filesystem on top and still eek
   out most of the performance.  Maybe not 100% of the performance a
   custom filesystem would give you, but it would be fairly close.


   Let me give you an example of case (2), particularly as it applies to

   Remember that UNDO log I mentioned?  For meta-data?  Well, lets say
   I wanted to improve HAMMER's performance on SSD/NAND devices.  The
   big problem is all the in-place modification of the meta-data.

But I have that UNDO log. And a memory cache. HMMM. So, what if I
logged not only the UNDO information, but also the DO information. That
is, if HAMMER had to update a field in some meta-data somewhere it
would lay down an UNDO record with the old contents of that field,
and I would also have it lay down a record with the NEW contents
of that same field.

If I were to do that, then I wouldn't actually have to update the
meta-data on-media for a very long period of time. Instead I could
simply cache the modifie meta-data bufers in memory and lay down
the UNDO+DO records on the flash. If the system crashes or is shutdown,
I don't have to flush the dirty buffers in memory. All I would have
to do is when the system is booted up again and the filesystem mounted,
I would simply play back the UNDO+DO records and regenerate the dirty
buffers in memory.

   Eventually the meta-data would have to be flushed to the media,
   meaning block replacement of course, but with a reasonable amount of
   memory a great deal of work could be cached before that had to happen
   which means that multiple meta-data modifications could build up and
   be written out far more efficiently.

   So that is an example of case (2), where memory is available for
   caching.   Such a scheme would not work in case (1), where very little
   memory is available for caching.  I would theorize a significant
   increase in performance on NAND/SSD devices were I to make that change.


   You also have to consider the relative value of writing a filesystem
   completely from scratch designed explicitly for NAND.  I personally
   do not have much of an interest in designing such a beast.  If I were
   to do it I would keep it simple stupid, with none of the bells and
   whistles you see in HAMMER.  It would be a turnkey product designed
   for applications which are aware they are running on a flash-backed

Matthew Dillon


Thanks for your insight.

Case 2 sounds like a much more optimal solution for most of the instances I
can think of where one might want to deploy SSD's in anything that could be
construed as the near future. LOTS of transactions and a > 95% read
workload. NAND-based SSD's perform very well, until you start issuing writes
(unless you're doing things very intelligently, which is what the
purpose-specific filesystems target). Anything over a 5% or so write load on
a traditional filesystem will pretty much hose your performance, based on
numbers I have seen (no links handy). Distributed OLTP applications,
specifically web-centric ones, come to mind.

That said, I wouldn't advocate any change targeting the performance of
NAND-based disks unless it would improve performance also on magnetic media
as well as any future flash-like/solid-state media. SSD's are finally at a
point where they are moving into the enterprise* to support read query heavy
databases and the like, but it would be sheer folly to think of NAND as the
endgame in that department.

*Mobile too, I guess, MacBook Air and Lenovo x300

There MAY BE relative value in doing something like case 2, assuming a case
can be made for it being "beneficial enough" to implement for magnetic
storage. (On first brush I would suspect low-RPM SATA disks might enjoy that
option.) .. As SSD's are becoming and will become more prevalent, especially
as prices fall.

Don't most filesystem implementations call their DO log a "Journal" or so?

I assume that laying out DO records would be exceedingly similar to UNDO
record logging, most of the code being common. (I'm adding a review of those
bits of HAMMER to my todo list), and that the added complexity would lie in
assembling/ collating/coalescing/etc. the metadata bits in memory, but more
than that, intelligently flushing those blocks to disk (expediently enough
to avoid creating too much memory pressure) and, of course, the recovery
code as you can't just assume your metadata is proper anymore. Are there
potentially portions of the filesystem (or future features) that DO logging
would simplify? Couldn't a DO log effectively be NOP'd (instead of bounded)
for low-memory situations?

I don't know if it should be pointed out or not, but seemingly the biggest
problem(s) with ZFS in practice all boil down to excess kernel memory
pressure related to the ARC cache. Especially on write-heavy workloads.

Thanks again,
(This time with something resembling sane word wrapping)

[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]