DragonFly BSD
DragonFly kernel List (threaded) for 2004-12
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: Description of the Journaling topology

To: Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
From: Maxim Sobolev <sobomax@xxxxxxxxxxx>
Date: Fri, 31 Dec 2004 02:27:27 +0200

I think that you miss the main idea of journaling fs, which is that filesystem ensures that journal entry for operation is always created *before* relevant operation physically takes place. This isn't guranteed in your design.

Yes, some buffering may apply, and is applied in existing implementations, but the filesystem should *never* commit actual update before appropriate journaling entry. In your case, however, it is possible that filesystem will commit some changes to the physical storage and due to buffering it will lost appropriate journaling entry in the case of crash. In this case, attempt to replay journal may have disastrous consequences, since you will not know what have been changed by that "lost" operation.

Therefore, IMO, it is impossible to do journaling without co-operation from the filesystem code and without implementing acks from whoever does actual recording of journaling entries to the persistent storage.


Matthew Dillon wrote:
:Rahul Siddharthan wrote:
:> I'm no expert but I thought the traditional case was fast recovery to
:> a consistent filesystem state (avoiding a long fsck), not recovery of
:> buffered data or fast writing of buffered data to disk. I'm pretty
:> sure ext3, for example, with its default async mount, is very
:> susceptible to losing data. ufs+softupdates most certainly can lose a
:> lot of buffered data.
:> :> Rahul
:A buffer is not a journal, its a buffer. Journaling file systems put the :journal ON DISK--if power is lost you replay the journal FROM DISK to :recover consistent file system. This scheme will not allow that because :the journal is kept in memory. You can use it for transparent backup, :but how useful is it for recovery from crashes/power loss? It seems like : transaction based VFS mirroring, but you cannot replay the journal if :the system crashes or otherwise reboots unexpectedly.

I think you are a little confused, Gary. The journal we are talking
about is buffered, yes, but only for a short period of time (e.g. less
then a second). This is NO DIFFERENT from what a journaling filesystem
does. When you type 'mkdir blah' in a journaling filesystem it does *NOT* instantly write the operation out to the journal. Disk performance
would go completely to pot if it did that.

All high performance filesystems buffer to some degree. It is not possible to build a high performance filesystem that does not buffer
(that is, it would no longer be 'high performance' if it didn't).

The key issue here is not that buffering is occuring, but how long the
data remains in the buffer before it gets shipped off to hard storage
somewhere (locally or over the net). That's the issue. And here when
we consider something like, oh, a RAID system's battery backed ram...
that would be considered hard storage, but it does not and cannot replace the buffering that the kernel does.

So what you gain during crash recovery is the ability to restore the
filesystem to its state up to N seconds before the crash, where N depends on the filesystem. With a softupdates filesystem N could be
upwards of 30 seconds. With ReiserFS I would expect N to be in the
< 10 second range. But N will never be 0. The journaling I am
implementing would allow N to be programmed. It could be as little
as a millisecond or as much as the memory buffer can hold depending on
the system operator's preference.

Even more key is the off-site capability. If the journal is a TCP
connection to another machine the buffering delay could be as little
as a millisecond before the data gets to the target machine, and the
local disks would not be impacted at all. The originating machine could immediately crash without really messing anything up, even if
the data has not yet been committed to hard storage on the target
machine. The target machine could be configured to buffer the data
again before comitting to hard storage, or it could commit it immediately. A key performance issue is that a target machine could be dedicated to journaled backups of other machines in a cluster and
basically only have to issue linear writes, yielding very high performance.

    So there are some very practical and desireable traits being discussed

Matthew Dillon <dillon@xxxxxxxxxxxxx>

[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]