DragonFly kernel List (threaded) for 2005-03
Journaling layer update - any really good programmer want to start working on the userland journal scanning utility ? You need to have a LOT of time available!
I'm making more good progress with the journaling code. The journaling
layer is now writing out most of the required information for most VOPs.
* Has a working journal creation/deletion/listing mechanism via mountctl
(most options not yet implemented). Still very primitive, but good
enough for testing purposes.
* Writes out the cred, vattr, security audit info (uid, gid, p_pid,
p_comm), timestamps, file write data (but not putpages data yet),
creations, deletions, link, softlink, renames, etc.
Still on the TODO list:
* Writing out the stuff I've forgotten.
* Writing out the UNDO information. UNDO information is what makes a
journal reversable, which is one of the big ticket items that I want
to support, but it requires writing out the prior contents of
data blocks, prior uid, gid, time info, and so forth.
* Identification of vnodes in vnode operations so the journal target
knows what file VOP_WRITE's are associated with without having to
run the log backwards too much. I'll probably write out the file
handle and then also write out the file path if it hadn't been written
out in the last N seconds.
* Direct SWAP backing support for the monitor.
* Adding a crc (I have a field for it but I'm not generating it yet).
* Two-way stream transaction id acknowledgement and a journal write
failure / restart protocol (this is what is going to make the journal
reliable over a network link).
* We need a utility program which scans the (binary) journal and
can generate shell commands to regenerate a filesystem and/or decode
the journal and display the operations in a human readable format.
I am doing all the kernel work, but I am looking for someone to help
engineer and write the user utility that actually does something real
with the generated journal! Anyone interested in doing this? We
want a utility that is capable of:
* extracting a file subhierarchy and generating a mirror of the filesystem.
* extracting a file subhierarchy and generating human readable output
showing the audit trail of all changes made within the subhierachy.
* extracting a file subhierarchy and generating a new raw journal
containing only that subhierarchy.
* extracting deleted files by name ('undelete' realized!)
* extracting a file subhierarchy and generating a mirror that is
'as of' a particular date in the past.
-- Technical Journal Record Format Details -
The journal record format is in sys/journal.h. It's quite straightforward
but it IS a multi-layer recursive record format. The first layer is a
virtual stream layer (needed because multiple entities may be
writing out transactions to the journal simultaniously). Virtual
streams are typically short-lived entities that represent transactions.
One transaction per virtual stream. The virtual stream layer is
controlled by the journal_rawrecbeg and journal_rawrecend structure
(designed so a utility program can scan the journal forwards or
The second layer is a recursive record layer controlled by the
journal_subrecord structure. Each transaction may contain a hierarchy
of subrecords representing all the information required to understand
and/or UNDO the transaction. So, for example, a file creation will
have a JTYPE_CREATE subrecord which contains a number of other subrecords
(JLEAF_PATH1, JLEAF_MODES, JLEAF_UID, JLEAF_GID, etc), and even other
non-leaf nodes (JTYPE_UNDO). All records (with one exception) contain
the actual record size but are always physically 16-byte aligned.
There are three gotchas. First, the high level virtual stream may
break up the subrecords. The stream transaction block must be
reconstructed before the subrecords can be scanned. Second, a NON-LEAF
journal subrecord may have a record size of 0, which means the utility
program has to recurse through it to figure out how big it actually is.
The recsize field is mandatory for LEAF subrecords so you don't have
to worry about those. This occurs because some transactions exceed
the size of the memory fifo and must be flushed out before the journaling
code knows how large the subrecord is! The last gotcha is that the high
level stream representing a transaction may be aborted at the stream
level, right smack in the middle of the subrecord transaction. Scanning
code must understand that the stream block may be 'truncated' relative
to the record sizes indicated by the subrecords. This occurs
if the journal is in the middle of a transaction and then determines
that the operation, in fact, has failed. e.g. due to the VFS op failing.
Yes, it's sophisticated, but the journal must be capable of doing
sophisticated things and had other requirements like multiple processes
building transactions at the same time, like transactions being
potentially *huge* (gigabytes e.g. if you do a 1GB write(), that's a
gigabyte-sized transaction!), having to store UNDO data in, wanting to
make the format extensible, and so forth.
I decided not to go for an ultra-compact format because I believe that
can be done even better using e.g. a gzip layer.