DragonFly kernel List (threaded) for 2005-03
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]

Journaling layer update - any really good programmer want to start working on the userland journal scanning utility ? You need to have a LOT of time available!

From:	Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
Date:	Thu, 3 Mar 2005 22:06:38 -0800 (PST)

    I'm making more good progress with the journaling code.  The journaling
    layer is now writing out most of the required information for most VOPs.
    It currently:

    * Has a working journal creation/deletion/listing mechanism via mountctl
      (most options not yet implemented).  Still very primitive, but good
      enough for testing purposes.

    * Writes out the cred, vattr, security audit info (uid, gid, p_pid,
      p_comm), timestamps, file write data (but not putpages data yet),
      creations, deletions, link, softlink, renames, etc.

    Still on the TODO list:

    * Writing out the stuff I've forgotten.

    * Writing out the UNDO information.  UNDO information is what makes a
      journal reversable, which is one of the big ticket items that I want
      to support, but it requires writing out the prior contents of
      data blocks, prior uid, gid, time info, and so forth.

    * Identification of vnodes in vnode operations so the journal target
      knows what file VOP_WRITE's are associated with without having to 
      run the log backwards too much.  I'll probably write out the file
      handle and then also write out the file path if it hadn't been written
      out in the last N seconds.

    * Direct SWAP backing support for the monitor.

    * Adding a crc (I have a field for it but I'm not generating it yet).

    * Two-way stream transaction id acknowledgement and a journal write
      failure / restart protocol (this is what is going to make the journal
      reliable over a network link).

    * We need a utility program which scans the (binary) journal and 
      can generate shell commands to regenerate a filesystem and/or decode
      the journal and display the operations in a human readable format.


				NEED HELP!

    I am doing all the kernel work, but I am looking for someone to help
    engineer and write the user utility that actually does something real 
    with the generated journal!  Anyone interested in doing this?   We
    want a utility that is capable of:

    * extracting a file subhierarchy and generating a mirror of the filesystem.
    * extracting a file subhierarchy and generating human readable output
      showing the audit trail of all changes made within the subhierachy.
    * extracting a file subhierarchy and generating a new raw journal 
      containing only that subhierarchy.
    * extracting deleted files by name ('undelete' realized!)
    * extracting a file subhierarchy and generating a mirror that is
      'as of' a particular date in the past.


		-- Technical Journal Record Format Details -

    The journal record format is in sys/journal.h.  It's quite straightforward
    but it IS a multi-layer recursive record format.  The first layer is a
    virtual stream layer (needed because multiple entities may be 
    writing out transactions to the journal simultaniously).   Virtual
    streams are typically short-lived entities that represent transactions.
    One transaction per virtual stream.  The virtual stream layer is
    controlled by the journal_rawrecbeg and journal_rawrecend structure
    (designed so a utility program can scan the journal forwards or
    backwards). 

    The second layer is a recursive record layer controlled by the
    journal_subrecord structure.  Each transaction may contain a hierarchy
    of subrecords representing all the information required to understand
    and/or UNDO the transaction.  So, for example, a file creation will
    have a JTYPE_CREATE subrecord which contains a number of other subrecords
    (JLEAF_PATH1, JLEAF_MODES, JLEAF_UID, JLEAF_GID, etc), and even other
    non-leaf nodes (JTYPE_UNDO).  All records (with one exception) contain
    the actual record size but are always physically 16-byte aligned.

    There are three gotchas.  First, the high level virtual stream may
    break up the subrecords.  The stream transaction block must be
    reconstructed before the subrecords can be scanned.  Second, a NON-LEAF
    journal subrecord may have a record size of 0, which means the utility
    program has to recurse through it to figure out how big it actually is.
    The recsize field is mandatory for LEAF subrecords so you don't have
    to worry about those.  This occurs because some transactions exceed 
    the size of the memory fifo and must be flushed out before the journaling
    code knows how large the subrecord is!  The last gotcha is that the high
    level stream representing a transaction may be aborted at the stream
    level, right smack in the middle of the subrecord transaction.  Scanning
    code must understand that the stream block may be 'truncated' relative
    to the record sizes indicated by the subrecords.  This occurs
    if the journal is in the middle of a transaction and then determines
    that the operation, in fact, has failed.  e.g. due to the VFS op failing.

    Yes, it's sophisticated, but the journal must be capable of doing 
    sophisticated things and had other requirements like multiple processes
    building transactions at the same time, like transactions being
    potentially *huge* (gigabytes e.g. if you do a 1GB write(), that's a
    gigabyte-sized transaction!), having to store UNDO data in, wanting to
    make the format extensible, and so forth.

    I decided not to go for an ultra-compact format because I believe that
    can be done even better using e.g. a gzip layer.

					-Matt
					Matthew Dillon 
					<dillon@xxxxxxxxxxxxx>

Follow-Ups:
- Re: Journaling layer update - any really good programmer want to start working on the userland journal scanning utility ? You need to have a LOT of time available!
  - From: Miguel Mendez

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]