DragonFly kernel List (threaded) for 2004-01
this is the outline of how to implement a background fsck for DragonFly.
DR, this is somewhat more mature than what I wrote on Friday.
ATM DragonFly forces a full fsck for before allowing a non-forced read-write
mount of a unclean filesystem. Since this filesystem checks can take quite
a long time, alternatives are highly desired.
The first alternative is using a journal for all meta-data updates. That's
what most Linux filesystems are doing. The advantage is almost no time
needed to bring a uncleanly unmounted filesystem back into a working state.
The disadvantage is a steady slow down for _all_ meta-data updates.
The second alternative is provided by FreeBSD 5. It uses the soft updates
code to provide a consistent filesystem even in case of power failure etc.
Therefore you can mount a filesystem with soft updates instantly. Without
further processing the only disadvantage is some missing space. In detail
does the softdep code guaranty that the only incorrections on the fs are
free blocks and inodes still marked as in use. To garbage collect those
ressources FreeBSD 5 uses the filesystem snapshot mechanism to provide
a consistent and stable view of the filesystem for fsck. This allows the
filesystem to be used without any performance penalty as soon as the
background fsck has completed.
While the snapshot code is useful for other things as well, e.g. backups,
it is IMO far to general for this special application. Esp. since snapshots
are persistent across reboots and therefore disk backed they can result
in quite some unnecessary I/O load.
First of all, the "clean" requirement for a read-write mount of a softdep fs
should be droped. This is what FreeBSD 5 already does.
Second add the functionality to free a block, fragment or inode by number
or adjust the reference count of an inode.
Third instrument certain FFS functions to notify the userland fsck of
updates to the filesystem structure. This is further detailed in the next
The background fsck
The steps to scan the filesystems are:
1. Set message port for block/fragment freeing and fragment allocation.
2. Read the block/fragment bitmaps from disk
2a. Mark blocks/fragments as active when such a message arrives
3. Set message port for inode freeing and directory updates (link, unlink, rename)
4. Notify kernel root directory is being read
5. Read root directory and keep inode,entry pairs and the reference counts
for the inodes
6. Notify kernel root directory scan finished
7. Process messages for the updates of the root directory in reverse
chronological ordner. Update reference counts accordingly.
8. Continue for other directories from 4 on. Keep list of directories
visited so far.
8a. Update reference counts for changes to already visited directories.
9. Lock first of the visited inodes and compare reference counts,
update if necessary.
10. Read block and fragment list for this inode, mark them as active
11. Let the kernel free unreferenced inodes.
12. Let the kernel free used but inactive blocks and fragments
The steps 2a and 8a are repeated and done in the background. The
steps 9 and 10 can be done when a inode is first visited. The order
and notification should be enough to get all references and active
ressources located. For the bitmap code, active and free blocks can
be considered equivalent. Fragments need attention than blocks since
they can extend and move.
This schema has the advantage of needed no additional I/O beside
reading the complete filesystem tree. The additional messages should
provide a small overhead, but not larger than having a snapshot in