DragonFly commits List (threaded) for 2005-08
Re: cvs commit: src/sys/kern vfs_cache.c vfs_syscalls.c vfs_vnops.c vfs_vopops.c src/sys/sys namecache.h stat.h
:On Thu, Aug 25, 2005 at 01:17:55PM -0700, Matthew Dillon wrote:
:> I don't know about imon under Linux, but kqueue on the BSDs doesn't
:> even come *close* to providing the functionality needed, let alone
:> providing us with a way monitor changes across distinct invocations.
:> Using kqueue for that sort of thing a terrible idea.
:Trying to reliable detect what changed between two invocations based on
:any kind of transaction ID is a very bad idea, because it does add a lot
:of *very* nasty issues. I'll comment on some of them later.
Joerg, please keep in mind that I've written a database. A real one.
From scratch. The whole thing. The site is defuct but the white paper
is still there:
So before you start commenting on what you think is wrong, read
that paper and then you will understand how well I understand
cache coherency and transactional database algorithms.
:> I'm not sure I understand what you mean about not reaching all
:> parent directories. Perhaps you did not read the patch set. It
:> most certainly DOES reach all parent directories, whether they've been
:> read or not. That's the whole point. It goes all the way to '/'.
:It only handles vnode changes for entries which are already in the name
:cache. So it is incomplete. It can't behave otherwise without keeping
:the whole directory tree in memory, but that doesn't solve the problem.
The namecache is fully coherent. If a vnode exists for a file that
has not been unlinked, its namecache entry will exist. That is part
of the namecache work I did for DragonFly. It isn't true in FreeBSD,
it IS true in DragonFly. If a file HAS been unlinked, well, its no
longer in the filesystem is it? So we don't care any more.
The entire directory tree does not need to be in memory, only the
pieces that lead to (cached) vnodes. DragonFly's namecache subsystem
is able to guarentee this.
:> to just the elements that have changed, and to do so without having to
:> constantly monitor the entire filesystem.
:Moniting filesystems is not always a good idea. Allowing any
:user/program to detect activity in a subtree of the system can help to
:detect or circumvent security measurements.
I'm not particularly worried about such a coarse detection method,
but it could always be restricted to root if need be (like st_gen is).
It's hardly a reason to not do something.
:> The methodology behind the transaction id assignments can make this
:> a 100% reliable operation on a *RUNNING*, *LIVE* system. Detecting
:> in-flight changes is utterly trivial.
:On a running system, it is enough to either get notification when a
:certain vnode changed (kqueue modell) or when a vnode changed (imon /
:dnotify model). Trying to detect in-flight changes is *not* utterly
:trivial for any model, since even accurate atime is already difficult to
:achieve for mmaped files. Believing that you can *reliable* backup a
:system based on VOP transactions alone is therefore a dream.
This is not correct. It is certainly NOT enough to just be told
when an inode changes.... you need to know where in the namespace
the change occured and you need to know how the change(s) effect
the namespace. Just knowing that a file with inode BLAH has been
modified is not nearly enough information.
Detecting in-flight changes is trivial. You check the FSMID before
descending into a directory or file, and you check it after you ascend
back out of it. If it has changed, you know that something changed
while you were processing the directory or file and you simply re-recurse
down and rescan just the bits that now have different FSMID's.
Its a simple recursive algorithm and it works just fine. If a file
is changing a whole lot, such that you can't take a snapshot before
it changes again, then you need finer-grained information... which is
exactly what the journal part of the system is capable of giving you,
but even without that the program attempting to do the backup is fully
able to detect that there might be a potential problem and can either
try to brute force its way through (retry), or do something smarter,
or just report the fact. It's a lot better then you get with 'dump',
that's for sure.
:> Nesting overhead is an issue, but not a big one. It's a very solvable
:> problem and certainly should not hold up an implementation. The only
:> real issue occurs when someone does a write() vs someone else stat()ing
:> a directory along the parent path. Again, very solvable and certainly
:> not a show stopper in any way.
:It is a big issue, because it is not controllable. With both kqueue,
:imon and dnotify it can be done *selectively* for filesystems where it
:is needed and wanted. Even my somewhat small filesystems have already
:over a million inodes. Just trying to read them would already create a
:lot more (memory) IO just to update the various atimes.
And its utterly trivial to do the same with FSMID's, because they are
based on the namecache topology all we need is a mechanism that flags
the part of the topology we care about IN the topology itself.
The namecache records in question (representing the base of the
hierarchy being recorded) are then simply locked into the system
and all children of said records inherit the flag. Poof, done.
But the plain fact of the matter is that for 99.9% of the installations
out there, doing it globally will not result in any significant
performance impact. We are talking about a few hundred nanoseconds
per write(), even with a deep hierarchy. And as I indicated earlier,
the performance issues are a very solvable problem anyway so it is
hardly something you hold up and say 'we can't do it because of this'.
:> is certainly entirely possible to correctly implement the desired
:As soon as you try to make it persistence you add the problem of how
:applications should behave after a reboot. Since you mentioned
:backups, let's just discuss that. The backup program reads a change
:just after was made by a program, but before it has hit the disk. The
:FSMID it sends to the backup server is therefore nowhere recorded on
:disk (and doing that would involve quiet a lot of performance
:penalties). Now the machine is "suddenly" restarting. Can you ensure
:that the same FSMID is not reused, in which case the state of the
:filesystem and the state of the backup is inconsistent? Sure, the
:program can try to detect it, but that would make the entire concept of
These are all fairly trivial problems. Yes, in fact you can *EASILY*
determine that an FSMID hasn't made it to disk, simply by adopting
a database-style transaction id (which that white paper discusses a
bit, I believe). In fact, if you did not have a transactional id
recorded in the filesystem (which we don't right now), it would be
very difficult to resynchronize something like the live journal with
the on-disk filesystem.
Think of transactional id's... the FSMID's we would record on the disk,
as being snapshot id's. They don't tell us exactly what is on the disk
but they give us a waypoint which tells us that all transactions occuring
<= the recorded FSMID are *definitely* on the disk. By comparing those
id's against id's we store in the journal transactions we can in fact
very easily determine which transactions MAY NOT have made it to disk,
and rerun the journal from point A to point B for the affected elements.
That is a whole lot more robust then anything else that currently exists
for any filesystem, and in fact I believe it would allow us to provide
recovery guarentees based solely on the journal data being
acknowledged, rather then having to wait for the filesystem to catch
up to the journal. That's a big deal.
For example, softupdates right now is not able to guarentee data
consistency. If you crash while writing something out then on reboot
you can wind up with some data blocks full of zero's, or full of old
data, while other data blocks contain new data. The FSMID for the file
would tell us how far back in the journal we need to go to recover
the potentially missing data. There is nothing else that gives us that
information, nothing else that tells us how far back in the journal
we would have to go to recover all the lost data.