DragonFly kernel List (threaded) for 2004-02
Re: lkwt in DragonFly
First, Bosko, thanks for asking your question! You really got me
thinking about our token implementation. I've been whining about it
being broken on our mailing lists for quite a while and as I was
formulating a reply to your email I had an epiphany. So I hope you
don't mind me having CC'd your email and my reply with the kernel@ list!
: Reading this:
:Token Passing Primitives Not Mutexes
: Do you intend to have a single pcpu token on a given cpu protecting
: all data structures (or a class of data structures)? The tradeoff of
: the lesser-granularity vs the admittedly much lower cost for a common
: token acquisition may or may not be worth it in the end. (The other
: part to the tradeoff is that you can never go with full-preemption).
: I did understand this correctly, right?
:Bosko Milekic * bmilekic@xxxxxxxxxxxxxxxx * bmilekic@xxxxxxxxxxx
Not entirely. First, please note that the current token implementation
does not work very well... it's a mess to use because cpu #B can steal
away cpu #A's token by sending cpu #A an IPI message requesting the token.
This means cpu #A has to be in a critical section to 'protect' the token,
and must re-acquire the token (in case it was lost) after it blocks.
After thinking about this quite a bit I have come up with a solution
that, in a moment of enlightment, would make our tokens operate like
self-contained little RCU implementations (superior to Linux's
implementation, in fact!).
I intend to fix our tokens by not allowing an IPI to steal a token until
the owning cpu has gone through a thread switch, or by explicit
release. This means that, once fixed, our token implementation will
allow us to abstract RCU-like operations, which I think is very
important. It is not what I originally intended tokens to be
but it is turning out to be what they should be.
The nice thing about the revamp I intend to do on the token code is that
the RCU abstraction will wind up being a lot better then Linux's RCU
abstraction, because it will be compartmentalized. Only cpus competing
for the same token are effected rather then all cpus in the system.
We would not have the serialization problem that Linux has.
Tokens will be used in a manner similar to how mutex are used in
FreeBSD-5, except that in our case (with the revamped token code)
DFly will be able to hold any number of tokens across a blocking
situation without creating deadlocks (but also not guarenteeing
atomicy across the blocking situation). I think this is exactly the
sort of abstraction we need. Mutexes in FBSD-5 and tokens in DFly
are primarily used as interlocks on structures, for which no blocking
occurs, or in situations where the blocking condition is well manageed
(e.g. lockmgr). I feel that my new token (RCU-like) implementation
will wind up being far superior to FreeBSD-5's mutex implementation
for 90% of these situations and it will not suffer from the deadlock
potential that mutexes have because it will not guarentee atomicy
across a blocking call.
Also note that several major DragonFly subsystems (LWKT scheduler and
SLAB allocator) are totally mutex & token free and things like the
network stack will also be mostly mutex & token free because it will
assign PCBs to particular threads (so operations on those PCBs are only
done by a particular thread which in turn means no locking of the PCB is
required at all).
The performance trade off is basically the ability for the cpu owning a
token to operate on the token without having to use bus-locked
instructions, verses having to send an IPI message to the owning cpu
when the requesting cpu does not own the token. This is not the major
reason for using tokens, though. Still, judging by the benchmark
comparisons we've done between 4.x and DFly, the overhead of a 'miss'
appears to be very small. It turns out that most locks (like vnode locks)
hit the same-cpu case a *LOT* more then they hit the foreign-cpu case
so the overall benefit is at least equivalent to FreeBSD-5's mutex
overhead. It certainly isn't worse. We do lose when it comes to
IPC operations between two processes running on difference CPUs, but
if this is the only thing that winds up being a problem it is a simple
enough case to fix.
However, the major reason for using tokens is that it creates an
abstraction that allows us to execute many types of operations as if
the system had only one cpu. Since a token is owned
by a cpu and not a thread, if thread #1 obtains a token and blocks,
thread #2 running on the same cpu can obtain that same token optimally.
Likewise, thread #3 running on a different cpu can 'steal' the token
(but it only gets transfered on a scheduler boundary of the first cpu
once I fix the current token code). We also have the advantage of being
able to use this abstraction heavily without compromising UP performance,
because on a UP box a blocking situations implies a scheduler switch
and since tokens are per-cpu... well, the whole thing devolves into a NOP.
I have some work to do to move my current token implementation into this
new paradigm, but I do not expect it to be difficult. For example,
if cpu #B steals a token from cpu #A then the threads that own the token
on cpu #A can't be scheduled to run again until cpu #B is through with
the token. This means actively accounting for and structurally linking
the dependancy information.
Whew! That may seem complex, and in a mutexed environment it would be
very complex. But remember in DFly the whole basis of using this
abstraction is to be able to operate almost as if one were on a single
cpu even when one is on a 16-cpu system, so the actual structural
accounting does not really have to worry about interrupts or stale data
or other cpus. A critical section is sufficient to operate on token meta
data on the owning cpu, and we can actually defer structural linkages
within tokens we own until a thread switch occurs (which won't happen in
most cases since like a mutex a token is obtained and released without
blocking most cases).
I think I rambled on there. Please feel free to ask additional