DragonFly kernel List (threaded) for 2010-06
Re: HEADS UP - major structure size changes in HEAD
:I fully agree that tokens are a more convenient API than, say, lockmgr
:locks, but AFAIK nobody has investigated their behavior under heavy
:contention (I suspect you have Matt, so please share any information you
:might have collected).
Tokens and the big-giant-lock behave the same way when under contention.
Both basically call lwkt_switch(). The LWKT scheduler then essentially
spins looking for schedulable threads until it finds one for which it
can restore the state of both the MP lock and the token(s).
:Right now, the lwkt_switch() routine is responsible for getting all
:tokens that a thread needs to run. If it can't get them (because some
:token is owned by a thread running on a different cpu), it moves on to
:the next runnable thread.
:However, lwkt_getalltokens() tries to get the tokens in the same order
:that they were acquired by the thread, which means that if you lose a
:race to another thread that needs your Nth token, you just wasted time
:on taking (N - 1) tokens, possibly obstructing other threads that could
:use those tokens to move forward. I admit that in practice most threads
:will take the same tokens in the same order, so I don't expect that to
:be a big issue. Some concrete data would be nice though :)
Well, theoretically getting the tokens in reverse order would more
efficiently locate the contending token, but only if the token is
being held for a long period of time by the other cpu. If it is only
being held for a short period of time (which is the normal case) then
actually having the other cpu get its tokens in forward order can be
more efficient because it adds a slight delay which allows the current
cpu to finish its short-run code and release the token before the
contending cpus try to acquire it again.
Another problem with getting the tokens in reverse order is the
potential for temporary livelocks when several cpus are trying to
get the same set of tokens and some of them are doing it from mainline
code in forward order while others have already contended and are doing
it (in reverse order) from the scheduler. This can cause them to meet
in the middle and cause ALL the cpus to contend (nobody gets the full set
of tokens). This is temporary because eventually all the cpus hitting
this situation wind up in the scheduler and try to get the tokens in
the same order. Plus other timing considerations prevent livelocks
(though to be fair maybe I should add a contention counter to lwkt_switch
and start adding some random delays if it takes too long to find a
If you think about it the meet-in-the-middle collision can create a
horrible cascading effect on systems with a lot of cpus, where everyone
contends until all cpus in question wind up in the scheduler.
:Another thing that makes me sceptical about token scalability is that,
:AFAICT, the scheduler needs to try to acquire them *every time*. So if
:you have M threads on a cpu all waiting for, say, the VM token, then
:while it is held on another cpu you'll try (and fail) to get that token
:M times every time you lwkt_switch(). This is time you could have spent
:running a task that potentially *could* get some work done (nevermind
:the cmp. That is in contrast to sleep locks where you only become
:runnable when a lock you wanted is released.
:I'm sure you've thought about all this; I suppose you're seeing tokens
:as a good trade off between simplicity and scalability?
Well, yes and no. Yes it is spinning, but no it doesn't really reduce
efficiency all that much because if no other kernel threads are runnable
that cpu has nothing else to do anyway.
If another kernel thread is runnable it winds up getting scheduled
within around ~100ns. If only the one contending kernel thread
is runnable it spins in the scheduler and eventually resumes the
original thread without actually having to do a switch. Probably
~50ns or less.
If many kernel threads are contending on the same token the requesting
cpu has nothing to do until the holding cpu releases the token,
no matter how many threads are stuck on the requesting cpu. When the
token is released by the holding cpu then the requesting cpu will
instantly be able to schedule whatever thread (of the many contending
threads) it is currently polling.
So the scheduling overhead winds up being essentially O(1).
There is one other thing I should note here about the LWKT scheduler.
Neither the MP lock or the tokens do any sort of IPI signaling when
a release occurs. They act essentially like spin locks except for
fact that where they spin is in the scheduler and not in the code
trying to acquire the token. IPI signaling would add a huge amount of
extra overhead for locks which are typically held for only a few hundred
nanoseconds (IPI signaling latency is on the order of 1uS so the
notification winds up being wasted most of the time).
This creates an issue with the scheduling of user threads when kernel
threads are also present but contending on the MP lock or a token.
If we schedule the user thread under these conditions the kernel threads
wind up with severe latencies which destroys interactive responsiveness.
Thus the LWKT scheduler currently will refuse to schedule a user thread
if a runnable kernel thread is present which is contending on the
MP lock or a token.
The spin (spinlocks) or the hybrid spin-in-scheduler (MP lock, tokens)
approach gives us a solution which has lower overheads for locks which
are held for very short periods of time.
The lockmgr/mutex design gives us a solution which has lower overheads
for locks which are held for longer periods of time and/or must be held
across blocking conditions.