DragonFly kernel List (threaded) for 2005-02
User/Kernel MPI [LONG, Newbie, Questions]
I'm writing mostly because I'm curious about the topic; I'm
probably not sufficiently an ubercoder to actually write the
implementation but I'd still like to understand the design and
implementation. If I'm jumping the gun and wondering about something
that's not yet on the radar, let me know. Also, apologies that this email
is damn long.
First off, is there any documentation on the messaging and threading model
beyond those on www.dragonflybsd.org/goals ?
Anyway, my friend and I spent a lot of time thinking about the proposed
threading/messaging model and developed it more fully than the docs we'd
read (he's looking for a project for a class; seems to have moved on to
L4/HURD but I'm still curious about DFly). So really, can somebody tell
me if the below is correct or at least on the right track?
=== Scheduler changes
First off, we introduce a userspace thread scheduler to replace the
current, entirely in-kernel userland scheduler (similar in concept to
NetBSD's scheduler activation sa_* API). This exposes to the kernel an
entrypoint which may be called with arguments (such as which signals are
pending, possibly with messages pending - if scatter/gather queues are
used the head of each new chain must be passed in; if the buffer is
static, it can be created once, registered with the kernel, and never
passed back from the kernel - this is really an implementation detail,
though I think the former is preferable as an initial opinion). Details
of when this is called are entangled below.
Secondly, we have to introduce some kind of shared state between the
kernel and userspace. If somebody has a way of avoiding this, do let me
know [the NetBSD people seem to use lots of syscalls which seems worse, to
me]. Firstly, the userland scheduler must inform the kernel of how many
threads are currently runnable (RUNNABLE_COUNTER). This implies that the
userland scheduler needs to provide, as one would expect, thread_yield()
and similar functions which manage this counter in a MP-safe way. The
kernel merely reads this number and the worst case of misreading is a
delayed schedule() or a spurious schedule(). If the kernel additionally
knows the locking mechanism employed it can delay reading until the lock
is released. Note that while this number is redundant against the data
structures below, its presence allows the kernel not to keep persistent
pointers to thread data structures.
In addition to the userland scheduler's per thread data structures, there
must be a per-thread shared data structure as well - primarily it must
contain the thread's runnable state and its context (stored registers).
We introduced several values for this thread state to take on: ACTIVE,
RUNNABLE_YIELDED, RUNNABLE_PREEMPTED, BLOCKED_USERLAND, and
BLOCKED_KERNELSPACE. Note well that despite the presence of these data
structures, they are *NOT* persistently pointed to by kernelspace entities
(excepting for BLOCKING_KERNELSPACE threads - see below), which means that
userland is free to create or destroy threads at its whim without
notifying the kernel (though failure to update the RUNNABLE_COUNTER may
have adverse effects on system performance - both cases [spurious
schedule() and missed schedule()] are probably traceable in the kernel and
the process can be reprimanded appropriately).
+ ACTIVE threads are those that are currently running. The userland
scheduler NEVER jumps into an active thread; doing so would lead to
undefined results, as one expects. The kernel will only see threads of
this type, excepting when it has to annotate one to BLOCKED_KERNELSPACE
+ RUNNABLE_YIELDED threads are those that voluntarily gave up their
timeslice via yield(), while RUNNABLE_PREEMPTED are those that the kernel
preempted at the end of their timeslice. The distinction is drawn merely
to enable different thread scheduling algorithms, in some sense. Note
that _YIELDED state is assertable by userland in thread_yield() (which
must also save %eip and the other registers...) but that the kernel must
WRITE TO the shared data-structure to transition the thread to _PREEMPTED
(in addition to updating the state the kernel must write its register dump
there). The kernel will never see a thread in this state [though it will
put one in _PREEMPTED... see below]. Preemptable threads requires
kernel-side support as far as we can tell, but cooperative multitasking
could do away with that little bit of instrumentation.
+ BLOCKED_USERLAND is the result of a mutex_lock() or msg_waitfor() on a
purely userland construct. In all probability the userland-only
per-thread data-structure will contain additional data for the thread
scheduler's decision process. These processes will never be running, so
the kernel will never see this state.
+ BLOCKED_KERNELSPACE is a very unfortunate state. It means that the
thread is in the process of page faulting or similar but that the kernel
was unable to service the request immediately. In the event of a
fast-fault (e.g., mmap'd region with the backing data in page cache, some
syscalls), this state is avoided as the ACTIVE state prevents the
scheduler from running the thread and we'll soon be able to return
(detrampoline) to it. However, in slow-faults (e.g., swapstorm and we
just asked for a page way off on disk, most syscalls), to avoid truly
blocking, the essentially synchronous message that was the page fault
should be downgraded to an ASYNC message and the kernel's userland
scheduler reentered. The exact details of this SYNC->ASYNC conversion are
not exactly clear but the end result must be that a (set of) LWKT(s) is
spawned to wait on the result with ultimate goal of reversing the thread's
annotation to RUNNABLE_PREEMPTED state [in order to facilitate this, the
kernel obviously needs to keep a pointer out to userland. It will produce
undefined results (probably crash) to delete the data structure while the
kernel is in this state because then it will blindly annotate something
else. This (and RUNNABLE_PREEMPTED, really) smells like a rich area for
bugs and exploits... if somebody has something cleaner, I'm all ears, of
The reason that BLOCKED_KERNELSPACE must be introduced is to decouple the
kernelspace from userland so that the kernel does not have to have
contexts stored per thread, merely per blocking thread, as well as so the
userland thread scheduler can, on the next reentry, note that a previously
active thread has been blocked an can reduce the RUNNABLE_COUNTER
appropriately. [This update cannot be done from kernel due to
So the biggest change, all in all, is that the kernel's userland scheduler
no longer resumes a context - it merely jumps in to the process' thread
scheduler and leaves the decision of which thread runs to that process.
This allows for different threading models to match workloads rather than
trying to tune the global scheduler to deal with it.
=== Message Passing
Now, given all of the above we can start using message passing system
calls in a real way. It's possible to use the above without MPI syscalls
as well as to use MPI syscalls without the above, but neither of those
seem to be desirable.
In the MPI syscall model, instead of faulting, we will call the
targetPort->send_msg() function, which may fault or SYSENTER or as
appropriate in the SYNC case, but we won't see that code or any of the
instrumentation (debugging, IPC, userland drivers / VFS, etc) that the MPI
allows - it's been nicely encapsulated away from us. [Incidentally, the
MPI can be used safely for IPC as long as the ports are used essentially
as file descriptors and the ->send_msg() functions are provided by a
trusted library, but that's another discussion.]
If we send an ASYNC message, our behavior is quite clear. We go on our
business and at some point inform the userland thread scheduler that we're
willing to be blocked on the arrival of a message for us on targetPort
In the event of a SYNC message, there are several cases:
+ The message is handled in full, synchronously, right now.
+ The message is immediately converted to ASYNC and somebody else blocks
for us [immediate return in this case would be a bug].
+ The message is synchronously trampolined into the kernel and later
would block. In this case the kernel must downgrade the remainder of the
message to ASYNC, spawn off a context (or several, again, depending on
details of implementation) and sleep the context on the reception of an
answer to the ASYNC message in flight. Here the thread is annotated by
the detrampoline code to BLOCKING_KERNELSPACE and the kernel userland
scheduler is invoked.
Example of the latter case (possibly incorrect in terminology / details,
but should give the idea). This case is overly long but is robust against
subsystems taking locks, which merely swinging the entire stack to the
side (spawning LWKT at the innermost layer only) would not be:
user calls read() in a way that triggers SYNC messages
read() generates a SYNC message and sends it off
->send_msg() trampolines into the kernel synchronously
the message is routed to the VFS and parsed
the VFS calls (C style) into the filesystem
the FS calls (C style) into the block device layer
the block device layer realized that is has to block. At this point
there is a kernel context (stack) associated with everything up to
this point. Here the message will be downgraded to ASYNC and
dispatched to the appropriate block device after the block layer
creates or reserves a message port. The block layer spawns a thread
to wait on the message it sent.
the block device layer returns a handle to that port back to the FS
along with an error indicating that it had to block.
the FS spawns a thread to wait on this port as well as generate a port
and error return code to the VFS.
the VFS does the same, returning to
the detrampoline code recognizes the error and spawns off a thread to
wait for reception of the message from the VFS to enqueue the result
for the process' userland thread scheduler. It annotates the
currently running thread to BLOCKING_KERNELSPACE. It then jumps over
to the kernel process scheduler.
At the end, the kernel stack is fully unwound, having been transferred to
LWKTs waiting for an answer. This is probably not ideal, but it works and
is robust against locking of any type inside the subsystems; I'm sure
somebody has a better idea.
Why SYNC messages at all? The short answer is performance. If every
inter-subsystem call is translated into an asynchronous message, the
results can be disastrous for performance, as the kernel can end up
spending more cycles in message queue management than in actually doing
the task requested.
=== Implementation details & misc.
+ The mechanisms of segmentation that can be used to give rise to thread
local storage can be used for all of this with the exception of
RUNNABLE_COUNTER. By merely reserving a chunk of memory at the top of the
TLS area (alternatively, the thread's stack and use a segment for that),
the kernel can always know the current thread's data structure location
(It'd be in the segment register) without having to have any notion of
thread IDs or similar. This may slightly impact the ABI but I do not
believe it must as TLS already exists. This may also have impacts against
such VMMs as Xen.
+ RUNNABLE_COUNTER can instead be implemented as a syscall-updated
in-kernel quantity (rather than shared state in the memory area). This
has the advantage that the kernel can decrement it on transition to
BLOCKED_KERNELSPACE as well as always ensure its validity, but the
disadvantage is that the userland must trampoline up for every thread
spawned or killed, which may be bad for performance of certain,
thread-heavy workloads (MozartOZ programs leap to mind, ignoring that
Mozart currently does its own threading - in an ideal world it could spawn
real threads for it).
+ ps & friends now have no process-agnostic way of showing threads if
RUNNABLE_COUNTER is managed in userland. Every process must be asked for
how many threads it has. Perhaps this is another advantage of it being a
syscall. Note that it does not otherwise impact viewing process
+ Register store and restore must be done in userspace on occasion. I
assume this isn't much of a problem but I am unfamiliar with the exact
details of all the permission bits present on x86.
OK, well, that's enough for now. I eagerly await responses.