Up to [DragonFly] / src / sys / sys
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Add TDF_NETWORK lwkt flag, so various assertion could be performed to make sure that packets are processed in network threads (i.e. controlled enviroment)
Add a MSGF_NORESCHED feature for lwkt thread-based message ports. The idea is to use it to allow certain async messages to be queued to higher priority system threads and schedule those threads without forcing an immediate reschedule. The feature will be used by the new socket code to prevent cavitation between a user process and system protocol thread when the user process is write()ing a lot of data over the network.
Fix issues with the scheduler that were causing unnecessary reschedules between tightly coupled processes as well as inefficient reschedules under heavy loads. The basic problem is that a process entering the kernel is 'passively released', meaning its thread priority is left at TDPRI_USER_NORM. The thread priority is only raised to TDPRI_KERN_USER if the thread switches out. This has the side effect of forcing a LWKT reschedule when any other user process woke up from a blocked condition in the kernel, regardless of its user priority, because it's LWKT thread was at the higher TDPRI_KERN_USER priority. This resulted in some significant switching cavitation under load. There is a twist here because we do not want to starve threads running in the kernel acting on behalf of a very low priority user process, because doing so can deadlock the namecache or other kernel elements that sleep with lockmgr locks held. In addition, the 'other' LWKT thread might be associated with a much higher priority user process that we *DO* in fact want to give cpu to. The solution is elegant. First, do not force a LWKT reschedule for the above case. Second, force a LWKT reschedule on every hard clock. Remove all the old hacks. That's it! The result is that the current thread is allowed to return to user mode and run until the next hard clock even if other LWKT threads (running on behalf of a user process) are runnable. Pure kernel LWKT threads still get absolute priority, of course. When the hard clock occurs the other LWKT threads get the cpu and at the end of that whole mess most of those LWKT threads will be trying to return to user mode and the user scheduler will be able to select the best one. Doing this on a hardclock boundary prevents cavitation from occuring at the syscall enter and return boundary. With this change the TDF_NORESCHED and PNORESCHED flags and their associated code hacks have also been removed, along with lwkt_checkpri_self() which is no longer needed.
Fix numerous pageout daemon -> buffer cache deadlocks in the main system. These issues usually only occur on systems with small amounts of ram but it is possible to trigger them on any system. * Get rid of the IO_NOBWILL hack. Just have the VN device use IO_DIRECT, which will clean out the buffer on completion of the write. * Add a timeout argument to vm_wait(). * Add a thread->td_flags flag called TDF_SYSTHREAD. kmalloc()'s made from designated threads are allowed to dip into the system reserve when allocating pages. Only the pageout daemon and buf_daemon[_hw] use the flag. * Add a new static procedure, recoverbufpages(), which explicitly tries to free buffers and their backing pages on the clean queue. * Add a new static procedure, bio_page_alloc(), to do all the nasty work of allocating a page on behalf of a buffer cache buffer. This function will call vm_page_alloc() with VM_ALLOC_SYSTEM to allow it to dip into the system reserve. If the allocation fails this function will call recoverbufpages() to try to recycle from VM pages from clean buffer cache buffers, and will then attempt to reallocate using VM_ALLOC_SYSTEM | VM_ALLOC_INTERRUPT to allow it to dip into the interrupt reserve as well. Warnings will blare on the console. If the effort still fails we sleep for 1/20 of a second and retry. The idea though is for all the effort above to not result in a failure at the end. Reported-by: Gergo Szakal <bastyaelvtars@gmail.com>
Allocate lwkt threads from objcache instead of custom per-cpu cache backed by zone. Reviewed-by: dillon@
MFC - Fix a nasty memory corruption issue related to the kernel's use of the FP registers for large copies.
Fix a nasty memory corruption issue which can occur due to the kernel bcopy's use of the FP unit. If the destination address faults the NPX code can lose track of the fact that the kernel was using the FP unit. When the fault is resolved the kernel bcopy resumes with corrupted FP registers. The most common situation where this could occur is with pipes, and generally only when the system is paging heavily and causing multiple processes to fault in the kernel FP bcopy code.
Clean up the token code and implement lwkt_token_is_stale(). Users of the token code are now able to detect if the token was acquired and released by someone else while they were blocked. Submitted-by: Michael Neumann <mneumann@ntecs.de>
Save and restore the FP context in the signal stack frame.
Add a new light weight function to synchronize IPI queues on other CPUs by broadcasting a NOP IPI to other CPUs; this is used be make sure that all IPIs before the NOP one are processed. Use this new function to fix a possible race between kfree() and malloc_uninit(): kfree() may be in transitting state when malloc_uninit() is running. Ideas-from: dillon@ Reviewed-by: dillon@
style(9) cleanup: Remove parameter names from prototypes. Submitted-by: Hasso Tepper <hasso@estpak.ee>
Pass structs by reference if you expect the callee to modify them. This fixes kernel boot with gcc41. The gpfault people were seeing comes from vm86_bioscall() in init386(). The cause is that the assembler code passes the struct vm86frame by value, i.e. simply creating it on the stack. This worked up to gcc34, but gcc41 now optimizes stores to unused memory locations away, whis is allowed per the standards. This led to an uninitialized stack frame which in turn panicked the box. Oooohh...-please-commit-by: dillon@
Remove LWKT reader-writer locks (kern/lwkt_rwlock.c). Remove lwkt_wait queues (only RW locks used them). Convert remaining uses of RW locks to LOCKMGR locks. In recent months lockmgr locks have been simplified to the point where we no longer need a lighter-weight fully blocking lock. The removal also simplifies lwkt_schedule() in that it no longer needs a special case to deal with wait lists.
gd_tdallq is not protected by the BGL any more, it can only be manipulated on the current cpu. Remove the thread when it exits rather then when it is freed.
Fix numerous bugs in the BSD4 scheduler introduced in recent commits. Primarily, do not try to get a spinlock from a hard interrupt (e.g. IPI) if spinlocks are already being held by the cpu. This will probably have to be made an absolute rule - no spinlocks at all in a hard interrupt / IPI (vs an interrupt thread).
Add two KTR (kernel trace) options: KTR_GIANT_CONTENTION and KTR_SPIN_CONTENTION. These will cause MP lock contention and spin lock contention to be KTR-logged.
Further isolate the user process scheduler data by moving more variables from the globaldata structure to the scheduler module(s). Make the user process scheduler MP safe. Make the LWKT 'pull thread' (to a different cpu) feature MP safe. Streamline the user process scheduler API. Do a near complete rewrite of the BSD4 scheduler. Remote reschedules (reschedules to other cpus), cpu pickup of queued processes, and locality of reference handling should make the new BSD4 scheduler a lot more responsive. Add a demonstration user process scheduler called 'dummy' (kern/usched_dummy.c). Add a kenv variable 'kern.user_scheduler' that can be set to the desired scheduler on boot (i.e. 'bsd4' or 'dummy'). NOTE: Until more of the system is taken out from under the MP lock, these changes actually slow things down slightly. Buildworlds are about ~2.7% slower.
Implement a much faster spinlock. * Spinlocks can't conflict with FAST interrupts without deadlocking anyway, so instead of using a critical section simply do not allow an interrupt thread to preempt the current thread if it is holding a spinlock. This cuts spinlock overhead in half. * Implement shared spinlocks in addition to exclusive spinlocks. Shared spinlocks would be used, e.g. for file descriptor table lookups. * Cache a shared spinlock by using the spinlock's lock field as a bitfield, one for each cpu (bit 31 for exclusive locks). A shared spinlock sets its cpu's shared bit and does not bother clearing it on unlock. This means that multiple, parallel shared spinlock accessors do NOT incur a cache conflict on the spinlock. ALL parallel shared accessors operate at full speed (~10ns vs ~40-100ns in overhead). 90% of the 10ns in overhead is due to a necessary MFENCE to interlock against exclusive spinlocks on the mutex. However, this MFENCE only has to play with pending cpu-local memory writes so it will always run at near full speed. * Exclusive spinlocks in the face of previously cached shared spinlocks are now slightly more expensive because they have to clear the cached shared spinlock bits by checking the globaldata structure for each conflicting cpu to see if it is still holding a shared spinlock. However, only the initial (unavoidable) atomic swap involves potential cache conflicts. The shared bit checks involve only memory reads and the situation should be self-correcting from a performance standpoint since the shared bits then get cleared. * Add sysctl's for basic spinlock performance testing. Setting debug.spin_lock_test issues a test. Tests #2 and #3 loop debug.spin_test_count times. p.s. these tests will stall the whole machine. 1 Test the indefinite wait code 2 Time the best-case exclusive lock overhead 3 Time the best-case shared lock overhead * TODO: A shared->exclusive spinlock upgrade inline with positive feedback, and an exclusive->shared spinlock downgrade inline.
I'm growing tired of having to add #include lines for header files that the include file(s) I really want depend on. Go through nearly all major system include files and add appropriately #ifndef'd #include lines to include all dependant header files. Kernel source files now only need to #include the header files they directly depend on. So, for example, if I wanted to add a SYSCTL to a kernel source file, I would only have to #include <sys/sysctl.h> to bring in the support for it, rather then four or five header files in addition to <sys/sysctl.h>.
Recent lwkt_token work broke UP builds. Fix the token code to operate properly for both UP and SMP builds. The SMP build uses spinlocks to control access and also to do the preemption check. The tokens are explicitly obtained when a thread is switched in and released when a thread is (non-preemptively) switched out. Spinlocks cannot be used for this purpose on UP because they are coded to a degenerate case on a UP build. On a UP build an explicit preemption check is needed, but no spinlock or per-thread counter is required because the definition of a token is that it is only 'held' while a thread is actually running or preempted. So, by definition, a token can always be obtained and held by a thread on UP EXCEPT in the case where a preempting thread is trying to obtain a token held by the preempted thread. Conditionalize elements in the lwkt_token structure definition to guarentee that SMP fields cannot be used in UP builds or vise-versa. The lwkt_token structure is made the same size for both builds. Also remove some of the degenerate spinlock functions (spin_trylock() and spin_tryunlock()) for UP builds to force a compile-time error if an attempt is made to use them. spin_lock*() and spin_unlock*() are retained as degenerate cases on UP. Reported-by: Sascha Wildner <saw@online.de>, walt <wa1ter@myrealbox.com>
Replace the LWKT token code's passive management of token ownership with active management based on Jeff's spin locks (which themselves are an adaptation of Sun spinlocks, I tihnk). LWKT tokens still have the same behavior. That is, even though tokens now use a spinlock internally, they are still active only while the thread is running (or preempted). When a thread non-preemptively switches away all held tokens are released as before and when a thread switches back in all held tokens are reacquired. Use spinlocks instead of tokens to manage access to LWKT RW lock structures. Use spinlocks instead of tokens to manage LWKT wait lists. Tokens are designed to fill a niche between spinlocks and lockmgr locks. Spinlocks are only to be used for short bits of low level code. Tokens are designed to be used when broad serialization is desired but when the caller may be making calls to procedures which might block. Lockmgr locks are designed to be used when strict serialization is desired even across blocking conditions. It should be noted that token overhead is only slightly greater than core spinlock overhead. The only real difference is due to the extra structural management required to record the token in the thread structure so it can be released and reacquired. The overhead of saving and restoring tokens in a thread switch is very rarely exercised (i.e. only when the underlying code actually blocks while holding a token). This patch reduces buildworld -j 8 times by about 5 seconds (1400->1395 seconds on my test box), about 0.3%, but is expected to have a more pronounced effect as further MP work is accomplished.
Bring in the parallel route table code and clean up ARP. The route table is now replicated across all cpus (ncpus, not ncpus2). Note that cloned routes are not replicated. This removes one of the few remaining obstacles to being able to run the network protocol stacks without the BGL. Primary-Design-by: Jeffrey Hsu Work-by: Jeffrey Hsu and Matthew Dillon
Fix a process exit/wait race. The wait*() code was making a faulty test to determine that the exiting process had completely exited and was no longer running. Testing the TDF_RUNNING flag is insufficient because an exiting process may block at various points after becoming a Zombie, but before it deschedules itself for the last time. Add a new flag, TDF_EXITING, which is set just prior to a thread descheduling itself for the last time. The reaper then checks that TDF_EXITING is set and TDF_RUNNING is clear. Fix a second faulty test in both the exit and the thread cpu migration code. If a thread gets preempted, TDF_RUNNING will be temporarily cleared, so testing TDF_RUNNING is not sufficient by itself. We must also test the TDF_PREEMPT_LOCK flag to be sure that it is also clear. So the grand result is that to really be sure the zombie process has been completely descheduled and is no longer running or will ever run again, the TDF_EXITING, TDF_RUNNING, *and* TDF_PREEMPT_LOCK flags must be tested and all must be clear except for TDF_EXITING. It should be noted that TDF_RUNNING on the previously scheduled process is always cleared AFTER we have context-switched into the next scheduled thread or the idle thread, so seeing a cleared TDF_RUNNING along with the appropriate state for the other flags does in fact guarentee that the thread in question is no longer using its stack in any way. Reported-by: Stefan Krueger <skrueger@meinberlikomm.de>
Consolidate the initialization of td_mpcount into lwkt_init_thread(). Fix a bug in kern.trap_mpsafe, the mplock was not being properly released when operating in vm86 mode (when kern.trap_mpsafe was set to 1).
Add a thread flag, TDF_MPSAFE, which is used during thread creation to determine whether the thread should initially be holding the MP lock or not.
Make tsleep/wakeup() MP SAFE for kernel threads and get us closer to making it MP SAFE for user processes. Currently the code is operating under the rule that access to a thread structure requires cpu locality of reference, and access to a proc structure requires the Big Giant Lock. The two are not mutually exclusive so, for example, tsleep/wakeup on a proc needs both cpu locality of reference *AND* the BGL. This was true with the old tsleep/wakeup and has now been documented. The new tsleep/wakeup algorithm is quite simple in concept. Each cpu has its own ident based hash table and each hash slot has a cpu mask which tells wakeup() which cpu's might have the ident. A wakeup iterates through all candidate cpus simply by chaining the IPI message through them until either all candidate cpus have been serviced, or (with wakeup_one()) the requested number of threads have been woken up. Other changes made in this patch set: * The sense of P_INMEM has been reversed. It is now P_SWAPPEDOUT. Also, P_SWAPPING, P_SWAPINREQ are not longer relevant and have been removed. * The swapping code has been cleaned up and seriously revamped. The new swapin code staggers swapins to give the VM system a chance to respond to new conditions. Also some lwp-related fixes were made (more p_rtprio vs lwp_rtprio confusion). * As mentioned above, tsleep/wakeup have been rewritten. The process p_stat no longer does crazy transitions from SSLEEP to SSTOP. There is now only SSLEEP and SSTOP is synthesized from P_SWAPPEDOUT for userland consumpion. Additionally, tsleep() with PCATCH will NO LONGER STOP THE PROCESS IN THE TSLEEP CALL. Instead, the actual stop is deferred until the process tries to return to userland. This removes all remaining cases where a stopped process can hold a locked kernel resource. * A P_BREAKTSLEEP flag has been added. This flag indicates when an event occurs that is allowed to break a tsleep with PCATCH. All the weird undocumented setrunnable() rules have been removed and replaced with a very simple algorithm based on this flag. * Since the UAREA is no longer swapped, we no longer faultin() on PHOLD(). This also incidently fixes the 'ps' command's tendancy to try to swap all processes back into memory. * speedup_syncer() no longer does hackish checks on proc0's tsleep channel (td_wchan). * Userland scheduler acquisition and release has now been tightened up and KKASSERT's have been added (one of the bugs Stefan found was related to an improper lwkt_schedule() that was found by one of the new assertions). We also have added other assertions related to expected conditions. * A serious race in pmap_release_free_page() has been corrected. We no longer couple the object generation check with a failed pmap_release_free_page() call. Instead the two conditions are checked independantly. We no longer loop when pmap_release_free_page() succeeds (it is unclear how that could ever have worked properly). Major testing by: Stefan Krueger <skrueger@meinberlikomm.de>
Turn around the spinlock code to reduce the chance of programmer error. Remove spin_lock_crit() and spin_unlock_crit(). Instead make the primary spinlock API, spin_lock() and spin_unlock(), enter and exit a critical section. Add two API functions, spin_lock_quick() and spin_unlock_quick() which assume the caller is already in a critical section or that the spinlock will never be used by a preempting thread (hardware interrupt or software interrupt).
Remove the dummy IPI messaging routines for UP builds and properly conditionalize the use of IPI messages in various core kernel modules. Change the callback from func(arg, frameptr) to func(arg1, arg2, frameptr), where the new argument (arg2) is an integer supplied by the originator. Create wrappers for simpler versions of the callback: func(arg1), and func(arg1, arg2) (for the moment we presume that GCC will generate code for the full-sized three-argument callback which is compatible with one and two-argument function pointers). This extension to the IPI messaging code is needed to properly implement MP-safe tsleep/wakeup code. Although the extra argument is superfluous in most cases, the overhead of doing an IPI is such that there should be no noticeable impact on performance.
Major cleanup of the interrupt registration subsystem. * Collapse the separate registrations in the kernel interrupt thread and i386 layers into a single machine-independant kernel interrupt thread layer in kern/kern_intr.c. Get rid of the i386 layer's 'MUX' code entirely. * Have the interrupt vector assembly code (icu_vector.s and apic_vector.s) call a machine-independant function in the kernel interrupt thread layer to figure out how to process an interrupt. * Move a lot of assembly into the new C interrupt processing function. * Add support for INTR_MPSAFE. If a device driver registers an interrupt as being MPSAFE, the Big Giant Lock will not be obtained or required. * Temporarily just schedule the ithread if a FAST interrupt cannot be executed due to its serializer being locked. * Add LWKT serialization support for a non-blocking 'try' function. * Get rid of ointhand2_t and adjust all old ISA code to use inthand2_t. * Supply a frame pointer as a pointer rather then embedding it on th stack. * Allow FAST and SLOW interrupts to be mixed on the same IRQ, though this will not necessarily result in optimal operation. * Remove direct APIC/ICU vector calls from the apic/icu vector assembly code. Everything goes through the new routine in kern/kern_intr.c now. * Add a new flag, INTR_NOPOLL. Interrupts registered with the flag will not be polled by the upcoming emergency general interrupt polling sysctl (e.g. ATA cannot be safely polled due to the way ATA register access interferes with ATA DMA). * Remove most of the distinction in the i386 assembly layers between FAST and SLOW interrupts (part 1/2). * Revamp the interrupt name array returned to userland to list multiple drivers associated with the same IRQ.
1:1 Userland threading stage 2.8/4: Switch the userland scheduler to use lwps instead of procs.
Userland 1:1 threading changes step 1/4+: o Move thread-local members from struct proc into new struct lwp. o Add a LIST_HEAD(lwp) p_lwps to struct proc. This links a proc with its lwps. o Add a td_lwp member to struct thread which links a thread to its lwp, if it exists. This won't replace td_proc completely to save indirections. o For now embed one struct lwp into struct proc and set up preprocessor linkage so that semantics don't change for the rest of the kernel. Once all consumers are converted to take a struct lwp instead of a struct proc, this will go away. Reviewed-by: dillon, davidxu
Add a new kernel compile debugging option, DEBUG_CRIT_SECTIONS. This fairly invasive debugging option compiles matching code into the critical section inlines and reports mismatches at run-time. It is used to detect missing/forgotten crit_exit() calls. Note that because there are a number of places where critical sections are manipulated outside the procedures that entered them, this code will generate a number of false hits and should only be used under the direction of experienced developers. Note that the thread structure will be extended by this option.
When a cpu is stopped due to a panic or the debugger, it can be in virtually any state, including possibly holding a critical section. IPIQ interrupts must still be processed while we are in this state (even though we could be racing IPIQ processing if we were interrupted at just the wrong time). In particular, dumping is not likely to work if a panic occurs on a cpu != 0 unless we process the IPIQ on the stopped cpus. There are simply too many interactions between cpus. Interrupt threads are LWKT scheduled entities and will generally still not work during a panic while dumping. The dumping code expects this. However, call splz() anyway. We may in the future have to allow certain threads to run while dumping. For example, to allow dumping over the network. There are various ways this can be done, such as by masking gd_runqmask or flagging special threads to be runnable while in a paniced or dumping state.
Limit switch-from-interrupt warnings to once per thread to avoid an endless loop. Generate a DDB backtrace when it occurs. Note from Peter's report that it is possible for the idle thread to panic if e.g. an IPI or FAST interrupt running in the idle thread's context panics. This can result in highly unexpected operation and needs to be addressed. Reported-by: Peter Avalos <pavalos@theshell.com>
Add counters for recording Token/MPlock contention, this would help in determining the number of times contention has occured in the system. The contention counters have been made 64-bit quantities because they are situated within a tight-loop. KTR tracepoints have been added for marking start and stop of a token's contention. New field tr_flags added to struct lwkt_tokref. By adding tracepoints in lwkt_chktokens(9), it gives us interesting data on MP machines when it indirectly sends a passive IPI to the remote CPU for gaining ownership of a token. It would be interesting to see KTR dumps for a 4-CPU or an 8-CPU system. Discussed-with: Matthew Dillon <dillon@apollo.backplane.com>
Add more magic numbers for the token code.
staticize lwkt_reqtoken_remote().
Optimize lwkt_send_ipiq() - the IPI based inter-cpu messaging routine. * Add a passive version which does not initiate any actual hardware IPI. The message will be handled the next time the target cpu polls the queue (on each tick typically). Adjust the free() path to use this version when freeing memory owned by another cpu. * Add an interlock to avoid reissuing and unnecessarily stalling on the hardware IPI if a prior hardware IPI to the target cpu has not yet completed processing. This feature theoretically means that two cpus can tightly couple a large number of pipelined messages with only a single actual IPI being sent. * Reorganize the hystersis points in the IPIQ FIFOs. * Change a token livelock warning into a panic if it occurs 10 times in a row. * Add a call to lwkt_process_ipiq() just after the AP startup code enables a cpu, to process any messages that might have built up during startup. There shouldn't be any, but this may avoid surprises later.
Add syscall primitives for generic userland accessible sleep/wakeup
functions. These functions are capable of sleeping and waking up based on
a generic user VM address. Programs capable of sharing memory are also
capable of interaction through these functions.
Also regenerate our system calls.
umtx_sleep(ptr, matchvalue, timeout)
If *(int *)ptr (userland pointer) does not match the matchvalue,
sleep for timeout microseconds. Access to the contents of *ptr plus
entering the sleep is interlocked against calls to umtx_wakeup().
Various error codes are turned depending on what causes the function
to return. Note that the timeout may not exceed 1 second.
utmx_wakeup(ptr, count)
Wakeup at least count processes waiting on the specified userland
address. A count of 0 wakes all waiting processes up. This function
interlocks against umtx_sleep().
The typical race case showing resolution between two userland processes is
shown below. A process releasing a contested mutex may adjust the contents
of the pointer after the kernel has tested *ptr in umtx_sleep(), but this does
not matter because the first process will see that the mutex is set to a
contested state and will call wakeup after changing the contents of the
pointer. Thus, the kernel itself does not have to execute any
compare-and-exchange operations in order to support userland mutexes.
PROCESS 1 PROCESS 2 ******** RACE#1 ******
cmp_exg(ptr, FREE, HELD)
. cmp_exg(ptr, HELD, CONTESTED)
. umtx_sleep(ptr, CONTESTED, 0)
. [kernel tests *ptr] <<<< COMPARE vs
cmp_exg(CONTESTED, FREE) . <<<< CHANGE
. tsleep(....)
umtx_wakeup(ptr, 1) .
. .
. .
PROCESS 1 PROCESS 2 ******** RACE#2 ******
cmp_exg(ptr, FREE, HELD)
cmp_exg(ptr, HELD, CONTESTED)
umtx_sleep(ptr, CONTESTED, 0)
cmp_exg(CONTESTED, FREE) <<<< CHANGE vs
umtx_wakeup(ptr, 1)
[kernel tests *ptr] <<<< COMPARE
[MISMATCH, DO NOT TSLEEP]
These functions are very loosely based on Jeff Roberson's umtx work in
FreeBSD. These functions are greatly simplified relative to that work in
order to provide a more generic mechanism.
This is precursor work for a port of David Xu's 1:1 userland threading
library.
Avoid redefined symbol warning when libcaps uses thread.h with its own stack specification. Submitted-by: Eirik Nygaard <eirikn@kerneled.com>
Give the MP fields in the thread structure useful names for UP builds so programs like 'ps' (where SMP is not defined during compilation) can pick out the MP info.
Add a stack-size argument to the LWKT threading code so threads can be created with different-sized stacks. Adjust libcaps to match. This is a pre-requisit to adding NDIS support. NDIS threads need larger stacks because microsoft drivers expect larger stacks.
Update the userland scheduler. Fix scheduler interactions which were previously resulting in the wrong process sometimes getting a full 1/10 second slice, which under heavy load resulted in serious glitching. Introduce a new dynamic 'p_interactive' heuristic and allow it to effect priority +/- by a few nice levels. With this patch batch operations such as buildworlds, setiathome should not interfere with X / interactive operations as much as they did before. Note that we are talking about the the userland scheduler here, not the LWKT scheduler. Also note that the userland scheduler needs a complete rewrite.
Move the 'p_start' field from struct pstats (Process Statistics) into the thread structure and call it 'td_start'. The behavior of vm_fork(9) is retained, i.e., it still copies the start time from the parent process just as it did before. The 'td_start' will later be used by pure threads to indicate their start time. It has not been committed in this round because use of the microtime() function at such a early point in the boot process might be unsafe. Note, there should be no problem in accessing the td_start field, unless the process is a Zombie; due to the way Zombies are reaped, the thread will be decoupled in kern_wait1() but the process will still be around for a while it will not be possible to access the td_start field in such scenarios. A little note about this has been added on top of struct proc in <sys/proc.h> for future reference. This work was a collaboration of Hiten Pandya <hmp@backplane.com> and Matthew Dillon <dillon@apollo.backplane.com>
Spell 'written' properly.
Both 'ps' and the loadav calculations got broken by thread sleeps, which occur without knowledge by the proc and so ps/loadav thought processes sitting in e.g. accept() were in a 'R'un state when they were actually sleeping. Make ps and the loadav calculator thread-aware.
Add lwkt_setcpu_self(), a function which migrates the current thread to the specified cpu. This will soon be used by sysctl_kern_proc() to collect thread information across all available cpus (because it is only legal to manipulate a thread on the cpu it belongs to). Yes, you heard that right and, yes, the overhead is nasty... one whole microsecond per cpu at least, possibly even two. But who cares for something like 'ps'? In-conversation-with: Hiten Pandya <hmp@freebsd.org>
Do some minor critical path performance improvements in the scheduler and at the user/system boundary. Avoid some unnecessary segment prefix ops, remove some unnecessary memory ops by using more optimal critical section inlines, and use 32 bit arithmatic instead of 64 bit arithmatic when calculating system tick overheads in userret(). This saves a whopping 5ns worth of syscall overhead, which just proves how silly I am sometimes.
Cleanup libcaps to support recent LWKT changes. Add TDF_SYSTHREAD back to sys/thread.h (libcaps needs it).
Second major scheduler patch. This corrects interactive issues that were introduced in the pipe sf_buf patch. Split need_resched() into need_user_resched() and need_lwkt_resched(). Userland reschedules are requested when a process is scheduled with a higher priority then the currently running process, and LWKT reschedules are requested when a thread is scheduled with a higher priority then the currently running thread. As before, these are ASTs, LWKTs are not preemptively switch while running in the kernel. Exclusively use the resched wanted flags to determine whether to reschedule or call lwkt_switch() upon return to user mode. We were previously also testing the LWKT run queue for higher priority threads, but this was causing inefficient scheduler interactions when two processes are doing tightly bound synchronous IPC (e.g. using PIPEs) because in DragonFly the LWKT priority of a thread is raised when it enters the kernel, and lowered when it tries to return to userland. The wakeups occuring in the pipe code were causing extra quick-flip thread switches. Introduce a new tsleep() flag which disables the need_lwkt_resched() call when the sleeping thread is woken up. This is used by the PIPE code in the synchronous direct-write PIPE case to avoid the above problem. Redocument and revamp the ESTCPU code. The original changes reduced the interrupt rate from 100Hz (FBsd-4 and FBsd-5) to 20Hz, but did not compensate for the slower ramp-up time. This commit introduces a 'virtual' ESTCPU frequency which compensates without us having to bump up the actual systimer interrupt rate. Redo the P_CURPROC methodology, which is used by the userland scheduler to manage processes running in userland. Create a globaldata->gd_uschedcp process pointer which represents the current running-in-userland (or about to be running in userland) process, and carefully recode acquire_curproc() to allow this gd_uschedcp designation to be stolen from other threads trying to return to userland without having to request a reschedule (which would have to switch back to those threads to release the designation). This reduces the number of unnecessary context switches that occur due to scheduler interactions. Also note that this specifically solves the case where there might be several threads running in the kernel which are trying to return to userland at the same time. A heuristic check against gd_upri is used to select the correct thread for schedling to userland 'most of the time'. When the correct thread is not selected, we fall back to the old behavior of forcing a reschedule. Add debugging sysctl variables to better track userland scheduler efficiency. With these changes pipe statistics are further improved. Though some scheduling aberrations still exist(1), the previous scheduler had totally broken interactive processes and this one does not. BLKSIZE BEFORE NEWPIPE NOW Tests on AMD64 MBytes/s MBytes/s MBytes/s 3200+ FN85MB (64KB L1, 1MB L2) 256KB 1900 2200 2250 64KB 1800 2200 2250 32KB - - 3300 16KB 1650 2500-3000 2600-3200 8KB 1400 2300 2000-2400(1) 4KB 1300 1400-1500 1500-1700
Turn TDF_SYSTHREAD into TDF_RESERVED0100 since the flag is never used and such a flag is not required. Discussed with: Matt Dillon
Newtoken commit. Change the token implementation as follows: (1) Obtaining a token no longer enters a critical section. (2) tokens can be held through schedular switches and blocking conditions and are effectively released and reacquired on resume. Thus tokens serialize access only while the thread is actually running. Serialization is not broken by preemptive interrupts. That is, interrupt threads which preempt do no release the preempted thread's tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on the same or on a different cpu. The vnode interlock code has been rewritten and the API has changed. The mountlist vnode scanning code has been consolidated and all known races have been fixed. The vnode interlock is now a pool token. The code that frees unreferenced vnodes whos last VM page has been freed has been moved out of the low level vm_page_free() code and moved to the periodic filesystem sycer code in vfs_msycn(). The SMP startup code and the IPI code has been cleaned up considerably. Certain early token interactions on AP cpus have been moved to the BSP. The LWKT rwlock API has been cleaned up and turned on. Major testing by: David Rhodus
Introduce an MI cpu synchronization API, redo the SMP AP startup code,
and start cleaning up deprecated IPI and clock code. Add a MMU/TLB page
table invalidation API (pmap_inval.c) which properly synchronizes page
table changes with other cpus in SMP environments.
* removed (unused) gd_cpu_lockid
* remove confusing invltlb() and friends, normalize use of cpu_invltlb()
and smp_invltlb().
* redo the SMP AP startup code to make the system work better in
situations where all APs do not startup.
* add memory barrier API, cpu_mb1() and cpu_mb2().
* remove (obsolete, no longer used) old IPI hard and stat clock forwarding
code.
* add a cpu synchronization API which is capable of handling multiple
simultanious requests without deadlocking or livelocking.
* major changes to the PMAP code to use the new invalidation API.
* remove (unused) all_procs_ipi() and self_ipi().
* only use all_but_self_ipi() if it is known that all AP's started up,
otherwise use a mask.
* remove (obsolete, no longer usde) BETTER_CLOCK code
* remove (obsolete, no longer used) Xcpucheckstate IPI code
Testing-by: David Rhodus and others
Cleanup and augment the cpu synchronization API a bit. Embed the maxcount in the structure rather then returning it and requiring it to be passed again, and document the procedures a bit more.
Split the IPIQ messaging out of lwkt_thread.c and move it to its own file, lwkt_ipiq.c. Add a MI synchronous cpu rendezvous API lwkt_cpusync_*(). This API allows the kernel to synchronize an operation across any number of cpus. Multiple cpus can initiate synchronization operations simultaniously without creating a deadlock. The API utilizes the IPI messaging core and guarentees that other synchronization and IPI messaging operations will continue to work during any given synchronization op. The API is a spin-blocking API, meaning that it will not switch threads and can be used by mainline code, interrupts, and other sensitive code. This API is intended to replace smp_rendezvous(), Xcpustop, and other hardwired IPI ops. It will also be used to fix our TLB shootdown code. As of this commit the API has not yet been connected to anything and has been tested only a little.
Create a new machine type, cpumask_t, to represent a mask of cpus, and replaces earlier uses of __uint32_t for cpu masks with cpumask_t.
* Update function defines to match up with the work from this moring as to fix the kernel build process.
Use a globaldata_t instead of a cpuid in the lwkt_token structure. The LWKT subsystem already uses globaldata_t instead of cpuid for its thread td_gd reference, and the IPI messaging code will soon be converted to take a globaldata_t instead of a cpuid as well. This reduces the number of memory indirections we have to make to access the per-cpu globaldata space in various procedures.
This commit represents a major revamping of the clock interrupt and timebase
infrastructure in DragonFly.
* Rip out the existing 8254 timer 0 code, and also disable the use of
Timer 2 (which means that the PC speaker will no longer go beep). Timer 0
used to represent a periodic interrupt and a great deal of code was in
place to attempt to obtain a timebase off of that periodic interrupt.
Timer 0 is now used in software retriggerable one-shot mode to produce
variable-delay interrupts. A new hardware interrupt clock abstraction
called SYSTIMERS has been introduced which allows threads to register
periodic or one-shot interrupt/IPI callbacks at approximately 1uS
granularity.
Timer 2 is now set in continuous periodic mode with a period of 65536
and provides the timebase for the system, abstracted to 32 bits.
All the old platform-integrated hardclock() and statclock() code has
been rewritten. The old IPI forwarding code has been #if 0'd out and
will soon be entirely removed (the systimer abstraction takes care of
multi-cpu registrations now). The architecture-specific clkintr() now
simply calls an entry point into the systimer and provides a Timer 0
reload and Timer 2 timebase function API.
* On both UP and SMP systems, cpus register systimer interrupts for the Hz
interrupt, the stat interrupt, and the scheduler round-robin interrupt.
The abstraction is carefully designed to allow multiple interrupts occuring
at the same time to be processed in a single hardware interrupt. While
we currently use IPI's to distribute requested interrupts from other cpu's,
the intent is to use the abstraction to take advantage of per-cpu timers
when available (e.g. on the LAPIC) in the future.
systimer interrupts run OUTSIDE THE MP LOCK. Entry points may be called
from the hard interrupt or via an IPI message (IPI messages have always
run outside the MP lock).
* Rip out timecounters and disable alternative timecounter code for other
time sources. This is temporary. Eventually other time sources, such as
the TSC, will be reintegrated as independant, parallel-running entities.
There will be no 'time switching' per-say, subsystems will be able to
select which timebase they wish to use. It is desireable to reintegrate
at least the TSC to improve [get]{micro,nano}[up]time() performance.
WARNING: PPS events may not work properly. They were not removed, but
they have not been retested with the new code either.
* Remove spl protection around [get]{micro,nano}[up]time() calls, they are
now internally protected.
* Use uptime instead of realtime in certain CAM timeout tests
* Remove struct clockframe. Use struct intrframe everywhere where clockframe
used to be used.
* Replace most splstatclock() protections with crit_*() protections, because
such protections must now also protect against IPI messaging interrupts.
* Add fields to the per-cpu globaldata structure to access timebase related
information using only a critical section rather then a mutex. However,
the 8254 Timer 2 access code still uses spin locks. More work needs to
be done here, the 'realtime' correction is still done in a single global
'struct timespec basetime' structure.
* Remove the CLKINTR_PENDING icu and apic interrupt hacks.
* Augment the IPI Messaging code to make an intrframe available to callbacks.
* Document 8254 timing modes in i386/sai/timerreg.h. Note that at the
moment we assume an 8254 instead of an 8253 as we are using TIMER_SWSTROBE
mode. This may or may not have to be changed to an 8253 mode.
* Integrate the NTP correction code into the new timebase subsystem.
* Separate boottime from basettime. Once boottime is believed to be stable
it is no longer effected by NTP or other time corrections.
CAVETS:
* PC speaker no longer works
* Profiling interrupt rate not increased (it needs work to be
made operational on a per-cpu basis rather then system-wide).
* The native timebase API is function-based, but currently hardwired.
* There might or might not be issues with 486 systems due to the
timer mode I am using.
CAPS IPC library stage 1/3: The core CAPS IPC code, providing system calls to create and connect to named rendezvous points. The CAPS interface implements a many-to-1 (client:server) capability and is totally self contained. The messaging is designed to support single and multi-threading, synchronous or asynchronous (as of this commit: polling and synchronous only). Message data is 100% opaque and so while the intention is to integrate it into a userland LWKT messaging subsystem, the actual system calls do not depend on any LWKT structures. Since these system calls are experiemental and may contain root holes, they must be enabled via the sysctl kern.caps_enabled.
Add additional functionality to the upcall support to allow us to wait for an upcall instead of spin. Also fix a bug in the trap code. %gs faults have to be handled in nested interrupts because %gs is not saved and restored. It is also possible that %fs may have to be handled the same way, but I am not sure yet.
Do some fairly major include file cleanups to further separate kernelland
from userland.
* Do not allow userland to include sys/proc.h directly, it must use
sys/user.h instead. This is because sys/proc.h has a huge number
of kernel header file dependancies.
* Do cleanups and work in lwkt_thread.c and lwkt_msgport.c to allow
these files to be directly compiled in an upcoming userland thread
support library.
* sys/lock.h is inappropriately included by a number of third party
programs so we can't disallow its inclusion, but do not include
any kernel structures unless _KERNEL or _KERNEL_STRUCTURES are
defined.
* <ufs/ufs/inode.h> is often included by userland to get at the
on-disk inode structure. Only include the on-disk components and do
not include kernel structural components unless _KERNEL or
_KERNEL_STRUCTURES is defined
* Various usr.bin programs include sys/proc.h unnecessarily.
* The slab allocator has no concept of malloc buckets. Remove malloc
buckets structures and VMSTAT support from the system.
* Make adjustments to sys/thread.h and sys/msgport.h such that the
upcoming userland thread support library can include these files
directly rather then copy them.
* Use low level __int types in sys/globaldata.h, sys/msgport.h,
sys/slaballoc.h, sys/thread.h, and sys/malloc.h, instead of
high level sys/types.h types, reducing include dependancies.
Augment the LWKT thread creation APIs to allow a cpu to be specified. This will be used by upcoming netisr and interrupt thread work to create protocol and interrupt threads on specified cpus rather then cpu #0.
Fix the userland scheduler. When the scheduler releases the P_CURPROC designation it unconditionally handed it off to the highest priority process on the userland process queue, ignoring the fact that the 'current' process might have had a higher priority. There was also a missing call to lwkt_maybe_switch() in the resched_wanted() case that could cause interrupt threads to stall for a long period of time when they could not preempt. In SMP there are still some issues. Niced processes work better, but at the moment the P_CURPROC handoff does not take into account the fact that the new higher priority process might better be handed off to another cpu that is running a lower priority process then the current cpu.
Have lwkt_reltoken() return the generation number to facilitate checks for stolen tokens. Cleanup, optimize, and better document lwkt_gentoken().
Fix a number of interrupt related issues. * Don't access kernel_map in free(), defer such operations to malloc() * Fix a slab allocator panic due to mishandling of malloc size slab limit checks on machines with small amounts of memory (the slab allocator reduces the size of the zone on low-memory machines but did not handle the reduced size properly). * Add thread->td_nest_count to prevent splz recursions from underflowing the kernel stack. This can occur because we drop the critical section when calling sched_ithd() in order to allow it to preempt. * Properly adjust intr_nesting_level around FAST interrupts * Adjust the debugging printf() in lockmgr to only complain about blockable lock requests from interrupts.
Clean up thread priority and critical section handling during boot. The initial kernel threads (e.g. thread0/proc0) had a priority lower then userland! Default them to the minimum kernel thread priority. Thread0 was also unnecessarily left in a critical section, which prevented certain device probes, such as the APIC 8254 timer test code, from working.
Add the NO_KMEM_MAP kernel configuration option. This is a temporary option that will allow developers to test kmem_map removal and also the upcoming (not this commit) slab allocator. Currently this option removes kmem_map and causes the malloc and zalloc subsystems to use kernel_map exclusively. Change gd_intr_nesting_level. This variable is now only bumped while we are in a FAST interrupt or processing an IPIQ message. This variable is not bumped while we are in a normal interrupt or software interrupt thread. Add warning printf()s if malloc() and related functions detect attempts to use them from within a FAST interrupt or IPIQ. Remove references to the no-longer-used zalloci() and zfreei() functions.
Fix typos in comments.
__P() != wanted, begin removal, in order to preserve white space this needs to be done by hand, as I accidently killed a source tree that I had gotten this far on. I'm committing this now, LINT and GENERIC both build with these changes, there are many more to come.
Fix a minor bug in lwkt_init_thread() (the thread was being added to the wrong td_allq). Remove thread->td_cpu. thread->td_gd (which points to the globaldata structure) is sufficient. Add e_cpuid to eproc to compensate.
Syscall messaging work 2: Continue with the implementation of sendsys(),
using int 0x81. This entry point will be responsible for sending system
call messages or waiting for messages / port activity.
With this commit system call messages can be run through 0x81 but at the
moment they will always run synchronously. Here's the core interface
code for IA32:
static __inline int
sendsys(void *port, void *msg, int msgsize)
{
int error;
__asm __volatile("int $0x81" : "=a"(error) :
"a"(port), "c"(msg), "d"(msgsize) : "memory");
return(error);
}
Performance verses a direct system call is currently excellent considering
that this is my initial attempt.
600MHzC3 1.2GHzP3x2(SMP)
getuid() 1300 ns 909 ns
getuid_msg() 1700 ns 1077 ns
DEV messaging stage 2/4: In this stage all DEV commands are now being funneled through the message port for action by the port's beginmsg function. CONSOLE and DISK device shims replace the port with their own and then forward to the original. FB (Frame Buffer) shims supposedly do the same thing but I haven't been able to test it. I don't expect instability in mainline code but there might be easy-to-fix, and some drivers still need to be converted. See primarily: kern/kern_device.c (new dev_*() functions and inherits cdevsw code from kern/kern_conf.c), sys/device.h, and kern/subr_disk.c for the high points. In this stage all DEV messages are still acted upon synchronously in the context of the caller. We cannot create a separate handler thread until the copyin's (primarily in ioctl functions) are made thread-aware. Note that the messaging shims are going to look rather messy in these early days but as more subsystems are converted over we will begin to use pre-initialized messages and message forwarding to avoid having to constantly rebuild messages prior to use. Note that DEV itself is a mess oweing to its 4.x roots and will be cleaned up in subsequent passes. e.g. the way sub-devices inherit the main device's cdevsw was always a bad hack and it still is, and several functions (mmap, kqfilter, psize, poll) return results rather then error codes, which will be fixed since now we have a message to store the result in :-)
This is the initial implmentation of the LWKT messaging infrastructure. Messages are sent to message ports and typically replied to a message port embedded in the originating thread's thread structure (td_msgport). The port functions match up and optimization client sync/asynch requests verses target synch/asynch responses. In this initial implementation a port must be owned by a particular thread, and we use *asynch* IPI messaging to forward queueing and dequeueing operations to the correct cpu. Most of the IPI overhead will be absorbed by the fact that these same IPIs also tend to schedule the threads in question, which on the correct cpu (which is the one it will be on) costs nothing. Message ports have in-context dispatch functions for initiating, aborting, and replying to a message which can be overriden and will queue by default. This code compiles but is as yet unreferenced, and almost certainly needs more work.
Collapse gd_astpending and gd_reqpri together into gd_reqflags. gd_reqflags now rollsup requests made pending for doreti. Cleanup a number of scheduling primitives and note that we do not need to use locked bus cycles on per-cpu variables. Note that the aweful idelayed hack for certain softints (used only by the TTY subsystem, BTW) gets slightly broken in this commit because idelayed has become per-cpu and the clock ints aren't yet distributed.
MP Implmentation 4/4: Final cleanup for this stage. Deal with a race that occurs due to not having to hold the MP lock through an lwkt_switch() where another cpu may pull off a process from the userland scheduler and schedule its thread before the original cpu has completely switched out it. Oddly enough latencies were enough that this bug never caused a crash! Cleanup the scheduling code and in particular the switch assembly code, save and restore eflags (cli/sti state) when switching heavy weight processes (this is already done for light weight threads), add some counters, and optimize fork() to (statistically) stay on the current cpu for a short while to take advantage of locality of cache reference, which greatly improves fork/exec times. Note that synchronous pipe operations between two procseses already (statistically) stick to the same cpu (which is what we want).
MP Implmentation 3B/4: Remove Xcpuast and Xforward_irq, replacing them with IPI messaging functions. Fix user scheduling issues so user processes are dependably scheduled on available cpus.
MP Implmentation 3/4: MAJOR progress on SMP, full userland MP is now working! A number of issues relating to MP lock operation have been fixed, primarily that we have to read %cr2 before get_mplock() since get_mplock() may switch away. Idlethreads can now safely HLT without any performance detriment. The userland scheduler has been almost completely rewritten and is now using an extremely flexible abstraction with a lot of room to grow. pgeflag has been removed from mapdev (without per-page invalidation it isn't safe to use PG_G even on UP). Necessary locked bus cycles have been added for the pmap->pm_active field in swtch.s. CR3 has been unoptimized for the moment (see comment in swtch.s). Since the switch code runs without the MP lock we have to adjust pm_active PRIOR to loading %cr3. Additional sanity checks have been added to the code (see PARANOID_INVLTLB and ONLY_ONE_USER_CPU in the code), plus many more in kern_switch.c. A passive release mechanism has been implemented to optimize P_CURPROC/lwkt priority shifting when going from user->kernel and kernel->user. Note: preemptive interrupts don't care due to the way preemption works so no additional complexity there. non-locking atomic functions to protect only against local interrupts have been added. astpending now uses non-locking atomic functions to set and clear bits. private_tss has been moved to a per-cpu variable. The LWKT thread module has been considerably enhanced and cleaned up, including some fixes to handle MPLOCKED vs td_mpcount races (so eventually we can do MP locking without a pushfl/cli/popfl combo). stopevent() needs critical section protection, maybe.
MP Implementation 2/4: Implement a poor-man's IPI messaging subsystem, get both cpus arbitrating the BGL for interrupts, IPIing foreign cpu LWKT scheduling requests without crashing, and dealing with the cpl. The APs are in a slightly less degenerate state now, but hardclock and statclock distribution is broken, only one user process is being scheduled at a time, and priorities are all messed up.
MP Implementation 1/2: Get the APIC code working again, sweetly integrate the MP lock into the LWKT scheduler, replace the old simplelock code with tokens or spin locks as appropriate. In particular, the vnode interlock (and most other interlocks) are now tokens. Also clean up a few curproc/cred sequences that are no longer needed. The APs are left in degenerate state with non IPI interrupts disabled as additional LWKT work must be done before we can really make use of them, and FAST interrupts are not managed by the MP lock yet. The main thing for this stage was to get the system working with an APIC again. buildworld tested on UP and 2xCPU/MP (Dell 2550)
Generic MP rollup work.
Add threads to the process-retrieval sysctls so they show up in top, ps, etc. Reorder the boot sequence a little to add a TAILQ for all threads. Add a td_refs field to prevent a thread from disappearing on us.
Misc interrupts/LWKT 1/2: threaded interrupts 2: Major work on the user scheduler, separate it completely from the LWKT scheduler and make user priorities, including idprio, normal, and rtprio, work properly. This includes fixing the priority inversion problem that 4.x had. Also complete the work on interrupt preemption. There were a few things I wasn't doing correctly including not protecting the initial call to cpu_heavy_restore when a process is just starting up. Enhance DDB a bit (threads don't show up in PS yet). This is a major milestone.
Misc interrupts/LWKT 1/2: interlock the idle thread. Put execution of fast interrupts inside a critical section. Make the hardclock and statclock INTR_FAST. Implement the strict priority queue mechanism for LWKTs. Implement prioritized preemption for interrupt and softint preemption. Keep better stats. Note: this commit hacks up the userland scheduler, in particular the notion of 'curproc' because threaded interrupts really mess up the userland scheduler's idea of curproc, which it uses to assume that the process is not on a run queue even though it is runnable. The next step will be to separate out and cleanup the userland scheduler.
Implement interrupt thread preemption + minor cleanup.
threaded interrupts 1: Rewrite the ICU interrupt code, splz, and doreti code. The APIC code hasn't been done yet. Consolidate many interrupt thread related functions into MI code, especially software interrupts. All normal interrupts and software interrupts are now threaded, and I'm almost ready to deal with interrupt-thread-only preemption. At the moment I run interrupt threads in a critical section and probably will continue to do so until I can make them MP safe.
smp/up collapse stage 2 of 2: cleanup the globaldata structure, cleanup and separate machine dependant portions of thread, proc, and globaldata, and reduce the need to include lots of MD header files.
Cleanup lwkt threads a bit, change the exit/reap interlock.
proc->thread stage 6: kernel threads now create processless LWKT threads. A number of obvious curproc cases were removed, tsleep/wakeup was made to work with threads (wmesg, ident, and timeout features moved to threads). There are probably a few curproc cases left to fix.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
proc->thread stage3: make time accounting threads based and rework it for performance. Cleanup user/sys/interrupt time accounting. Get rid of the microputime and equivalent support code in mi_switch() (it was really a bad idea to put that in the critical path IMHO). Instead account for time statistically from the statclock, which produce time accounting that is just as accurate in the long haul. Remove the u/s/iticks fields from the proc structure and put a slightly different version in the thread structure, so time can be accounted for both threads and processes.
Optimize lwkt_rwlock.c a bit
thread stage 10: (note stage 9 was the kern/lwkt_rwlock commit). Cleanup thread and process creation functions. Check the spl against ipending in cpu_lwkt_restore (so the idle loop does not lockup the machine). Remove the old VM object kstack allocation and freeing code. Leave newly created processes in a stopped state to fix wakeup/fork_handler races. Normalize the lwkt_init_*() functions. Add a sysctl debug.untimely_switch which will cause the last crit_exit() to yield, which causes a task switch to occur in wakeup() and catches a lot of 4.x-isms that can be found and fixed on UP.
Add kern/lwkt_rwlock.c -- reader/writer locks. Clean up the process exit & reaping interlock code to allow context switches to occur. Clean up and make operational the lwkt_block/signaling code.
thread stage 8: add crit_enter(), per-thread cpl handling, fix deferred interrupt handling for critical sections, add some basic passive token code, and blocking/signaling code. Add structural definitions for additional LWKT mechanisms. Remove asleep/await. Add generation number based xsleep/xwakeup. Note that when exiting the last crit_exit() we run splz() to catch up on blocked interrupts. There is also some #if 0'd code that will cause a thread switch to occur 'at odd times'... primarily wakeup()-> lwkt_schedule()->critical_section->switch. This will be usefulf or testing purposes down the line. The passive token code is mostly disabled at the moment. It's primary use will be under SMP and its primary advantage is very low overhead on UP and, if used properly, should also have good characteristics under SMP.
thread stage 7: Implement basic LWKTs, use a straight round-robin model for the moment. Also continue consolidating the globaldata structure so both UP and SMP use it with more commonality. Temporarily match user processes up with scheduled LWKTs on a 1:1 basis. Eventually user processes will have LWKTs, but they will not all be scheduled 1:1 with the user process's runnability. With this commit work can potentially start to fan out, but I'm not ready to announce yet.
thread stage 6: Move thread stack management from the proc structure to the thread structure, cleanup the pmap_new_*() and pmap_dispose_*() functions, and disable UPAGES swapping (if we eventually separate the kstack from the UPAGES we can reenable it). Also LIFO/4 cache thread structures which improves fork() performance by 40% (when used in typical fork/exec/exit or fork/subshell/exit situations).
Oops commit the thread.h file.