Up to [DragonFly] / src / sys / kern
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Fix issues with the scheduler that were causing unnecessary reschedules between tightly coupled processes as well as inefficient reschedules under heavy loads. The basic problem is that a process entering the kernel is 'passively released', meaning its thread priority is left at TDPRI_USER_NORM. The thread priority is only raised to TDPRI_KERN_USER if the thread switches out. This has the side effect of forcing a LWKT reschedule when any other user process woke up from a blocked condition in the kernel, regardless of its user priority, because it's LWKT thread was at the higher TDPRI_KERN_USER priority. This resulted in some significant switching cavitation under load. There is a twist here because we do not want to starve threads running in the kernel acting on behalf of a very low priority user process, because doing so can deadlock the namecache or other kernel elements that sleep with lockmgr locks held. In addition, the 'other' LWKT thread might be associated with a much higher priority user process that we *DO* in fact want to give cpu to. The solution is elegant. First, do not force a LWKT reschedule for the above case. Second, force a LWKT reschedule on every hard clock. Remove all the old hacks. That's it! The result is that the current thread is allowed to return to user mode and run until the next hard clock even if other LWKT threads (running on behalf of a user process) are runnable. Pure kernel LWKT threads still get absolute priority, of course. When the hard clock occurs the other LWKT threads get the cpu and at the end of that whole mess most of those LWKT threads will be trying to return to user mode and the user scheduler will be able to select the best one. Doing this on a hardclock boundary prevents cavitation from occuring at the syscall enter and return boundary. With this change the TDF_NORESCHED and PNORESCHED flags and their associated code hacks have also been removed, along with lwkt_checkpri_self() which is no longer needed.
* Fix some cases where NULL was used but 0 was meant (and vice versa). * Remove some bogus casts of NULL to (void *).
Fix feature logic so changing kern.pipe.dwrite_enable on the fly works properly. Before it could cause processes to block forever.
Fix many bugs and issues in the VM system, particularly related to heavy paging. * (cleanup) PG_WRITEABLE is now set by the low level pmap code and not by high level code. It means 'This page may contain a managed page table mapping which is writeable', meaning that hardware can dirty the page at any time. The page must be tested via appropriate pmap calls before being disposed of. * (cleanup) PG_MAPPED is now handled by the low level pmap code and only applies to managed mappings. There is still a bit of cruft left over related to the pmap code's page table pages but the high level code is now clean. * (bug) Various XIO, SFBUF, and MSFBUF routines which bypass normal paging operations were not properly dirtying pages when the caller intended to write to them. * (bug) vfs_busy_pages in kern/vfs_bio.c had a busy race. Separate the code out to ensure that we have marked all the pages as undergoing IO before we call vm_page_protect(). vm_page_protect(... VM_PROT_NONE) can block under very heavy paging conditions and if the pages haven't been marked for IO that could blow up the code. * (optimization) Make a minor optimization. When busying pages for write IO, downgrade the page table mappings to read-only instead of removing them entirely. * (bug) In platform/pc32/i386/pmap.c fix various places where pmap_inval_add() was being called at the wrong point. Only one was critical, in pmap_enter(), where pmap_inval_add() was being called so far away from the pmap entry being modified that it could wind up being flushed out prior to the modification, breaking the cpusync required. pmap.c also contains most of the work involved in the PG_MAPPED and PG_WRITEABLE changes. * (bug) Close numerous pte updating races with hardware setting the modified bit. There is still one race left (in pmap_enter()). * (bug) Disable pmap_copy() entirely. Fix most of the bugs anyway, but there is still one left in the handling of the srcmpte variable. * (cleanup) Change vm_page_dirty() from an inline to a real procedure, and move the code which set the object to writeable/maybedirty into vm_page_dirty(). * (bug) Calls to vm_page_protect(... VM_PROT_NONE) can block. Fix all cases where this call was made with a non-busied page. All such calls are now made with a busied page, preventing blocking races from re-dirtying or remapping the page unexpectedly. (Such blockages could only occur during heavy paging activity where the underlying page table pages are being actively recycled). * (bug) Fix the pageout code to properly mark pages as undergoing I/O before changing their protection bits. * (bug) Busy pages undergoing zeroing or partial zeroing in the vnode pager (vm/vnode_pager.c) to avoid unexpected effects.
Fix some lock ordering issues in the pipe code. In particular fix a bug in the pipe_write() code when multiple writers are present that could cause garbage to be injected into the pipe due to a resize possibly occuring while wpipe->pipe_buffer.cnt is non-zero.
The direct-write pipe code has a bug in it somewhere when the system is paging heavily. Disable it for now.
Make kernel_map, buffer_map, clean_map, exec_map, and pager_map direct structural declarations instead of pointers. Clean up all related code, in particular kmem_suballoc(). Remove the offset calculation for kernel_object. kernel_object's page indices used to be relative to the start of kernel virtual memory in order to improve the performance of VM page scanning algorithms. The optimization is no longer needed now that VM objects use Red-Black trees. Removal of the offset simplifies a number of calculations and makes the code more readable.
Ansify function declarations and fix some minor style issues. In-collaboration-with: Alexey Slynko <email@example.com>
Move flag(s) representing the type of vm_map_entry into its own vm_maptype_t type. This is a precursor to adding a new VM mapping type for virtualized page tables.
Rename malloc->kmalloc, free->kfree, and realloc->krealloc. Pass 1
Get rid of some unused fields in the fileops and adjust the declarations to use the '.field = blah' initialization method.
Add kernel syscall support for explicit blocking and non-blocking I/O regardless of the setting applied to the file pointer. send/sendmsg/sendto/recv/recvmsg/recfrom: New MSG_ flags defined in sys/socket.h may be passed to these functions to override the settings applied to the file pointer on a per-I/O basis. MSG_FBLOCKING - Force the operation to be blocking MSG_FNONBLOCKING- Force the operation to be non-blocking pread/preadv/pwrite/pwritev: These system calls have been renamed and wrappers will be added to libc. The new system calls are prefixed with a double underscore (like getcwd vs __getcwd) and include an additional flags argument. The new flags are defined in sys/fcntl.h and may be used to override settings applied to the file pointer on a per-I/O basis. Additionally, the internal __ versions of these functions now accept an offset of -1 to mean 'degenerate into a read/readv/write/writev' (i.e. use the offset in the file pointer and update it on completion). O_FBLOCKING - Force the operation to be blocking O_FNONBLOCKING - Force the operation to be non-blocking O_FAPPEND - Force the write operation to append (to a regular file) O_FOFFSET - (implied of the offset != -1) - offset is valid O_FSYNCWRITE - Force a synchronous write O_FASYNCWRITE - Force an asynchronous write O_FUNBUFFERED - Force an unbuffered operation (O_DIRECT) O_FBUFFERED - Force a buffered operation (negate O_DIRECT) If the flags do not specify an operation (e.g. neither FBLOCKING or FNONBLOCKING are set), then the settings in the file pointer are used. The original system calls will become wrappers in libc, without the flags arguments. The new system calls will be made available to libc_r to allow it to perform non-blocking I/O without having to mess with a descriptor's file flags. NOTE: the new __pread and __pwrite system calls are backwards compatible with the originals due to a pad byte that libc always set to 0. The new __preadv and __pwritev system calls are NOT backwards compatible, but since they were added to HEAD just two months ago I have decided to not renumber them either. NOTE: The subrev has been bumped to 1.5.4 and installworld will refuse to install if you are not running at least a 1.5.4 kernel.
Modify kern/makesyscall.sh to prefix all kernel system call procedures with "sys_". Modify all related kernel procedures to use the new naming convention. This gets rid of most of the namespace overloading between the kernel and standard header files.
More MP work. * Incorporate fd_knlistsize initialization into fsetfd(). * Mark all fileops vectors as MPSAFE (but get the mplock for most of them). Clean up a number of fileops routines, mainly *_ioctl(). * Make crget(), crhold(), and crfree() MPSAFE. crfree still needs the mplock on the last release. Give ucred a spinlock to handle the crfree() 0 transition race.
Do a major cleanup of the file descriptor handling code in preparation for making the descriptor table MPSAFE. Introduce a new feature that allows a file descriptor number to be reserved without having to assign a file pointer to it. This allows code such as open(), dup(), etc to reserve descriptors to work with without having to worry about the related file being ripped out from under them by another thread sharing the descriptor table. falloc() - This function allocates the file pointer and descriptor as before, but does NOT associate the file pointer with the descriptor. Before this change another thread could access the file pointer while the system call creating it was blocked, before the system call had a chance to completely initialize the file pointer. The caller must call fsetfd() to assign or clear the reserved descriptor. fsetfd() - Is now responsible for associating a file pointer with a previously reserved descriptor or clearing the reservation. fdealloc() - This hack existed to deal with open/dup races against other threads. The above changes remove the possibility so this routine has been deleted. dup code - kern_dup() and dupfdopen() have been completely rewritten. They are much cleaner and less obtuse now. Additional race conditions in the original code were also found and fixed. funsetfd() - Now returns the file pointer that was cleared and takes responsibility for adjusting fd_lastfile. NOTE: fd_lastfile is inclusive of any reserved descriptors. fdcopy() - While not yet MPSAFE, fdcopy now properly handles races against other threads. fdp->fd_lastfile - This field was not being properly updated in certain failure cases. This commit fixes that. Also, if all a process's descriptors were closed this field was incorrectly left at 0 when it should have been set to -1. fdp->fd_files - A number of code blocks were trying to optimize a for() loop over all file descriptors by caching a pointer to fd_files. This is a problem because fd_files can be reallocated if code within the loop blocks. These loops have been rewritten.
Consolidate the file descriptor destruction code used when a newly created file descriptor must be destroyed due to an error into a new procedure, fdealloc(), rather then manually repeating it over and over again. Move holdsock() and holdfp() into kern/kern_descrip.c.
The fdrop() procedure no longer needs a thread argument, remove it.
The thread/proc pointer argument in the VFS subsystem originally existed for... well, I'm not sure *WHY* it originally existed when most of the time the pointer couldn't be anything other then curthread or curproc or the code wouldn't work. This is particularly true of lockmgr locks. Remove the pointer argument from all VOP_*() functions, all fileops functions, and most ioctl functions.
Now that the C language has a "void *", use it instead of caddr_t.
Make shutdown() a fileops operation rather then a socket operation. Pipes are full-duplex entities, so implement shutdown support for them.
The pipe code was not properly handling kernel space writes. Such writes can be made by the journaling code when journaling to a pipe.
File descriptor cleanup stage 2, remove the separate arrays for file pointers, fileflags, and allocation counts and replace the mess with a single structural array. Also revamp the code that checks whether the file descriptor array is built-in or allocated. Note that the removed malloc's were doing something weird, allocating 'nf * OFILESIZE + 1' bytes instead of 'nf * OFILESIZE' bytes. I could not find any reason at all why it was doing that. It's gone now anyway.
Replace the linear search in file descriptor allocation with an O(log N) algorithm based on full in-place binary search trees augmented with subtree free file descriptor counts. Idea from: Solaris
pipe->pipe_buffer.out was not being reset to 0 when switching from direct mode back to copy mode, leading to the pipe data becoming corrupt. Reported-by: Joerg Sonnenberger <firstname.lastname@example.org>
Clean up the XIO API and structure. XIO no longer tries to 'track' partial copies into or out of an XIO. It no longer adjusts xio_offset or xio_bytes once they have been initialized. Instead, a relative offset is now passed to API calls to handle partial copies. This makes the API a lot less confusing and makes the XIO structure a lot more flexible, shareable, and more suitable for use by higher level entities (buffer cache, pipe code, upcoming MSFBUF work, etc).
VFS messaging/interfacing work stage 9/99: VFS 'NEW' API WORK. NOTE: unionfs and nullfs are temporarily broken by this commit. * Remove the old namecache API. Remove vfs_cache_lookup(), cache_lookup(), cache_enter(), namei() and lookup() are all gone. VOP_LOOKUP() and VOP_CACHEDLOOKUP() have been collapsed into a single non-caching VOP_LOOKUP(). * Complete the new VFS CACHE (namecache) API. The new API is able to supply topological guarentees and is able to reserve namespaces, including negative cache spaces (whether the target name exists or not), which the new API uses to reserve namespace for things like NRENAME and NCREATE (and others). * Complete the new namecache API. VOP_NRESOLVE, NLOOKUPDOTDOT, NCREATE, NMKDIR, NMKNOD, NLINK, NSYMLINK, NWHITEOUT, NRENAME, NRMDIR, NREMOVE. These new calls take (typicaly locked) namecache pointers rather then combinations of directory vnodes, file vnodes, and name components. The new calls are *MUCH* simpler in concept and implementation. For example, VOP_RENAME() has 8 arguments while VOP_NRENAME() has only 3 arguments. The new namecache API uses the namecache to lock namespaces without having to lock the underlying vnodes. For example, this allows the kernel to reserve the target name of a create function trivially. Namecache records are maintained BY THE KERNEL for both positive and negative hits. Generally speaking, the kernel layer is now responsible for resolving path elements. NRESOLVE is called when an unresolved namecache record needs to be resolved. Unlike the old VOP_LOOKUP, NRESOLVE is simply responsible for associating a vnode to a namecache record (positive hit) or telling the system that it's a negative hit, and not responsible for handling symlinks or other special cases or doing any of the other path lookup work, much unlike the old VOP_LOOKUP. It should be particularly noted that the new namecache topology does not allow disconnected namecache records. In rare cases where a vnode must be converted to a namecache pointer for new API operation via a file handle (i.e. NFS), the cache_fromdvp() function is provided and a new API VOP, VOP_NLOOKUPDOTDOT() is provided to allow the namecache to resolve the topology leading up to the requested vnode. These and other topological guarentees greatly reduce the complexity of the new namecache API. The new namei() is called nlookup(). This function uses a combination of cache_n*() calls, VOP_NRESOLVE(), and standard VOP calls resolve the supplied path, deal with symlinks, and so forth, in a nice small compact compartmentalized procedure. * The old VFS code is no longer responsible for maintaining namecache records, a function which was mostly adhoc cache_purge()s occuring before the VFS actually knows whether an operation will succeed or not. The new VFS code is typically responsible for adjusting the state of locked namecache records passed into it. For example, if NCREATE succeeds it must call cache_setvp() to associate the passed namecache record with the vnode representing the successfully created file. The new requirements are much less complex then the old requirements. * Most VFSs still implement the old API calls, albeit somewhat modified and in particular the VOP_LOOKUP function is now *MUCH* simpler. However, the kernel now uses the new API calls almost exclusively and relies on compatibility code installed in the default ops (vop_compat_*()) to convert the new calls to the old calls. * All kernel system calls and related support functions which used to do complex and confusing namei() operations now do far less complex and far less confusing nlookup() operations. * SPECOPS shortcutting has been implemented. User reads and writes now go directly to supporting functions which talk to the device via fileops rather then having to be routed through VOP_READ or VOP_WRITE, saving significant overhead. Note, however, that these only really effect /dev/null and /dev/zero. Implementing this was fairly easy, we now simply pass an optional struct file pointer to VOP_OPEN() and let spec_open() handle the override. SPECIAL NOTES: It should be noted that we must still lock a directory vnode LK_EXCLUSIVE before issuing a VOP_LOOKUP(), even for simple lookups, because a number of VFS's (including UFS) store active directory scanning information in the directory vnode. The legacy NAMEI_LOOKUP cases can be changed to use LK_SHARED once these VFS cases are fixed. In particular, we are now organized well enough to actually be able to do record locking within a directory for handling NCREATE, NDELETE, and NRENAME situations, but it hasn't been done yet. Many thanks to all of the testers and in particular David Rhodus for finding a large number of panics and other issues.
Make fstat() account for pending direct-write data when run on a pipe. Submitted-by: Hiten Pandya <email@example.com> Obtained-from: FreeBSD 1.172 (Mike Silbersack)
Fix SYSCTL description style.
device switch 1/many: Remove d_autoq, add d_clone (where d_autoq was). d_autoq was used to allow the device port dispatch to mix old-style synchronous calls with new style messaging calls within a particular device. It was never used for that purpose. d_clone will be more fully implemented as work continues. We are going to install d_port in the dev_t (struct specinfo) structure itself and d_clone will be needed to allow devices to 'revector' the port on a minor-number by minor-number basis, in particular allowing minor numbers to be directly dispatched to distinct threads. This is something we will be needing later on.
Fix a bug in sys/pipe.c. xio_init_ubuf() might not be able to load up the requested number of bytes even if the request is limited to XIO_INTERNAL_SIZE if the user buffer base is not page-aligned. XIO will set xio_bytes to the actual size of the buffer. Note that this bug was never exercised due to the 64KB pipe kmem buffer size limit, so it could not have been the cause of recent problems. Use kmem_alloc_nofault() instead of kmem_alloc_pageable() for the kmem reservation. This is more correct but should have no actual effect on the system.
Add an assertion to sys_pipe to cover a possible overrun case and reorder the zone cache code in zalloc() to not assign the link pointer until after various sanity checks are performed.
We must pmap_qremove() pages that we previously pmap_qenter()'d before we can safely call kmem_free(). This corrects a serious corruption issue that occured when using PIPE algorithms other then the default. The default SFBUF algorithm was not effected by this bug.
Commit an update to the pipe code that implements various pipe algorithms. Note that the newer algorithms are either experimental or only exist for testing purposes. The default remains the same (sfbuf mode), which is considered to be stable. The code is just too useful not to commit it. Add pmap_qenter2() for installing cpu-localized KVM mappings. Add pmap_page_assertzero() which will be used in a later diagnostic commit.
Enhance the pmap_kenter*() API and friends, separating out entries which only need invalidation on the local cpu against entries which need invalidation across the entire system, and provide a synchronization abstraction. Enhance sf_buf_alloc() and friends to allow the caller to specify whether the sf_buf's kernel mapping is going to be used on just the current cpu or whether it needs to be valid across all cpus. This is done by maintaining a cpumask of known-synchronized cpus in the struct sf_buf Optimize sf_buf_alloc() and friends by removing both TAILQ operations in the critical path. TAILQ operations to remove the sf_buf from the free queue are now done in a lazy fashion. Most sf_buf operations allocate a buf, work on it, and free it, so why waste time moving the sf_buf off the freelist if we are only going to move back onto the free list a microsecond later? Fix a bug in sf_buf_alloc() code as it was being used by the PIPE code. sf_buf_alloc() was unconditionally using PCATCH in its tsleep() call, which is only correct when called from the sendfile() interface. Optimize the PIPE code to require only local cpu_invlpg()'s when mapping sf_buf's, greatly reducing the number of IPIs required. On a DELL-2550, a pipe test which explicitly blows out the sf_buf caching by using huge buffers improves from 350 to 550 MBytes/sec. However, note that buildworld times were not found to have changed. Replace the PIPE code's custom 'struct pipemapping' structure with a struct xio and use the XIO API functions rather then its own.
Second major scheduler patch. This corrects interactive issues that were introduced in the pipe sf_buf patch. Split need_resched() into need_user_resched() and need_lwkt_resched(). Userland reschedules are requested when a process is scheduled with a higher priority then the currently running process, and LWKT reschedules are requested when a thread is scheduled with a higher priority then the currently running thread. As before, these are ASTs, LWKTs are not preemptively switch while running in the kernel. Exclusively use the resched wanted flags to determine whether to reschedule or call lwkt_switch() upon return to user mode. We were previously also testing the LWKT run queue for higher priority threads, but this was causing inefficient scheduler interactions when two processes are doing tightly bound synchronous IPC (e.g. using PIPEs) because in DragonFly the LWKT priority of a thread is raised when it enters the kernel, and lowered when it tries to return to userland. The wakeups occuring in the pipe code were causing extra quick-flip thread switches. Introduce a new tsleep() flag which disables the need_lwkt_resched() call when the sleeping thread is woken up. This is used by the PIPE code in the synchronous direct-write PIPE case to avoid the above problem. Redocument and revamp the ESTCPU code. The original changes reduced the interrupt rate from 100Hz (FBsd-4 and FBsd-5) to 20Hz, but did not compensate for the slower ramp-up time. This commit introduces a 'virtual' ESTCPU frequency which compensates without us having to bump up the actual systimer interrupt rate. Redo the P_CURPROC methodology, which is used by the userland scheduler to manage processes running in userland. Create a globaldata->gd_uschedcp process pointer which represents the current running-in-userland (or about to be running in userland) process, and carefully recode acquire_curproc() to allow this gd_uschedcp designation to be stolen from other threads trying to return to userland without having to request a reschedule (which would have to switch back to those threads to release the designation). This reduces the number of unnecessary context switches that occur due to scheduler interactions. Also note that this specifically solves the case where there might be several threads running in the kernel which are trying to return to userland at the same time. A heuristic check against gd_upri is used to select the correct thread for schedling to userland 'most of the time'. When the correct thread is not selected, we fall back to the old behavior of forcing a reschedule. Add debugging sysctl variables to better track userland scheduler efficiency. With these changes pipe statistics are further improved. Though some scheduling aberrations still exist(1), the previous scheduler had totally broken interactive processes and this one does not. BLKSIZE BEFORE NEWPIPE NOW Tests on AMD64 MBytes/s MBytes/s MBytes/s 3200+ FN85MB (64KB L1, 1MB L2) 256KB 1900 2200 2250 64KB 1800 2200 2250 32KB - - 3300 16KB 1650 2500-3000 2600-3200 8KB 1400 2300 2000-2400(1) 4KB 1300 1400-1500 1500-1700
Import Alan Cox's /usr/src/sys/kern/sys_pipe.c 1.171. This rips out writer-side KVA mappings and replaces them with writer-side vm_page wiring (left intact from before) plus reader-side SF_BUF copies. Import 1.141, which is a simple patch which removes a blocking condition when space is available in the pipe's write buffer which was causing non-blocking I/O select-based writes to spin-wait unnecessarily. 1.171 rips out writer-side KVA mappings and replaces them Import FreeBSD-5.x's uiomove_fromphys(), which sys_pipe.c now uses. This procedure could become very useful in a number of DragonFly subsystems. This greatly improves PIPE performance for the direct-mapped case (moderate to large reads and writes). Additionally, recent scheduler fixes greatly improve PIPE performance for both the direct-mapped and small-buffer cases. NOTE: wired page limits for pipes have not yet been imported, and the heavy use of sf_buf's may require some tuning in the many-pipes case. BLKSIZE BEFORE AFTER MBytes/s MBytes/s Tests on AMD64/3200+ FN85 MB ------- ------ ------ (64KB L1, 1MB L2) 256KB 1900 2200 64KB 1800 2200 16KB 1650 2500-3000 8KB 1400 2300 4KB 1300 1400-1500 (note 1) note 1: The 4KB case is not a direct-write case, the results are due to the scheduler fixes only. Obtained-from: FreeBSD-5.x / FreeBSD's Alan Cox
Implement a pipe KVM cache primarily to reduce unnecessary TLB IPIs between cpcus on MP systems due to continuous KVM allocations.
64 bit address space cleanups which are a prerequisit for future 64 bit address space work and PAE. Note: this is not PAE. This patch basically adds vm_paddr_t, which represents a 'physical address'. Physical addresses may be larger then virtual addresses and on IA32 we make vm_paddr_t a 64 bit quantity. Submitted-by: Hiten Pandya <firstname.lastname@example.org>
Pass only one argument to vm_page_hold() as a sane person would do. Reported by: DragonFly BuildBox
Return a more sane error code, EPIPE. The EBADF error code is misleading, since we have already got this far, and it's not a bad file descriptor. Obtained from: FreeBSD
Use vm_page_hold() instead of vm_page_wire(). Obtained from: FreeBSD
syscall messaging 3: Expand the 'header' that goes in front of the syscall arguments in the kernel copy. The header was previously just an lwkt_msg. The header is now a 'union sysmsg'. 'union sysmsg' contains an lwkt_msg plus space for the additional meta data required to asynchronize various system calls. We haven't actually asynchronized anything yet and will not be able to until the reply port and abort processing infrastructure is in place. See sys/sysmsg.h for more information on the new header. Also cleanup syscall generation somewhat and add some ibcs2 stuff I missed.
fileops messaging stage 1: add port and feature mask to struct fileops and rename fo_ functions to fold.
syscall messaging 2: Change the standard return value storage for system calls from proc->p_retval to the message structure embedded in the syscall. System calls used to set their non-error return value in p_retval but must now set it in the message structure. This is a necessary precursor to any sort of asynchronizatino, for obvious reasons. This work was particularly annoying because all the emualtion code declares and manually fills in syscall argument structures. This commit could potentially destabilize some of the emulation code but I went through the most important Linux emulation code three times and tested it with linux-mozilla, so I am fairly confident that I got it right. Note: proper linux emulation requires setting the fallback elf brand to 3 or it will default to SVR4. It really ought to default to linux (3), not SVR4. sysctl -w kern.fallback_elf_brand=3
Remove the priority part of the priority|flags argument to tsleep(). Only flags are passed now. The priority was a user scheduler thingy that is not used by the LWKT subsystem. For process statistics assume sleeps without P_SINTR set to be disk-waits, and sleeps with it set to be normal sleeps. This commit should not contain any operational changes.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
proc->thread stage 2: MAJOR revamping of system calls, ucred, jail API, and some work on the low level device interface (proc arg -> thread arg). As -current did, I have removed p_cred and incorporated its functions into p_ucred. p_prison has also been moved into p_ucred and adjusted accordingly. The jail interface tests now uses ucreds rather then processes. The syscall(p,uap) interface has been changed to just (uap). This is inclusive of the emulation code. It makes little sense to pass a proc pointer around which confuses the MP readability of the code, because most system call code will only work with the current process anyway. Note that eventually *ALL* syscall emulation code will be moved to a kernel-protected userland layer because it really makes no sense whatsoever to implement these emulations in the kernel. suser() now takes no arguments and only operates with the current process. The process argument has been removed from suser_xxx() so it now just takes a ucred and flags. The sysctl interface was adjusted somewhat.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 220.127.116.11