Up to [DragonFly] / src / sys / sys
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
* Implement new system calls in the kernel: statvfs(), fstatvfs(), fhstatvfs(). * Implement a new VFS op, VFS_STATVFS(). Implement a default for this new op for VFSs which do not implement VFS_STATVFS(), which calls VFS_STATFS() and converts the structure (using Joerg's conversion procedure from libc). * Remove statvfs(), fstatvfs(), and fhstatvfs() from libc. These functions are now system calls.
Bring uuidgen(3) into libc and implement the uuidgen() system call. Obtained-from: FreeBSD / Marcel Moolenaar
Give the device major / minor numbers their own separate 32 bit fields in the kernel. Change dev_ops to use a RB tree to index major device numbers and remove the 256 device major number limitation. Build a dynamic major number assignment feature into dev_ops_add() and adjust ASR (which already had a hand-rolled one), and MFS to use the feature. MFS at least does not require any filesystem visibility to access its backing device. Major devices numbers >= 256 are used for dynamic assignment. Retain filesystem compatibility for device numbers that fall within the range that can be represented in UFS or struct stat (which is a single 32 bit field supporting 8 bit major numbers and 24 bit minor numbers).
1:1 Userland threading stage 4.8/4: Add syscalls lwp_gettid() and lwp_kill().
Major namecache work primarily to support NULLFS.
* Move the nc_mount field out of the namecache{} record and use a new
namecache handle structure called nchandle { mount, ncp } for all
API accesses to the namecache.
* Remove all mount point linkages from the namecache topology. Each mount
now has its own namecache topology rooted at the root of the mount point.
Mount points are flagged in their underlying filesystem's namecache
topology but instead of linking the mount into the topology, the flag
simply triggers a mountlist scan to locate the mount. ".." is handled
the same way... when the root of a topology is encountered the scan
can traverse to the underlying filesystem via a field stored in the
mount structure.
* Ref the mount structure based on the number of nchandle structures
referencing it, and do not kfree() the mount structure during a forced
unmount if refs remain.
These changes have the following effects:
* Traversal across mount points no longer require locking of any sort,
preventing process blockages occuring in one mount from leaking across
a mount point to another mount.
* Aliased namespaces such as occurs with NULLFS no longer duplicate the
namecache topology of the underlying filesystem. Instead, a NULLFS
mount simply shares the underlying topology (differentiating between
it and the underlying topology by the fact that the name cache
handles { mount, ncp } contain NULLFS's mount pointer.
This saves an immense amount of memory and allows NULLFS to be used
heavily within a system without creating any adverse impact on kernel
memory or performance.
* Since the namecache topology for a NULLFS mount is shared with the
underyling mount, the namecache records are in fact the same records
and thus full coherency between the NULLFS mount and the underlying
filesystem is maintained by design.
* Future efforts, such as a unionfs or shadow fs implementation, now
have a mount structure to work with. The new API is a lot more
flexible then the old one.
Make some adjustments to low level madvise/mcontrol/mmap support code to accomodate vmspace_*() calls. Reformulate the new vmspace_*() calls so they operate similarly to the MAP_VPAGETABLE and mcontrol() calls. This also makes vmspace's more 'programmable' in the sense that it will be possible to mix virtual pagetable mmap()ings with other mmap()ing in a vmspace. Fill in the code for all the new vmspace_*() calls except for vmspace_ctl(). NOTE: vmspace calls are effectively disabled unless vm.vkernel_enable is turned on, just like MAP_VPAGETABLE. Renumber the new mcontrol() and vmspace_*() calls and regenerate.
Add two more system calls, __accept and __connect. The old accept() and connect() are still present but will eventually be replaced with a libc wrapper. The new system calls add a flags argument, allowing O_FBLOCKING or O_FNONBLOCKING to be passed to override the non-blocking setting in the file pointer. They are intended to be used by libc_r.
Move all the resource limit handling code into a new file, kern/kern_plimit.c. Add spinlocks for access, and mark getrlimit and setrlimit as being MPSAFE. Document how LWPs will have to be handled - basically we will have to unshare the resource structure once we start allowing multiple LWPs per process, but we can otherwise leave it in the proc structure.
The thread/proc pointer argument in the VFS subsystem originally existed for... well, I'm not sure *WHY* it originally existed when most of the time the pointer couldn't be anything other then curthread or curproc or the code wouldn't work. This is particularly true of lockmgr locks. Remove the pointer argument from all VOP_*() functions, all fileops functions, and most ioctl functions.
Add the preadv() and pwritev() systems and regenerate. Submitted-by: Chuck Tuffli <ctuffli@gmail.com> Loosely-based-on: FreeBSD
Pass the direction to kern_getdirentries, it will be used by the emulation layer soon without transfering the data to userland first.
Add closefrom(2) syscall. It closes all file descriptors equal or greater than the given descriptor. This function does exit and return EINTR, when necessary. Other errors from close() are ignored.
Uncomment the entry for kern_chrot in kern_syscall.h and change the implementation to take the namecache entry directly.
Journaling layer work. Lock down the journaling data format and most of the record-building API. The Journaling data format consists of two layers, A logical stream abstraction layer and a recursive subrecord layer. The memory FIFO and worker thread only deals with the logical stream abstraction layer. subrecord data is broken down into logical stream records which the worker thread then writes out to the journal. Space for a logical stream record is 'reserved' and then filled in by the journaling operation. Other threads can reserve their own space in the memory FIFO, even if earlier reservations have not yet been committed. The worker thread will only write out completed records and it currently does so in sequential order, so the worker thread itself may stall temporarily if the next reservation in the FIFO has not yet been completed. (this will probably have to be changed in the future but for now its the easiest solution, allowing for some parallelism without creating too big a mess). Each logical stream is a (typically) short-lived entity, usually encompassing a single VFS operation, but may be made up of multiple stream records. The stream records contain a stream id and bits specifying whether the record is beginning a new logical stream, in the middle somewhere, or ending a logical stream. Small transactions may be able to fit in a single record in which case multiple bits may be set. Very large transactions, for example when someone does a write(... 10MB), are fully supported and would likely generate a large number of stream records. Such transactions would not necessarily stall other operations from other processes, however, since they would be broken up into smaller pieces for output to the journal. The stream layer serves to multiplex individual logical streams onto the memory FIFO and out the journaling stream descriptor. The recursive subrecord layer identifies the transaction as well as any other necessary data, including UNDO data if the journal is reversable. A single transaction may contain several sub-records identifying the bits making up the transaction (for example, a 'mkdir' transaction would need a subrecord identifying the userid, groupid, file modes, and path). The record formats also allow for transactional aborts, even if some of the data has already been pushed out to the descriptor due to limited buffer space. And, finally, while the subrecord's header format includes a record size field, this value may not be known for subrecords representing recusive 'pushes' since the header may be flushed out to the journal long before the record is completed. This case is also fully supported. NOTE: The memory FIFO used to ship data to the worker thread is serialized by the BGL for the moment, but will eventually be made per-cpu to support lockless operation under SMP.
Improve seperation between kernel code and userland code by requiring that source files that #include kernel headers also #define _KERNEL or _KERNEL_STRUCTURES as appropriate. With-suggestions-from: joerg Still-todo: systimer.h, and clean up scsi_da.c Tested-by: cd /usr/src/nrelease && make installer_release Approved-by: dillon If-anything-breaks-yell-at: cpressey
Journaling layer work. * Adjust the new mountctl syscall to make the passed file descriptor an explicit argument rather then storing the fd in the control structure. Convert the fd to a file pointer to make kern_mountctl() callable from a pure thread. * Get rid of vop_stdmountctl and just have the VOP default ops call journal_mountctl(), which makes things less confusing. * Get more of the journaling infrastructure working. Basic installation and removal of the journaling structure and the creation and destruction of the worker thread and stream file pointer now works (with lots of XXX's). * Add a journaling vector for VOP_NMKDIR to test the journaling VOP ops shim.
Journaling layer work. Add a new system call, mountctl, which will be used to manage the journaling layer. Add a new VOP, VOP_MOUNTCTL, which will be used to pass mountctl operations down into the VFS layer.
Lots of bug fixes to the checkpointing code. The big fix is that you can now checkpoint a program that you have checkpoint-restored. i.e. you run program X, you checkpoint it, you checkpoint-restore X from the checkpoint, and then you checkpoint it again. The issue here is the when a checkpointed program is restored the checkpoint file is used to map portions of the image of the restored program. If you then tried to checkpoint the restored image the system would overwrite or destroy the original checkpoint file and the new checkpoint file would have references to the old file (now non-existant) file. Any attempt to restore the recursed checkpoint would result in a seg-fault. That is now fixed. * Remove the previous checkpoint file before saving the new one. If the program we are checkpointing happens to be a checkpoint restore from the same file then overwriting the file would wind up corrupting the image set we are trying to save. * When checkpointing a program that has been checkpoint-restored do not attempt to save the file handles for the vnode representing the checkpoint-restored program's own checkpoint file (which is a good chunk of its backing store), because this vnode is likely to be destroyed the moment we close the handle, since we are likely replacing the previous checkpoint file. Instead, the backing store representing the old checkpoint file is copied to the new one. * Re-checkpointing a program (hitting ^E multiple times) now properly replaces the checkpoint file. * Properly close any file descriptors from the checkpt(1) program itself when restoring a checkpointed program, properly replace any file descriptors that need replacing. * Properly replace p_comm[] when restoring a checkpoint file, so checkpointing again saves under the same program name. 'ps' output is still wrong, though. TODO LIST: * Add an iterator to the checkpoint file, accessible via kern.ckptfile, so successive checkpoints save to a blah.ckpt.1, blah.ckpt.2, etc, rather then always overwriting blah.ckpt (the iterator could be saved in the proc structure). * Add back as a 'feature' the ability for the new checkpoint file to reference the old one. That is, each new checkpoint file would represent a delta relative to the old one. This might be useful when checkpointing programs with ever growing data setse so as not to have to copy the entire contents of the program to the checkpoint file each time you want to make a new checkpoint. It would be hell on the VM system, but it would work. * Add an option to checkpt(1) so you can checkpoint-restore-enter-gdb all in one go, to be able to debug a checkpointed file more easily. Inspired by: Brook Davis's HPC presentation. He expressed an interest in possibly porting the checkpoint code so I figure I ought to fix it up.
VFS messaging/interfacing work stage 9/99: VFS 'NEW' API WORK. NOTE: unionfs and nullfs are temporarily broken by this commit. * Remove the old namecache API. Remove vfs_cache_lookup(), cache_lookup(), cache_enter(), namei() and lookup() are all gone. VOP_LOOKUP() and VOP_CACHEDLOOKUP() have been collapsed into a single non-caching VOP_LOOKUP(). * Complete the new VFS CACHE (namecache) API. The new API is able to supply topological guarentees and is able to reserve namespaces, including negative cache spaces (whether the target name exists or not), which the new API uses to reserve namespace for things like NRENAME and NCREATE (and others). * Complete the new namecache API. VOP_NRESOLVE, NLOOKUPDOTDOT, NCREATE, NMKDIR, NMKNOD, NLINK, NSYMLINK, NWHITEOUT, NRENAME, NRMDIR, NREMOVE. These new calls take (typicaly locked) namecache pointers rather then combinations of directory vnodes, file vnodes, and name components. The new calls are *MUCH* simpler in concept and implementation. For example, VOP_RENAME() has 8 arguments while VOP_NRENAME() has only 3 arguments. The new namecache API uses the namecache to lock namespaces without having to lock the underlying vnodes. For example, this allows the kernel to reserve the target name of a create function trivially. Namecache records are maintained BY THE KERNEL for both positive and negative hits. Generally speaking, the kernel layer is now responsible for resolving path elements. NRESOLVE is called when an unresolved namecache record needs to be resolved. Unlike the old VOP_LOOKUP, NRESOLVE is simply responsible for associating a vnode to a namecache record (positive hit) or telling the system that it's a negative hit, and not responsible for handling symlinks or other special cases or doing any of the other path lookup work, much unlike the old VOP_LOOKUP. It should be particularly noted that the new namecache topology does not allow disconnected namecache records. In rare cases where a vnode must be converted to a namecache pointer for new API operation via a file handle (i.e. NFS), the cache_fromdvp() function is provided and a new API VOP, VOP_NLOOKUPDOTDOT() is provided to allow the namecache to resolve the topology leading up to the requested vnode. These and other topological guarentees greatly reduce the complexity of the new namecache API. The new namei() is called nlookup(). This function uses a combination of cache_n*() calls, VOP_NRESOLVE(), and standard VOP calls resolve the supplied path, deal with symlinks, and so forth, in a nice small compact compartmentalized procedure. * The old VFS code is no longer responsible for maintaining namecache records, a function which was mostly adhoc cache_purge()s occuring before the VFS actually knows whether an operation will succeed or not. The new VFS code is typically responsible for adjusting the state of locked namecache records passed into it. For example, if NCREATE succeeds it must call cache_setvp() to associate the passed namecache record with the vnode representing the successfully created file. The new requirements are much less complex then the old requirements. * Most VFSs still implement the old API calls, albeit somewhat modified and in particular the VOP_LOOKUP function is now *MUCH* simpler. However, the kernel now uses the new API calls almost exclusively and relies on compatibility code installed in the default ops (vop_compat_*()) to convert the new calls to the old calls. * All kernel system calls and related support functions which used to do complex and confusing namei() operations now do far less complex and far less confusing nlookup() operations. * SPECOPS shortcutting has been implemented. User reads and writes now go directly to supporting functions which talk to the device via fileops rather then having to be routed through VOP_READ or VOP_WRITE, saving significant overhead. Note, however, that these only really effect /dev/null and /dev/zero. Implementing this was fairly easy, we now simply pass an optional struct file pointer to VOP_OPEN() and let spec_open() handle the override. SPECIAL NOTES: It should be noted that we must still lock a directory vnode LK_EXCLUSIVE before issuing a VOP_LOOKUP(), even for simple lookups, because a number of VFS's (including UFS) store active directory scanning information in the directory vnode. The legacy NAMEI_LOOKUP cases can be changed to use LK_SHARED once these VFS cases are fixed. In particular, we are now organized well enough to actually be able to do record locking within a directory for handling NCREATE, NDELETE, and NRENAME situations, but it hasn't been done yet. Many thanks to all of the testers and in particular David Rhodus for finding a large number of panics and other issues.
VFS messaging/interfacing work stage 7/99. BEGIN DESTABILIZATION! Implement the infrastructure required to allow us to begin switching to the new nlookup() VFS API. filedesc->fd_ncdir, fd_nrdir, fd_njdir File descriptors (associated with processes) now record the namecache pointer related to the current directory, root directory, and jail directory, in addition to the vnode pointers. These pointers are used as the basis for the new path lookup code (nlookup() and friends). file->f_ncp File pointers may now have a referenced+unlocked namecache pointer associated with them. All fp's representing directories have this attached. This allows fchdir() to properly record the ncp in fdp->fd_ncdir and friends. mount->mnt_ncp The namecache topology for crossing a mount point works as follows: when looking up a path element which is a mount point, cache_nlookup() will locate the ncp for the vnode-under the mount point. mount->mnt_ncp represents the root of the mount, that is the vnode-over. nlookup() detects the mount point and accesses mount->mnt_ncp to skip past the vnode-under. When going backwards (..), nlookup() detects the case and skips backwards. The ncp linkages are: ncp->ncp->ncp[vnode_under]->ncp[vnode_over]. That is, when going forwards or backwards nlookup must explicitly skip over the double-ncp when crossing a mount point. This allows us to keep the namecache topology intact across mount points. NEW CACHE level API functions: cache_get() Reference and lock a namecache entry cache_put() Dereference and unlock a namecache entry cache_lock() lock an already-referenced namecache entry cache_unlock() unlock a lockednamecache entry NOTE: namecache locks are exclusive and recursive. These are the 'namespace' locks that we will be using to guarentee namespace operations such as in a CREATE, RENAME, or REMOVE. vfs_cache_setroot() Set the new system-wide root directory cache_allocroot() System bootstrap helper function to allocate the root namecache node. cache_resolve() Resolve a NCF_UNRESOLVED namecache node. The namecache node should be locked on call. cache_setvp() (resolver) associate a VP or create a negative cache entry representation for a namecache pointer and clear NCF_UNRESOLVED. The namecache node should be locked on call. cache_setunresolved() Revert a resolved namecache entry back to an unresolved state, disassociating any vnode but leaving the topology intact. The namecache node should be locked on call. cache_vget() Obtain the locked+refd vnode related to a namecache entry, resolving the entry if necessary. Return ENOENT if the entry represents a negative cache hit. cache_vref() Obtained a refd (not locked) vnode related to a namecache entry, as above. cache_nlookup() The new namecache lookup routine. This routine does a lookup and allocates a new namecache node (into an unresolved state) if necessary. Returns a namecache record whether or not the item can be found and whether or not it represents a positive or negative hit. cache_lookup() OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. cache_enter() OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. NEW default VOPs vop_noresolve() Implements a namecache resolver for VFSs which are still using the old VOP_LOOKUP/ VOP_CACHEDLOOKUP API (which is all of them still). VOP_LOOKUP OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. VOP_CACHEDLOOKUP OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. NEW PATHNAME LOOKUP CODE nlookup_init() Similar to NDINIT, initialize a nlookupdata structure for nlookup() and nlookup_done(). nlookup() Lookup a path. Unlike the old namei/lookup code the new lookup code does not do any fancy pre-disposition of the cache for create/delete, it simply looks up the requested path and returns the appropriate locked namecache pointer. The caller can obtain the vnode and directory vnode, as applicable, from the one namecache structure that is returned. Access checks are done on directories leading up to the result but not done on the returned namecache node. nlookup_done() Mandatory routine to cleanup a nlookupdata structure after it has been initialized and all operations have been completed on it. nlookup_simple() (in progress) all-in-one wrapped new lookup. nlookup_mp() helper call for resolving a mount point's glue NCP. hackish, will be cleaned up later. nreadsymlink() helper call to resolve a symlink. Note that the namecache does not yet cache symlink data but the intention is to eventually do so to avoid having to do VFS ops to get the data. naccess() Perform access checks on a namecache node given a mode and cred. naccess_va() Perform access cheks on a vattr given a mode and cred. Begin switching VFS operations from using namei to using nlookup. In this batch: * mount (install mnt_ncp for cross-mount-point handling in nlookup, simplify the vfs_mount() API to no longer pass a nameidata structure) * [l]stat (use nlookup) * [f]chdir (use nlookup, use recorded f_ncp) * [f]chroot (use nlookup, use recorded f_ncp)
Rearrange the kern_getcwd() procedure to return the base of the string rather then relocating the string. Also fix two bugs: (1) the original bcopy was copying data beyond the end of the buffer ([bp, bp+buflen] exceeds the buffer), and (2), the uap->buflen checks must be made in __getcwd(), before the kernel tries to malloc() space.
Split the __getcwd syscall into a kernel and an userland part, so it can be used in the kernel as well. Pointers by: Matthew Dillon <dillon@apollo.backplane.com>
Change sendfile() to send the header out coaleseced with the data. Inspired by Mike Silbersack's FreeBSD rev 1.171 to uipc_syscalls.c.
Separate chroot() into kern_chroot(). Rename change_dir() to checkvp_chdir() and reorganize the code to avoid doing weird things to the passed vnode's lock and ref count in deep subroutines (which lead to buggy code). Fix a bug in chdir()/kern_chdir() (the namei data was not being freed in all cases), and also fix a bug in symlink() (missing zfree in error case). Submitted-by: Paul Herman <pherman@frenchfries.net> Additional-work-by: dillon
Split mmap(). Move ovadvise(), ogetpagesize() and ommap() to new file 43bsd/43bsd_vm.c. http://gomerbud.com/daver/patches/dragonfly/syscall-separation-15.diff
Split mkfifo().
Trash the CHECKALT{CREAT,EXIST} macros and friends. Implement
linux_copyin_path() and linux_free_path() for path translation without
using the stackgap.
Use the above and recently split syscalls to remove stackgap allocations
from linux_creat(), linux_open(), linux_lseek(), linux_llseek(),
linux_access(), linux_unlink(), linux_chdir(), linux_chmod(),
linux_mkdir(), linux_rmdir(), linux_rename(), linux_symlink(),
linux_readlink(), linux_truncate(), linux_link(), linux_chown(),
linux_lchown(), linux_uselib(), linux_utime(), linux_mknod(),
linux_newstat(), linux_newlstat(), linux_statfs(), linux_stat64(),
linux_lstat64(), linux_chown16(), linux_lchown16(), linux_execve().
Split use split syscalls to reimplement linux_fstatfs().
Implement linux_translate_path() for use in exec_linux_imgact_try().
Split execve(). This required some interesting changes to the shell image activation code and the image_params structure. Userland pointers are no longer passed in the image_params structure. The exec_copyin_args() function now pulls the arguments, environment and filename of the target being execve()'d into a kernel space buffer before calling kern_execve(). The exec_shell_imgact() function does some magic to prepend the interpreter arguments.
The last major syscall separation commit completely broke our lseek() as well as the linux emulated lseek(). It's sheer luck that the system works at all :-). Fix lseek's 64 bit return value.
Split wait4(), setrlimit(), getrlimit(), statfs(), fstatfs(), chdir(),
open(), mknod(), link(), symlink(), unlink(), lseek(), access(), stat(),
lstat(), readlink(), chmod(), chown(), lchown(), utimes(), lutimes(),
futimes(), truncate(), rename(), mkdir(), rmdir(), getdirentries(),
getdents().
Trash the 4.3BSD numeric filesystem type support in mount().
Move ocreat(), olseek(), otruncate(), ostat(), olstat(), owait(),
ogetrlimit(), and osetrlimit() to the 43bsd subtree and reimplement
using split syscalls. Move ogetdirentries() to the subtree without
change because it is such a mess.
Convince linux_waitpid(), linux_wait(), linux_setrlimit(),
linux_old_getrlimit(), and linux_getrlimit() to use split syscalls.
The file kern/vfs_syscalls.c is now completely free of COMPAT_43 code.
I believe that execve() is the only pending split before I can tackle
stackgap usage in the linux emulator's CHECKALT{EXIST,CREAT}() macros.
Remove the FreeBSD 3.x signal code. This includes osendsig(), osigreturn() and a couple of structures that these syscalls depended on. Split the sigaction(), sigprocmask(), sigpending(), sigsuspend(), sigaltstack() and kill() syscalls. Move the 4.3BSD signal syscalls osigvec(), osigblock(), osigsetmask(), osigstack() and okillpg() to the 43bsd subtree. I'm not too sure if these will even work with the FreeBSD-4 signal trampoline code, but they do compile and link. Implement linux_signal(), linux_rt_sigaction(), linux_sigprocmask(), linux_rt_sigprocmask(), linux_sigpending(), linux_kill(), linux_sigaction(), linux_sigsuspend(), linux_rt_sigsuspend(), linux_pause(), and linux_sigaltstack() with the new in-kernel syscalls. This patch kills 7 stackgap allocations in the Linuxolator.
Create the kern_fstat() and kern_ftruncate() in-kernel syscalls. Implement fstat(), nfstat() and ftruncate() using the in-kernel syscalls. Move ofstat() and oftruncate() to the 43bsd emulation tree and implement with in-kernel syscalls. Create the linux_ftruncate() syscall in the linux emulation layer. This replaces a direct use of oftruncate() in the linux syscall map. Rewrite linux_newfstat() and linux_fstat64() with the in-kernel syscalls.
Create kern_readv() and kern_writev() and use them to split read(), pread(), readv(), write(), pwrite(), and writev(). Also, rewrite linux_pread() and linux_pwrite() using the in-kernel syscalls.
Rename do_dup() to kern_dup() and pull in some changes from FreeBSD-CURRENT. Implement dup(), dup2() and fcntl(F_DUPFD) with kern_dup(). Split fcntl() into fcntl() and kern_fcntl(). Implement linux_fcntl() using kern_fcntl() and replace a call to fcntl() in linux_accept() with a call to kern_fcntl().
Introduce the function iovec_copyin() and it's friend iovec_free(). These remove a great deal of duplicate code in the syscall functions. For those who like numbers, this patch uses iovec_copyin() four times in uipc_syscalls.c, two times in linux_socket.c and two times in 43bsd_socket.c. Would somebody please comment on the inclusion of sys/malloc.h in sys/uio.h? Remove sockargs() which was used once in the svr4 emulation code. It is replaced with a small piece of code that gets an mbuf and copyin()'s to it's data region. Remove the osendfile() syscall which was inapropriately named and placed in the COMPAT_43 code where it doesn't belong. Split the socket(), shutdown() and sendfile() syscalls. All of the syscalls in kern/uipc_syscalls.c are now split. Prevent a panic due to m_freem()'ing a dangling pointer in recvmsg(), orecvmsg(), linux_recvmsg(). This patch completely removes COMPAT_43 from kern/uipc_syscalls.c.
Modify kern_{send,recv}msg() to take struct uio's, not struct msghdr's.
Fix up all syscalls which use these functions, including the 43bsd
syscalls.
Also, fix some spots where I forgot to pass the message flags in the
emulation code.
Split getsockopt() and setsockopt().
Separate all of the send{to,msg} and recv{from,msg} syscalls and create
kern_sendmsg() and kern_recvmsg(). The functions sendit() and recvit()
are now no more.
Rewrite the related legacy syscalls and place them in the 43bsd emulation
directory. Also move the definition of omsghdr to the emulation directory.
Change the split syscall naming convention from syscall1() to kern_syscall() while moving the prototypes from sys/syscall1.h to sys/kern_syscall.h. Split the listen(), getsockname(), getpeername(), and socketpair() syscalls.