Up to [DragonFly] / src / sys / kern
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Changes to consdev - low level kernel console initialization. The consdev API was calling make_dev() extremely early in the boot sequence, but except for a little code in syscons didn't really need the abstraction to operate the kernel console during boot. Change the consdev API to no longer require the use the device abstraction to operate. This will allow the device ABI (cdev_t) to be converted to use SYSREF.
Modify the trapframe sigcontext, ucontext, etc. Add %gs to the trapframe and xflags and an expanded floating point save area to sigcontext/ucontext so traps can be fully specified. Remove all the %gs hacks in the system code and signal trampoline and handle %gs faults natively, like we do %fs faults. Implement writebacks to the virtual page table to set VPTE_M and VPTE_A and add checks for VPTE_R and VPTE_W. Consolidate the TLS save area into a MD structure that can be accessed by MI code. Reformulate the vmspace_ctl() system call to allow an extended context to be passed (for TLS info and soon the FP and eventually the LDT). Adjust the GDB patches to recognize the new location of %gs. Properly detect non-exception returns to the virtual kernel when the virtual kernel is running an emulated user process and receives a signal. And misc other work on the virtual kernel.
Ansify function declarations and fix some minor style issues. In-collaboration-with: Alexey Slynko <firstname.lastname@example.org>
Recent dev_t work confused sysctl. Adjust the reported type to udev_t and make sysctl recognize it. Reported-by: "Frank W. Josellis" <email@example.com>
Change the kernel dev_t, representing a pointer to a specinfo structure, to cdev_t. Change struct specinfo to struct cdev. The name 'cdev' was taken from FreeBSD. Remove the dev_t shim for the kernel. This commit generally removes the overloading of 'dev_t' between userland and the kernel. Also fix a bug in libkvm where a kernel dev_t (now cdev_t) was not being properly converted to a userland dev_t.
MASSIVE reorganization of the device operations vector. Change cdevsw to dev_ops. dev_ops is a syslink-compatible operations vector structure similar to the vop_ops structure used by vnodes. Remove a huge number of instances where a thread pointer is still being passed as an argument to various device ops and other related routines. The device OPEN and IOCTL calls now take a ucred instead of a thread pointer, and the CLOSE call no longer takes a thread pointer.
Make the entire BUF/BIO system BIO-centric instead of BUF-centric. Vnode and device strategy routines now take a BIO and must pass that BIO to biodone(). All code which previously managed a BUF undergoing I/O now manages a BIO. The new BIO-centric algorithms allow BIOs to be stacked, where each layer represents a block translation, completion callback, or caller or device private data. This information is no longer overloaded within the BUF. Translation layer linkages remain intact as a 'cache' after I/O has completed. The VOP and DEV strategy routines no longer make assumptions as to which translated block number applies to them. The use the block number in the BIO specifically passed to them. Change the 'untranslated' constant to NOOFFSET (for bio_offset), and (daddr_t)-1 (for bio_blkno). Rip out all code that previously set the translated block number to the untranslated block number to indicate that the translation had not been made. Rip out all the cluster linkage fields for clustered VFS and clustered paging operations. Clustering now occurs in a private BIO layer using private fields within the BIO. Reformulate the vn_strategy() and dev_dstrategy() abstraction(s). These routines no longer assume that bp->b_vp == the vp of the VOP operation, and the dev_t is no longer stored in the struct buf. Instead, only the vp passed to vn_strategy() (and related *_strategy() routines for VFS ops), and the dev_t passed to dev_dstrateg() (and related *_strategy() routines for device ops) is used by the VFS or DEV code. This will allow an arbitrary number of translation layers in the future. Create an independant per-BIO tracking entity, struct bio_track, which is used to determine when I/O is in-progress on the associated device or vnode. NOTE: Unlike FreeBSD's BIO work, our struct BUF is still used to hold the fields describing the data buffer, resid, and error state. Major-testing-by: Stefan Krueger
Clean up struct session hold/rele management. The tty half-closed support (i.e. showing 'p0-' in the ps output for the tty instead of '??' after a process has detached) had an issue where the tty would be left with a reference to the freed session structure in certain situations because the session structure's ref-counting code was not properly implementing the release case. Consolidate the disparate session ref-counting code into real sess_hold() and sess_rele() functions and ensure that any tty reference to the session is cleared before the session structure is free()'d. NOTE: Joerg noticed a 0xdeadc1de (deadcode) panic related to this issue which means that prior to this fix it was possible for the bug to cause memory corruption in certain situations. NOTE: Linux does not implement half-closed tty sessions like BSD. Add code to implement fully-closed tty sessions, and document the whole mess, but leave it conditionalized-out for now. Reported-by: Joerg Sonnenberger <firstname.lastname@example.org>
Device layer rollup commit. * cdevsw_add() is now required. cdevsw_add() and cdevsw_remove() may specify a mask/match indicating the range of supported minor numbers. Multiple cdevsw_add()'s using the same major number, but distinctly different ranges, may be issued. All devices that failed to call cdevsw_add() before now do. * cdevsw_remove() now automatically marks all devices within its supported range as being destroyed. * vnode->v_rdev is no longer resolved when the vnode is created. Instead, only v_udev (a newly added field) is resolved. v_rdev is resolved when the vnode is opened and cleared on the last close. * A great deal of code was making rather dubious assumptions with regards to the validity of devices associated with vnodes, primarily due to the persistence of a device structure due to being indexed by (major, minor) instead of by (cdevsw, major, minor). In particular, if you run a program which connects to a USB device and then you pull the USB device and plug it back in, the vnode subsystem will continue to believe that the device is open when, in fact, it isn't (because it was destroyed and recreated). In particular, note that all the VFS mount procedures now check devices via v_udev instead of v_rdev prior to calling VOP_OPEN(), since v_rdev is NULL prior to the first open. * The disk layer's device interaction has been rewritten. The disk layer (i.e. the slice and disklabel management layer) no longer overloads its data onto the device structure representing the underlying physical disk. Instead, the disk layer uses the new cdevsw_add() functionality to register its own cdevsw using the underlying device's major number, and simply does NOT register the underlying device's cdevsw. No confusion is created because the device hash is now based on (cdevsw,major,minor) rather then (major,minor). NOTE: This also means that underlying raw disk devices may use the entire device minor number instead of having to reserve the bits used by the disk layer, and also means that can we (theoretically) stack a fully disklabel-supported 'disk' on top of any block device. * The new reference counting scheme prevents this by associating a device with a cdevsw and disconnecting the device from its cdevsw when the cdevsw is removed. Additionally, all udev2dev() lookups run through the cdevsw mask/match and only successfully find devices still associated with an active cdevsw. * Major work on MFS: MFS no longer shortcuts vnode and device creation. It now creates a real vnode and a real device and implements real open and close VOPs. Additionally, due to the disk layer changes, MFS is no longer limited to 255 mounts. The new limit is 16 million. Since MFS creates a real device node, mount_mfs will now create a real /dev/mfs<PID> device that can be read from userland (e.g. so you can dump an MFS filesystem). * BUF AND DEVICE STRATEGY changes. The struct buf contains a b_dev field. In order to properly handle stacked devices we now require that the b_dev field be initialized before the device strategy routine is called. This required some additional work in various VFS implementations. To enforce this requirement, biodone() now sets b_dev to NODEV. The new disk layer will adjust b_dev before forwarding a request to the actual physical device. * A bug in the ISO CD boot sequence which resulted in a panic has been fixed. Testing by: lots of people, but David Rhodus found the most aggregious bugs.
device switch 1/many: Remove d_autoq, add d_clone (where d_autoq was). d_autoq was used to allow the device port dispatch to mix old-style synchronous calls with new style messaging calls within a particular device. It was never used for that purpose. d_clone will be more fully implemented as work continues. We are going to install d_port in the dev_t (struct specinfo) structure itself and d_clone will be needed to allow devices to 'revector' the port on a minor-number by minor-number basis, in particular allowing minor numbers to be directly dispatched to distinct threads. This is something we will be needing later on.
Revamp the initial lwkt_abortmsg() support to normalize the abstraction. Now a message's primary command is always processed by the target even if an abort is requested before the target has retrieved the message from the message port. The message will then be requeued and the abort command copied into lwkt_msg_t->ms_cmd. Thus the target is always guarenteed to see the original message and then a second, abort message (the same message with ms_cmd = ms_abort) regardless of whether the abort was requested before or after the target retrieved the original message. ms_cmd is now an opaque union. LWKT makes no assumptions as to its contents. The NET code now stores nm_handler in ms_cmd as a function vector, and nm_handler has been removed from all netmsg structures. The ms_cmd function vector support nominally returns an integer error code which is intended to support synchronous/asynchronous optimizations in the future (to bypass messaging queueing and dequeueing in those situations where they can be bypassed, without messing up the messaging abstraction). The connect() predicate for which signal/abort support was added in the last commit now uses the new abort mechanism. Instead of having the handler function check whether a message represents an abort or not, a different handler vector is stored in ms_abort and run when an abort is processed (making for an easy separation of function). The large netmsg switch has been replaced by individual function vectors using the new ms_cmd function vector support. This will soon be removed entirely in favor of direct assignment of LWKT-aware PRU vectors to the messages command vector. NOTE ADDITIONAL: eventually the SYSCALL, VFS, and DEV interfaces will use the new message opaque ms_cmd 'function vector' support instead of a command index. Work by: Matthew Dillon and Jeffrey Hsu
More LWKT messaging cleanups. Isolate the default port functions by making them static and rename lwkt_init_port() to lwkt_initport() to conform with lwkt_initmsg().
This is a major cleanup of the LWKT message port code. The messaging code is getting closer to being directly useable by userland. With these changes message/port operations are now far better abstracted then they were before. * Stale fields have been removed from struct lwkt_msg. * lwkt_abortmsg() has been revamped to make it easier to support. * lwkt_waitmsg has been converted to a port function. * mp_*port() function fields have been renamed for better readability. * ms_cleanupmsg has been removed from struct lwkt_msg. * Union sysmsg is now struct sysmsg. * A copyout function has been added to struct sysmsg. * The system calls have been regenerated.
Fully synchronize sys/boot from FreeBSD-5.x, but add / to the module path so /kernel will be found and loaded instead of /boot/kernel. This will give us all the capabilities of the FreeBSD-5 boot code including AMD64 and ELF64 support. As part of this work, rather then try to adjust ufs/fs.h and friends to get UFS2 info I instead copied the fs.h and friends from FreeBSD-5 into the sys/boot subtree Additionally, import Peter Wemm's linker set improvements from FreeBSD-5.x. They happen to be compatible with GCC 2.95.x and it allows very few changes to be made to the boot code. Additionally import a number of other elements from FreeBSD-5 including sys/diskmbr.h separation.
Register keyword removal Approved by: Matt Dillon
DEV messaging stage 2/4: In this stage all DEV commands are now being funneled through the message port for action by the port's beginmsg function. CONSOLE and DISK device shims replace the port with their own and then forward to the original. FB (Frame Buffer) shims supposedly do the same thing but I haven't been able to test it. I don't expect instability in mainline code but there might be easy-to-fix, and some drivers still need to be converted. See primarily: kern/kern_device.c (new dev_*() functions and inherits cdevsw code from kern/kern_conf.c), sys/device.h, and kern/subr_disk.c for the high points. In this stage all DEV messages are still acted upon synchronously in the context of the caller. We cannot create a separate handler thread until the copyin's (primarily in ioctl functions) are made thread-aware. Note that the messaging shims are going to look rather messy in these early days but as more subsystems are converted over we will begin to use pre-initialized messages and message forwarding to avoid having to constantly rebuild messages prior to use. Note that DEV itself is a mess oweing to its 4.x roots and will be cleaned up in subsequent passes. e.g. the way sub-devices inherit the main device's cdevsw was always a bad hack and it still is, and several functions (mmap, kqfilter, psize, poll) return results rather then error codes, which will be fixed since now we have a message to store the result in :-)
DEV messaging stage 1/4: Rearrange struct cdevsw and add a message port and auto-queueing mask. The mask will tell us which message functions can be safely queued to another thread and which still need to run in the context of the caller. Primary configuration fields (name, cmaj, flags, port, autoq mask) are now at the head of the structure. Function vectors, which may eventually go away, are at the end. The port and autoq fields are non-functional in this stage. The old BDEV device major number support has also been removed from cdevsw, and code has been added to translate the bootdev passed from the boot code (the boot code has always passed the now defunct block device major numbers and we obviously need to keep that compatibility intact).
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
proc->thread stage 2: MAJOR revamping of system calls, ucred, jail API, and some work on the low level device interface (proc arg -> thread arg). As -current did, I have removed p_cred and incorporated its functions into p_ucred. p_prison has also been moved into p_ucred and adjusted accordingly. The jail interface tests now uses ucreds rather then processes. The syscall(p,uap) interface has been changed to just (uap). This is inclusive of the emulation code. It makes little sense to pass a proc pointer around which confuses the MP readability of the code, because most system call code will only work with the current process anyway. Note that eventually *ALL* syscall emulation code will be moved to a kernel-protected userland layer because it really makes no sense whatsoever to implement these emulations in the kernel. suser() now takes no arguments and only operates with the current process. The process argument has been removed from suser_xxx() so it now just takes a ucred and flags. The sysctl interface was adjusted somewhat.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 220.127.116.11