Up to [DragonFly] / src / sys / kern
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Add an entry for lchflags using the same number as FreeBSD.
Implement a new system call: getvfsstat(). This system call returns an array of statfs and statvfs structures. Unfortunately there is no way to just return an array of statvfs structures because the statvfs structure does not have sufficient information in it to identify the mount point. getvfsstat(struct statfs *buf, struct statvfs *vbuf, long vbufsize, int flags);
* Implement new system calls in the kernel: statvfs(), fstatvfs(), fhstatvfs(). * Implement a new VFS op, VFS_STATVFS(). Implement a default for this new op for VFSs which do not implement VFS_STATVFS(), which calls VFS_STATFS() and converts the structure (using Joerg's conversion procedure from libc). * Remove statvfs(), fstatvfs(), and fhstatvfs() from libc. These functions are now system calls.
Add pselect syscall. Add pselect man page (obtained from FreeBSD). Add pselect wrapper in libthread_xu that calls pselect syscall. Add pselect wrapper in libc_r that calls poll syscall (see XXX in code and BUGS in pselect man page). Changed libbind to use pselect syscall instead of locally defined wrapper.
Regenerate system callsa (add uuidgen()).
Add a new system call, lwp_rtprio(), and regenerate system calls. int lwp_rtprio (int, pid_t, lwpid_t, struct rtprio *); This patch provides an alternative to rtprio(2) which is able to operate on individual LWPs. Submitted-by: Aggelos Economopoulos <firstname.lastname@example.org>
Just throw all the main arguments for syslink() into syslink_info and pass the structure. Do not pass the descriptor separately, do not pass a pointer to the structure size (just pass the size directly). The search routines just return one structure at a time so a return size field is not needed. Start revamping syslink() to make it more mbuf-centric. This work is very much still in progress.
Probably the last change to the syslink() system call. Allow a generic structure to be passed and returned and revamp the command structure.
syslink work - Implement code for a reformulated system call, giving the kernel the ability to manage multiple syslink routing hubs. Include the physical id space reservation and allocation and assignment of same.
1:1 Userland threading stage 4.8/4: Add syscalls lwp_gettid() and lwp_kill().
1:1 Userland threading stage 4.7/4: Add a new system call lwp_create() which spawns a new lwp with a given thread function address and given stack pointer. Rework and add some associated functions to realize this goal. In-collaboration-with: Thomas E. Spanjaard <email@example.com>
1:1 Userland threading stage 4.5/4: Add a new syscall, extexit(), which follows approximately the semantics discussed about two years ago. This function will be used by userland threading libraries to exit just a single lwp of a process.
Rename the following special extended I/O system calls. Only libc, libc_r, and VKERNEL are currently known to use these calls so the rename should have no major effects. Fix broken prototypes in unistd.h. __accept -> extaccept __connect -> extconnect __pread -> extpread __preadv -> extpreadv __pwrite -> extpwrite __pwritev -> extpwritev Broken-Prototypes-Reported-by: Joe Talbott <firstname.lastname@example.org>
Modify the trapframe sigcontext, ucontext, etc. Add %gs to the trapframe and xflags and an expanded floating point save area to sigcontext/ucontext so traps can be fully specified. Remove all the %gs hacks in the system code and signal trampoline and handle %gs faults natively, like we do %fs faults. Implement writebacks to the virtual page table to set VPTE_M and VPTE_A and add checks for VPTE_R and VPTE_W. Consolidate the TLS save area into a MD structure that can be accessed by MI code. Reformulate the vmspace_ctl() system call to allow an extended context to be passed (for TLS info and soon the FP and eventually the LDT). Adjust the GDB patches to recognize the new location of %gs. Properly detect non-exception returns to the virtual kernel when the virtual kernel is running an emulated user process and receives a signal. And misc other work on the virtual kernel.
Rename system calls, removing a "sys_" prefix that turned out not to be such a good idea. sys_set_tls_area() to set_tls_area() sys_get_tls_area() to get_tls_area()
Add two more vmspace_*() system calls to read and write a vmspace. These will be used by the virtual kernel to handle copyin/copyout. The routines are just empty wrappers at the moment. Implement the body for vmspace_mmap() and vmspace_munmap().
Make some adjustments to low level madvise/mcontrol/mmap support code to accomodate vmspace_*() calls. Reformulate the new vmspace_*() calls so they operate similarly to the MAP_VPAGETABLE and mcontrol() calls. This also makes vmspace's more 'programmable' in the sense that it will be possible to mix virtual pagetable mmap()ings with other mmap()ing in a vmspace. Fill in the code for all the new vmspace_*() calls except for vmspace_ctl(). NOTE: vmspace calls are effectively disabled unless vm.vkernel_enable is turned on, just like MAP_VPAGETABLE. Renumber the new mcontrol() and vmspace_*() calls and regenerate.
MAP_VPAGETABLE support part 3/3. Implement a new system call called mcontrol() which is an extension of madvise(), adding an additional 64 bit argument. Add two new advisories, MADV_INVAL and MADV_SETMAP. MADV_INVAL will invalidate the pmap for the specified virtual address range. You need to do this for the virtual addresses effected by changes made in a virtual page table. MADV_SETMAP sets the top-level page table entry for the virtual page table governing the mapped range. It only works for memory governed by a virtual page table and strange things will happen if you only set the root page table entry for part of the virtual range. Further refine the virtual page table format. Keep with 32 bit VPTE's for the moment, but properly implement VPTE_PS and VPTE_V. VPTE_PS can be used to suport 4MB linear maps in the top level page table and it can also be used when specifying the 'root' VPTE to disable the page table entirely and just linear map the backing store. VPTE_V is the 'valid' bit (before it was inverted, now it is normal).
Add skeleton procedures for the vmspace_*() series of system calls which will be used by virtual kernels to implement processes.
Add structures and skeleton code for a new system call called syslink() which will support the kernel syslink API. This is the link protocol that will be used for user<->kernel (e.g. user VFS) and kernel<->kernel (cluster) communications. Syslink-based protocols will be used for DEV, VFS, CCMS, and other cluster-related operations.
Add two more system calls, __accept and __connect. The old accept() and connect() are still present but will eventually be replaced with a libc wrapper. The new system calls add a flags argument, allowing O_FBLOCKING or O_FNONBLOCKING to be passed to override the non-blocking setting in the file pointer. They are intended to be used by libc_r.
Add kernel syscall support for explicit blocking and non-blocking I/O regardless of the setting applied to the file pointer. send/sendmsg/sendto/recv/recvmsg/recfrom: New MSG_ flags defined in sys/socket.h may be passed to these functions to override the settings applied to the file pointer on a per-I/O basis. MSG_FBLOCKING - Force the operation to be blocking MSG_FNONBLOCKING- Force the operation to be non-blocking pread/preadv/pwrite/pwritev: These system calls have been renamed and wrappers will be added to libc. The new system calls are prefixed with a double underscore (like getcwd vs __getcwd) and include an additional flags argument. The new flags are defined in sys/fcntl.h and may be used to override settings applied to the file pointer on a per-I/O basis. Additionally, the internal __ versions of these functions now accept an offset of -1 to mean 'degenerate into a read/readv/write/writev' (i.e. use the offset in the file pointer and update it on completion). O_FBLOCKING - Force the operation to be blocking O_FNONBLOCKING - Force the operation to be non-blocking O_FAPPEND - Force the write operation to append (to a regular file) O_FOFFSET - (implied of the offset != -1) - offset is valid O_FSYNCWRITE - Force a synchronous write O_FASYNCWRITE - Force an asynchronous write O_FUNBUFFERED - Force an unbuffered operation (O_DIRECT) O_FBUFFERED - Force a buffered operation (negate O_DIRECT) If the flags do not specify an operation (e.g. neither FBLOCKING or FNONBLOCKING are set), then the settings in the file pointer are used. The original system calls will become wrappers in libc, without the flags arguments. The new system calls will be made available to libc_r to allow it to perform non-blocking I/O without having to mess with a descriptor's file flags. NOTE: the new __pread and __pwrite system calls are backwards compatible with the originals due to a pad byte that libc always set to 0. The new __preadv and __pwritev system calls are NOT backwards compatible, but since they were added to HEAD just two months ago I have decided to not renumber them either. NOTE: The subrev has been bumped to 1.5.4 and installworld will refuse to install if you are not running at least a 1.5.4 kernel.
Modify kern/makesyscall.sh to prefix all kernel system call procedures with "sys_". Modify all related kernel procedures to use the new naming convention. This gets rid of most of the namespace overloading between the kernel and standard header files.
Oops, the usched_set syscall prototype should be updated.
Mark various forms of read() and write() MPSAFE. Note that the MP lock is still acquire, but now its a lot deeper in the fileops. Mark dup(), dup2(), close(), closefrom(), and fcntl() MPSAFE. Some code paths don't have to get the MP lock, but most still do deeper into the fileops.
unbreak world: spell MPSAFE correctly
spinlock more of the file descriptor code. No appreciable difference in performance on buildworld tests. Change getvnode() to holdvnode() and use semantics similar to holdsock(). The old getvnode() code wasn't fhold()ing the file pointer. The new holdvnode() code does.
Move all the resource limit handling code into a new file, kern/kern_plimit.c. Add spinlocks for access, and mark getrlimit and setrlimit as being MPSAFE. Document how LWPs will have to be handled - basically we will have to unshare the resource structure once we start allowing multiple LWPs per process, but we can otherwise leave it in the proc structure.
Add the preadv() and pwritev() systems and regenerate. Submitted-by: Chuck Tuffli <email@example.com> Loosely-based-on: FreeBSD
Backout the rest of 1.29. There are a number of issues with the other system calls too that have to be resolved before we can mark these MPSAFE.
Hold MP lock for getppid(). As noted by Dillon getppid() is not MP safe. This need sthe MP lock for cases such as a parent dying and becoming inherited by init.
Mark a few more system calls MPSAFE. getppid() getegid() uname() getrlimit()
Continue work on our pluggable scheduler abstraction. Implement a system call to set the scheduler for the current process (and future children), and add an abstraction for scheduler registration. Submitted-by: Sergey Glushchenko <firstname.lastname@example.org>
Make struct dirent contain a full 64bit inode. Allow more than 255 byte filenames by increasing d_namlen to 16bit. Remove UFS specific macros from sys/dirent.h, programs which really need them should include vfs/ufs/dir.h. MAXNAMLEN should not be used, but replaced by NAME_MAX. To keep the impact for older BSD code small, d_ino and d_fileno are kept in the old meaning when __BSD_VISIBLE is defined, otherwise the POSIX version d_ino is used. This will be changed later to always define only d_ino and make d_fileno a compatiblity macro for __BSD_VISIBLE. d_name is left with hard-coded 256 byte space, this will be changed at some point in the future and doesn't affect the ABI. Programs should correctly allocate space themselve, since the maximum directory entry length can be > 256 byte. For allocating dirents (e.g. for readdir_r), _DIRENT_RECLEN and _DIRENT_DIRSIZ should be used. NetBSD has choosen the same names. Revamp the compatibility code to always use a local kernel buffer and write out the entries. This will be changed later by passing down the output function to vop_readdir, elimininating the redundant copy. Change NFS and CD9660 to use to use vop_write_dirent, for CD9660 ensure that the buffers are big enough by prepending char arrays of the right size. Tested-by & discussed-with: dillon
Make nlink_t 32bit and ino_t 64bit. Implement the old syscall numbers for *stat by wrapping the new syscalls and truncation of the values. Add a hack for boot2 to keep ino_t 32bit, otherwise we would have to link the 64bit math code in and that would most likely overflow boot2. Bump libc major to annotate changed ABI and work around a problem with strip during installworld. strip is dynamically linked and doesn't play well with the new libc otherwise. Support for 64bit inode numbers is still incomplete, because the dirent limited to 32bit. The checks for nlink_t have to be redone too.
Move ostat definition from sys/stat.h into emulation43bsd/stat.h.
Remove partial NetBSD support. It's pointless to have an emulation of three syscalls (stat, lstat and fstat), the rest was never finished. Discussed-with: dillon
Tie SCTP into the kernel, this includes adding a new syscall (sctp_peeloff). Obtained from: KAME
Add closefrom(2) syscall. It closes all file descriptors equal or greater than the given descriptor. This function does exit and return EINTR, when necessary. Other errors from close() are ignored.
Change prototype of sys_set_tls_area and sys_get_tls_area to take the size argument as size_t.
Implement TLS support, tls manual pages, and link the umtx and tls manual pages together. TLS stands for 'thread local storage' and is used to support efficient userland threading and threaded data access models. Three TLS segments are supported in order to (eventually) support GCC3's __thread qualifier. David Xu's thread library only uses one descriptor for now. The system calls implement a mostly machine-independant API which return architecture-specific results. Rather then pass the actual descriptor structure, which unnecessarily pollutes the userland implementation, we pass a more generic (base,size) and the system call returns the %gs load value for IA32. For AMD64 and other architectures, the returned value will be something for those architectures. The current low level assembly support is not as efficient as it could be, but it is good enough for now. The heavy weight switch code for processes does the work. The light weight switch code for pure kernel threads has not been changed (since the kernel doesn't use TLS descriptors we can just ignore them). Based on work by David Xu <email@example.com> and Matthew Dillon <firstname.lastname@example.org>
Implement sigtimedwait and sigwaitinfo syscalls. Reviewed by: dillon
Add jail_attach syscall.
Minor correction in umtx_*() calls, the mutex pointer should point to volatile store.
Add syscall primitives for generic userland accessible sleep/wakeup functions. These functions are capable of sleeping and waking up based on a generic user VM address. Programs capable of sharing memory are also capable of interaction through these functions. Also regenerate our system calls. umtx_sleep(ptr, matchvalue, timeout) If *(int *)ptr (userland pointer) does not match the matchvalue, sleep for timeout microseconds. Access to the contents of *ptr plus entering the sleep is interlocked against calls to umtx_wakeup(). Various error codes are turned depending on what causes the function to return. Note that the timeout may not exceed 1 second. utmx_wakeup(ptr, count) Wakeup at least count processes waiting on the specified userland address. A count of 0 wakes all waiting processes up. This function interlocks against umtx_sleep(). The typical race case showing resolution between two userland processes is shown below. A process releasing a contested mutex may adjust the contents of the pointer after the kernel has tested *ptr in umtx_sleep(), but this does not matter because the first process will see that the mutex is set to a contested state and will call wakeup after changing the contents of the pointer. Thus, the kernel itself does not have to execute any compare-and-exchange operations in order to support userland mutexes. PROCESS 1 PROCESS 2 ******** RACE#1 ****** cmp_exg(ptr, FREE, HELD) . cmp_exg(ptr, HELD, CONTESTED) . umtx_sleep(ptr, CONTESTED, 0) . [kernel tests *ptr] <<<< COMPARE vs cmp_exg(CONTESTED, FREE) . <<<< CHANGE . tsleep(....) umtx_wakeup(ptr, 1) . . . . . PROCESS 1 PROCESS 2 ******** RACE#2 ****** cmp_exg(ptr, FREE, HELD) cmp_exg(ptr, HELD, CONTESTED) umtx_sleep(ptr, CONTESTED, 0) cmp_exg(CONTESTED, FREE) <<<< CHANGE vs umtx_wakeup(ptr, 1) [kernel tests *ptr] <<<< COMPARE [MISMATCH, DO NOT TSLEEP] These functions are very loosely based on Jeff Roberson's umtx work in FreeBSD. These functions are greatly simplified relative to that work in order to provide a more generic mechanism. This is precursor work for a port of David Xu's 1:1 userland threading library.
Journaling layer work. * Adjust the new mountctl syscall to make the passed file descriptor an explicit argument rather then storing the fd in the control structure. Convert the fd to a file pointer to make kern_mountctl() callable from a pure thread. * Get rid of vop_stdmountctl and just have the VOP default ops call journal_mountctl(), which makes things less confusing. * Get more of the journaling infrastructure working. Basic installation and removal of the journaling structure and the creation and destruction of the worker thread and stream file pointer now works (with lots of XXX's). * Add a journaling vector for VOP_NMKDIR to test the journaling VOP ops shim.
Journaling layer work. Add a new system call, mountctl, which will be used to manage the journaling layer. Add a new VOP, VOP_MOUNTCTL, which will be used to pass mountctl operations down into the VFS layer.
There is enough demand for Kip Macy's checkpointing code to warrent permanent integration into the kernel. Add a fixed system call, sys_checkpoint(2), to support the checkpt(1) utility as well as user programs which want to install their own signal handler (SIGCKPT).
Additional CAPS IPC work. Add additional system calls to allow a CAPS server to set a generation number and a CAPS client to query it, which can be used for any purpose but which is intended to allow a server to tell its clients to invalidate their caches. Add missing fork-handling code. CAPS links are only good on a thread-by-thread basis. When a process forks/rforks/clones any active CAPS links will be created as dummy entries in the forked process, causing CAPS syscalls to return ENOTCONN. This allows code based on CAPS to detect when it has been forked so it can re-connect to the service. Make a slight change to the API. caps_sys_put() now returns an immediate ENOTCONN if it forked. Note that userland CAPS code must still deal with the case where a message has been sent and the connection is lost before the reply is returned. The kernel automatically replies unreplied messages with 0-length data in these cases. Add additional flags to the API, including one that allows a client to block when connecting to a non-existant service.
Resident executable support stage 1/4: Add kernel bits and syscall support for in-kernel caching of vmspace structures. The main purpose of this feature is to make it possible to run dynamically linked programs as fast as if they were statically linked, by vmspace_fork()ing their vmspace and saving the copy in the kernel, then using that whenever the program is exec'd.
CAPS IPC library stage 2/3: Adjust syscalls.master and regenerate our system calls.
Implement an upcall mechanism to support userland LWKT. This mechanism will allow multiple processes sharing the same VM space (aka clone/threading) to send each other what are basically IPIs. Two new system calls have been added, upc_register() and upc_control(). Documentation is forthcoming. The upcalls are nicely abstracted and a program can register as many as it wants up to the kernel limit (which is 32 at the moment). The upcalls will be used for passing asynch data from kernel to userland, such as asynch syscall message replies, for thread preemption timing, software interrupts, IPIs between virtual cpus (e.g. between the processes that are sharing the single VM space).
Add the varsym_list() system call and add listing support to the varsym utility. Work done by: Eirik Nygaard <email@example.com> and Matt Dillon
Variant symlink support stage 1/2: Implement support for storing and retrieving system-specific, user-specific, and process-specific variables.
Remove the FreeBSD 3.x signal code. This includes osendsig(), osigreturn() and a couple of structures that these syscalls depended on. Split the sigaction(), sigprocmask(), sigpending(), sigsuspend(), sigaltstack() and kill() syscalls. Move the 4.3BSD signal syscalls osigvec(), osigblock(), osigsetmask(), osigstack() and okillpg() to the 43bsd subtree. I'm not too sure if these will even work with the FreeBSD-4 signal trampoline code, but they do compile and link. Implement linux_signal(), linux_rt_sigaction(), linux_sigprocmask(), linux_rt_sigprocmask(), linux_sigpending(), linux_kill(), linux_sigaction(), linux_sigsuspend(), linux_rt_sigsuspend(), linux_pause(), and linux_sigaltstack() with the new in-kernel syscalls. This patch kills 7 stackgap allocations in the Linuxolator.
makesyscalls.sh wants comments to be on their own line. Move the Checkpoint comment to its own line.
Additional checkpoint suppor for vmspace info. In particular, the data size is used by sbrk and must be restored for programs to work properly.
Introduce the function iovec_copyin() and it's friend iovec_free(). These remove a great deal of duplicate code in the syscall functions. For those who like numbers, this patch uses iovec_copyin() four times in uipc_syscalls.c, two times in linux_socket.c and two times in 43bsd_socket.c. Would somebody please comment on the inclusion of sys/malloc.h in sys/uio.h? Remove sockargs() which was used once in the svr4 emulation code. It is replaced with a small piece of code that gets an mbuf and copyin()'s to it's data region. Remove the osendfile() syscall which was inapropriately named and placed in the COMPAT_43 code where it doesn't belong. Split the socket(), shutdown() and sendfile() syscalls. All of the syscalls in kern/uipc_syscalls.c are now split. Prevent a panic due to m_freem()'ing a dangling pointer in recvmsg(), orecvmsg(), linux_recvmsg(). This patch completely removes COMPAT_43 from kern/uipc_syscalls.c.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 188.8.131.52