Up to [DragonFly] / src / sys / sys
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
* Add a flag to track an in-transit socket abort to avoid races when closing a socket. * Abort sockets asynchronously to prevent socket proto threads from deadlocking each other. Reported-by: Peter Avalos
Allocate socket structs with kmalloc() instead of zalloc.
do early copyin / delayed copyout for socket options
Fix socketvar.h inclusion by userland. This is a temporary hack and, frankly, a lot more of the header file should be made _KERNEL-only. Reported-by: Hasso Tepper <email@example.com>
Get rid of an old and terrible hack. Local stream sockets enqueue packets directly on the peer's sockbuf, rather then the sender's sockbuf. That part of the code is fine, but in order to prevent the sender from queueing infinite mbufs (because its sockbuf appears to be empty when you do that) the code dynamically messed around with the sender's high water mark. This blew up the new SOCK_SEQPACKET. In particular, it blows up the use of the PR_ATOMIC on stream sockets and can cause spurious EMSGSIZE errors to be returned instead of the EWOULDBLOCK that should have been returned. Also fix, or partially the resource limit code which tries to reduce the high water mark when a user is using too many mbufs. This never worked well and still doesn't, but it is in better shape now. Get rid of the crufty code and simply add a flag to the signalsockbuf, SSB_STOP, to stop the sender. Also adjust the vkernel to increase the default socket buffer when connecting to vknet instead of if_tap. VKE currently issues non-blocking writes to vknet/tap and we do not want to lose packets for no good reason.
MFC socket buffer lock fixes - The macros previously always returned success even when they failed. Rework macros in a new header file, sys/socketvar.2 (this header file now tagged for 1.10 and not present in commit message), fixing the bug and cleaning them up at the same time.
Separate ssb_lock() and ssb_unlock() into its own header file and reimplement the macro as an inline. Using the DragonFly '2' notation for header files containing potentially complex inlines. Correct an extremely old bug that caused ssb_lock() to always return success, even when it failed. This could have been responsible for miscellanious random network bug reports over the years. Reported-by: Johannes Hofmann <Johannes.Hofmann@gmx.de> Taken-from: FreeBSD using the inline suggested by OpenBSD
MFC - fix a mbuf leak.
Fix a mbuf leak that was introduced in April. In April I made a change that allows sends with control or address information to be discarded when the target socket buffer is full, but the two macros returned the wrong error code and prevented the mbuf from being freed.
Give the sockbuf structure its own header file and supporting source file. Move all sockbuf-specific functions from kern/uipc_socket2.c into the new kern/uipc_sockbuf.c and move all the sockbuf-specific structures from sys/socketvar.h to sys/sockbuf.h. Change the sockbuf structure to only contain those fields required to properly management a chain of mbufs. Create a signalsockbuf structure to hold the remaining fields (e.g. selinfo, mbmax, etc). Change the so_rcv and so_snd structures in the struct socket from a sockbuf to a signalsockbuf. Remove the recently added sorecv_direct structure which was being used to provide a direct mbuf path to consumers for socket I/O. Use the newly revamped sockbuf base structure instead. This gives mbuf consumers direct access to the sockbuf API functions for use outside of a struct socket. This will also allow new API functions to be added to the sockbuf interface to ease the job of parsing data out of chained mbufs.
Clean up the so_pru_soreceive() API a bit to make it easier to read mbuf chains without having to use a fake UIO.
Add kernel syscall support for explicit blocking and non-blocking I/O regardless of the setting applied to the file pointer. send/sendmsg/sendto/recv/recvmsg/recfrom: New MSG_ flags defined in sys/socket.h may be passed to these functions to override the settings applied to the file pointer on a per-I/O basis. MSG_FBLOCKING - Force the operation to be blocking MSG_FNONBLOCKING- Force the operation to be non-blocking pread/preadv/pwrite/pwritev: These system calls have been renamed and wrappers will be added to libc. The new system calls are prefixed with a double underscore (like getcwd vs __getcwd) and include an additional flags argument. The new flags are defined in sys/fcntl.h and may be used to override settings applied to the file pointer on a per-I/O basis. Additionally, the internal __ versions of these functions now accept an offset of -1 to mean 'degenerate into a read/readv/write/writev' (i.e. use the offset in the file pointer and update it on completion). O_FBLOCKING - Force the operation to be blocking O_FNONBLOCKING - Force the operation to be non-blocking O_FAPPEND - Force the write operation to append (to a regular file) O_FOFFSET - (implied of the offset != -1) - offset is valid O_FSYNCWRITE - Force a synchronous write O_FASYNCWRITE - Force an asynchronous write O_FUNBUFFERED - Force an unbuffered operation (O_DIRECT) O_FBUFFERED - Force a buffered operation (negate O_DIRECT) If the flags do not specify an operation (e.g. neither FBLOCKING or FNONBLOCKING are set), then the settings in the file pointer are used. The original system calls will become wrappers in libc, without the flags arguments. The new system calls will be made available to libc_r to allow it to perform non-blocking I/O without having to mess with a descriptor's file flags. NOTE: the new __pread and __pwrite system calls are backwards compatible with the originals due to a pad byte that libc always set to 0. The new __preadv and __pwritev system calls are NOT backwards compatible, but since they were added to HEAD just two months ago I have decided to not renumber them either. NOTE: The subrev has been bumped to 1.5.4 and installworld will refuse to install if you are not running at least a 1.5.4 kernel.
Move selinfo stuff to the separate header sys/selinfo.h. Make sys/select.h POSIX compatible. Note: Modifications from the original patch. For the moment maintain compatibility with BSD manual pages by ensuring that the prototype for the select() function is declared in both sys/select.h and unistd.h. Submitted-by: Alexey Slynko <firstname.lastname@example.org>
Clean up more #include files. Create an internal __boolean_t so two or three sys/ header files don't have to juggle the type. Use _KERNEL_STRUCTURES in variuos pieces of user code that delve into kvm. Reported-by: Rumko <email@example.com>, walt <firstname.lastname@example.org>
Remove so_gencnt and so_gen_t. The generation counter is not used any more.
Consolidate the file descriptor destruction code used when a newly created file descriptor must be destroyed due to an error into a new procedure, fdealloc(), rather then manually repeating it over and over again. Move holdsock() and holdfp() into kern/kern_descrip.c.
The thread/proc pointer argument in the VFS subsystem originally existed for... well, I'm not sure *WHY* it originally existed when most of the time the pointer couldn't be anything other then curthread or curproc or the code wouldn't work. This is particularly true of lockmgr locks. Remove the pointer argument from all VOP_*() functions, all fileops functions, and most ioctl functions.
Fix a sockbuf race. Currently the m_free*() path can block, due to objcache_put() blocking when it must access the global depot. This breaks the critical section *AND the BGL during a time when the sockbuf state is inconsistent. Another process accessing the same sockbuf would then corrupt it. Since depot access is fairly rare, this bug typically required a number of hours to reproduce. Delay the actual freeing of mbufs until after the sockbuf state has been updated. Encapsulate common operations in a procedure, and add additional assertions. NULL out sb_lastrecord when it becomes invalid, and add a considerable amount of debugging code. SOCKBUF_DEBUG has been added. Note that this is a VERY EXPENSIVE kernel compile option which should only be used when specifically debugging the networking subsystem. This is a stabilization patch is rather hackish. A better cleanup will occur once we are sure we've fixed all the bugs. sbcheck provided by: Jeffrey Hsu Reported-by: David Rhodus, Peter Avalos, YONETANI Tomokazu, Tomaz Borstnar, and numerous other people.
Make shutdown() a fileops operation rather then a socket operation. Pipes are full-duplex entities, so implement shutdown support for them.
Keep a hint for the last packet in the singly-linked list of packets in a sockbuf in order to convert the cost of append operations from O(n) to O(1).
Code cleanup. Refactor some functions. Push some globals into local scope.
Clean up routing code before I parallelize it.
Cache a pointer the last mbuf in the sockbuf for faster insertion. Update it on sockbuf insertion and deletion and on user reads. Add a new sbappendstream() function that inserts in constant time. Use it for TCP.
Change (almost) all references to tqh_first and tqe_next and tqe_prev to the correct TAILQ macros. Exceptions are contrib/ipfilter, which will be handled separately, and dev/misc/labpc, which makes some very wiered things and therefore needs much more care.
Remove the canwait argument to dup_sockaddr(). Callers of dup_sockaddr() all assume that it just works, so it really has to work. Since interrupts are now threads we can use M_INTWAIT. While it is possible that a memory deadlock issue exists here (e.g. if swapping over NFS), it isn't likely in this case.
Add predicate message facility.
Give UDP its own sosend() function.
Increase the default socket buffer for NFS to deal with linux bugs and to improve performance. The default nfs socket buffer is now 65535 bytes, settable with a sysctl (vfs.nfs.soreserve). It is my belief that when large data block sizes (32K) are negotiated, the larger socket buff should improve read-ahead performance and reduce nfs socket buffer lock contention that occurs with multiple nfsd's. I was able to do some testing over GigE and it did seem to help, but problems with one of the machines made the tests less then reliable. Credits: Richard Sharpe originally encountered issues with linux NFS clients that were traced to linux doing a bad job in its delayed-ack code. David Rhodus created an initial patch which I used as a partial basis for this commit (circa October 2003).
Once we distribute socket protocol processing requests to different processors, we no longer have a process context to refer to, so eliminate the use of curproc in soreserve() by passing the sockbuf resource limit all the down from the system call code to sbreserve(). Eliminate the use of curproc in unp_attach() by passing down the fields it needs from the proc structure. Define a pru_attach_info structure to hold the information the attach usrreq function requires. The thread argument to in_pcballoc() is unused, so we don't need to pass a thread structure down to in_pcballoc().
Pull the sf_buf routines and structures out into its own files in anticipation of wider future use. Requested and reviewed by: dillon
This patch improves the performance of sendfile(2). It adds a hash table of active sf_buf mappings to ensure there is exactly one (ref-counted) sf_buf for each vm_page. This saves on the number of sf_bufs used when sendfile(2) sends the same file over and over again to multiple connections simultaneously. It also does lazy updates on the hw page table when a sf_buf ref count goes to zero, placing them on a freelist instead, in effect, making sf_bufs a cache of virtual-to-physical translations with LRU replacement on the inactive sf_bufs. Finally, it does a wakeup_one() instead of a broadcast wakeup() when a free sf_buf becomes available. This patch roughly corresponds to FreeBSD revs 1.219 and 1.220 of sys/i386/i386/vm_machdep.c: revision 1.219 date: 2003/11/17 18:22:24; author: alc; state: Exp; lines: +48 -26 Change the i386's sf_buf implementation so that it never allocates more than one sf_buf for one vm_page. To accomplish this, we add a global hash table mapping vm_pages to sf_bufs and a reference count to each sf_buf. (This is similar to the patches for RELENG_4 at http://www.cs.princeton.edu/~yruan/debox/.) For the uninitiated, an sf_buf is nothing more than a kernel virtual address that is used for temporary virtual-to-physical mappings by sendfile(2) and zero-copy sockets. As such, there is no reason for one vm_page to have several sf_bufs mapping it. In fact, using more than one sf_buf for a single vm_page increases the likelihood that sendfile(2) blocks, hurting throughput. (See http://www.cs.princeton.edu/~yruan/debox/.) revision 1.220 date: 2003/12/07 22:49:25; author: alc; state: Exp; lines: +10 -9 Don't remove the virtual-to-physical mapping when an sf_buf is freed. Instead, allow the mapping to persist, but add the sf_buf to a free list. If a later sendfile(2) or zero-copy send resends the same physical page, perhaps with the same or different contents, then the mapping overhead is avoided and the sf_buf is simply removed from the free list. In other words, the i386 sf_buf implementation now behaves as a cache of virtual-to-physical translations using an LRU replacement policy on inactive sf_bufs. This is similar in concept to a part of http://www.cs.princeton.edu/~yruan/debox/ patch, but much simpler in implementation. Note: none of this is required on alpha, amd64, or ia64. They now use their direct virtual-to-physical mapping to avoid any emphemeral mapping overheads in their sf_buf implementations. Reviewed by: dillon
Introduce the function iovec_copyin() and it's friend iovec_free(). These remove a great deal of duplicate code in the syscall functions. For those who like numbers, this patch uses iovec_copyin() four times in uipc_syscalls.c, two times in linux_socket.c and two times in 43bsd_socket.c. Would somebody please comment on the inclusion of sys/malloc.h in sys/uio.h? Remove sockargs() which was used once in the svr4 emulation code. It is replaced with a small piece of code that gets an mbuf and copyin()'s to it's data region. Remove the osendfile() syscall which was inapropriately named and placed in the COMPAT_43 code where it doesn't belong. Split the socket(), shutdown() and sendfile() syscalls. All of the syscalls in kern/uipc_syscalls.c are now split. Prevent a panic due to m_freem()'ing a dangling pointer in recvmsg(), orecvmsg(), linux_recvmsg(). This patch completely removes COMPAT_43 from kern/uipc_syscalls.c.
Add support for Protocol Independent Multicast. Submitted to FreeBSD by: Pavlin Radoslavov <email@example.com>
__P() != wanted, begin removal, in order to preserve white space this needs to be done by hand, as I accidently killed a source tree that I had gotten this far on. I'm committing this now, LINT and GENERIC both build with these changes, there are many more to come.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 220.127.116.11