Up to [DragonFly] / src / sys / netinet
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
- If we receive redirect or host dead ICMP message due to packets sent on TCP sockets, we need to go through all CPUs to check per-cpu TCP inpcbs. - If we receive redirect ICMP message due to packets sent on UDP sockets, we need to go through all CPUs to free UDP inpcbs' cached route entry. Reported-by: pavalos@ Tested-by: pavalos@
Don't use listen socket's route cache, it is not MPSAFE currently. This could happen if a listen socket received TCP segment with invalid flags combination.
Add following three network protocol threads running mode: 1) BGL (default) 2) Adaptive BGL. Protocol threads run without BGL by default. BGL will be held if the received msg does not have MSGF_MPSAFE turned on the ms_flags field 3) No BGL (experimental) The code on the main path is done by dillon@ Following three sysctls and tunables are added to adjust the "mode": net.netisr.mpsafe_thread net.inet.tcp.mpsafe_thread net.inet.udp.mpsafe_thread They have same set of values, 0 (default) -- BGL 1 -- Adaptive BGL 2 -- No BGL NETISR_FLAG_MPSAFE is added (netisr.ni_flags), so that: - netisr_queue() and schednetisr() could set MSGF_MPSAFE during msg initialization - netisr_run() (called by ether_input_oncpu()) could hold BGL based on this flag before calling netisr's handler PR_MPSAFE is added (protosw.pr_flags), so that tranport_processing_oncpu() could hold BGL before calling protocol's input handler Kernel API changes: - The thread parameter to netmsg_service_loop() must be supplied (running mode) and it must have the type of "int *" - netisr_register() takes additional flags parameter to indicate whether its handler is MPSAFE (NETISR_FLAG_MPSAFE) or not Reviewed-by: dillon@
Allocate sackblock structs with kmalloc() instead of zalloc().
After lwkt_waitmsg/lwkt_waitport splitting, the second parameter of lwkt_waitport() is tsleep flags instead of msg pointer.
* Greatly reduce the complexity of the LWKT messaging and port abstraction. Significantly reduce the overhead of the subsystem. * The message abort algorithm has been rewritten. It now sends a separate message to issue the abort instead of trying to requeue the original message. This also means the TAILQ embedded in the lwkt_msg structure can be used by unrelated code during processing of the message. * Numerous MSGF_ flags have been removed, and all the LWKT msg/port algorithms have been rewritten and simplified. The message structure is now only touched by the current owner in all situations. * Numerous structural fields have been removed. In particular, the fields used for message abort sequencing have been simplified and we do not try to embed a 'command' field in the base LWKT message any more. * Clean up the netmsg abstraction, which is used all over the network stack. Instead of trying to overload fields in lwkt_msg we now simply extend the base lwkt_msg into struct netmsg. The function dispatch now takes a netmsg and returns void (before we had to return EASYNC), and we no longer need weird casts. Accept/connect message aborts are now greatly simplified.
Give the sockbuf structure its own header file and supporting source file. Move all sockbuf-specific functions from kern/uipc_socket2.c into the new kern/uipc_sockbuf.c and move all the sockbuf-specific structures from sys/socketvar.h to sys/sockbuf.h. Change the sockbuf structure to only contain those fields required to properly management a chain of mbufs. Create a signalsockbuf structure to hold the remaining fields (e.g. selinfo, mbmax, etc). Change the so_rcv and so_snd structures in the struct socket from a sockbuf to a signalsockbuf. Remove the recently added sorecv_direct structure which was being used to provide a direct mbuf path to consumers for socket I/O. Use the newly revamped sockbuf base structure instead. This gives mbuf consumers direct access to the sockbuf API functions for use outside of a struct socket. This will also allow new API functions to be added to the sockbuf interface to ease the job of parsing data out of chained mbufs.
Remove weird license clause which has expired.
Rename printf -> kprintf in sys/ and add some defines where necessary (files which are used in userland, too).
Local variables that were improperly named 'errno' must be renamed so as not to conflict with libc's errno, when building a virtual kernel.
Rename malloc->kmalloc, free->kfree, and realloc->krealloc. Pass 2
Rename malloc->kmalloc, free->kfree, and realloc->krealloc. Pass 1
* Remove (void) casts for discarded return values. * Put function types on separate lines. * Ansify function definitions. * Remove __P. In-collaboration-with: Alexey Slynko <slynko@tronet.ru>
Add KTR logging to the core tcp protocol loop.
Remove spl*() calls from netinet, replacing them with critical sections. A slight rearrangement of COMMON_START() in tcp_usrreq.c was necessary to ensure that the inp is loaded after entering the critical section.
Implement TCP Appropriate Byte Counting. Reviewed by Noritoshi Demizu, demizu@dd.iij4u.or.jp. Misunderstanding of spec clarified by Mark Allman.
Don't call cpu_mb1 after lwkt_setcpu_self, but call it internally after the processing is done in lwkt_setcpu_self. The extern call should automatically invalidate all non-local data and keeping it in lwkt_setcpu_self ensures that it continues to work even with IPO. Requested-by: dillon
Mechanical cleanup of TCP per-cpu statistics code, better naming etc. Correct a stale comment regarding initialisation of the counters.
Minimal patch that allows Path MTU discovery to be turned back on, but leave it off by default. Tested by: Hiroki Sato, Dave Rhodus, Yonetani Tomokazu, Matt Dillon, Andrew Atrens,
Cosmetic cleanups.
Clean up the routing and networking code before I parallelize routing.
Remove the userland visible part of the socket generation counting. As a side issue, the CPU used for processing a PCB isn't shown anymore, since this is currently not included by the userland sockets.
Correct a bug where incoming connections do not properly initialize the inflight bandwidth calculator. Reorg the code a bit, removing random initialization elsewhere and putting it all in one place. Add an idle check and a pure-ack check. Reported-by: Dan Nelson <dnelson@allantgroup.com>
Implement SACK.
Update includes now that the Fast IPSec code has moved to netproto/ipsec. Submitted by: Pawel Biernacki <kaktus@dragonflybsd.pl>
Add a state to sanity check tcp_close() to make sure it is not called twice. Add a 'cpu' field to the inpcb so the cpu owning a pcb can be made well-known, for use in later assertions as we move closer to removing the BGL. Fix a bug in the closing of listen sockets. The inp wildcard hash table removal was being done asynchronously with the freeing of the inp, which could lead to problems. Instead of sending messages in parallel to all tcp protocol threads to remove the wildcard hash we instead chain a single message through all tcp protocol threads to remove the hash, then detach the inp at the end of the chain. There is still an issue with the socket being ripped out from under other protocol threads which might be trying to accept connections on behalf of the listen socket which must be resolved before the BGL can be removed (amoung other things).
tcp_input()'s DELAY_ACK() code checks to see if the delayed ack timer is running and if it is not it starts it and returns rather then issue an ack. If the timer is already running tcp_input() will generate an immediate ack, resulting in one ack every other packet. This every-other-packet ack is usually required to ensure that the window does not close too much and stall the sender, but it really only exists because the tcp stack does not look ahead to see if there are other incoming packets that need to be processed that might themselves require additional acks. For optimal operation we really want to process all the pending TCP packets for the connection before sending any 'normal' acks. Many ethernet interfaces, including and most especially GigE interfaces, rate-limit their interrupts. This results in several packets being moved from the RX ring to the TCP/IP stack all at once, in a batch. GIVE THE TCP stack its own netisr dispatcher loop rather then using the generic netisr dispatcher loop. The TCP dispatcher loop will call an additional routine, tcp_willblock(), after all messages queued to the TCP protocol stack have been exhausted. When tcp_input() needs to send an ack in the normal header-prediction case it now places the TCPCB on a queue rather then send an immediate ack. tcp_willblock() processes this queue and calls tcp_output() to send the actual ack. The result is that on a GigE interface which typically queues 8+ packets per interrupt, a TCP stream will only be acked once per ~8 packets rather then 4 times (every other packet) per ~8 packets. This *GREATLY* reduces TCP protocol overhead and network ack traffic on both ends of the connection. NOTE: a later commit will deal with pure window space updates which generate an additional ACK per ~8 packets when the user program drains the buffer. Reviewed-by: Jeffrey Hsu <hsu@crater.dragonflybsd.org>
Add the standard DragonFly copyright notice to go along with mine. Approved by: Matt
Update some of my copyright notices before we officially publish DragonFlyBSD in Release 1.0.
The route table treats sockaddr data as opaque, which means that the unused fields in the structure passed to rtalloc() MUST BE ZERO. The syncache code allocates a governing struct syncache structure which contains an embedded struct route, but it does not zero this structure. When used in a mixed IPV4/IPV6 environment, it is possible for a structure to be allocated for IPV4 whos unused fields for the route lookup (e.g. sin_port and sin_zero) may contain garbage. This screws up the route table lookup and causes the wrong route to be returned. I believe the proper fix in this case is to rewrite the route table code, but since that would take a very long time the fix I am committing is to have tcp_rtlookup() zero out the sockaddr_in before it builds it for the rtalloc() call. Reported-by: Richard Nyberg <rnyberg@it.su.se> With-help-from: Hiten Pandya <hmp@nxad.com>
Add in_pcbinfo_init() to encapsulate basic structural setup (right now just the LIST_INIT). Rename inpcbinfo->listhead to inpcbinfo->pcblisthead due to changes in the API (addition of markers). Add support for markers in the inpcbinfo->pcblisthead lists of INPCB structures. Use markers in sysctl output code to iterate through these lists without losing its place or having to worry about structures being ripped out from under it. Scrap the original two-pass code. Redo the sysctl INPCB output code for tcp, udp, and other protocols so we always output the correct number of structures (as specified in xig_count). Generate output for all cpus (for TCP). This is accomplished by using lwkt_setcpu_self() to migrate the kernel thread to each cpu, which allows us to iterate the list(s) managed by that cpu without having to deal with mutexes or other forms of locks. Iterations always wind up on the same cpu they began on. Redo netstat to properly iterate across as many cpu chunks as the inpcb sysctl's return, rather then just the first one. Work-by: Hiten Pandya and Matthew Dillon
Use MPIPE instead of the really hackish use of m_get() and mtod()/dtom() for the 'tcptemp' structure.
Change mbug allocation flags from M_ to MB_ to avoid confusion with malloc flags. Requested by: Jeffrey Hsu
Remember if an inpcb was entered into the wildcard table to save some cycles when a connection is closed.
Replicate the TCP listen table to give each cpu its own copy.
Fix a netmsg memory leak in the ARP code. Adjust all ms_cmd function dispatches to return a proper error code. Reported-by: multiple people
Revamp the initial lwkt_abortmsg() support to normalize the abstraction. Now a message's primary command is always processed by the target even if an abort is requested before the target has retrieved the message from the message port. The message will then be requeued and the abort command copied into lwkt_msg_t->ms_cmd. Thus the target is always guarenteed to see the original message and then a second, abort message (the same message with ms_cmd = ms_abort) regardless of whether the abort was requested before or after the target retrieved the original message. ms_cmd is now an opaque union. LWKT makes no assumptions as to its contents. The NET code now stores nm_handler in ms_cmd as a function vector, and nm_handler has been removed from all netmsg structures. The ms_cmd function vector support nominally returns an integer error code which is intended to support synchronous/asynchronous optimizations in the future (to bypass messaging queueing and dequeueing in those situations where they can be bypassed, without messing up the messaging abstraction). The connect() predicate for which signal/abort support was added in the last commit now uses the new abort mechanism. Instead of having the handler function check whether a message represents an abort or not, a different handler vector is stored in ms_abort and run when an abort is processed (making for an easy separation of function). The large netmsg switch has been replaced by individual function vectors using the new ms_cmd function vector support. This will soon be removed entirely in favor of direct assignment of LWKT-aware PRU vectors to the messages command vector. NOTE ADDITIONAL: eventually the SYSCALL, VFS, and DEV interfaces will use the new message opaque ms_cmd 'function vector' support instead of a command index. Work by: Matthew Dillon and Jeffrey Hsu
Allow an inp control block to be inserted on multiple wildcard hash tables.
Cosmetic changes.
Don't need opt_tcp_input.h for TCP_DISTRIBUTED_TCBINFO anymore.
get rid of TCP_DISTRIBUTED_TCBINFO, it only added confusion.
Add header file to pull in the setting of the TCP_DISTRIBUTED_TCBINFO option.
Push the lwkt_replymsg() up one level from netisr_service_loop() to the message handler so we can explicitly reply or not reply as appropriate.
Make TCP stats per-cpu. Submitted-by: Hiten Pandya <hmp@crater.dragonflybsd.org>
per-cpu tcbinfo[]s aren't ready for prime time yet. The tcbinfo is assigned at tcp_attach time, but there is insufficient information available at this time to select the hash table and the wrong one gets assigned N-1 out of N times on MP systems (N = number of cpus), causing outgoing tcp connections to fail. An an option, TCP_DISTRIBUTED_TCBINFO, so MP-safe tcbinfo distribution can continue to be developed without impacting users.
Only enter wildcard sockets into the wildcard hash table.
Ifdef out unused variable.
Make tcp_drain() per-cpu.
Make tcp_drain() per-cpu.
Partition the TCP connection table.
Change the "struct inpcbhead *listhead" field in "struct inpcbinfo" to "struct inpcbhead listhead" so we can have a separate list per "struct inpcbinfo" when it becomes per-cpu.
Split out wildcarded sockets from the connection hash table.
Patch forr FreeBSD-SA-04:04.tcp limits out of sequence reassembly queue size, to make sure we don't run out of mbufs, resulting in a DOS attack. This is the same as tcp47.patch checked by Robert Garrett & Joerg Sonnenberger
Move <machine/in_cksum.h> to <sys/in_cksum.h>. This file is now platform independant. If we want to add extreme machine specialization later on then sys/in_cksum.h will #include machine/in_cksum.h. Move i386/i386/in_cksum.c to netinet/in_cksum.c. Note that netinet/in_cksum.c already existed but was not used by the build system at all. The move overwrites it. The new in_cksum.c is a portable, complete rewrite which references core assembly (procedure call) to do 32-bit-aligned work. See also i386/i386/in_cksum2.s.
Network threading stage 1/3: netisrs are already software interrupts, which means they alraedy run in their own thread. This commit creates multiple supporting threads for netisrs rather then just one and code has been added to begin routing packets to particular threads based on their content. Eventually this will lead to us being able to isolate and serialize PCBs in particular threads. The tail end of the ip_input path's protocol dispatch, the UIPC (user entry) code, and listen socket have not been covered yet and still need to be serialized. A new debugging sysctl, net.inet.ip.mthread_enable, has been added. It defaults to 1. If you set this sysctl 0 netisr processing will revert to the prior single-threaded behavior. Submitted-by: Jeffrey Hsu <hsu@FreeBSD.org> Additional-work-by: dillon
if ipv6 doesnt need oldstyle prototypes maybe its time we took them out of ipv4's code
Register keyword removal Approved by: Matt Dillon
LINT pass. Cleanup missed proc->thread conversions and get rid of warnings.
proc->thread stage 4: post commit cleanup. Fix minor issues when recompiling with GENERIC.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
proc->thread stage 2: MAJOR revamping of system calls, ucred, jail API, and some work on the low level device interface (proc arg -> thread arg). As -current did, I have removed p_cred and incorporated its functions into p_ucred. p_prison has also been moved into p_ucred and adjusted accordingly. The jail interface tests now uses ucreds rather then processes. The syscall(p,uap) interface has been changed to just (uap). This is inclusive of the emulation code. It makes little sense to pass a proc pointer around which confuses the MP readability of the code, because most system call code will only work with the current process anyway. Note that eventually *ALL* syscall emulation code will be moved to a kernel-protected userland layer because it really makes no sense whatsoever to implement these emulations in the kernel. suser() now takes no arguments and only operates with the current process. The process argument has been removed from suser_xxx() so it now just takes a ucred and flags. The sysctl interface was adjusted somewhat.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 1.73.2.31