Up to [DragonFly] / src / sys / netinet
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
- If we receive redirect or host dead ICMP message due to packets sent on TCP sockets, we need to go through all CPUs to check per-cpu TCP inpcbs. - If we receive redirect ICMP message due to packets sent on UDP sockets, we need to go through all CPUs to free UDP inpcbs' cached route entry. Reported-by: pavalos@ Tested-by: pavalos@
In ip_lengthcheck(), make sure that pkthdr.len is not less than "IP total length" in IP header. Change the related testing in ip_input() and ipflow_fastforward() into assertion.
Add comment about ip_lengthcheck()
pr_ctlinput is usually called when certains types of ICMP packets are received. However, the processing of ICMP packets happens in netisr0, which means the thread context, in which pr_ctlinput is called, is not correct. To handle this following two fixes are applied: - Add pr_ctlport to protosw and ip6protosw, which could be used to locate correct msgport to call pr_ctlinput for specific protocol - All necessary information needed by pr_ctlinput are gather into one netmsg, and this netmsg is delivered synchronously (some information is on the stack) Note for new protocol implementation: pr_ctlinput and pr_ctlport should be both NULL or both non-NULL. Obtained-from: dillon@ Tested-by: pavalos@
Add following three network protocol threads running mode: 1) BGL (default) 2) Adaptive BGL. Protocol threads run without BGL by default. BGL will be held if the received msg does not have MSGF_MPSAFE turned on the ms_flags field 3) No BGL (experimental) The code on the main path is done by dillon@ Following three sysctls and tunables are added to adjust the "mode": net.netisr.mpsafe_thread net.inet.tcp.mpsafe_thread net.inet.udp.mpsafe_thread They have same set of values, 0 (default) -- BGL 1 -- Adaptive BGL 2 -- No BGL NETISR_FLAG_MPSAFE is added (netisr.ni_flags), so that: - netisr_queue() and schednetisr() could set MSGF_MPSAFE during msg initialization - netisr_run() (called by ether_input_oncpu()) could hold BGL based on this flag before calling netisr's handler PR_MPSAFE is added (protosw.pr_flags), so that tranport_processing_oncpu() could hold BGL before calling protocol's input handler Kernel API changes: - The thread parameter to netmsg_service_loop() must be supplied (running mode) and it must have the type of "int *" - netisr_register() takes additional flags parameter to indicate whether its handler is MPSAFE (NETISR_FLAG_MPSAFE) or not Reviewed-by: dillon@
Add TDF_NETWORK lwkt flag, so various assertion could be performed to make sure that packets are processed in network threads (i.e. controlled enviroment)
Add two tunables to run netisr and udp_thread without mplock, so experiment could be conducted under controlled environment. Default values of these two tunables are to run netisr/udp_thread with mplock.
Fix bugs concerning cached route entry in UDP inpcb.
For an unconnected and unbound UDP socket, first sending calls in_pcbladdr()
to fix the local port, which may change the target CPU of the next sending.
in_pcbladdr() has a side effect to allocate the route entry cached in inpcb.
If the target CPU after in_pcbladdr() is no longer the current CPU, then
the route entry will be accessed/freed on non-owner CPU during later sending.
Similarly, connect/disconnect a UDP socket may change the target CPU too; the
target CPU may no longer the owner of the cached route entry.
So, for the first sending happens on an unconnected and unbound UDP socket,
the target CPU of next sending is compared with the current CPU. If they
are different, then cached route entry will be freed, so next time a packet
sent on this socket, a new route entry owned by the correct CPU will be
cached. Same target CPU check is applied to UDP socket connect/disconnect.
Originally UDP PRU_CONNECT always happens on CPU0, which will cause problem if
following conditions are met:
- Dst of the cached route entry is different from the dst to be connected
- Cached route entry is not allocated on CPU0
This could happen if two datagram are sent on an unbounded and unconnected UDP
socket, then later connectting this UDP socket will cause cached route entry
being freed on different CPU. To solve this problem, PRU_CONNECT is dispatched
according to existing [lf]{addr,port} pairs.
If in_pcbladdr() fails after altering the cached route entry, the cached route
entry is freed to make sure that freeing this cached route entry happens on
its owner CPU.
Reported-by: y0netan1@
Tested-by: y0netan1@
Make divert(4) socket dispatch mbuf to correct the lwkt port for further
processing (ip_{input,output}):
- Add mbuf** function parameter to protosw.pr_mport()
- Pass 'addr' to pr_mport() in so_pru_send(); udp_soport() is adjusted
accordingly
- Add additional parameter to ip_mport(), so it could be called with both
incoming and outgoing packets. And the processing for outgoing UDP packets
matches udp_soport()
- Add div_soport() as IPPROTO_DIVERT's pr_mport()
o Delegate non-PRU_SEND operation to cpu0_soport()
o Move receiving interface setting up code from div_output() into this
function, so ip_mport() could be called
o Use ip_mport() to find the target lwkt port
For ip_lengthcheck(): - Centralize check failure processing. - Set the passed in mbuf pointer to NULL, if check fails, so that callers won't need to do that.
Use m_freem() to free the whole mbuf chain. Confirmed-by: hsu@
Remove weird license clause which has expired.
We can only do upper-layer protocol length checks on the first fragment.
Correct the th_off check against ip_len. The check in ip_demux occurs before ip_len is adjusted so we have to add iphlen into the equation. We forgot to do this when the code was originally moved from tcp_input to ip_demux (in tcp_input the check occurs after ip_len is adjusted). This fixes a panic assertion in tcp_input when a mangled packet is received. Reported-by: Joe Talbott <josepht@cstone.net>
Now that 'so_pcb' is properly declared as a 'void *', remove a layer of indirection and directly use 'so->so_pcb' in place of 'sotoinpcb(so)'.
When a PCMCIA networking card is removed the IF code may free() the network interface before processing has completed on pending packets, leaving a dangling pointer in the mbuf and causing a crash. The two solutions are to either ref-count the network interface on a per-packet basis or to synchronize against consumers of the packet. ref-counting is very expensive in an MP system so we have chosen to synchronize against consumers by sending a NOP message to all protocol processing threads and waiting for it to be replied. This only occurs when an interface is being brought down and is not expected to introduce any performance issues. Crash-Reported-by: Jonathon McKitrick <jcm@FreeBSD-uk.eu.org>
Clean up the routing and networking code before I parallelize routing.
Separate out the length checks from IP dispatch and also do them along the IPSEC path to the protocol processing routines. Reported by: Andrew Atrens <atrens@nortelnetworks.com>
Since ip_input() truncates the packet to ip->ip_len prior to entering the protocol stack, ip_demux must incorporate an ip_len check in its tcp/udp prechecks to avoid an assertion in the tcp/udp stacks if the packet is malformed. This is a temporary hack until Jeff and I can come up with a better way to do per-protocol mbuf checks. Basically the issue involved is that we want to pull up the entire tcp/udp/etc... header in order to be able to trivially demux an IP packet (choose which protocol thread to route it to). Since we have to m_pullup the packet in the demux we would rather do all header length checks in the demux as well so as not have to repeat them in the protocol code. Reported-by: Sven Willenberger <sven@dmv.com> In-discussion-with: Jeffrey Hsu
tcp_input()'s DELAY_ACK() code checks to see if the delayed ack timer is running and if it is not it starts it and returns rather then issue an ack. If the timer is already running tcp_input() will generate an immediate ack, resulting in one ack every other packet. This every-other-packet ack is usually required to ensure that the window does not close too much and stall the sender, but it really only exists because the tcp stack does not look ahead to see if there are other incoming packets that need to be processed that might themselves require additional acks. For optimal operation we really want to process all the pending TCP packets for the connection before sending any 'normal' acks. Many ethernet interfaces, including and most especially GigE interfaces, rate-limit their interrupts. This results in several packets being moved from the RX ring to the TCP/IP stack all at once, in a batch. GIVE THE TCP stack its own netisr dispatcher loop rather then using the generic netisr dispatcher loop. The TCP dispatcher loop will call an additional routine, tcp_willblock(), after all messages queued to the TCP protocol stack have been exhausted. When tcp_input() needs to send an ack in the normal header-prediction case it now places the TCPCB on a queue rather then send an immediate ack. tcp_willblock() processes this queue and calls tcp_output() to send the actual ack. The result is that on a GigE interface which typically queues 8+ packets per interrupt, a TCP stream will only be acked once per ~8 packets rather then 4 times (every other packet) per ~8 packets. This *GREATLY* reduces TCP protocol overhead and network ack traffic on both ends of the connection. NOTE: a later commit will deal with pure window space updates which generate an additional ACK per ~8 packets when the user program drains the buffer. Reviewed-by: Jeffrey Hsu <hsu@crater.dragonflybsd.org>
Fix two serious bugs in the IP demux code. First, if ip_mport() m_pullup()'s an mbuf, the new/modified mbuf is not returned to the caller and the caller may wind up using a stale/freed mbuf. Second, ip_mport() was not consistently freeding mbufs which could lead to both a memory leak and a double free. Reported-by: YONETANI Tomokazu <qhwt+dragonfly-bugs@les.ath.cx> (panic: TCP header not in one mbuf).
Add the standard DragonFly copyright notice to go along with mine. Approved by: Matt
In ip_mport() (IP packet demux code), check the minimum length requirement for an IP packet *before* the defrag case rather then after. This avoids a panic in ip_input.c where the length is already assumed to be reasonable. Bug-report-by: Allan Fields <bsd@afields.ca> In-consultation-with: Jeffrey Hsu <hsu@leaf.dragonflybsd.org>
The default protocol threads also need the check for same thread synchronous execution. Reported by: YONETANI Tomokazu <qhwt+dragonfly-bugs@les.ath.cx>
Replicate the TCP listen table to give each cpu its own copy.
Pass more information down to the protocol-specific socket dispatch function to use if desired.
Revamp the initial lwkt_abortmsg() support to normalize the abstraction. Now a message's primary command is always processed by the target even if an abort is requested before the target has retrieved the message from the message port. The message will then be requeued and the abort command copied into lwkt_msg_t->ms_cmd. Thus the target is always guarenteed to see the original message and then a second, abort message (the same message with ms_cmd = ms_abort) regardless of whether the abort was requested before or after the target retrieved the original message. ms_cmd is now an opaque union. LWKT makes no assumptions as to its contents. The NET code now stores nm_handler in ms_cmd as a function vector, and nm_handler has been removed from all netmsg structures. The ms_cmd function vector support nominally returns an integer error code which is intended to support synchronous/asynchronous optimizations in the future (to bypass messaging queueing and dequeueing in those situations where they can be bypassed, without messing up the messaging abstraction). The connect() predicate for which signal/abort support was added in the last commit now uses the new abort mechanism. Instead of having the handler function check whether a message represents an abort or not, a different handler vector is stored in ms_abort and run when an abort is processed (making for an easy separation of function). The large netmsg switch has been replaced by individual function vectors using the new ms_cmd function vector support. This will soon be removed entirely in favor of direct assignment of LWKT-aware PRU vectors to the messages command vector. NOTE ADDITIONAL: eventually the SYSCALL, VFS, and DEV interfaces will use the new message opaque ms_cmd 'function vector' support instead of a command index. Work by: Matthew Dillon and Jeffrey Hsu
Send connects to the right processor.
Fix typo with last minute change in last commit.
Push the lwkt_replymsg() up one level from netisr_service_loop() to the message handler so we can explicitly reply or not reply as appropriate.
Make TCP stats per-cpu. Submitted-by: Hiten Pandya <hmp@crater.dragonflybsd.org>
Consistently use "foreign" and "local", which are invariant on the host machine, instead of "src" and "dst", which varies according to whether a packet is being received or sent.
Fix byte-order.
Consolidate length checks in ip_demux().
Remove the ip_mthread_enable sysctl option. Enough code has been converted over to threads and message-passing that true dispatching is required for proper synchronization. Approved by: Matt Dillon
Do all the length checks before returning even if "ip_mthread_enable" is not enabled.
Consolidate length checks in ip_demux().
Make tcp_drain() per-cpu.
Partition the TCP connection table.
Dispatch upper-half protocol request handling.
Verify code assumption on number of processors with a kernel assertion.
Use power of 2 masking to make packet hash function fast.
This is a major cleanup of the LWKT message port code. The messaging code
is getting closer to being directly useable by userland. With these changes
message/port operations are now far better abstracted then they were before.
* Stale fields have been removed from struct lwkt_msg.
* lwkt_abortmsg() has been revamped to make it easier to support.
* lwkt_waitmsg has been converted to a port function.
* mp_*port() function fields have been renamed for better readability.
* ms_cleanupmsg has been removed from struct lwkt_msg.
* Union sysmsg is now struct sysmsg.
* A copyout function has been added to struct sysmsg.
* The system calls have been regenerated.
What happens when you mod a negative number? Mask off the hash value before modding so the returned value doesn't overflow the thread array when hashing packets. Also disable the multi-threaded networking by default. Setting net.inet.ip.mthread_enable to 1 will enable it.
Network threading stage 1/3: netisrs are already software interrupts, which means they alraedy run in their own thread. This commit creates multiple supporting threads for netisrs rather then just one and code has been added to begin routing packets to particular threads based on their content. Eventually this will lead to us being able to isolate and serialize PCBs in particular threads. The tail end of the ip_input path's protocol dispatch, the UIPC (user entry) code, and listen socket have not been covered yet and still need to be serialized. A new debugging sysctl, net.inet.ip.mthread_enable, has been added. It defaults to 1. If you set this sysctl 0 netisr processing will revert to the prior single-threaded behavior. Submitted-by: Jeffrey Hsu <hsu@FreeBSD.org> Additional-work-by: dillon