Up to [DragonFly] / src / sys / netinet
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Remove MSGF_PRIORITY support. The flag testing and message queue selection on the hot code path introduce noticeable performance regression during ip forwarding (from 667Kpps to 655Kpps w/ 64bytes packet and fastforwarding enabled on Phenom 9550).
- During ARP resolving, current thread msgport or current CPU's netisr msgport (if current is not TDF_NETWORK) is recorded in addition to the mbuf. - When ARP resolving is finished, the recorded mbuf is dispatched to the recorded msgport to do network output using priority message. This eliminates the last network output in route threads. Reviewed-by: dillon@
- fix UP build
- Add tunable (net.link.ether.inet.arp_mpsafe) to register ARP as MPSAFE netisr. - Hold BGL on CARP in arp input path.
Split arprequest() into two parts, arpreq_alloc() and arpreq_send(). arprequest() simply calls these two functions sequencially. Add arprequest_async(), which allocates the arp request using arpreq_alloc() and then dispatch the real sending (arpreq_send()) to current CPU's netisr. Callers of arprequest_async() do not need to worry about the ifp's serializer state. This function also makes sure that the network output happens in TDF_NETWORK kernel thread. Let arp_ifinit(), arp_ifinit2() and arp_rtrequest() call arprequest_async().
- Constify 'enaddr' - Minor style change
For SMP kernel - Don't call arp_update_oncpu() again in in_arpinput() - Pass dologging to arp_update_oncpu(), only if the current cpu is CPU0, remove the redundant cpu==0 checks Reviewed-by: dillon@
- Minor style and white space changes - Break long line
Add NETISR_FLAG_NOTMPSAFE, which could be used as the last parameter to netisr_register(), more expressive and less error-prone than 0. Suggested-by: hsu@
Add following three network protocol threads running mode: 1) BGL (default) 2) Adaptive BGL. Protocol threads run without BGL by default. BGL will be held if the received msg does not have MSGF_MPSAFE turned on the ms_flags field 3) No BGL (experimental) The code on the main path is done by dillon@ Following three sysctls and tunables are added to adjust the "mode": net.netisr.mpsafe_thread net.inet.tcp.mpsafe_thread net.inet.udp.mpsafe_thread They have same set of values, 0 (default) -- BGL 1 -- Adaptive BGL 2 -- No BGL NETISR_FLAG_MPSAFE is added (netisr.ni_flags), so that: - netisr_queue() and schednetisr() could set MSGF_MPSAFE during msg initialization - netisr_run() (called by ether_input_oncpu()) could hold BGL based on this flag before calling netisr's handler PR_MPSAFE is added (protosw.pr_flags), so that tranport_processing_oncpu() could hold BGL before calling protocol's input handler Kernel API changes: - The thread parameter to netmsg_service_loop() must be supplied (running mode) and it must have the type of "int *" - netisr_register() takes additional flags parameter to indicate whether its handler is MPSAFE (NETISR_FLAG_MPSAFE) or not Reviewed-by: dillon@
- Avoid excessive goto - Adjust arphdr pointer, if we need to do another m_pullup() - Indentation in switch block Obtained-from: FreeBSD w/ mod
- ifnet.if_output() should be called without ifnet.if_serializer being held. Add assertion about it in ether_output(). - ether_output_frame() should be called without the ifnet.if_serializer being held. Add assertion in it. - arp_ifinit() will be called with ifnet.if_serializer being held. To prevent serializer from recursion, ifnet.if_serializer is released before calling arprequest(), which supposes caller does not hold output iface's serializer. - ifnet.if_serializer can't be held when calling arp_ifinit2(). Reported-by: dillon@
Reduce ifnet.if_serializer contention on output path: - Push ifnet.if_serializer holding down into each ifnet.if_output implementation - Add a serializer into ifaltq, which is used to protect send queue instead of its parent's if_serializer. This change has following implication: o On output path, enqueueing packets and calling ifnet.if_start are decoupled o In device drivers, poll->dev_encap_ok->dequeue operation sequence is no longer safe, instead dequeue->dev_encap_fail->prepend should be used This serializer will be held by using lwkt_serialize_adaptive_enter() - Add altq_started field into ifaltq, which is used to interlock the calling of its parent's if_start, to reduce ifnet.if_serializer contention. if_devstart(), a helper function which utilizes ifaltq.altq_started, is added to reduce code duplication in ethernet device drivers. - Add if_cpuid into ifnet. This field indicates on which CPU device driver's interrupt will happen. - Add ifq_dispatch(). This function will try to hold ifnet.if_serializer in order to call ifnet.if_start. If this attempt fails, this function will schedule ifnet.if_start to be called on CPU located by ifnet.if_start_cpuid if_start_nmsg, which is per-CPU netmsg, is added to ifnet to facilitate ifnet.if_start scheduling. ifq_dispatch() is called by ether_output_frame() currently - Use ifq_classic_ functions, if altq is not enabled - Fix various device drivers bugs in their if_start implementation - Add ktr for ifq classic enqueue and dequeue - Add ktr for ifnet.if_start
Parallelize ifnet.if_addrhead accessing by duplicating the list itself on each CPU, each list element points to ifaddr: - Add SI_SUB_PRE_DRIVERS before SI_SUB_DRIVERS, so action could be taken before drivers' initialization (mainly before NIC driver's if_attach()) - Move netisr_init() to the FIRST of SI_SUB_PRE_DRIVERS, so that netmsg_service_port_init() could be called in earlier stage of system initialization. - Create one thread on each CPU to propagate changes to ifnet.if_addrhead. Their thread ports are registered with netmsg_service_port_init() for port syncing operation. - Change to ifnet.if_addrhead begins in netisr0, i.e. serial of changes to ifnet.if_addrhead are serialized by netisr0 - ifaddr's refcnt is moved to its list elements, i.e. per-CPU refcnt. They are initialized to 1 instead of 0. - A magic field is added to ifaddr list element to make sure that IFAREF and IFAFREE are called on valid ifaddr list element. This field is initialized to a magic value and is wiped out once the list element's refcnt drops to 0 - To close the gap between testing and freeing, once the ifaddr list element's refcnt drops to 0, ifa_portfn(0) (a thread's port on CPU0) is poked to check whether ifaddr is referenced on other CPUs, if not, then ifaddr is freed on ifa_portfn(0) Reviewed-by: dillon@ (earlier version)
Fix arprequest serialization. arprequest() calls ifp->if_output() without locally grabbing the respective serializer, so ASSERT_SERIALIZED at the beginning of the function. Grab the serializer at arp_rtrequest() when it calls arprequest(). Reviewed-by: sephe@
Kill token ring remainder.
Nuke FDDI support.
Nuke token ring support. This also means one blob less in DragonFly.
Nuke ARCnet support.
Bring CARP into the tree. CARP = Common Address Redundancy Protocol, which allows an IP address to hot switch to backup machine(s) when the master goes offline. Submitted-by: Baptiste Ritter <firstname.lastname@example.org>, Jonathan, and Nicolas Testing-by: Thomas Nikolajsen, Gergo Szakal Obtained-from: OpenBSD, NetBSD, and FreeBSD
Add lwkt_sleep() to formalize a shortcut numerous bits of code have been using for a while, which is to directly deschedule oneself and switch away. This method of blocking requires a direct lwkt_schedule() call to reschedule the thread and is primarily used by the message port abstraction. Change the psignal code to check TDF_SINTR in the thread flags instead of checking MSGPORTF_WAITING in the thread's private message port. The lwkt_waitmsg() and lwkt_waitport() functions use the same msgport backend function (mp_waitport). Separate the backend into two functions, mp_waitport and mp_waitmsg, and allow tsleep flags to be passed in instead of flagging interruptability in the lwkt_msg flags. Optimize the lwkt_waitmsg() backends - in the fully synchronous critical path case no critical sections or spinlocks are required at all.
* Greatly reduce the complexity of the LWKT messaging and port abstraction. Significantly reduce the overhead of the subsystem. * The message abort algorithm has been rewritten. It now sends a separate message to issue the abort instead of trying to requeue the original message. This also means the TAILQ embedded in the lwkt_msg structure can be used by unrelated code during processing of the message. * Numerous MSGF_ flags have been removed, and all the LWKT msg/port algorithms have been rewritten and simplified. The message structure is now only touched by the current owner in all situations. * Numerous structural fields have been removed. In particular, the fields used for message abort sequencing have been simplified and we do not try to embed a 'command' field in the base LWKT message any more. * Clean up the netmsg abstraction, which is used all over the network stack. Instead of trying to overload fields in lwkt_msg we now simply extend the base lwkt_msg into struct netmsg. The function dispatch now takes a netmsg and returns void (before we had to return EASYNC), and we no longer need weird casts. Accept/connect message aborts are now greatly simplified.
Remove weird license clause which has expired.
Rename printf -> kprintf in sys/ and add some defines where necessary (files which are used in userland, too).
Fix bug introduced in rev 1.33 and make a clearer code flow. Noticed-by: y0netani
Fix bug introduced in rev 1.33 and make a clearer code flow. Noticed-by: y0netani
Embed the netmsg in the mbuf itself rather than allocating one for each received packet. This greatly reduces the overhead in the network receive path (removing a malloc() and free()).
Bring in the parallel route table code and clean up ARP. The route table is now replicated across all cpus (ncpus, not ncpus2). Note that cloned routes are not replicated. This removes one of the few remaining obstacles to being able to run the network protocol stacks without the BGL. Primary-Design-by: Jeffrey Hsu Work-by: Jeffrey Hsu and Matthew Dillon
Bring in if_bridge from Open-/Net-/FreeBSD Based-on-patch-by: Andrew Atrens Reviewed-and-locking-corrected-by: dillon and sephe
Adjust sources to accomodate for repo copy of our bridging code sys/net/bridge was copied to sys/net/oldbridge
Make all network interrupt service routines MPSAFE part 1/3. Replace the critical section that was previously used to serialize access with the LWKT serializer. Integrate the serializer into the IFNET structure. Note that kern.intr_mpsafe must be set to 1 for network interrupts to actually run MPSAFE. Also note that any interrupts shared with othre non-MP drivers will cause all drivers on that interrupt to run with the Big Giant Lock. Network interrupt - Each network driver then simply passes that serializer to bus_setup_intr() so only a single serializer is required to process the entire interrupt path. LWKT serialization support is already 100% integrated into the interrupt subsystem so it will already be held as of when the registered interrupt procedure is called. Ioctl and if_* functions - All callers of if_* functions (such as if_start, if_ioctl, etc) now obtain the IFNET serializer before making the call. Thus all of these entry points into the driver will now be serialized. if_input - All code that calls if_input now ensures that the serializer is held. It will either already be held (when called from a driver), or the serializer will be wrapped around the call. When packets are forwarded or bridged between interfaces, the target interface serializer will be dropped temporarily to avoid a deadlock. Device Driver access - dev_* entry points into certain pseudo-network devices now obtain and release the serializer. This had to be done on a device-by-device basis (but there are only a few such devices). Thanks to several people for helping test the patch, in particular Sepherosa Ziehau.
Remove spl*() calls from netinet, replacing them with critical sections. A slight rearrangement of COMMON_START() in tcp_usrreq.c was necessary to ensure that the inp is loaded after entering the critical section.
Cosmetic changes only.
Now that we generate the ethernet header in place in the mbuf instead of in a secondary buffer, we have to remove space for the old ethernet header before we send a packet waiting for ARP resolution back to ether_output() again.
Now that I understand the poorly written BSD routing code and what it was trying to do, rewrite it in a clear and concise manner. The old rtalloc1() code written by CSRG had a number of problems: 1. it was not clear which route was being returned 2. it was not clear what was being reported 3. it hid the essential radix tree lookup operation inside a series of conditional tests and inline assignments 4. it had multiple gotos to the inside of if statements 5. it intermixed reporting code with the operational logic of lookup and cloning 6. it assigns multiple times to key variables 7. it has unnecessary assignments to key variables 8. it overloaded the "report" argument parameter, to have two different semantics 9. it misnamed the key route lookup function "rtalloc1", obscuring all uses of route lookup. In contrast to the rtalloc1 code in FreeBSD 4 or the even more convoluted rtalloc1 code in FreeBSD 5, the DragonFlyBSD version A. has a clear control flow that makes the common case obvious by highlighting the core call to the radix tree look up function, eliminating gotos into if statements, and completely separating out the special-case cloning logic B. makes it clear which route is being returned by only assigning once to the key "rt" variable and by expliciting returning "rt" or "clonedroute" C. abstracts out the reporting code into its own reporting API D. cleans up the semantics of the "report" argument parameter to only indicate whether to report a miss and not whether to clone E. introduces a simple single-argument API for caller that want to clone and those that do not.
Clean up the networking code before I parallelize the routing code.
Clean up the routing and networking code before I parallelize routing.
Clean up routing code before I parallelize it.
Merge from FreeBSD: revision 1.102 date: 2003/02/08 15:05:15; author: orion; state: Exp; lines: +7 -6 Avoid multiply for preemptive arp calculation since it hits every ethernet packet sent. Prompted by: Jeffrey Hsu <hsu@FreeBSD.org>
Fix a conditional. sdl was not unconditionally being checked for NULL. Found-by: Mikhail Teterin <email@example.com>
Normally we want to warn if the local IP address is used by a different host. This isn't useful for 0.0.0.0, because it is used by dhclient when no address is known.
Fix 'route add -host <target> -interface <interface_name>. This was previously adding a static arp entry with the interface's MAC address instead of the target's address or an incomplete address. The result is that the target cannot be routed to. The fix is to (1) Install the route with 'incomplete' link level info rather then using the interface's MAC address, (2) Allowing the incomplete address to be resolved and timeout normally, (3) re-clearing the entry to an incomplete status instead of destroying it when the ARP times out, and (4) Making 'arp -d -a' only clear link level routes marked static instead of deleting them. Noticed-by: Mikhail Teterin <firstname.lastname@example.org>
timeout/untimeout ==> callout_* This moves the reset of the arptimer after the list processing. There's no protection against multiple runs here, so this makes more sense.
Add if_broadcastaddr to struct ifnet to hold the link layer broadcast address. Use this in place of the various direct references esp. to etherbroadcastaddr. Inspired-by: NetBSD if.h, rev. 1.29
Change mbug allocation flags from M_ to MB_ to avoid confusion with malloc flags. Requested by: Jeffrey Hsu
Fix a netmsg memory leak in the ARP code. Adjust all ms_cmd function dispatches to return a proper error code. Reported-by: multiple people
Push the lwkt_replymsg() up one level from netisr_service_loop() to the message handler so we can explicitly reply or not reply as appropriate.
Dispatch upper-half protocol request handling.
if_xname support Part 2/2: Convert remaining netif devices and implement full support for if_xname. Restructure struct ifnet in net/if_var.h, pulling in a few minor additional changes from current including making if_dunit an int, and making if_flags an int. Submitted-by: Max Laier <email@example.com>
Network threading stage 1/3: netisrs are already software interrupts, which means they alraedy run in their own thread. This commit creates multiple supporting threads for netisrs rather then just one and code has been added to begin routing packets to particular threads based on their content. Eventually this will lead to us being able to isolate and serialize PCBs in particular threads. The tail end of the ip_input path's protocol dispatch, the UIPC (user entry) code, and listen socket have not been covered yet and still need to be serialized. A new debugging sysctl, net.inet.ip.mthread_enable, has been added. It defaults to 1. If you set this sysctl 0 netisr processing will revert to the prior single-threaded behavior. Submitted-by: Jeffrey Hsu <hsu@FreeBSD.org> Additional-work-by: dillon
Apply FreeBSD Security Advisory FreeBSD-SA-03:14.arp. Fix DOS crash due to arp starvation.
Centralize if queue handling. Original patch against FreeBSD submitted by Jonathan Lemon. Reviewed by Matt Dillon.
if ipv6 doesnt need oldstyle prototypes maybe its time we took them out of ipv4's code
kernel tree reorganization stage 1: Major cvs repository work (not logged as commits) plus a major reworking of the #include's to accomodate the relocations. * CVS repository files manually moved. Old directories left intact and empty (temporary). * Reorganize all filesystems into vfs/, most devices into dev/, sub-divide devices by function. * Begin to move device-specific architecture files to the device subdirs rather then throwing them all into, e.g. i386/include * Reorganize files related to system busses, placing the related code in a new bus/ directory. Also move cam to bus/cam though this may not have been the best idea in retrospect. * Reorganize emulation code and place it in a new emulation/ directory. * Remove the -I- compiler option in order to allow #include file localization, rename all config generated X.h files to use_X.h to clean up the conflicts. * Remove /usr/src/include (or /usr/include) dependancies during the kernel build, beyond what is normally needed to compile helper programs. * Make config create 'machine' softlinks for architecture specific directories outside of the standard <arch>/include. * Bump the config rev. WARNING! after this commit /usr/include and /usr/src/sys/compile/* should be regenerated from scratch.
Register keyword removal Approved by: Matt Dillon
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 220.127.116.11