$NetBSD: TODO.smpnet,v 1.40 2021/01/20 10:26:43 nia Exp $ MP-safe components ================== They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE kernel option. Some components scale up and some don't. - Device drivers - aq(4) - vioif(4) - vmx(4) - wm(4) - ixg(4) - ixl(4) - ixv(4) - Layer 2 - Ethernet (if_ethersubr.c) - bridge(4) - STP - Fast forward (ipflow) - Layer 3 - All except for items in the below section - Interfaces - gif(4) - ipsecif(4) - l2tp(4) - pppoe(4) - if_spppsubr.c - tap(4) - tun(4) - vlan(4) - Packet filters - npf(7) - Others - bpf(4) - ipsec(4) - opencrypto(9) - pfil(9) Non MP-safe components and kernel options ========================================= The components and options aren't MP-safe, i.e., requires the big kernel lock, yet. Some of them can be used safely even if NET_MPSAFE is enabled because they're still protected by the big kernel lock. The others aren't protected and so unsafe, e.g, they may crash the kernel. Protected ones -------------- - Device drivers - Most drivers other than ones listed in the above section - Layer 4 - DCCP - SCTP - TCP - UDP Unprotected ones ---------------- - Layer 2 - ARCNET (if_arcsubr.c) - IEEE 1394 (if_ieee1394subr.c) - IEEE 802.11 (ieee80211(4)) - Layer 3 - IPSELSRC - MROUTING - PIM - MPLS (mpls(4)) - IPv6 address selection policy - Interfaces - agr(4) - carp(4) - faith(4) - gre(4) - ppp(4) - sl(4) - stf(4) - if_srt - Packet filters - ipf(4) - pf(4) - Others - AppleTalk (sys/netatalk/) - Bluetooth (sys/netbt/) - altq(4) - kttcp(4) - NFS Know issues =========== NOMPSAFE -------- We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe yet. We use it in comments and also use as part of function names, for example m_get_rcvif_NOMPSAFE. Let's use "NOMPSAFE" to make it easy to find non-MP-safe codes by grep. bpf --- MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context or softint context, i.e., not in hardware interrupt context. For Tx, all bpf_mtap satisfy the requrement. For Rx, most of bpf_mtap are called in softint. Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context. This is the list of the functions that have such bpf_mtap: - sca_frame_process() @ sys/dev/ic/hd64570.c Ideally we should make the functions run in softint somehow, but we don't have actual devices, no time (or interest/love) to work on the task, so instead we provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint context. It's a workaround and once the functions run in softint, we should use the original bpf_mtap again. if_mcast_op() - SIOCADDMULTI/SIOCDELMULTI ----------------------------------------- Helper function is called to add or remove multicast addresses for interface. When called via ioctl it takes IFNET_LOCK(), when called via sosetopt() it doesn't. Various network drivers can't assert IFNET_LOCKED() in their if_ioctl because of this. Generally drivers still take care to splnet() even with NET_MPSAFE before calling ether_ioctl(), but they do not take KERNEL_LOCK(), so this is actually unsafe. Lingering obsolete variables ----------------------------- Some obsolete global variables and member variables of structures remain to avoid breaking old userland programs which directly access such variables via kvm(3). The following programs still use kvm(3) to get some information related to the network stack. - netstat(1) - vmstat(1) - fstat(1) netstat(1) accesses ifnet_list, the head of a list of interface objects (struct ifnet), and traverses each object through ifnet#if_list member variable. ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and ifnet#if_pslist_entry respectively. netstat also accesses the IP address list of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr and struct in6_ifaddr are accessed and the following obsolete member variables are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list, in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already implements alternative methods to fetch the above information via sysctl(3). vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel. The statistic information is retrieved via kvm(3). The global variables in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4 addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist, are kept for this purpose. We should provide a means to fetch statistics of hash tables via sysctl(3). fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list member variables are obsolete but remain. ifnet#if_xname is also accessed via struct bpf_if and obsolete ifnet#if_list is required to remain to not change the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount, bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for scalability the statistic counters should be per-CPU and we should stop using atomic operations for them however we have to remain the counters and atomic operations. Scalability ----------- - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple flows per CPU - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up is O(n) - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable as they are serialized by one mutex ALTQ ---- If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd) for packet transmissions, resulting in serializing all Tx packet processing on the queue. We should probably design and implement an alternative queuing mechanism that deals with multi-core systems at the first place, not making the existing ALTQ MP-safe because it's just annoying. Using kernel modules -------------------- Please note that if you enable NET_MPSAFE in your kernel, and you use and loadable kernel modules (including compat_xx modules or individual network interface if_xxx device driver modules), you will need to build custom modules. For each module you will need to add the following line to its Makefile: CPPFLAGS+= NET_MPSAFE Failure to do this may result in unpredictable behavior. IPv4 address initialization atomicity ------------------------------------- An IPv4 address is referenced by several data structures: an associated interface, its local route, a connected route (if necessary), the global list, the global hash table, etc. These data structures are not updated atomically, i.e., there can be inconsistent states on an IPv4 address in the kernel during the initialization of an IPv4 address. One known failure of the issue is that incoming packets destinating to an initializing address can loop in the network stack in a short period of time. The address initialization creates an local route first and then registers an initializing address to the global hash table that is used to decide if an incoming packet destinates to the host by checking the destination of the packet is registered to the hash table. So, if the host allows forwaring, an incoming packet can match on a local route of an initializing address at ip_output while it fails the to-self check described above at ip_input. Because a matched local route points a loopback interface as its destination interface, an incoming packet sends to the network stack (ip_input) again, which results in looping. The loop stops once an initializing address is registered to the hash table. One solution of the issue is to reorder the address initialization instructions, first register an address to the hash table then create its routes. Another solution is to use the routing table for the to-self check instead of using the global hash table, like IPv6. if_flags -------- To avoid data race on if_flags it should be protected by a lock (currently it's IFNET_LOCK). Thus, if_flags should not be accessed on packet processing to avoid performance degradation by lock contentions. Traditionally IFF_RUNNING, IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing. If you make a driver MP-safe you must remove such checks. IFF_ALLMULTI can be set/unset via if_mcast_op. To protect updates of the flag, we had added IFNET_LOCK around if_mcast_op. However that was not a good approach because if_mcast_op is typically called in the middle of a call path and holding IFNET_LOCK such places is problematic. Actually a deadlock is observed. Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI somewhere other than if_flags, for example ethercom or driver itself (or a common driver framework once it appears). Such a change is feasible because IFF_ALLMULTI is only set/unset by a driver and not accessed from any common components such as network protocols. Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow.