251 lines
8.9 KiB
Plaintext
251 lines
8.9 KiB
Plaintext
$NetBSD: TODO.smpnet,v 1.39 2020/08/01 08:20:47 maxv Exp $
|
|
|
|
MP-safe components
|
|
==================
|
|
|
|
They work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE
|
|
kernel option. Some components scale up and some don't.
|
|
|
|
- Device drivers
|
|
- aq(4)
|
|
- vioif(4)
|
|
- vmx(4)
|
|
- wm(4)
|
|
- ixg(4)
|
|
- ixl(4)
|
|
- ixv(4)
|
|
- Layer 2
|
|
- Ethernet (if_ethersubr.c)
|
|
- bridge(4)
|
|
- STP
|
|
- Fast forward (ipflow)
|
|
- Layer 3
|
|
- All except for items in the below section
|
|
- Interfaces
|
|
- gif(4)
|
|
- ipsecif(4)
|
|
- l2tp(4)
|
|
- pppoe(4)
|
|
- if_spppsubr.c
|
|
- tun(4)
|
|
- vlan(4)
|
|
- Packet filters
|
|
- npf(7)
|
|
- Others
|
|
- bpf(4)
|
|
- ipsec(4)
|
|
- opencrypto(9)
|
|
- pfil(9)
|
|
|
|
Non MP-safe components and kernel options
|
|
=========================================
|
|
|
|
The components and options aren't MP-safe, i.e., requires the big kernel lock,
|
|
yet. Some of them can be used safely even if NET_MPSAFE is enabled because
|
|
they're still protected by the big kernel lock. The others aren't protected and
|
|
so unsafe, e.g, they may crash the kernel.
|
|
|
|
Protected ones
|
|
--------------
|
|
|
|
- Device drivers
|
|
- Most drivers other than ones listed in the above section
|
|
- Layer 4
|
|
- DCCP
|
|
- SCTP
|
|
- TCP
|
|
- UDP
|
|
|
|
Unprotected ones
|
|
----------------
|
|
|
|
- Layer 2
|
|
- ARCNET (if_arcsubr.c)
|
|
- IEEE 1394 (if_ieee1394subr.c)
|
|
- IEEE 802.11 (ieee80211(4))
|
|
- Layer 3
|
|
- IPSELSRC
|
|
- MROUTING
|
|
- PIM
|
|
- MPLS (mpls(4))
|
|
- IPv6 address selection policy
|
|
- Interfaces
|
|
- agr(4)
|
|
- carp(4)
|
|
- faith(4)
|
|
- gre(4)
|
|
- ppp(4)
|
|
- sl(4)
|
|
- stf(4)
|
|
- if_srt
|
|
- tap(4)
|
|
- Packet filters
|
|
- ipf(4)
|
|
- pf(4)
|
|
- Others
|
|
- AppleTalk (sys/netatalk/)
|
|
- Bluetooth (sys/netbt/)
|
|
- altq(4)
|
|
- kttcp(4)
|
|
- NFS
|
|
|
|
Know issues
|
|
===========
|
|
|
|
NOMPSAFE
|
|
--------
|
|
|
|
We use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe
|
|
yet. We use it in comments and also use as part of function names, for example
|
|
m_get_rcvif_NOMPSAFE. Let's use "NOMPSAFE" to make it easy to find non-MP-safe
|
|
codes by grep.
|
|
|
|
bpf
|
|
---
|
|
|
|
MP-ification of bpf requires all of bpf_mtap* are called in normal LWP context
|
|
or softint context, i.e., not in hardware interrupt context. For Tx, all
|
|
bpf_mtap satisfy the requrement. For Rx, most of bpf_mtap are called in softint.
|
|
Unfortunately some bpf_mtap on Rx are still called in hardware interrupt context.
|
|
|
|
This is the list of the functions that have such bpf_mtap:
|
|
|
|
- sca_frame_process() @ sys/dev/ic/hd64570.c
|
|
|
|
Ideally we should make the functions run in softint somehow, but we don't have
|
|
actual devices, no time (or interest/love) to work on the task, so instead we
|
|
provide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint
|
|
context. It's a workaround and once the functions run in softint, we should use
|
|
the original bpf_mtap again.
|
|
|
|
if_mcast_op() - SIOCADDMULTI/SIOCDELMULTI
|
|
-----------------------------------------
|
|
Helper function is called to add or remove multicast addresses for
|
|
interface. When called via ioctl it takes IFNET_LOCK(), when called
|
|
via sosetopt() it doesn't.
|
|
|
|
Various network drivers can't assert IFNET_LOCKED() in their if_ioctl
|
|
because of this. Generally drivers still take care to splnet() even
|
|
with NET_MPSAFE before calling ether_ioctl(), but they do not take
|
|
KERNEL_LOCK(), so this is actually unsafe.
|
|
|
|
Lingering obsolete variables
|
|
-----------------------------
|
|
|
|
Some obsolete global variables and member variables of structures remain to
|
|
avoid breaking old userland programs which directly access such variables via
|
|
kvm(3).
|
|
|
|
The following programs still use kvm(3) to get some information related to
|
|
the network stack.
|
|
|
|
- netstat(1)
|
|
- vmstat(1)
|
|
- fstat(1)
|
|
|
|
netstat(1) accesses ifnet_list, the head of a list of interface objects
|
|
(struct ifnet), and traverses each object through ifnet#if_list member variable.
|
|
ifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and
|
|
ifnet#if_pslist_entry respectively. netstat also accesses the IP address list
|
|
of an interface throught ifnet#if_addrlist. struct ifaddr, struct in_ifaddr
|
|
and struct in6_ifaddr are accessed and the following obsolete member variables
|
|
are stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list,
|
|
in6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already
|
|
implements alternative methods to fetch the above information via sysctl(3).
|
|
|
|
vmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel.
|
|
The statistic information is retrieved via kvm(3). The global variables
|
|
in_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4
|
|
addresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist,
|
|
are kept for this purpose. We should provide a means to fetch statistics of
|
|
hash tables via sysctl(3).
|
|
|
|
fstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is
|
|
obtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list
|
|
member variables are obsolete but remain. ifnet#if_xname is also accessed
|
|
via struct bpf_if and obsolete ifnet#if_list is required to remain to not change
|
|
the offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount,
|
|
bpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for
|
|
scalability the statistic counters should be per-CPU and we should stop using
|
|
atomic operations for them however we have to remain the counters and atomic
|
|
operations.
|
|
|
|
Scalability
|
|
-----------
|
|
|
|
- Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple
|
|
flows per CPU
|
|
- ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up
|
|
is O(n)
|
|
- opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable
|
|
as they are serialized by one mutex
|
|
|
|
ALTQ
|
|
----
|
|
|
|
If ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd)
|
|
for packet transmissions, resulting in serializing all Tx packet processing on
|
|
the queue. We should probably design and implement an alternative queuing
|
|
mechanism that deals with multi-core systems at the first place, not making the
|
|
existing ALTQ MP-safe because it's just annoying.
|
|
|
|
Using kernel modules
|
|
--------------------
|
|
|
|
Please note that if you enable NET_MPSAFE in your kernel, and you use and
|
|
loadable kernel modules (including compat_xx modules or individual network
|
|
interface if_xxx device driver modules), you will need to build custom
|
|
modules. For each module you will need to add the following line to its
|
|
Makefile:
|
|
|
|
CPPFLAGS+= NET_MPSAFE
|
|
|
|
Failure to do this may result in unpredictable behavior.
|
|
|
|
IPv4 address initialization atomicity
|
|
-------------------------------------
|
|
|
|
An IPv4 address is referenced by several data structures: an associated
|
|
interface, its local route, a connected route (if necessary), the global list,
|
|
the global hash table, etc. These data structures are not updated atomically,
|
|
i.e., there can be inconsistent states on an IPv4 address in the kernel during
|
|
the initialization of an IPv4 address.
|
|
|
|
One known failure of the issue is that incoming packets destinating to an
|
|
initializing address can loop in the network stack in a short period of time.
|
|
The address initialization creates an local route first and then registers an
|
|
initializing address to the global hash table that is used to decide if an
|
|
incoming packet destinates to the host by checking the destination of the packet
|
|
is registered to the hash table. So, if the host allows forwaring, an incoming
|
|
packet can match on a local route of an initializing address at ip_output while
|
|
it fails the to-self check described above at ip_input. Because a matched local
|
|
route points a loopback interface as its destination interface, an incoming
|
|
packet sends to the network stack (ip_input) again, which results in looping.
|
|
The loop stops once an initializing address is registered to the hash table.
|
|
|
|
One solution of the issue is to reorder the address initialization instructions,
|
|
first register an address to the hash table then create its routes. Another
|
|
solution is to use the routing table for the to-self check instead of using the
|
|
global hash table, like IPv6.
|
|
|
|
if_flags
|
|
--------
|
|
|
|
To avoid data race on if_flags it should be protected by a lock (currently it's
|
|
IFNET_LOCK). Thus, if_flags should not be accessed on packet processing to
|
|
avoid performance degradation by lock contentions. Traditionally IFF_RUNNING,
|
|
IFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing. If
|
|
you make a driver MP-safe you must remove such checks.
|
|
|
|
IFF_ALLMULTI can be set/unset via if_mcast_op. To protect updates of the flag,
|
|
we had added IFNET_LOCK around if_mcast_op. However that was not a good
|
|
approach because if_mcast_op is typically called in the middle of a call path
|
|
and holding IFNET_LOCK such places is problematic. Actually a deadlock is
|
|
observed. Probably we should remove IFNET_LOCK and manage IFF_ALLMULTI
|
|
somewhere other than if_flags, for example ethercom or driver itself (or a
|
|
common driver framework once it appears). Such a change is feasible because
|
|
IFF_ALLMULTI is only set/unset by a driver and not accessed from any common
|
|
components such as network protocols.
|
|
|
|
Also IFF_PROMISC is checked in ether_input and we should get rid of it somehow.
|