Setup a command and function pointer in one case statement
instead of having a seconary case statement within a loop.
This makes the code much easier to follow, and possibly to add more compat
in the future.
Don't panic when running an old binary without compat support.
building was completed only to discover that within there lay havoc.
On the second day all just groaned and moaned, and it must be someone
else's problen.
On the third day, St. Martin stepped in and traced the culprit, which
provided inspiration, and a correction was made.
Forevermore all were agog at just how such a trivial thing could do
so much damage...
OK... to be a little less vague. The loopback interface is a truly
"special" thing, and rump knew that - and treated it very specially.
Unfortunately, when the loopback interface is changed, and rump does
not keep up, bad things happen.
This (overall) might, or might not, be the correct fix - but for now
it appears to work. If someone, sometime, finds a better way to
deal with the issues of the loopback interfaces true majesty, feel
free to revert this and do it another way.
happen automatically (via "registration" of the setup function in a
link-set), and if we're not a module, the SYSCTL_SETUP_PROTO() will
not have declared a function prototype!
This change makes struct ifaddr and its variants (in_ifaddr and in6_ifaddr)
MP-safe by using pserialize and psref. At this moment, pserialize_perform
and psref_target_destroy are disabled because (1) we don't need them
because of softnet_lock (2) they cause a deadlock because of softnet_lock.
So we'll enable them when we remove softnet_lock in the future.
netstat now uses sysctl instead of kvm(3) to get address information from
the kernel. So we can avoid the issue introduced by the reverted commit
(PR kern/51325) by updating netstat with the latest source code.
The rump code needs to call devsw_attach() in order to assign a dev_major
for bpf; it then uses this to create rumps /dev/bpf node. Unfortunately,
this leaves the devsw attached, so when the bpf module tries to initialize
itself, it gets an EEXIST error and fails.
So, once rump has figured what the dev_major should be, call devsw_detach()
to remove the devsw. Then, when the module initialization code calls
devsw_attach() it will succeed.
Timers (such as nd6_timer) typically free/destroy some data in callout
(softint). If we apply psz/psref for such data, we cannot do free/destroy
process in there because synchronization of psz/psref cannot be used in
softint. So run timer callbacks in workqueue works (normal LWP context).
Doing workqueue_enqueue a work twice (i.e., call workqueue_enqueue before
a previous task is scheduled) isn't allowed. For nd6_timer and
rt_timer_timer, this doesn't happen because callout_reset is called only
from workqueue's work. OTOH, ip{,6}flow_slowtimo's callout can be called
before its work starts and completes because the callout is periodically
called regardless of completion of the work. To avoid such a situation,
add a flag for each protocol; the flag is set true when a work is
enqueued and set false after the work finished. workqueue_enqueue is
called only if the flag is false.
Proposed on tech-net and tech-kern.
Reverting the whole change set just messes up many files uselessly
because changes to them (except for if.h) are proper.
- Remove ifa_pslist_entry that breaks kvm(3) users (e.g., netstat -ia)
- Change IFADDR_{READER,WRITER}_* macros to use old IFADDR_* (or just NOP)
for now
Fix PR kern/51325
Note that we leave the old list just in case; it seems there are some
kvm(3) users accessing the list. We can remove it later if we confirmed
nobody does actually.
A panic cause in rn_match() called by encap[46]_lookup(). The reason is that
gif(4) does not suspend receive packet processing in spite of suspending
transmit packet processing while anyone is doing gif(4) ioctl.
To prevent calling softint_schedule() after called softint_disestablish(),
the following modifications are added
+ ioctl (writing configuration) side
- off IFF_RUNNING flag before changing configuration
- wait softint handler completion before changing configuration
+ packet processing (reading configuraiotn) side
- if IFF_RUNNING flag is on, do nothing
+ in whole
- add gif_list_lock_{enter,exit} to prevent the same configuration is
set to other gif(4) interfaces
Addresses of an interface (struct ifaddr) have a (reverse) pointer of an
interface object (ifa->ifa_ifp). If the addresses are surely freed when
their interface is destroyed, the pointer is always valid and we don't
need a tweak of replacing the pointer to if_index like mbuf.
In order to make sure the assumption, the following changes are required:
- Deactivate the interface at the firstish of if_detach. This prevents
in6_unlink_ifa from saving multicast addresses (wrongly)
- Invalidate rtcache(s) and clear a rtentry referencing an address on
RTM_DELETE. rtcache(s) may delay freeing an address
- Replace callout_stop with callout_halt of DAD timers to ensure stopping
such timers in if_detach
Basically we should insert an item to a collection (say a list) after
item's initialization has been completed to avoid accessing an item
that is initialized halfway. ifaddr (in{,6}_ifaddr) isn't processed
like so and needs to be fixed.
In order to do so, we need to tweak {arp,nd6}_rtrequest that depend
on that an ifaddr is inserted during its initialization; they explore
interface's address list to determine that rt_getkey(rt) of a given
rtentry is in the list to know whether the route's interface should
be a loopback, which doesn't work after the change. To make it work,
first check RTF_LOCAL flag that is set in rt_ifa_addlocal that calls
{arp,nd6}_rtrequest eventually. Note that we still need the original
code for the case to remove and re-add a local interface route.
- If NET_MPSAFE is not defined, IFQ_LOCK is nop. Currently, that means
IFQ_ENQUEUE() of some paths such as bridge_enqueue() is called parallel
wrongly.
- If ALTQ is enabled, Tx processing should call if_transmit() (= IFQ_ENQUEUE
+ ifp->if_start()) instead of ifp->if_transmit() to call ALTQ_ENQUEUE()
and ALTQ_DEQUEUE().
Furthermore, ALTQ processing is always required KERNEL_LOCK currently.
To this end, callers need to pass struct psref to the functions
and the fuctions acquire a reference of ifp with it. In some cases,
we can simply use if_get_byindex, however, in other cases
(say rt->rt_ifp and ia->ifa_ifp), we have no MP-safe way for now.
In order to take a reference anyway we use non MP-safe function
if_acquire_NOMPSAFE for the latter cases. They should be fixed in
the future somehow.
The motivation is the same as the mbuf's rcvif case; avoid having a pointer
of an ifnet object in ip_moptions and ip6_moptions, which is not MP-safe.
ip_moptions and ip6_moptions can be stored in a PCB for inet or inet6
that's life time is different from ifnet one and so an ifnet object can be
disappeared anytime we get it via them. Thus we need to look up an ifnet
object by if_index every time for safe.
Having a pointer of an interface in a mbuf isn't safe if we remove big
kernel locks; an interface object (ifnet) can be destroyed anytime in any
packet processing and accessing such object via a pointer is racy. Instead
we have to get an object from the interface collection (ifindex2ifnet) via
an interface index (if_index) that is stored to a mbuf instead of an
pointer.
The change provides two APIs: m_{get,put}_rcvif_psref that use psref(9)
for sleep-able critical sections and m_{get,put}_rcvif that use
pserialize(9) for other critical sections. The change also adds another
API called m_get_rcvif_NOMPSAFE, that is NOT MP-safe and for transition
moratorium, i.e., it is intended to be used for places where are not
planned to be MP-ified soon.
The change adds some overhead due to psref to performance sensitive paths,
however the overhead is not serious, 2% down at worst.
Proposed on tech-kern and tech-net.
The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.
No functional change.
can be included in kernels which need them without also duplicating
them in other modules. Removes the duplicate symbols I found which
prevented loading i2c and bpf modules after having fixed PR 45125.
ifnet_lock is a dedicated method to safely destroy an interface over running
ioctl operations. Replace it with a more generic mechanism using psref(9).
The new API enables to obtain an ifnet object with protected by psref(9).
It is intended to be used where an obtained ifnet object is used over
sleepable operations.
The change ensures that ifnet objects in the ifnet list aren't freed during
list iterations by using pserialize(9) and psref(9).
Note that the change adds a pslist(9) for ifnet but doesn't remove the
original ifnet list (ifnet_list) to avoid breaking kvm(3) users. We
shouldn't use the original list in the kernel anymore.
We no longer need to change rtentry below if_output.
The change makes it clear where rtentries are changed (or not)
and helps forthcoming locking (os psrefing) rtentries.
rt_gwroute of rtentry is a reference to a rtentry of the gateway
for a rtentry with RTF_GATEWAY. That was used by L2 (arp and ndp)
to look up L2 addresses. By separating L2 nexthop caches, we don't
need a route for the purpose and we can stop using rt_gwroute.
By doing so, we can reduce referencing and modifying rtentries,
which makes it easy to apply a lock (and/or psref) to the
routing table and rtentries.
One issue to do this is to keep RTF_REJECT behavior. It seems it
was broken when we moved rtalloc1 things from L2 output routines
(e.g., ether_output) to ip_hresolv_output, but (fortunately?)
it works unexpectedly. What we mistook are:
- RTF_REJECT was checked for any routes in L2 output routines,
but in ip_hresolv_output it is checked only when the route
is RTF_GATEWAY
- The RTF_REJECT check wasn't copied to IPv6 (nd6_output)
It seems that rt_gwroute checks hid the mistakes and it looked
work (unexpectedly) and removing rt_gwroute checks unveil the
issue. So we need to fix RTF_REJECT checks in ip_hresolv_output
and also add them to nd6_output.
One more point we have to care is returning an errno; we need
to mimic looutput behavior. Originally RTF_REJECT check was
done either in L2 output routines or in looutput. The latter is
applied when a reject route directs to a loopback interface.
However, now RTF_REJECT check is done before looutput so to keep
the original behavior we need to return an errno which looutput
chooses. Added rt_check_reject_route does such tweaks.
Note that there is an issue that ioctls for an interface and a destruction
of the interface can run in parallel and it causes race conditions on
bridge as well (it rarely happens). The issue will be addressed in the
interface common code (if.c).
We need to enable it by default because bridge_input now runs
in softint, but bridge_input w/o BRIDGE_MPSAFE was designed as
it runs in hardware interrupt.
Note that there remains a racy code in bridge_output; it will be
solved in the upcoming change (applying psref(9)).
show arptab command of ddb is now inappropriate because it actually dumps
routes but arp entries aren't routes anymore. So rename it to show routes
and move the code from if_arp.c to route.c.
ok christos@
By this change, nexthop caches (IP-MAC address pair) are not stored
in the routing table anymore. Instead nexthop caches are stored in
each network interface; we already have lltable/llentry data structure
for this purpose. This change also obsoletes the concept of cloning/cloned
routes. Cloned routes no longer exist while cloning routes still exist
with renamed to connected routes.
Noticeable changes are:
- Nexthop caches aren't listed in route show/netstat -r
- sysctl(NET_RT_DUMP) doesn't return them
- If RTF_LLDATA is specified, it returns nexthop caches
- Several definitions of routing flags and messages are removed
- RTF_CLONING, RTF_XRESOLVE, RTF_LLINFO, RTF_CLONED and RTM_RESOLVE
- RTF_CONNECTED is added
- It has the same value of RTF_CLONING for backward compatibility
- route's -xresolve, -[no]cloned and -llinfo options are removed
- -[no]cloning remains because it seems there are users
- -[no]connected is introduced and recommended
to be used instead of -[no]cloning
- route show/netstat -r drops some flags
- 'L' and 'c' are not seen anymore
- 'C' now indicates a connected route
- Gateway value of a route of an interface address is now not
a L2 address but "link#N" like a connected (cloning) route
- Proxy ARP: "arp -s ... pub" doesn't create a route
You can know details of behavior changes by seeing diffs under tests/.
Proposed on tech-net and tech-kern:
http://mail-index.netbsd.org/tech-net/2016/03/11/msg005701.html
introduced in the prior patch.
The queue has capacity to store 8 link state changes, if it overflows then
the oldest state change is lost, but the oldest DOWN state change is
preserved to ensure any subsequent UP state changes reflect properly.
Because there are only 3 states to queue, the queue itself is implemented
by storing 2-bit numbers in a bigger one.
To increase the size of the queue, just increase the size of the backing
store to a bigger number.
The workaround was introduced because lltable/llentry uses rwlock
but it may be executed in hardware interrupt due to fast forward.
Now we don't run fast forward in hardware interrupt anymore, so
we can remove the workaround.
if_link_state_change can execute the network stack that is expected to
not run in hardware interrupt (at least now), however network drivers
may call it in hardware interrupt. Avoid that by introducing a new
softint for if_link_state_change.
The original patch is provided by mlelstv@ and tweaked a bit by me.
Should fix PR kern/50602.
Thanks to introducing softint-based if_input, the entire bridge code now
never run in hardware interrupt context. So we can simplify the code.
- Remove spin mutexes
- They were needed because some code of bridge could run in
hardware interrupt context
- We now need only an adaptive mutex for each shared object
(a member list and a forwarding table)
- Remove pktqueue
- bridge_input is already in softint, using another softint
(for bridge_forward) is useless
- Packet distribution should be down at device drivers
This change intends to run the whole network stack in softint context
(or normal LWP), not hardware interrupt context. Note that the work is
still incomplete by this change; to that end, we also have to softint-ify
if_link_state_change (and bpf) which can still run in hardware interrupt.
This change softint-ifies at ifp->if_input that is called from
each device driver (and ieee80211_input) to ensure Layer 2 runs
in softint (e.g., ether_input and bridge_input). To this end,
we provide a framework (called percpuq) that utlizes softint(9)
and percpu ifqueues. With this patch, rxintr of most drivers just
queues received packets and schedules a softint, and the softint
dequeues packets and does rest packet processing.
To minimize changes to each driver, percpuq is allocated in struct
ifnet for now and that is initialized by default (in if_attach).
We probably have to move percpuq to softc of each driver, but it's
future work. At this point, only wm(4) has percpuq in its softc
as a reference implementation.
Additional information including performance numbers can be found
in the thread at tech-kern@ and tech-net@:
http://mail-index.netbsd.org/tech-kern/2016/01/14/msg019997.html
Acknowledgment: riastradh@ greatly helped this work.
Thank you very much!
This change was intended, but Nakahara-san had already made a better
one locally! So I'll let him commit that one, and I'll try not to
step on anyone's toes again.
Mostly mechanical change to replace it, culling some now-needless
boilerplate around all the users.
This does not substantively change the ip_encap API or eliminate
abuse of sketchy pointer casts -- that will come later, and will be
easier now that it is not tangled up with struct protosw.
You can't use this unless you know what it is a priori: the formal
prototype is variadic, and the different instances (e.g., ip_output,
route_output) have different real prototypes.
Convert the only user of it, raw_send in net/raw_cb.c, to take an
explicit callback argument. Convert the only instances of it,
route_output and key_output, to such explicit callbacks for raw_send.
Use assertions to make sure the conversion to explicit callbacks is
warranted.
Discussed on tech-net with no objections:
https://mail-index.netbsd.org/tech-net/2016/01/16/msg005484.html
llentry#la_opaque which is for token ring is allocated in arp.c
and freed in arp.c when freeing llentry. However, llentry can be
freed from other places, e.g., lltable_free. In such cases,
la_opaque is never freed.
To fix that, add a new callback (lle_ll_free) to llentry and
register a destruction function of la_opque to it. On freeing a
llentry, we can surely free la_opque via the callback.
lltable and llentry were introduced to replace ARP cache data structure
for further restructuring of the routing table: L2 nexthop cache
separation. This change replaces the NDP cache data structure
(llinfo_nd6) with them as well as ARP.
One noticeable change is for neighbor cache GC mechanism that was
introduced to prevent IPv6 DoS attacks. net.inet6.ip6.neighborgcthresh
was the max number of caches that we store in the system. After
introducing lltable/llentry, the value is changed to be per-interface
basis because lltable/llentry stores neighbor caches in each interface
separately. And the change brings one degradation; the old GC mechanism
dropped exceeded packets based on LRU while the new implementation drops
packets in order from the beginning of lltable (a hash table + linked
lists). It would be improved in the future.
Added functions in in6.c come from FreeBSD (as of r286629) and are
tweaked for NetBSD.
Proposed on tech-kern and tech-net.
Using softnet_lock for mutual exclusion between lltable_free and
arptimer was wrong and had an issue causing a deadlock between
them; lltable_free waits arptimer completion by calling
callout_halt with softnet_lock that is held in arptimer, however
lltable_free also holds llentry's lock that is also held in
arptimer so arptimer never obtain the lock and both never go
forward eventually. We have to pass llentry's lock to
callout_halt instead.
modular/filesystem. In the non-modular case we initialize through attach.
In the modular/builtin case we define the module to be class misc so it
attaches late (after percpu is initialized) since driver modules attach
too early. In the modular/filesystem case we define it to be a driver
module since we autoload it via /dev/npf open.
are initialized. MODULE_CLASS_DRIVER modules are now initialized before
autoconfiguration starts, but npf_init has a dependency on percpu(9) which
doesn't work until CPUs have attached (at least on ARM).
Callers of arpresolve() now pass the error code back to their caller,
masking out EWOULDBLOCK.
This allows applications such as ping(8) to display a suitable error
condition.
In it's place, use rtrequest1() inside rt_ifa_addlocal() and
rtdeletemsg() inside rt_ifa_remlocal().
This removes the need for INET/INET6 specific code and allows
greater control over the creation of the local address route.
Currently RX can run on a CPU other than CPU#0, so always enqueuing
to a pktqueue of CPU#0 makes no sense. Let's use a curcpu's pktqueue,
although bridge_foward softint doesn't run in parallel without
NET_MPSAFE.
This is a temporal solution. We need a fundamental solution.
With GATEWAY (fastforward), the whole forwarding processing runs in
hardware interrupt context. So we cannot use rwlock for lltable and
llentry in that case.
This change replaces rwlock with mutex(IPL_NET) for lltable and llentry
when GATEWAY is enabled. We need to tweak locking only around rtree
in lltable_free. Other than that, what we need to do is to change macros
for locks.
I hope fastforward runs in softint some day in the future...
The previous code took locks the following order:
- LLE_WLOCKs
- mutex_enter(softnet_lock)
- LLE_WUNLOCKs
- mutex_exit(softnet_lock)
This fix moves mutex_enter(softnet_lock) before LLE_WLOCKs.
We have to touch la_rt always with holding softnet_lock. And we have to
use callout_halt with softnet_lock instead of callout_stop for
la_timer (arptimer) because arptimer holds softnet_lock inside it.
This fix may solve a kernel panic christos@ encountered.
Highlights of the change are:
- Use llentry instead of llinfo to manage ARP caches
- ARP specific data are stored in the hashed list
of an interface instead of the global list (llinfo_arp)
- Fine-grain locking on llentry
- arptimer (callout) per ARP cache
- the global timer callout with the big locks can be
removed (though softnet_lock is still required for now)
- net.inet.arp.prune is now obsoleted
- it was the interval of the global timer callout
- net.inet.arp.refresh is now obsoleted
- it was a parameter that prevents expiration of active caches
- Removed to simplify the timer logic, but we may be able to
restore the feature if really needed
Proposed on tech-kern and tech-net.
lltable/llentry is new L2 nexthop cache data structures that
store caches in each interface (struct ifnet). It is imported
to replace the current ARP cache implementation that uses the
global list with the big kernel lock, and provide fine-grain
locking for cache operations. It is also planned to replace
NDP caches.
The code is based on FreeBSD's lltable/llentry as of r286629
and tweaked for NetBSD.
rtrequest has already done it. So we don't need to do it once more.
This fixes regressed behavior of ARP cache expiration which an expired
cache doesn't disappear.
Some codes in sys/net* use time_second to manage time periods such as
cache expirations. However, time_second doesn't increase monotonically
and can leap by say settimeofday(2) according to time_second(9). We
should use time_uptime instead of it to avoid such time leaps.
This change replaces time_second with time_uptime. Additionally it
converts a time based on time_uptime to a time based on time_second
when the kernel passes the time to userland programs that expect
the latter, and vice versa.
Note that we shouldn't leak time_uptime to other hosts over the
netowrk. My investigation shows there is no such leak:
http://mail-index.netbsd.org/tech-net/2015/08/06/msg005332.html
Discussed on tech-kern and tech-net.