The change to if_spppsubr.c moves the test for whether LCP should
request a mru change until after the pppoe device has picked up the
mtu of the underlying ethernet device.
returned to userland by read(2) also needs to be converted.
For this, the bpf descriptor is flagged as compat32 (or not) in the
open and ioctl functions (where the user process's pid is also updated
in the descriptor). When the bpf buffer is filled in, the 32bits or native
header is used depending on the information stored in the descriptor.
This won't work if a 64bit binary does the open and ioctls, and then
exec a 32bit program which will do the read. But this is very
unlikely to happen in real life ...
Tested on i386 and loongson; with these changes my loongson can run
dhclient and tcpdump with a n32 userland.
sys/stdarg.h and expect compiler to provide proper builtins, defaulting
to the GCC interface. lint still has a special fallback.
Reduce abuse of _BSD_VA_LIST_ by defining __va_list by default and
derive va_list as required by standards.
smpls_addrs in sockaddr_mpls. The number of smpls_addrs is found from
smpls_len. First label encountered is BoS.
XXX: need to do the same for LSE and this feature needs to be documented.
This is still somewhat experimental. Tested between 2 similar boxes
so far. There is much potential for performance improvement. For now,
I've changed the gmac code to accept any data alignment, as the "char *"
pointer suggests. As the code is practically used, 32-bit alignment
can be assumed, at the cost of data copies. I don't know whether
bytewise access or copies are worse performance-wise. For efficient
implementations using SSE2 instructions on x86, even stricter
alignment requirements might arise.
will have an easier time replacing it with something different, even if
it is a second radix-trie implementation.
sys/net/route.c and sys/net/rtsock.c no longer operate directly on
radix_nodes or radix_node_heads.
Hopefully this will reduce the temptation to implement multipath or
source-based routing using grotty hacks to the grotty old radix-trie
code, too. :-)
M_NOWAIT cause dhcpd on a low-memory server with lots of interfaces to
occasionally fail to start with ENOBUFS; (M_WAITOK | M_CANFAIL) seems to
fix this.
Tested on 3 different dhcp servers.
- Add libnpf(3) - a library to control NPF (configuration, ruleset, etc).
- Add NPF support for ftp-proxy(8).
- Add rc.d script for NPF.
- Convert npfctl(8) to use libnpf(3) and thus make it less depressive.
Note: next clean-up step should be a parser, once dholland@ will finish it.
- Add more documentation.
- Various fixes.
- Add the concept of rule procedure: separate normalization, logging and
potentially other functions from the rule structure. Rule procedure can be
shared amongst the rules. Separation is both at kernel level (npf_rproc_t)
and configuration ("procedure" + "apply").
- Fix portmap sharing for NAT policy.
- Update TCP state tracking logic. Use TCP FSM definitions.
- Add if_byindex(), OK by matt@. Use in logging for the lookup.
- Fix traceroute ALG and many other bugs; misc clean-up.
- Add support for session saving/restoring.
- Add packet logging support (can tcpdump a pseudo-interface).
- Support reload without flushing of sessions; rework some locking.
- Revisit session mangement, replace linking with npf_sentry_t entries.
- Add some counters for statistics, using percpu(9).
- Add IP_DF flag cleansing.
- Fix various bugs; misc clean-up.
by zero while validating the bpf program.
originally spotted by skrll@, and broke atf the month-old atf test for
this exact problem: net_bpf_t_div-by-zero_div_by_zero.
- Add proper TCP state tracking as described in Guido van Rooij paper,
plus handle TCP Window Scaling option.
- Completely rework npf_cache_t, reduce granularity, simplify code.
- Add npf_addr_t as an abstraction, amend session handling code, as well
as NAT code et al, to use it. Now design is prepared for IPv6 support.
- Handle IPv4 fragments i.e. perform packet reassembly.
- Add support for IPv4 ID randomization and minimum TTL enforcement.
- Add support for TCP MSS "clamping".
- Random bits for IPv6. Various fixes and clean-up.
1. Fix inverted node order, so that negative value from comparison operator
would represent lower (left) node, and positive - higher (right) node.
2. Add an argument (i.e. "context"), passed to comparison operators.
3. Change rb_tree_insert_node() to return a node - either inserted one or
already existing one.
4. Amend the interface to manipulate the actual object, instead of the
rb_node (in a similar way as Patricia-tree interface does).
5. Update all RB-tree users accordingly.
XXX: Perhaps rename rb.h to rbtree.h, since cleaning-up..
1-3 address the PR/43488 by Jeremy Huddleston.
Passes RB-tree regression tests.
Reviewed by: matt@, christos@
- Add support for bi-directional NAT and redirection / port forwarding.
- Finish filtering on ICMP type/code and add filtering on TCP flags.
- Add support for TCP reset (RST) or ICMP destination unreachable on block.
- Fix a bunch of bugs; misc cleanup.
hosts. IPv6 is probably still broken, and, actually, the lookup table
for mask values should be kept in network byte order, not host byte order
and the corresponding change to the srtconfig ioctl interface made.
But at least this works.
1) RFC2367 says in 2.3.3 Address Extension: "All non-address
information in the sockaddrs, such as sin_zero for AF_INET sockaddrs,
and sin6_flowinfo for AF_INET6 sockaddrs, MUST be zeroed out."
the IPSEC_NAT_T code was expecting the port information it needs
to be conveyed in the sockaddr instead of exclusively by
SADB_X_EXT_NAT_T_SPORT and SADB_X_EXT_NAT_T_DPORT,
and was not zeroing out the port information in the non-nat-traversal
case.
Since it was expecting the port information to reside in the sockaddr
it could get away with (re)setting the ports after starting to use them.
-> Set the natt ports before setting the SA mature.
2) RFC3947 has two Original Address fields, initiator and responder,
so we need SADB_X_EXT_NAT_T_OAI and SADB_X_EXT_NAT_T_OAR and not just
SADB_X_EXT_NAT_T_OA
The change has been created using vanhu's patch for FreeBSD as reference.
Note that establishing actual nat-t sessions has not yet been tested.
Likely fixes the following:
PR bin/41757
PR net/42592
PR net/42606
- Designed to be fully MP-safe and highly efficient.
- Tables/IP sets (hash or red-black tree) for high performance lookups.
- Stateful filtering and Network Address Port Translation (NAPT).
Framework for application level gateways (ALGs).
- Packet inspection engine called n-code processor - inspired by BPF -
supporting generic RISC-like and specific CISC-like instructions for
common patterns (e.g. IPv4 address matching). See npf_ncode(9) manual.
- Convenient userland utility npfctl(8) with npf.conf(8).
NOTE: This is not yet a fully capable alternative to PF or IPFilter.
Further work (support for binat/rdr, return-rst/return-icmp, common ALGs,
state saving/restoring, logging, etc) is in progress.
Thanks a lot to Matt Thomas for various useful comments and code review.
Aye by: board@
to find and unlink routes that reference the detached ifnet: make
if_rt_walktree() return ERESTART whenever it has deleted a route.
Whenever rt_walktree() returns ERESTART, if_detach() restarts it.
I believe that this fix resembles one by Jonathan Kollasch or by someone
else, which has languished in a PR for too long. Sorry!
Tested by me and by Jeff Rizzo.
XXX It's supposed to be safe for rn_walktree() to apply to the routing
XXX table a routine that may delete routes. Why isn't it safe in
XXX practice?
These annotations help to mitigate false sharing on multiprocessor
systems.
Variables annotated with __cacheline_aligned are placed into the
.data.cacheline_aligned section in the kernel. Each item in this
section is aligned on a cachline boundary - this avoids false
sharing. Highly contended global locks are a good candidate for
__cacheline_aligned annotation.
Variables annotated with __read_mostly are packed together tightly
into a .data.read_mostly section in the kernel. The idea here is that
we can pack infrequently modified data items into a cacheline and
avoid having to purge the cache, which would happen if read mostly
data and write mostly data shared a cachline. Initialisation variables
are a prime candiate for __read_mostly annotations.
- better named one
- not suffering from buffer oveflow
- simpler
- handling different separators
- returning error codes for errors
Some ideas from one posted on tech-net by Jonathan A. Kollasch
Be more leinent on input string format. Each nibble pair may optionally be
followed by any of ':', '-', '.' or ' '.
Make source string const and work on a temporary copy. The caller may not
expect their string to be destroyed.
kern/39940 and by Martti Kuparinen on current-users@: replace the
ioctl lock with finer-grained locking. Lock the ports list and
wait to if_clone_destroy() until all threads are out of the softc.
Thanks to Martti Kuparinen for testing these changes.
#if NBPFILTER is no longer required in the client. This change
doesn't yet add support for loading bpf as a module, since drivers
can register before bpf is attached. However, callers of bpf can
now be modularized.
Dynamically loadable bpf could probably be done fairly easily with
coordination from the stub driver and the real driver by registering
attachments in the stub before the real driver is loaded and doing
a handoff. ... and I'm not going to ponder the depths of unload
here.
Tested with i386/MONOLITHIC, modified MONOLITHIC without bpf and rump.
#if NBPFILTER is no longer required in the client. This change
doesn't yet add support for loading bpf as a module, since drivers
can register before bpf is attached. However, callers of bpf can
now be modularized.
Dynamically loadable bpf could probably be done fairly easily with
coordination from the stub driver and the real driver by registering
attachments in the stub before the real driver is loaded and doing
a handoff. ... and I'm not going to ponder the depths of unload
here.
Tested with i386/MONOLITHIC, modified MONOLITHIC without bpf and rump.
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.
- Drop the INET6 block. The commands are never given to this function
and truncating the sockaddr is arguably not the desired result anyway.
- Clear the address before copying. This fixes SIOCGIFNETMASK and possible
other ioctls for users that don't check sa_len. This includes
COMPAT_43 and Linux emulation.
OK dyoung@
Pfsync interface exposes change in the pf(4) over a pseudo-interface, and can
be used to synchronise different pf.
This work was part of my 2009 GSoC
No objection on tech-net@
addresses. Make the kernel support SIOC[SG]IFADDRPREF for IPv6
interface addresses.
In in6ifa_ifpforlinklocal(), consult preference numbers before
making an otherwise arbitrary choice of in6_ifaddr. Otherwise,
preference numbers are *not* consulted by the kernel, but that will
be rather easy for somebody with a little bit of free time to fix.
Please note that setting the preference number for a link-local
IPv6 address does not work right, yet, but that ought to be fixed
soon.
In support of the changes above,
1 Add a method to struct domain for "externalizing" a sockaddr, and
provide an implementation for IPv6. Expect more work in this area: it
may be more proper to say that the IPv6 implementation "internalizes"
a sockaddr. Add sockaddr_externalize().
2 Add a subroutine, sofamily(), that returns a struct socket's address
family or AF_UNSPEC.
3 Make a lot of IPv4-specific code generic, and move it from
sys/netinet/ to sys/net/ for re-use by IPv6 parts of the kernel and
ifconfig(8).
queue's maximum length, current length, and number of drops. E.g.,
% sysctl net.interfaces.bnx0
net.interfaces.bnx0.sndq.len = 0
net.interfaces.bnx0.sndq.maxlen = 509
net.interfaces.bnx0.sndq.drops = 0
Let userland adjust the maximum queue length.
While I'm here, add a 64-bit generation number, if_index_gen, to
ifnet; the pair [ifp->if_index, ifp->if_index_gen] can serve to
identify an ifnet for the lifetime of the system. I will use this
in an upcoming change.
Ok matt@.
These changes allow vlans to be layered above agr, with the attach
and detach propogated to the member ports in the aggregation.
Note the agr interface must be up before the vlan is attached.
Adds SIOCINITIFADDR support to the wm driver for setting the AF_LINK
address, necessary for agr to be able to set the mac addresses of each
port to the agr address (i.e. so it can receive all intended traffic
at the hardware level).
Adds support for disabling the LACP protocol by setting LINK1 on the agr
interface (e.g. ifconfig agr0 link1).
In consultation with tls@.
will be processed when the radix "subsystem" is initialized -- all
users must be attached before any inits to know the max keylength.
Use of link sets is no longer required, and only attached domains
need to be considered.
BRDGGFLT and BRDGSFILT bridge controls are only available with BRIDGE_IPF and PFIL_HOOKS defined.
In amd64 GENERIC and XEN kernel configs PFIL_HOOKS is defined but BRIDGE_IPF is not.
When a BRDGGFLT or BRDGSFILT command comes in, then ifd->ifd_cmd is not in range
of bridge_control_table_size. Then bc is not set and is dereferenced
later => BOOM.
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.
Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.
thr0 accept(fd, ...)
thr1 close(fd)
context (reported on various mailing-lists, and part of PR kern/41114,
causing panic in pf(4) and possibly ipf(4) when BRIDGE_IPF is used).
Defer bridge_forward() to a software interrupt; bridge_input() enqueues
mbufs to ifp->if_snd which is handled in bridge_forward().
RT_ prefix and use them appropriately, instead of making copies. Make
pppd use the RT_ROUNDUP macro; fixes proxyarp setting on 64 bit hosts.
XXX: All this should be pulled up to 5.0
Note the vlan interface does not see updates to the parents capabilities
so if, for example, TSO is on in both, then turned off in the parent it
will remain on in the vlan interface.
Extends the Opencrypto API to allow the destination buffer size to be
specified when its not the same size as the input buffer (i.e. for
operations like compress and decompress).
The crypto_op and crypt_n_op structures gain a u_int dst_len field.
The session_op structure gains a comp_alg field to specify a compression
algorithm.
Moved four ioctls to new ids; CIOCGSESSION, CIOCNGSESSION, CIOCCRYPT,
and CIOCNCRYPTM.
Added four backward compatible ioctls; OCIOCGSESSION, OCIOCNGSESSION,
OCIOCCRYPT, and OCIOCNCRYPTM.
Backward compatibility is maintained in ocryptodev.h and ocryptodev.c which
implement the original ioctls and set dst_len and comp_alg to 0.
Adds user-space access to compression features.
Adds software gzip support (CRYPTO_GZIP_COMP).
Adds the fast version of crc32 from zlib to libkern. This should be generally
useful and provide a place to start normalizing the various crc32 routines
in the kernel. The crc32 routine is used in this patch to support GZIP.
With input and support from tls@NetBSD.org.
There are still about 1600 left, but they have ',' or /* ... */
in the actual variable definitions - which my awk script doesn't handle.
There are also many that need () -> (void).
(The script does handle misordered arguments.)
* A sign extension error creating the bridge ID corrupted the
priority (always making it the maximum).
* Do not catch STP packets on an interface for which STP is not
enabled -- it's a violation of the spec, and causes STP to fail on
neighboring bridges.
* An optimization to bstp_input() -- some information is already
known when we call it.
contributed anonymously.