packet handler. Change the default policy to block when the config is
loaded and set it to pass when flush operation is performed.
- Use kmem_zalloc(9) instead of kmem_alloc(9) in few places.
- npf_rproc_{create,release}: use kmem_intr_{alloc,free} as the destruction
of rule procedure might happen in the interrupt handler (under a very rare
condition, if config reload races with the handler).
- npf_session_establish: check whether layer 3 and 4 are cached.
- npfctl_build_group: do not make groups as passing rules.
- Remove some unecessary header inclusion.
- Enable checking for zero mask in IP{4,6}MATCH after npfctl changes.
- Make locking symmetric for npf_ruleset_inspect().
- Sync function prototypes in npf(3) man page with reality.
- Rename NPF_TABLE_RBTREE to NPF_TABLE_TREE.
implementation. Rewrite pseudodevice code to use cprng_strong(9).
The new pseudodevice is cloning, so each caller gets bits from a stream
generated with its own key. Users of /dev/urandom get their generators
keyed on a "best effort" basis -- the kernel will rekey generators
whenever the entropy pool hits the high water mark -- while users of
/dev/random get their generators rekeyed every time key-length bits
are output.
The underlying cprng_strong API can use AES-256 or AES-128, but we use
AES-128 because of concerns about related-key attacks on AES-256. This
improves performance (and reduces entropy pool depletion) significantly
for users of /dev/urandom but does cause users of /dev/random to rekey
twice as often.
Also fixes various bugs (including some missing locking and a reseed-counter
overflow in the CTR_DRBG code) found while testing this.
For long reads, this generator is approximately 20 times as fast as the
old generator (dd with bs=64K yields 53MB/sec on 2Ghz Core2 instead of
2.5MB/sec) and also uses a separate mutex per instance so concurrency
is greatly improved. For reads of typical key sizes for modern
cryptosystems (16-32 bytes) performance is about the same as the old
code: a little better for 32 bytes, a little worse for 16 bytes.
prefix does not have IFA_ROUTE.
Don't scrub the interface in SIOCAIFADDR if the new address does't
have IFA_ROUTE. If more functions are added to in_ifscrub then this logic
might need to be revisited.
Fixes PR/26450.
<20111022023242.BA26F14A158@mail.netbsd.org>. This change includes
the following:
An initial cleanup and minor reorganization of the entropy pool
code in sys/dev/rnd.c and sys/dev/rndpool.c. Several bugs are
fixed. Some effort is made to accumulate entropy more quickly at
boot time.
A generic interface, "rndsink", is added, for stream generators to
request that they be re-keyed with good quality entropy from the pool
as soon as it is available.
The arc4random()/arc4randbytes() implementation in libkern is
adjusted to use the rndsink interface for rekeying, which helps
address the problem of low-quality keys at boot time.
An implementation of the FIPS 140-2 statistical tests for random
number generator quality is provided (libkern/rngtest.c). This
is based on Greg Rose's implementation from Qualcomm.
A new random stream generator, nist_ctr_drbg, is provided. It is
based on an implementation of the NIST SP800-90 CTR_DRBG by
Henric Jungheim. This generator users AES in a modified counter
mode to generate a backtracking-resistant random stream.
An abstraction layer, "cprng", is provided for in-kernel consumers
of randomness. The arc4random/arc4randbytes API is deprecated for
in-kernel use. It is replaced by "cprng_strong". The current
cprng_fast implementation wraps the existing arc4random
implementation. The current cprng_strong implementation wraps the
new CTR_DRBG implementation. Both interfaces are rekeyed from
the entropy pool automatically at intervals justifiable from best
current cryptographic practice.
In some quick tests, cprng_fast() is about the same speed as
the old arc4randbytes(), and cprng_strong() is about 20% faster
than rnd_extract_data(). Performance is expected to improve.
The AES code in src/crypto/rijndael is no longer an optional
kernel component, as it is required by cprng_strong, which is
not an optional kernel component.
The entropy pool output is subjected to the rngtest tests at
startup time; if it fails, the system will reboot. There is
approximately a 3/10000 chance of a false positive from these
tests. Entropy pool _input_ from hardware random numbers is
subjected to the rngtest tests at attach time, as well as the
FIPS continuous-output test, to detect bad or stuck hardware
RNGs; if any are detected, they are detached, but the system
continues to run.
A problem with rndctl(8) is fixed -- datastructures with
pointers in arrays are no longer passed to userspace (this
was not a security problem, but rather a major issue for
compat32). A new kernel will require a new rndctl.
The sysctl kern.arandom() and kern.urandom() nodes are hooked
up to the new generators, but the /dev/*random pseudodevices
are not, yet.
Manual pages for the new kernel interfaces are forthcoming.
RTF_ANNOUNCE was defined as RTF_PROTO2. The flag is used to indicated
that host should act as a proxy for a link level arp or ndp request.
(If RTF_PROTO2 is used as an experimental flag (as advertised),
various problems can occur.)
This commit provides a first-class definition with its own bit for
RTF_ANNOUNCE, removes the old aliasing definitions, and adds support
for the new RTF_ANNOUNCE flag to netstat(8) and route(8).,
Also, remove unused RTF_ flags that collide with RTF_PROTO1:
netinet/icmp6.h defined RTF_PROBEMTU as RTF_PROTO1
netinet/if_inarp.h defined RTF_USETRAILERS as RTF_PROTO1
(Neither of these flags are used anywhere. Both have been removed
to reduce chances of collision with RTF_PROTO1.)
Figuring this out and the diff are the work of Beverly Schwartz of
BBN.
(Passed release build, boot in VM, with no apparently related atf
failures.)
Approved for Public Release, Distribution Unlimited
This material is based upon work supported by the Defense Advanced
Research Projects Agency and Space and Naval Warfare Systems Center,
Pacific, under Contract No. N66001-09-C-2073.
SIOCSLIFPHYADDR, in gif_ioctl() or in gre_ioctl(), because those
operations are ordinarily kauth-orized already in ifioctl().
Kauth-orizing SIOCSIFFLAGS in gre_ioctl() caused a panic ("panic:
bpf_detachd: ifpromisc failed: 1") when tcpdump(8) was interrupted.
Somehow bpf(4) enables promiscuous mode using different credentials than
it uses to disable promiscuous mode, hence the ifpromisc failure. This
may have something to do with privilege-separation in tcpdump(8). I.e.,
an LWP with SIOCSIFFLAGS privilege opens /dev/bpf, but an LWP without
SIOCSIFFLAGS privilege closes it.
checksum-offload. Note well: it really is necessary to clear the
csum_data.
While I'm here, remove the do-nothing case for SIOCSIFDSTADDR and let
ifioctl_common() or the protocol handle it.
into a struct ifnet_lock that the ifnet has a pointer to. In a
non-_KERNEL environment, don't #include <sys/percpu.h> et cetera, and
don't define the struct ifnet_lock but *do* declare it.
Add ifnet functions, if_mcast_op(), if_flags_set(), and if_addr_init()
for adding/deleting multicast addresses, modifying the if_flags,
and initializing local/remote addresses. Make ifpromisc() use
if_flags_set(). Protocols and network drivers should use these
instead of ifp->if_ioctl() calls. Subsequent commits will
replace ifp->if_ioctl(SIOCADDMULTI| SIOCDELMULTI| SIOCSIFDSTADDR|
SIOCINITIFADDR| SIOCSIFFLAGS) calls with calls to the new functions.
Use a mutex(9) to synchronize ifp->if_ioctl() calls originating in
userland. Also synchronize ifp->if_ioctl() calls with ifnet detachment
and reclamation.
The change to if_spppsubr.c moves the test for whether LCP should
request a mru change until after the pppoe device has picked up the
mtu of the underlying ethernet device.
returned to userland by read(2) also needs to be converted.
For this, the bpf descriptor is flagged as compat32 (or not) in the
open and ioctl functions (where the user process's pid is also updated
in the descriptor). When the bpf buffer is filled in, the 32bits or native
header is used depending on the information stored in the descriptor.
This won't work if a 64bit binary does the open and ioctls, and then
exec a 32bit program which will do the read. But this is very
unlikely to happen in real life ...
Tested on i386 and loongson; with these changes my loongson can run
dhclient and tcpdump with a n32 userland.
sys/stdarg.h and expect compiler to provide proper builtins, defaulting
to the GCC interface. lint still has a special fallback.
Reduce abuse of _BSD_VA_LIST_ by defining __va_list by default and
derive va_list as required by standards.
smpls_addrs in sockaddr_mpls. The number of smpls_addrs is found from
smpls_len. First label encountered is BoS.
XXX: need to do the same for LSE and this feature needs to be documented.
This is still somewhat experimental. Tested between 2 similar boxes
so far. There is much potential for performance improvement. For now,
I've changed the gmac code to accept any data alignment, as the "char *"
pointer suggests. As the code is practically used, 32-bit alignment
can be assumed, at the cost of data copies. I don't know whether
bytewise access or copies are worse performance-wise. For efficient
implementations using SSE2 instructions on x86, even stricter
alignment requirements might arise.
will have an easier time replacing it with something different, even if
it is a second radix-trie implementation.
sys/net/route.c and sys/net/rtsock.c no longer operate directly on
radix_nodes or radix_node_heads.
Hopefully this will reduce the temptation to implement multipath or
source-based routing using grotty hacks to the grotty old radix-trie
code, too. :-)
M_NOWAIT cause dhcpd on a low-memory server with lots of interfaces to
occasionally fail to start with ENOBUFS; (M_WAITOK | M_CANFAIL) seems to
fix this.
Tested on 3 different dhcp servers.
- Add libnpf(3) - a library to control NPF (configuration, ruleset, etc).
- Add NPF support for ftp-proxy(8).
- Add rc.d script for NPF.
- Convert npfctl(8) to use libnpf(3) and thus make it less depressive.
Note: next clean-up step should be a parser, once dholland@ will finish it.
- Add more documentation.
- Various fixes.
- Add the concept of rule procedure: separate normalization, logging and
potentially other functions from the rule structure. Rule procedure can be
shared amongst the rules. Separation is both at kernel level (npf_rproc_t)
and configuration ("procedure" + "apply").
- Fix portmap sharing for NAT policy.
- Update TCP state tracking logic. Use TCP FSM definitions.
- Add if_byindex(), OK by matt@. Use in logging for the lookup.
- Fix traceroute ALG and many other bugs; misc clean-up.
- Add support for session saving/restoring.
- Add packet logging support (can tcpdump a pseudo-interface).
- Support reload without flushing of sessions; rework some locking.
- Revisit session mangement, replace linking with npf_sentry_t entries.
- Add some counters for statistics, using percpu(9).
- Add IP_DF flag cleansing.
- Fix various bugs; misc clean-up.
by zero while validating the bpf program.
originally spotted by skrll@, and broke atf the month-old atf test for
this exact problem: net_bpf_t_div-by-zero_div_by_zero.
- Add proper TCP state tracking as described in Guido van Rooij paper,
plus handle TCP Window Scaling option.
- Completely rework npf_cache_t, reduce granularity, simplify code.
- Add npf_addr_t as an abstraction, amend session handling code, as well
as NAT code et al, to use it. Now design is prepared for IPv6 support.
- Handle IPv4 fragments i.e. perform packet reassembly.
- Add support for IPv4 ID randomization and minimum TTL enforcement.
- Add support for TCP MSS "clamping".
- Random bits for IPv6. Various fixes and clean-up.
1. Fix inverted node order, so that negative value from comparison operator
would represent lower (left) node, and positive - higher (right) node.
2. Add an argument (i.e. "context"), passed to comparison operators.
3. Change rb_tree_insert_node() to return a node - either inserted one or
already existing one.
4. Amend the interface to manipulate the actual object, instead of the
rb_node (in a similar way as Patricia-tree interface does).
5. Update all RB-tree users accordingly.
XXX: Perhaps rename rb.h to rbtree.h, since cleaning-up..
1-3 address the PR/43488 by Jeremy Huddleston.
Passes RB-tree regression tests.
Reviewed by: matt@, christos@
- Add support for bi-directional NAT and redirection / port forwarding.
- Finish filtering on ICMP type/code and add filtering on TCP flags.
- Add support for TCP reset (RST) or ICMP destination unreachable on block.
- Fix a bunch of bugs; misc cleanup.
hosts. IPv6 is probably still broken, and, actually, the lookup table
for mask values should be kept in network byte order, not host byte order
and the corresponding change to the srtconfig ioctl interface made.
But at least this works.
1) RFC2367 says in 2.3.3 Address Extension: "All non-address
information in the sockaddrs, such as sin_zero for AF_INET sockaddrs,
and sin6_flowinfo for AF_INET6 sockaddrs, MUST be zeroed out."
the IPSEC_NAT_T code was expecting the port information it needs
to be conveyed in the sockaddr instead of exclusively by
SADB_X_EXT_NAT_T_SPORT and SADB_X_EXT_NAT_T_DPORT,
and was not zeroing out the port information in the non-nat-traversal
case.
Since it was expecting the port information to reside in the sockaddr
it could get away with (re)setting the ports after starting to use them.
-> Set the natt ports before setting the SA mature.
2) RFC3947 has two Original Address fields, initiator and responder,
so we need SADB_X_EXT_NAT_T_OAI and SADB_X_EXT_NAT_T_OAR and not just
SADB_X_EXT_NAT_T_OA
The change has been created using vanhu's patch for FreeBSD as reference.
Note that establishing actual nat-t sessions has not yet been tested.
Likely fixes the following:
PR bin/41757
PR net/42592
PR net/42606
- Designed to be fully MP-safe and highly efficient.
- Tables/IP sets (hash or red-black tree) for high performance lookups.
- Stateful filtering and Network Address Port Translation (NAPT).
Framework for application level gateways (ALGs).
- Packet inspection engine called n-code processor - inspired by BPF -
supporting generic RISC-like and specific CISC-like instructions for
common patterns (e.g. IPv4 address matching). See npf_ncode(9) manual.
- Convenient userland utility npfctl(8) with npf.conf(8).
NOTE: This is not yet a fully capable alternative to PF or IPFilter.
Further work (support for binat/rdr, return-rst/return-icmp, common ALGs,
state saving/restoring, logging, etc) is in progress.
Thanks a lot to Matt Thomas for various useful comments and code review.
Aye by: board@
to find and unlink routes that reference the detached ifnet: make
if_rt_walktree() return ERESTART whenever it has deleted a route.
Whenever rt_walktree() returns ERESTART, if_detach() restarts it.
I believe that this fix resembles one by Jonathan Kollasch or by someone
else, which has languished in a PR for too long. Sorry!
Tested by me and by Jeff Rizzo.
XXX It's supposed to be safe for rn_walktree() to apply to the routing
XXX table a routine that may delete routes. Why isn't it safe in
XXX practice?
These annotations help to mitigate false sharing on multiprocessor
systems.
Variables annotated with __cacheline_aligned are placed into the
.data.cacheline_aligned section in the kernel. Each item in this
section is aligned on a cachline boundary - this avoids false
sharing. Highly contended global locks are a good candidate for
__cacheline_aligned annotation.
Variables annotated with __read_mostly are packed together tightly
into a .data.read_mostly section in the kernel. The idea here is that
we can pack infrequently modified data items into a cacheline and
avoid having to purge the cache, which would happen if read mostly
data and write mostly data shared a cachline. Initialisation variables
are a prime candiate for __read_mostly annotations.
- better named one
- not suffering from buffer oveflow
- simpler
- handling different separators
- returning error codes for errors
Some ideas from one posted on tech-net by Jonathan A. Kollasch
Be more leinent on input string format. Each nibble pair may optionally be
followed by any of ':', '-', '.' or ' '.
Make source string const and work on a temporary copy. The caller may not
expect their string to be destroyed.
kern/39940 and by Martti Kuparinen on current-users@: replace the
ioctl lock with finer-grained locking. Lock the ports list and
wait to if_clone_destroy() until all threads are out of the softc.
Thanks to Martti Kuparinen for testing these changes.
#if NBPFILTER is no longer required in the client. This change
doesn't yet add support for loading bpf as a module, since drivers
can register before bpf is attached. However, callers of bpf can
now be modularized.
Dynamically loadable bpf could probably be done fairly easily with
coordination from the stub driver and the real driver by registering
attachments in the stub before the real driver is loaded and doing
a handoff. ... and I'm not going to ponder the depths of unload
here.
Tested with i386/MONOLITHIC, modified MONOLITHIC without bpf and rump.
#if NBPFILTER is no longer required in the client. This change
doesn't yet add support for loading bpf as a module, since drivers
can register before bpf is attached. However, callers of bpf can
now be modularized.
Dynamically loadable bpf could probably be done fairly easily with
coordination from the stub driver and the real driver by registering
attachments in the stub before the real driver is loaded and doing
a handoff. ... and I'm not going to ponder the depths of unload
here.
Tested with i386/MONOLITHIC, modified MONOLITHIC without bpf and rump.
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.
- Drop the INET6 block. The commands are never given to this function
and truncating the sockaddr is arguably not the desired result anyway.
- Clear the address before copying. This fixes SIOCGIFNETMASK and possible
other ioctls for users that don't check sa_len. This includes
COMPAT_43 and Linux emulation.
OK dyoung@
Pfsync interface exposes change in the pf(4) over a pseudo-interface, and can
be used to synchronise different pf.
This work was part of my 2009 GSoC
No objection on tech-net@
addresses. Make the kernel support SIOC[SG]IFADDRPREF for IPv6
interface addresses.
In in6ifa_ifpforlinklocal(), consult preference numbers before
making an otherwise arbitrary choice of in6_ifaddr. Otherwise,
preference numbers are *not* consulted by the kernel, but that will
be rather easy for somebody with a little bit of free time to fix.
Please note that setting the preference number for a link-local
IPv6 address does not work right, yet, but that ought to be fixed
soon.
In support of the changes above,
1 Add a method to struct domain for "externalizing" a sockaddr, and
provide an implementation for IPv6. Expect more work in this area: it
may be more proper to say that the IPv6 implementation "internalizes"
a sockaddr. Add sockaddr_externalize().
2 Add a subroutine, sofamily(), that returns a struct socket's address
family or AF_UNSPEC.
3 Make a lot of IPv4-specific code generic, and move it from
sys/netinet/ to sys/net/ for re-use by IPv6 parts of the kernel and
ifconfig(8).
queue's maximum length, current length, and number of drops. E.g.,
% sysctl net.interfaces.bnx0
net.interfaces.bnx0.sndq.len = 0
net.interfaces.bnx0.sndq.maxlen = 509
net.interfaces.bnx0.sndq.drops = 0
Let userland adjust the maximum queue length.
While I'm here, add a 64-bit generation number, if_index_gen, to
ifnet; the pair [ifp->if_index, ifp->if_index_gen] can serve to
identify an ifnet for the lifetime of the system. I will use this
in an upcoming change.
Ok matt@.