kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals
kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)
based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
Remove the set-but-not-used "proto" variable.
Guard the "ostate" variable in #ifdef TCP_DEBUG.
Remove the set-but-not-used "parentinpcb" variable in syn_cache_get().
netinet/files.ipfilter, etinet/files.netinet, netinet6/files.netinet6,
and netinet6/files.netipsec.
XXX There are still a few stragglers in conf/files, which are entangled
with other network protocols.
hunting for an MSS option to clamp. The previous code assumed that at least
one more byte of options (such as a TCPOPT_EOL) would follow the MSS
option; now, we allow the MSS option to end on the last byte of the
TCP header.
Packets have been observed "in the wild" with a TCP header length of
'6' (24 bytes.. 20 bytes fixed header, 4 bytes options) with a 4-byte
MSS option exactly filling the 4 bytes of options payload and no
following TCPOPT_EOL.
RFC793 is quite explicit that the EOL byte:
" .. need only be used if the end of the options would not
otherwise coincide with the end of the TCP header."
"In rare cases when there is no room for ip options ip_insertoptions()
can fail and corrupt a header length. Initialize len and check what
ip_insertoptions() returns."
This merge changes the device switch tables from static array to
dynamically generated by config(8).
- All device switches is defined as a constant structure in device drivers.
- The new grammer ``device-major'' is introduced to ``files''.
device-major <prefix> char <num> [block <num>] [<rules>]
- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.
- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.
- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.
- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.
- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.
TCPCB .. the fields need to be converted back to net-order, because
the packet is checksummed after the TCPCB lookup happens.
From YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>.
we can always keep 2 packets on the wire, no matter what SO_SNDBUF is,
and therefore ACKs will never be delayed unless we run out of data to
transmit. The problem is quite easy to tickle when the MTU of the
outgoing interface is larger than the socket buffer size (e.g. loopback).
Fix from Charles Hannum.
optimization made last year. should solve PR 17867 and 10195.
IP_HDRINCL behavior of raw ip socket is kept unchanged. we may want to
provide IP_HDRINCL variant that does not swap endian.
value of the TCP_NODELAY socket option from the listener to the
newly connected connection. Agrees with how Linux & FreeBSD behave,
and goes more with the spirit of accept(2) creating a socket with
the same properties as the listener.
Analysis by Kevin Lahey. Closes PR 17616 by myself.
* Keep pointers to the first and last mbufs of the last record in the
socket buffer.
* Use the sb_lastrecord pointer in the sbappend*() family of functions
to avoid traversing the packet chain to find the last record.
* Add a new sbappend_stream() function for stream protocols which
guarantee that there will never be more than one record in the
socket buffer. This function uses the sb_mbtail pointer to perform
the data insertion. Make TCP use sbappend_stream().
On a profiling run, this makes sbappend of a TCP transmission using
a 1M socket buffer go from 50% of the time to .02% of the time.
Thanks to Bill Sommerfeld and YAMAMOTO Takashi for their debugging
assistance!
1. size_t is 64 bits, so use a u_32_t for iplused
2. microtime() and friends expect a struct timeval,
passing the first of two unsigned longs will not cut it.
as necessary:
* Implement a new mbuf utility routine, m_copyup(), is is like
m_pullup(), except that it always prepends and copies, rather
than only doing so if the desired length is larger than m->m_len.
m_copyup() also allows an offset into the destination mbuf, which
allows space for packet headers, in the forwarding case.
* Add *_HDR_ALIGNED_P() macros for IP, IPv6, ICMP, and IGMP. These
macros expand to 1 if __NO_STRICT_ALIGNMENT is defined, so that
architectures which do not have strict alignment constraints don't
pay for the test or visit the new align-if-needed path.
* Use the new macros to check if a header needs to be aligned, or to
assert that it already is, as appropriate.
Note: This code is still somewhat experimental. However, the new
code path won't be visited if individual device drivers continue
to guarantee that packets are delivered to layer 3 already properly
aligned (which are rules that are already in use).
- the destination is IPv4 multicast or 255.255.255.255, and
- outgoing interface is specified via socket option
this simplifies operation of routed
(no longer reqiure 224.0.0.0/4 to be set up)
behavior changes:
- two iocts used by ndp(8) are now obsolete (backward compat provided).
use sysctl path instead.
- lo0 does not get ::1 automatically. it will get ::1 when lo0 comes up.
benefit currently). Rework tcp_reass code to optimize the 4 most likely causes
of out-of-order packets: first OoO pkt, next OoO pkt in seq, OoO pkt is part
of new chuck of OoO packets, and the OoO pkt fills the first hole. Add evcnts
to instrument tcp_reass (enabled by the options TCP_REASS_COUNTERS). This is
part 1/2 of tcp_reass changes.
* Remove the code that allocates a cluster if the packet would
fit in one; it totally defeats doing references to M_EXT mbufs
in the socket buffer. This drastically reduces the number of
data copies in the tcp_output() path for applications which use
large writes. Kudos to Matt Thomas for pointing me in the right
direction.
the QNX licence seems to be allow both non-commercial and commercial
use actually.
According to Darren, the H.323 proxy code is buggy ATM, but is imported
here for reference anyway.
Configured by a new option "mssclamp" in NAT rules, like:
map pppoe0 192.168.1.0/24 -> 0/32 mssclamp 1452
This is based on work by Xiaodan Tang <xtang@qnx.com>.
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:
* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.
From art@openbsd.org.
This avoids kernel crashes when we don't handle nonsensial values
like 0 gracefully. Better check here once beforehand than having to
check for non meaningful values in time critical paths (like tcp_output).
Fixes PR 15709.
Don't copy ttl from the inner packet to the encapsulating packet. Make
the outer ttl sysctl'able. This should close PR 14269 from Jasper Wallace
(change partly from there) and it makes traceroute work over gre tunnels.
outgoing interface (ia == NULL after IFP_TO_IA).
historic behavior (up to revision 1.43) was to use 0.0.0.0 as source address,
but it seems like a mistake according to RFC1112/1122.
as config(8) will warn for value-less defparam options
- minor whitespace/formatting cleanup
- consolidate opt_tcp_recvspace.h and opt_tcp_sendspace.h into opt_tcp_space.h
as discussed in tech-net several weeks ago. It turned out that
KAME had already added this functionality to the IPv6 stack, so
I followed their example in adding the sysctl variables
net.inet.icmp.rediraccept and net.inet.icmp.redirtimeout.
Add capabilities bits that indicate an interface can only perform
in-bound TCPv4 or UDPv4 checksums. There is at least one Gig-E chip
for which this is true (Level One LXT-1001), and this is also the
case for the Intel i82559 10/100 Ethernet chips.
all open TCP connections in tcp_slowtimo() (which is called 2x
per second). It's fairly rare for TCP timers to actually fire,
so saving this list traversal is good, especially if you want
to scale to thousands of open connections.
and call it directly from tcp_slowtimo() (via a table) rather
than going through tcp_userreq().
This will allow us to call TCP timers directly from callouts,
in a future revision.
Instead of incrementing t_idle and t_rtt in tcp_slowtimo(), we now
take a timstamp (via tcp_now) and use subtraction to compute the
delta when we actually need it (using unsigned arithmetic so that
tcp_now wrapping is handled correctly).
Based on similar changes in FreeBSD.