route_in6, struct route_iso), replacing all caches with a struct
route.
The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.
Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.
DETAILS
1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:
struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);
sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.
sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.
The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).
2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.
3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:
int rtcache_setdst(struct route *, const struct sockaddr *);
rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.
It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.
4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
parentheses in return statements.
Cosmetic: don't open-code TAILQ_FOREACH().
Cosmetic: change types of variables to avoid oodles of casts: in
in6_src.c, avoid casts by changing several route_in6 pointers
to struct route pointers. Remove unnecessary casts to caddr_t
elsewhere.
Pave the way for eliminating address family-specific route caches:
soon, struct route will not embed a sockaddr, but it will hold
a reference to an external sockaddr, instead. We will set the
destination sockaddr using rtcache_setdst(). (I created a stub
for it, but it isn't used anywhere, yet.) rtcache_free() will
free the sockaddr. I have extracted from rtcache_free() a helper
subroutine, rtcache_clear(). rtcache_clear() will "forget" a
cached route, but it will not forget the destination by releasing
the sockaddr. I use rtcache_clear() instead of rtcache_free()
in rtcache_update(), because rtcache_update() is not supposed
to forget the destination.
Constify:
1 Introduce const accessor for route->ro_dst, rtcache_getdst().
2 Constify the 'dst' argument to ifnet->if_output(). This
led me to constify a lot of code called by output routines.
3 Constify the sockaddr argument to protosw->pr_ctlinput. This
led me to constify a lot of code called by ctlinput routines.
4 Introduce const macros for converting from a generic sockaddr
to family-specific sockaddrs, e.g., sockaddr_in: satocsin6,
satocsin, et cetera.
from Kentaro A. Kurahone, with minor adjustments by me.
the ack prediction part of the original patch was omitted because
it's a separate change. reviewed by Rui Paulo.
The code to generate an ISS via an MD5 hash has been present in the
NetBSD kernel since 2001, but it wasn't even exported to userland at
that time. It was agreed on tech-net with the original author <thorpej>
that we should let the user decide if he wants to enable it or not.
Not enabled by default.
happen in the TCP stack, this interface calls the specified callback to
handle the situation according to the currently selected congestion
control algorithm.
A new sysctl node was created: net.inet.tcp.congctl.{available,selected}
with obvious meanings.
The old net.inet.tcp.newreno MIB was removed.
The API is discussed in tcp_congctl(9).
In the near future, it will be possible to selected a congestion control
algorithm on a per-socket basis.
Discussed on tech-net and reviewed by <yamt>.
Both available for IPv4 and IPv6.
Basic implementation test results are available at
http://netbsd-soc.sourceforge.net/projects/ecn/testresults.html.
Work sponsored by the Google Summer of Code project 2006.
Special thanks to Kentaro Kurahone, Allen Briggs and Matt Thomas for their
help, comments and support during the project.
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
1. Don't act on ICMP-need-frag immediately if adhoc checks on the
advertised MTU fail. The MTU update is delayed until a TCP retransmit
happens.
2. Ignore ICMP Source Quench messages meant for TCP connections.
From OpenBSD.
- introduce t_segqlen, the number of segments in segq/timeq.
the name is from freebsd.
- rather than maintaining a copy of sack blocks (rcv_sack_block[]),
build it directly from the segment list when needed.
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
* t_partialacks<0 means we are not in fast recovery.
* t_partialacks==0 means we are in fast recovery, but we have not received
any partial acks yet.
* t_partialacks>0 means we are in fast recovery, and we have received
partial acks.
This is used to implement 2 changes in RFC 3782:
* We keep the notion that we are in fast recovery separate from t_dupacks, so
it is not reset due to out-of-order acks. (This affects both the Reno and
NewReno cases.)
* We only reset the retransmit timer on the first partial ack -- preventing us
from possibly taking one RTO per segment once fast recovery is initiated.
As before, it is hard to measure any difference between Reno and NewReno in the
real-world cases that I've tested.
1) If an echoed RFC 1323 time stamp appears to be later than the current time,
ignore it and fall back to old-style RTT calculation. This prevents ending
up with a negative RTT and panicking later.
2) Fix NewReno. This involves a few changes:
a) Implement the send_high variable in RFC 2582. Our implementation is
subtly different; it is one *past* the last sequence number transmitted
rather than being equal to it. This simplifies some logic and makes
the code smaller. Additional logic was required to prevent sequence
number wraparound problems; this is not mentioned in RFC 2582.
b) Make sure we reset t_dupacks on new acks, but *not* on a partial ack.
All of the new ack code is pushed out into tcp_newreno(). (Later this
will probably be a pluggable function.) Thus t_dupacks keeps track of
whether we're in fast recovery all the time, with Reno or NewReno, which
keeps some logic simpler.
c) We do not need to update snd_recover when we're not in fast recovery.
See tech-net for an explanation of this.
d) In the gratuitous fast retransmit prevention case, do not send a packet.
RFC 2582 specifically says that we should "do nothing".
e) Do not inflate the congestion window on a partial ack. (This is done by
testing t_dupacks to see whether we're still in fast recovery.)
This brings the performance of NewReno back up to the same as Reno in a few
random test cases (e.g. transferring peer-to-peer over my wireless network).
I have not concocted a good test case for the behavior specific to NewReno.
support IPv6 if KAME IPSEC (RFC is not explicit about how we make data stream
for checksum with IPv6, but i'm pretty sure using normal pseudo-header is the
right thing).
XXX
current TCP MD5 signature code has giant flaw:
it does not validate signature on input (can't believe it! what is the point?)
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
individually, create a tcpcb template pre-initialized (and pre-zero'd)
with the static and mostly-static tcpcb parameters. The template is
now copied into the new tcpcb, which zeros and initializes most of the
tcpcb in one pass. The template is kept up-to-date as TCP sysctl
variables are changed.
Combined with the previous sb_max change, TCP socket creation is now
25% faster.
cooperating with the callout code in working around the race
condition caused by the TCP code's use of the callout facility.
Instead of unconditionally releasing memory in tcp_close() and
SYN_CACHE_PUT(), check whether any of the related callout handlers
are about to be invoked (but have not yet done callout_ack()), and
if so, just mark the associated data structure (tcpcb or syn cache
entry) as "dead", and test for this (and release storage) in the
callout handler functions.
sent from. This change avoid a linear search through all mbufs when using
large TCP windows, and therefore permit high-speed connections on long
distances.
Tested on a 1 Gigabit connection between Luleå and San Francisco, a distance
of about 15000km. With TCP windows of just over 20 Mbytes it could keep up
with 950Mbit/s.
After discussions with Matt Thomas and Jason Thorpe.
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.
Bump the kernel rev up to 1.6V