2011-12-19 15:59:56 +04:00
|
|
|
/* $NetBSD: tcp_input.c,v 1.319 2011/12/19 11:59:57 drochner Exp $ */
|
1999-07-01 12:12:45 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
|
|
|
|
* All rights reserved.
|
2002-06-09 20:33:36 +04:00
|
|
|
*
|
1999-07-01 12:12:45 +04:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. Neither the name of the project nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
2002-06-09 20:33:36 +04:00
|
|
|
*
|
1999-07-01 12:12:45 +04:00
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*/
|
1998-02-19 05:36:42 +03:00
|
|
|
|
2002-01-24 05:12:29 +03:00
|
|
|
/*
|
|
|
|
* @(#)COPYRIGHT 1.1 (NRL) 17 January 1995
|
2002-06-09 20:33:36 +04:00
|
|
|
*
|
2002-01-24 05:12:29 +03:00
|
|
|
* NRL grants permission for redistribution and use in source and binary
|
|
|
|
* forms, with or without modification, of the software and documentation
|
|
|
|
* created at NRL provided that the following conditions are met:
|
2002-06-09 20:33:36 +04:00
|
|
|
*
|
2002-01-24 05:12:29 +03:00
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. All advertising materials mentioning features or use of this software
|
|
|
|
* must display the following acknowledgements:
|
|
|
|
* This product includes software developed by the University of
|
|
|
|
* California, Berkeley and its contributors.
|
|
|
|
* This product includes software developed at the Information
|
|
|
|
* Technology Division, US Naval Research Laboratory.
|
|
|
|
* 4. Neither the name of the NRL nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
2002-06-09 20:33:36 +04:00
|
|
|
*
|
2002-01-24 05:12:29 +03:00
|
|
|
* THE SOFTWARE PROVIDED BY NRL IS PROVIDED BY NRL AND CONTRIBUTORS ``AS
|
|
|
|
* IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
|
|
|
|
* TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
|
|
|
|
* PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NRL OR
|
|
|
|
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
|
|
|
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
|
|
|
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
|
|
|
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
|
|
|
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
|
|
|
* NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
|
|
|
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
2002-06-09 20:33:36 +04:00
|
|
|
*
|
2002-01-24 05:12:29 +03:00
|
|
|
* The views and conclusions contained in the software and documentation
|
|
|
|
* are those of the authors and should not be interpreted as representing
|
|
|
|
* official policies, either expressed or implied, of the US Naval
|
|
|
|
* Research Laboratory (NRL).
|
|
|
|
*/
|
|
|
|
|
1998-02-19 05:36:42 +03:00
|
|
|
/*-
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
* Copyright (c) 1997, 1998, 1999, 2001, 2005, 2006,
|
|
|
|
* 2011 The NetBSD Foundation, Inc.
|
1998-02-19 05:36:42 +03:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* This code is derived from software contributed to The NetBSD Foundation
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
* by Coyote Point Systems, Inc.
|
|
|
|
* This code is derived from software contributed to The NetBSD Foundation
|
1998-02-19 05:36:42 +03:00
|
|
|
* by Jason R. Thorpe and Kevin M. Lahey of the Numerical Aerospace Simulation
|
|
|
|
* Facility, NASA Ames Research Center.
|
2005-03-02 13:20:18 +03:00
|
|
|
* This code is derived from software contributed to The NetBSD Foundation
|
|
|
|
* by Charles M. Hannum.
|
2006-09-05 04:29:35 +04:00
|
|
|
* This code is derived from software contributed to The NetBSD Foundation
|
|
|
|
* by Rui Paulo.
|
1998-02-19 05:36:42 +03:00
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
|
|
|
|
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
|
|
|
|
* TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
|
|
|
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
|
|
|
|
* BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
|
|
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
|
|
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
|
|
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
|
|
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
|
|
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
1994-06-29 10:29:24 +04:00
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
1998-01-05 13:31:44 +03:00
|
|
|
* Copyright (c) 1982, 1986, 1988, 1990, 1993, 1994, 1995
|
1994-05-13 10:02:48 +04:00
|
|
|
* The Regents of the University of California. All rights reserved.
|
1993-03-21 12:45:37 +03:00
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2003-08-07 20:26:28 +04:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1993-03-21 12:45:37 +03:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1998-01-05 13:31:44 +03:00
|
|
|
* @(#)tcp_input.c 8.12 (Berkeley) 5/24/95
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
|
|
|
* TODO list for SYN cache stuff:
|
|
|
|
*
|
1998-04-03 11:54:01 +04:00
|
|
|
* Find room for a "state" field, which is needed to keep a
|
|
|
|
* compressed state for TIME_WAIT TCBs. It's been noted already
|
|
|
|
* that this is fairly important for very high-volume web and
|
|
|
|
* mail servers, which use a large number of short-lived
|
|
|
|
* connections.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
|
|
|
|
2001-11-13 03:32:34 +03:00
|
|
|
#include <sys/cdefs.h>
|
2011-12-19 15:59:56 +04:00
|
|
|
__KERNEL_RCSID(0, "$NetBSD: tcp_input.c,v 1.319 2011/12/19 11:59:57 drochner Exp $");
|
2001-11-13 03:32:34 +03:00
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
#include "opt_inet.h"
|
1999-07-10 02:57:15 +04:00
|
|
|
#include "opt_ipsec.h"
|
2001-06-02 20:17:09 +04:00
|
|
|
#include "opt_inet_csum.h"
|
2001-07-08 20:18:56 +04:00
|
|
|
#include "opt_tcp_debug.h"
|
1999-07-01 12:12:45 +04:00
|
|
|
|
1993-12-18 03:40:47 +03:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mbuf.h>
|
|
|
|
#include <sys/protosw.h>
|
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
|
|
|
#include <sys/errno.h>
|
1998-04-29 01:52:16 +04:00
|
|
|
#include <sys/syslog.h>
|
1998-08-02 04:35:51 +04:00
|
|
|
#include <sys/pool.h>
|
1999-07-01 12:12:45 +04:00
|
|
|
#include <sys/domain.h>
|
2001-09-11 02:14:26 +04:00
|
|
|
#include <sys/kernel.h>
|
2004-05-18 18:44:14 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
#include <sys/md5.h>
|
|
|
|
#endif
|
2007-12-16 17:12:34 +03:00
|
|
|
#include <sys/lwp.h> /* for lwp0 */
|
First step of random number subsystem rework described in
<20111022023242.BA26F14A158@mail.netbsd.org>. This change includes
the following:
An initial cleanup and minor reorganization of the entropy pool
code in sys/dev/rnd.c and sys/dev/rndpool.c. Several bugs are
fixed. Some effort is made to accumulate entropy more quickly at
boot time.
A generic interface, "rndsink", is added, for stream generators to
request that they be re-keyed with good quality entropy from the pool
as soon as it is available.
The arc4random()/arc4randbytes() implementation in libkern is
adjusted to use the rndsink interface for rekeying, which helps
address the problem of low-quality keys at boot time.
An implementation of the FIPS 140-2 statistical tests for random
number generator quality is provided (libkern/rngtest.c). This
is based on Greg Rose's implementation from Qualcomm.
A new random stream generator, nist_ctr_drbg, is provided. It is
based on an implementation of the NIST SP800-90 CTR_DRBG by
Henric Jungheim. This generator users AES in a modified counter
mode to generate a backtracking-resistant random stream.
An abstraction layer, "cprng", is provided for in-kernel consumers
of randomness. The arc4random/arc4randbytes API is deprecated for
in-kernel use. It is replaced by "cprng_strong". The current
cprng_fast implementation wraps the existing arc4random
implementation. The current cprng_strong implementation wraps the
new CTR_DRBG implementation. Both interfaces are rekeyed from
the entropy pool automatically at intervals justifiable from best
current cryptographic practice.
In some quick tests, cprng_fast() is about the same speed as
the old arc4randbytes(), and cprng_strong() is about 20% faster
than rnd_extract_data(). Performance is expected to improve.
The AES code in src/crypto/rijndael is no longer an optional
kernel component, as it is required by cprng_strong, which is
not an optional kernel component.
The entropy pool output is subjected to the rngtest tests at
startup time; if it fails, the system will reboot. There is
approximately a 3/10000 chance of a false positive from these
tests. Entropy pool _input_ from hardware random numbers is
subjected to the rngtest tests at attach time, as well as the
FIPS continuous-output test, to detect bad or stuck hardware
RNGs; if any are detected, they are detached, but the system
continues to run.
A problem with rndctl(8) is fixed -- datastructures with
pointers in arrays are no longer passed to userspace (this
was not a security problem, but rather a major issue for
compat32). A new kernel will require a new rndctl.
The sysctl kern.arandom() and kern.urandom() nodes are hooked
up to the new generators, but the /dev/*random pseudodevices
are not, yet.
Manual pages for the new kernel interfaces are forthcoming.
2011-11-20 02:51:18 +04:00
|
|
|
#include <sys/cprng.h>
|
1993-03-21 12:45:37 +03:00
|
|
|
|
1993-12-18 03:40:47 +03:00
|
|
|
#include <net/if.h>
|
|
|
|
#include <net/route.h>
|
1999-07-17 11:07:08 +04:00
|
|
|
#include <net/if_types.h>
|
1993-03-21 12:45:37 +03:00
|
|
|
|
1993-12-18 03:40:47 +03:00
|
|
|
#include <netinet/in.h>
|
|
|
|
#include <netinet/in_systm.h>
|
|
|
|
#include <netinet/ip.h>
|
|
|
|
#include <netinet/in_pcb.h>
|
2003-06-15 06:49:32 +04:00
|
|
|
#include <netinet/in_var.h>
|
1993-12-18 03:40:47 +03:00
|
|
|
#include <netinet/ip_var.h>
|
2005-08-10 17:06:49 +04:00
|
|
|
#include <netinet/in_offload.h>
|
1999-07-01 12:12:45 +04:00
|
|
|
|
|
|
|
#ifdef INET6
|
|
|
|
#ifndef INET
|
|
|
|
#include <netinet/in.h>
|
|
|
|
#endif
|
|
|
|
#include <netinet/ip6.h>
|
2000-07-06 16:36:18 +04:00
|
|
|
#include <netinet6/ip6_var.h>
|
1999-07-01 12:12:45 +04:00
|
|
|
#include <netinet6/in6_pcb.h>
|
|
|
|
#include <netinet6/ip6_var.h>
|
|
|
|
#include <netinet6/in6_var.h>
|
|
|
|
#include <netinet/icmp6.h>
|
1999-12-11 12:55:14 +03:00
|
|
|
#include <netinet6/nd6.h>
|
2006-02-02 08:52:23 +03:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
#include <netinet6/scope6_var.h>
|
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
|
|
|
|
1999-12-13 18:17:17 +03:00
|
|
|
#ifndef INET6
|
|
|
|
/* always need ip6.h for IP6_EXTHDR_GET */
|
|
|
|
#include <netinet/ip6.h>
|
|
|
|
#endif
|
|
|
|
|
1993-12-18 03:40:47 +03:00
|
|
|
#include <netinet/tcp.h>
|
|
|
|
#include <netinet/tcp_fsm.h>
|
|
|
|
#include <netinet/tcp_seq.h>
|
|
|
|
#include <netinet/tcp_timer.h>
|
|
|
|
#include <netinet/tcp_var.h>
|
2008-04-12 09:58:22 +04:00
|
|
|
#include <netinet/tcp_private.h>
|
1993-12-18 03:40:47 +03:00
|
|
|
#include <netinet/tcpip.h>
|
2006-10-09 20:27:07 +04:00
|
|
|
#include <netinet/tcp_congctl.h>
|
1993-12-18 03:40:47 +03:00
|
|
|
#include <netinet/tcp_debug.h>
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2011-12-19 15:59:56 +04:00
|
|
|
#ifdef KAME_IPSEC
|
1999-07-01 12:12:45 +04:00
|
|
|
#include <netinet6/ipsec.h>
|
2008-04-23 10:09:04 +04:00
|
|
|
#include <netinet6/ipsec_private.h>
|
1999-07-01 12:12:45 +04:00
|
|
|
#include <netkey/key.h>
|
2011-12-19 15:59:56 +04:00
|
|
|
#endif /*KAME_IPSEC*/
|
1999-07-17 16:53:05 +04:00
|
|
|
#ifdef INET6
|
1999-07-17 11:07:08 +04:00
|
|
|
#include "faith.h"
|
2001-05-08 14:15:13 +04:00
|
|
|
#if defined(NFAITH) && NFAITH > 0
|
|
|
|
#include <net/if_faith.h>
|
|
|
|
#endif
|
2011-12-19 15:59:56 +04:00
|
|
|
#endif /* INET6 */
|
2003-08-15 07:42:00 +04:00
|
|
|
|
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
#include <netipsec/ipsec.h>
|
2008-04-23 10:09:04 +04:00
|
|
|
#include <netipsec/ipsec_var.h>
|
|
|
|
#include <netipsec/ipsec_private.h>
|
2003-08-15 07:42:00 +04:00
|
|
|
#include <netipsec/key.h>
|
2003-11-19 23:47:00 +03:00
|
|
|
#ifdef INET6
|
|
|
|
#include <netipsec/ipsec6.h>
|
|
|
|
#endif
|
2003-08-15 07:42:00 +04:00
|
|
|
#endif /* FAST_IPSEC*/
|
|
|
|
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
#include <netinet/tcp_vtw.h>
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
int tcprexmtthresh = 3;
|
1999-05-24 00:33:50 +04:00
|
|
|
int tcp_log_refused;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2010-01-26 21:09:07 +03:00
|
|
|
int tcp_do_autorcvbuf = 1;
|
2007-08-02 06:42:40 +04:00
|
|
|
int tcp_autorcvbuf_inc = 16 * 1024;
|
|
|
|
int tcp_autorcvbuf_max = 256 * 1024;
|
2009-09-10 02:41:28 +04:00
|
|
|
int tcp_msl = (TCPTV_MSL / PR_SLOWHZ);
|
2007-08-02 06:42:40 +04:00
|
|
|
|
2000-07-27 15:34:06 +04:00
|
|
|
static int tcp_rst_ppslim_count = 0;
|
|
|
|
static struct timeval tcp_rst_ppslim_last;
|
2004-04-20 20:52:12 +04:00
|
|
|
static int tcp_ackdrop_ppslim_count = 0;
|
|
|
|
static struct timeval tcp_ackdrop_ppslim_last;
|
2000-02-15 22:54:11 +03:00
|
|
|
|
2005-01-27 20:10:07 +03:00
|
|
|
#define TCP_PAWS_IDLE (24U * 24 * 60 * 60 * PR_SLOWHZ)
|
1994-05-13 10:02:48 +04:00
|
|
|
|
|
|
|
/* for modulo comparisons of timestamps */
|
|
|
|
#define TSTMP_LT(a,b) ((int)((a)-(b)) < 0)
|
|
|
|
#define TSTMP_GEQ(a,b) ((int)((a)-(b)) >= 0)
|
|
|
|
|
1999-12-11 12:55:14 +03:00
|
|
|
/*
|
|
|
|
* Neighbor Discovery, Neighbor Unreachability Detection Upper layer hint.
|
|
|
|
*/
|
|
|
|
#ifdef INET6
|
2007-12-20 22:53:29 +03:00
|
|
|
static inline void
|
|
|
|
nd6_hint(struct tcpcb *tp)
|
|
|
|
{
|
|
|
|
struct rtentry *rt;
|
|
|
|
|
|
|
|
if (tp != NULL && tp->t_in6pcb != NULL && tp->t_family == AF_INET6 &&
|
2008-01-14 07:19:09 +03:00
|
|
|
(rt = rtcache_validate(&tp->t_in6pcb->in6p_route)) != NULL)
|
2007-12-20 22:53:29 +03:00
|
|
|
nd6_nud_hint(rt, NULL, 0);
|
|
|
|
}
|
1999-12-11 12:55:14 +03:00
|
|
|
#else
|
2007-12-20 22:53:29 +03:00
|
|
|
static inline void
|
|
|
|
nd6_hint(struct tcpcb *tp)
|
|
|
|
{
|
|
|
|
}
|
1999-12-11 12:55:14 +03:00
|
|
|
#endif
|
|
|
|
|
1997-12-11 09:33:29 +03:00
|
|
|
/*
|
2008-02-20 14:44:07 +03:00
|
|
|
* Compute ACK transmission behavior. Delay the ACK unless
|
1998-04-01 03:44:09 +04:00
|
|
|
* we have already delayed an ACK (must send an ACK every two segments).
|
1998-05-02 08:21:58 +04:00
|
|
|
* We also ACK immediately if we received a PUSH and the ACK-on-PUSH
|
|
|
|
* option is enabled.
|
1997-12-11 09:33:29 +03:00
|
|
|
*/
|
2008-02-20 14:44:07 +03:00
|
|
|
static void
|
|
|
|
tcp_setup_ack(struct tcpcb *tp, const struct tcphdr *th)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (tp->t_flags & TF_DELACK ||
|
|
|
|
(tcp_ack_on_push && th->th_flags & TH_PUSH))
|
|
|
|
tp->t_flags |= TF_ACKNOW;
|
|
|
|
else
|
|
|
|
TCP_SET_DELACK(tp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
icmp_check(struct tcpcb *tp, const struct tcphdr *th, int acked)
|
|
|
|
{
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we had a pending ICMP message that refers to data that have
|
|
|
|
* just been acknowledged, disregard the recorded ICMP message.
|
|
|
|
*/
|
|
|
|
if ((tp->t_flags & TF_PMTUD_PEND) &&
|
|
|
|
SEQ_GT(th->th_ack, tp->t_pmtud_th_seq))
|
|
|
|
tp->t_flags &= ~TF_PMTUD_PEND;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Keep track of the largest chunk of data
|
|
|
|
* acknowledged since last PMTU update
|
|
|
|
*/
|
|
|
|
if (tp->t_pmtud_mss_acked < acked)
|
|
|
|
tp->t_pmtud_mss_acked = acked;
|
|
|
|
}
|
2005-07-19 21:00:02 +04:00
|
|
|
|
2000-02-12 20:19:34 +03:00
|
|
|
/*
|
|
|
|
* Convert TCP protocol fields to host order for easier processing.
|
|
|
|
*/
|
2008-02-20 14:44:07 +03:00
|
|
|
static void
|
|
|
|
tcp_fields_to_host(struct tcphdr *th)
|
|
|
|
{
|
|
|
|
|
|
|
|
NTOHL(th->th_seq);
|
|
|
|
NTOHL(th->th_ack);
|
|
|
|
NTOHS(th->th_win);
|
|
|
|
NTOHS(th->th_urp);
|
|
|
|
}
|
2000-02-12 20:19:34 +03:00
|
|
|
|
2002-08-28 06:23:57 +04:00
|
|
|
/*
|
|
|
|
* ... and reverse the above.
|
|
|
|
*/
|
2008-02-20 14:44:07 +03:00
|
|
|
static void
|
|
|
|
tcp_fields_to_net(struct tcphdr *th)
|
|
|
|
{
|
|
|
|
|
|
|
|
HTONL(th->th_seq);
|
|
|
|
HTONL(th->th_ack);
|
|
|
|
HTONS(th->th_win);
|
|
|
|
HTONS(th->th_urp);
|
|
|
|
}
|
2002-08-28 06:23:57 +04:00
|
|
|
|
2001-06-02 20:17:09 +04:00
|
|
|
#ifdef TCP_CSUM_COUNTERS
|
|
|
|
#include <sys/device.h>
|
|
|
|
|
2005-08-10 17:05:16 +04:00
|
|
|
#if defined(INET)
|
2001-06-02 20:17:09 +04:00
|
|
|
extern struct evcnt tcp_hwcsum_ok;
|
|
|
|
extern struct evcnt tcp_hwcsum_bad;
|
|
|
|
extern struct evcnt tcp_hwcsum_data;
|
|
|
|
extern struct evcnt tcp_swcsum;
|
2005-08-10 17:05:16 +04:00
|
|
|
#endif /* defined(INET) */
|
|
|
|
#if defined(INET6)
|
|
|
|
extern struct evcnt tcp6_hwcsum_ok;
|
|
|
|
extern struct evcnt tcp6_hwcsum_bad;
|
|
|
|
extern struct evcnt tcp6_hwcsum_data;
|
|
|
|
extern struct evcnt tcp6_swcsum;
|
|
|
|
#endif /* defined(INET6) */
|
2001-06-02 20:17:09 +04:00
|
|
|
|
|
|
|
#define TCP_CSUM_COUNTER_INCR(ev) (ev)->ev_count++
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
#define TCP_CSUM_COUNTER_INCR(ev) /* nothing */
|
|
|
|
|
|
|
|
#endif /* TCP_CSUM_COUNTERS */
|
|
|
|
|
2002-05-07 06:59:38 +04:00
|
|
|
#ifdef TCP_REASS_COUNTERS
|
|
|
|
#include <sys/device.h>
|
|
|
|
|
|
|
|
extern struct evcnt tcp_reass_;
|
|
|
|
extern struct evcnt tcp_reass_empty;
|
|
|
|
extern struct evcnt tcp_reass_iteration[8];
|
|
|
|
extern struct evcnt tcp_reass_prependfirst;
|
|
|
|
extern struct evcnt tcp_reass_prepend;
|
|
|
|
extern struct evcnt tcp_reass_insert;
|
|
|
|
extern struct evcnt tcp_reass_inserttail;
|
|
|
|
extern struct evcnt tcp_reass_append;
|
|
|
|
extern struct evcnt tcp_reass_appendtail;
|
|
|
|
extern struct evcnt tcp_reass_overlaptail;
|
|
|
|
extern struct evcnt tcp_reass_overlapfront;
|
|
|
|
extern struct evcnt tcp_reass_segdup;
|
|
|
|
extern struct evcnt tcp_reass_fragdup;
|
|
|
|
|
|
|
|
#define TCP_REASS_COUNTER_INCR(ev) (ev)->ev_count++
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
#define TCP_REASS_COUNTER_INCR(ev) /* nothing */
|
|
|
|
|
|
|
|
#endif /* TCP_REASS_COUNTERS */
|
|
|
|
|
2006-12-06 12:08:27 +03:00
|
|
|
static int tcp_reass(struct tcpcb *, const struct tcphdr *, struct mbuf *,
|
|
|
|
int *);
|
2006-10-21 14:08:54 +04:00
|
|
|
static int tcp_dooptions(struct tcpcb *, const u_char *, int,
|
2007-05-19 01:31:16 +04:00
|
|
|
struct tcphdr *, struct mbuf *, int, struct tcp_opt_info *);
|
2006-10-21 14:08:54 +04:00
|
|
|
|
2002-06-29 08:13:21 +04:00
|
|
|
#ifdef INET
|
2005-02-03 00:41:55 +03:00
|
|
|
static void tcp4_log_refused(const struct ip *, const struct tcphdr *);
|
2002-06-29 08:13:21 +04:00
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
2005-02-03 00:41:55 +03:00
|
|
|
static void tcp6_log_refused(const struct ip6_hdr *, const struct tcphdr *);
|
2002-06-29 08:13:21 +04:00
|
|
|
#endif
|
|
|
|
|
2004-04-22 19:05:33 +04:00
|
|
|
#define TRAVERSE(x) while ((x)->m_next) (x) = (x)->m_next
|
|
|
|
|
2006-12-06 12:10:45 +03:00
|
|
|
#if defined(MBUFTRACE)
|
|
|
|
struct mowner tcp_reass_mowner = MOWNER_INIT("tcp", "reass");
|
|
|
|
#endif /* defined(MBUFTRACE) */
|
|
|
|
|
2009-01-29 23:38:22 +03:00
|
|
|
static struct pool tcpipqent_pool;
|
|
|
|
|
|
|
|
void
|
2009-03-16 00:23:31 +03:00
|
|
|
tcpipqent_init(void)
|
2009-01-29 23:38:22 +03:00
|
|
|
{
|
|
|
|
|
|
|
|
pool_init(&tcpipqent_pool, sizeof(struct ipqent), 0, 0, 0, "tcpipqepl",
|
|
|
|
NULL, IPL_VM);
|
|
|
|
}
|
2004-09-15 13:21:22 +04:00
|
|
|
|
2005-03-30 00:10:16 +04:00
|
|
|
struct ipqent *
|
2008-02-27 22:41:51 +03:00
|
|
|
tcpipqent_alloc(void)
|
2005-03-30 00:10:16 +04:00
|
|
|
{
|
|
|
|
struct ipqent *ipqe;
|
|
|
|
int s;
|
|
|
|
|
|
|
|
s = splvm();
|
|
|
|
ipqe = pool_get(&tcpipqent_pool, PR_NOWAIT);
|
|
|
|
splx(s);
|
|
|
|
|
|
|
|
return ipqe;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
tcpipqent_free(struct ipqent *ipqe)
|
|
|
|
{
|
|
|
|
int s;
|
|
|
|
|
|
|
|
s = splvm();
|
|
|
|
pool_put(&tcpipqent_pool, ipqe);
|
|
|
|
splx(s);
|
|
|
|
}
|
|
|
|
|
2006-12-06 12:08:27 +03:00
|
|
|
static int
|
|
|
|
tcp_reass(struct tcpcb *tp, const struct tcphdr *th, struct mbuf *m, int *tlen)
|
1993-03-21 12:45:37 +03:00
|
|
|
{
|
2000-03-30 16:51:13 +04:00
|
|
|
struct ipqent *p, *q, *nq, *tiqe = NULL;
|
1999-07-01 12:12:45 +04:00
|
|
|
struct socket *so = NULL;
|
1998-04-30 00:43:29 +04:00
|
|
|
int pkt_flags;
|
|
|
|
tcp_seq pkt_seq;
|
|
|
|
unsigned pkt_len;
|
|
|
|
u_long rcvpartdupbyte = 0;
|
|
|
|
u_long rcvoobyte;
|
2002-05-07 06:59:38 +04:00
|
|
|
#ifdef TCP_REASS_COUNTERS
|
|
|
|
u_int count = 0;
|
|
|
|
#endif
|
2008-04-12 09:58:22 +04:00
|
|
|
uint64_t *tcps;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
if (tp->t_inpcb)
|
|
|
|
so = tp->t_inpcb->inp_socket;
|
|
|
|
#ifdef INET6
|
|
|
|
else if (tp->t_in6pcb)
|
|
|
|
so = tp->t_in6pcb->in6p_socket;
|
|
|
|
#endif
|
|
|
|
|
1998-12-19 00:38:02 +03:00
|
|
|
TCP_REASS_LOCK_CHECK(tp);
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
1999-07-01 12:12:45 +04:00
|
|
|
* Call with th==0 after become established to
|
1993-03-21 12:45:37 +03:00
|
|
|
* force pre-ESTABLISHED data up to user socket.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (th == 0)
|
1993-03-21 12:45:37 +03:00
|
|
|
goto present;
|
|
|
|
|
2006-12-06 12:10:45 +03:00
|
|
|
m_claimm(m, &tcp_reass_mowner);
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
rcvoobyte = *tlen;
|
1995-11-21 04:07:34 +03:00
|
|
|
/*
|
1998-04-30 00:43:29 +04:00
|
|
|
* Copy these to local variables because the tcpiphdr
|
|
|
|
* gets munged while we are collapsing mbufs.
|
1995-11-21 04:07:34 +03:00
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
pkt_seq = th->th_seq;
|
|
|
|
pkt_len = *tlen;
|
|
|
|
pkt_flags = th->th_flags;
|
2002-05-07 06:59:38 +04:00
|
|
|
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_);
|
|
|
|
|
|
|
|
if ((p = TAILQ_LAST(&tp->segq, ipqehead)) != NULL) {
|
|
|
|
/*
|
|
|
|
* When we miss a packet, the vast majority of time we get
|
|
|
|
* packets that follow it in order. So optimize for that.
|
|
|
|
*/
|
|
|
|
if (pkt_seq == p->ipqe_seq + p->ipqe_len) {
|
|
|
|
p->ipqe_len += pkt_len;
|
|
|
|
p->ipqe_flags |= pkt_flags;
|
2004-04-22 19:05:33 +04:00
|
|
|
m_cat(p->ipre_mlast, m);
|
|
|
|
TRAVERSE(p->ipre_mlast);
|
2004-02-26 05:34:59 +03:00
|
|
|
m = NULL;
|
2002-05-07 06:59:38 +04:00
|
|
|
tiqe = p;
|
|
|
|
TAILQ_REMOVE(&tp->timeq, p, ipqe_timeq);
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_appendtail);
|
|
|
|
goto skip_replacement;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* While we're here, if the pkt is completely beyond
|
|
|
|
* anything we have, just insert it at the tail.
|
|
|
|
*/
|
|
|
|
if (SEQ_GT(pkt_seq, p->ipqe_seq + p->ipqe_len)) {
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_inserttail);
|
|
|
|
goto insert_it;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
q = TAILQ_FIRST(&tp->segq);
|
|
|
|
|
|
|
|
if (q != NULL) {
|
|
|
|
/*
|
|
|
|
* If this segment immediately precedes the first out-of-order
|
|
|
|
* block, simply slap the segment in front of it and (mostly)
|
|
|
|
* skip the complicated logic.
|
|
|
|
*/
|
|
|
|
if (pkt_seq + pkt_len == q->ipqe_seq) {
|
|
|
|
q->ipqe_seq = pkt_seq;
|
|
|
|
q->ipqe_len += pkt_len;
|
|
|
|
q->ipqe_flags |= pkt_flags;
|
|
|
|
m_cat(m, q->ipqe_m);
|
|
|
|
q->ipqe_m = m;
|
2004-04-22 19:05:33 +04:00
|
|
|
q->ipre_mlast = m; /* last mbuf may have changed */
|
|
|
|
TRAVERSE(q->ipre_mlast);
|
2002-05-07 06:59:38 +04:00
|
|
|
tiqe = q;
|
|
|
|
TAILQ_REMOVE(&tp->timeq, q, ipqe_timeq);
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_prependfirst);
|
|
|
|
goto skip_replacement;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_empty);
|
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* Find a segment which begins after this one does.
|
|
|
|
*/
|
2002-05-07 06:59:38 +04:00
|
|
|
for (p = NULL; q != NULL; q = nq) {
|
|
|
|
nq = TAILQ_NEXT(q, ipqe_q);
|
|
|
|
#ifdef TCP_REASS_COUNTERS
|
|
|
|
count++;
|
|
|
|
#endif
|
1998-04-30 00:43:29 +04:00
|
|
|
/*
|
|
|
|
* If the received segment is just right after this
|
|
|
|
* fragment, merge the two together and then check
|
|
|
|
* for further overlaps.
|
|
|
|
*/
|
|
|
|
if (q->ipqe_seq + q->ipqe_len == pkt_seq) {
|
|
|
|
#ifdef TCPREASS_DEBUG
|
|
|
|
printf("tcp_reass[%p]: concat %u:%u(%u) to %u:%u(%u)\n",
|
|
|
|
tp, pkt_seq, pkt_seq + pkt_len, pkt_len,
|
|
|
|
q->ipqe_seq, q->ipqe_seq + q->ipqe_len, q->ipqe_len);
|
|
|
|
#endif
|
|
|
|
pkt_len += q->ipqe_len;
|
|
|
|
pkt_flags |= q->ipqe_flags;
|
|
|
|
pkt_seq = q->ipqe_seq;
|
2004-04-22 19:05:33 +04:00
|
|
|
m_cat(q->ipre_mlast, m);
|
|
|
|
TRAVERSE(q->ipre_mlast);
|
1998-04-30 00:43:29 +04:00
|
|
|
m = q->ipqe_m;
|
2002-05-07 06:59:38 +04:00
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_append);
|
1998-04-30 00:43:29 +04:00
|
|
|
goto free_ipqe;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If the received segment is completely past this
|
|
|
|
* fragment, we need to go the next fragment.
|
|
|
|
*/
|
|
|
|
if (SEQ_LT(q->ipqe_seq + q->ipqe_len, pkt_seq)) {
|
|
|
|
p = q;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/*
|
2002-06-09 20:33:36 +04:00
|
|
|
* If the fragment is past the received segment,
|
1998-04-30 00:43:29 +04:00
|
|
|
* it (or any following) can't be concatenated.
|
|
|
|
*/
|
2002-05-07 06:59:38 +04:00
|
|
|
if (SEQ_GT(q->ipqe_seq, pkt_seq + pkt_len)) {
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_insert);
|
1993-03-21 12:45:37 +03:00
|
|
|
break;
|
2002-05-07 06:59:38 +04:00
|
|
|
}
|
|
|
|
|
1998-04-30 00:43:29 +04:00
|
|
|
/*
|
|
|
|
* We've received all the data in this segment before.
|
|
|
|
* mark it as a duplicate and return.
|
|
|
|
*/
|
|
|
|
if (SEQ_LEQ(q->ipqe_seq, pkt_seq) &&
|
|
|
|
SEQ_GEQ(q->ipqe_seq + q->ipqe_len, pkt_seq + pkt_len)) {
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVDUPPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVDUPBYTE] += pkt_len;
|
|
|
|
TCP_STAT_PUTREF();
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
tcp_new_dsack(tp, pkt_seq, pkt_len);
|
1998-04-30 00:43:29 +04:00
|
|
|
m_freem(m);
|
2005-03-30 00:10:16 +04:00
|
|
|
if (tiqe != NULL) {
|
|
|
|
tcpipqent_free(tiqe);
|
|
|
|
}
|
2002-05-07 06:59:38 +04:00
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_segdup);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
goto out;
|
1998-04-30 00:43:29 +04:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Received segment completely overlaps this fragment
|
|
|
|
* so we drop the fragment (this keeps the temporal
|
|
|
|
* ordering of segments correct).
|
|
|
|
*/
|
|
|
|
if (SEQ_GEQ(q->ipqe_seq, pkt_seq) &&
|
|
|
|
SEQ_LEQ(q->ipqe_seq + q->ipqe_len, pkt_seq + pkt_len)) {
|
|
|
|
rcvpartdupbyte += q->ipqe_len;
|
|
|
|
m_freem(q->ipqe_m);
|
2002-05-07 06:59:38 +04:00
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_fragdup);
|
1998-04-30 00:43:29 +04:00
|
|
|
goto free_ipqe;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* RX'ed segment extends past the end of the
|
|
|
|
* fragment. Drop the overlapping bytes. Then
|
|
|
|
* merge the fragment and segment then treat as
|
|
|
|
* a longer received packet.
|
|
|
|
*/
|
2004-02-26 05:34:59 +03:00
|
|
|
if (SEQ_LT(q->ipqe_seq, pkt_seq) &&
|
|
|
|
SEQ_GT(q->ipqe_seq + q->ipqe_len, pkt_seq)) {
|
1998-04-30 00:43:29 +04:00
|
|
|
int overlap = q->ipqe_seq + q->ipqe_len - pkt_seq;
|
|
|
|
#ifdef TCPREASS_DEBUG
|
|
|
|
printf("tcp_reass[%p]: trim starting %d bytes of %u:%u(%u)\n",
|
|
|
|
tp, overlap,
|
|
|
|
pkt_seq, pkt_seq + pkt_len, pkt_len);
|
|
|
|
#endif
|
|
|
|
m_adj(m, overlap);
|
|
|
|
rcvpartdupbyte += overlap;
|
2004-04-22 19:05:33 +04:00
|
|
|
m_cat(q->ipre_mlast, m);
|
|
|
|
TRAVERSE(q->ipre_mlast);
|
1998-04-30 00:43:29 +04:00
|
|
|
m = q->ipqe_m;
|
|
|
|
pkt_seq = q->ipqe_seq;
|
|
|
|
pkt_len += q->ipqe_len - overlap;
|
|
|
|
rcvoobyte -= overlap;
|
2002-05-07 06:59:38 +04:00
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_overlaptail);
|
1998-04-30 00:43:29 +04:00
|
|
|
goto free_ipqe;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* RX'ed segment extends past the front of the
|
|
|
|
* fragment. Drop the overlapping bytes on the
|
|
|
|
* received packet. The packet will then be
|
|
|
|
* contatentated with this fragment a bit later.
|
|
|
|
*/
|
2004-02-26 05:34:59 +03:00
|
|
|
if (SEQ_GT(q->ipqe_seq, pkt_seq) &&
|
|
|
|
SEQ_LT(q->ipqe_seq, pkt_seq + pkt_len)) {
|
1998-04-30 00:43:29 +04:00
|
|
|
int overlap = pkt_seq + pkt_len - q->ipqe_seq;
|
|
|
|
#ifdef TCPREASS_DEBUG
|
|
|
|
printf("tcp_reass[%p]: trim trailing %d bytes of %u:%u(%u)\n",
|
|
|
|
tp, overlap,
|
|
|
|
pkt_seq, pkt_seq + pkt_len, pkt_len);
|
|
|
|
#endif
|
|
|
|
m_adj(m, -overlap);
|
|
|
|
pkt_len -= overlap;
|
|
|
|
rcvpartdupbyte += overlap;
|
2002-05-07 06:59:38 +04:00
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_overlapfront);
|
1998-04-30 00:43:29 +04:00
|
|
|
rcvoobyte -= overlap;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If the received segment immediates precedes this
|
|
|
|
* fragment then tack the fragment onto this segment
|
|
|
|
* and reinsert the data.
|
|
|
|
*/
|
|
|
|
if (q->ipqe_seq == pkt_seq + pkt_len) {
|
|
|
|
#ifdef TCPREASS_DEBUG
|
|
|
|
printf("tcp_reass[%p]: append %u:%u(%u) to %u:%u(%u)\n",
|
|
|
|
tp, q->ipqe_seq, q->ipqe_seq + q->ipqe_len, q->ipqe_len,
|
|
|
|
pkt_seq, pkt_seq + pkt_len, pkt_len);
|
|
|
|
#endif
|
|
|
|
pkt_len += q->ipqe_len;
|
|
|
|
pkt_flags |= q->ipqe_flags;
|
|
|
|
m_cat(m, q->ipqe_m);
|
2002-05-07 06:59:38 +04:00
|
|
|
TAILQ_REMOVE(&tp->segq, q, ipqe_q);
|
|
|
|
TAILQ_REMOVE(&tp->timeq, q, ipqe_timeq);
|
2005-03-16 03:39:56 +03:00
|
|
|
tp->t_segqlen--;
|
|
|
|
KASSERT(tp->t_segqlen >= 0);
|
|
|
|
KASSERT(tp->t_segqlen != 0 ||
|
|
|
|
(TAILQ_EMPTY(&tp->segq) &&
|
|
|
|
TAILQ_EMPTY(&tp->timeq)));
|
2005-03-30 00:10:16 +04:00
|
|
|
if (tiqe == NULL) {
|
2004-02-26 05:34:59 +03:00
|
|
|
tiqe = q;
|
2005-03-30 00:10:16 +04:00
|
|
|
} else {
|
|
|
|
tcpipqent_free(q);
|
|
|
|
}
|
2002-05-07 06:59:38 +04:00
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_prepend);
|
1998-04-30 00:43:29 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If the fragment is before the segment, remember it.
|
|
|
|
* When this loop is terminated, p will contain the
|
|
|
|
* pointer to fragment that is right before the received
|
|
|
|
* segment.
|
|
|
|
*/
|
|
|
|
if (SEQ_LEQ(q->ipqe_seq, pkt_seq))
|
|
|
|
p = q;
|
|
|
|
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is a common operation. It also will allow
|
|
|
|
* to save doing a malloc/free in most instances.
|
|
|
|
*/
|
|
|
|
free_ipqe:
|
2002-05-07 06:59:38 +04:00
|
|
|
TAILQ_REMOVE(&tp->segq, q, ipqe_q);
|
|
|
|
TAILQ_REMOVE(&tp->timeq, q, ipqe_timeq);
|
2005-03-16 03:39:56 +03:00
|
|
|
tp->t_segqlen--;
|
|
|
|
KASSERT(tp->t_segqlen >= 0);
|
|
|
|
KASSERT(tp->t_segqlen != 0 ||
|
|
|
|
(TAILQ_EMPTY(&tp->segq) && TAILQ_EMPTY(&tp->timeq)));
|
2005-03-30 00:10:16 +04:00
|
|
|
if (tiqe == NULL) {
|
2004-02-26 05:34:59 +03:00
|
|
|
tiqe = q;
|
2005-03-30 00:10:16 +04:00
|
|
|
} else {
|
|
|
|
tcpipqent_free(q);
|
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
2002-05-07 06:59:38 +04:00
|
|
|
#ifdef TCP_REASS_COUNTERS
|
|
|
|
if (count > 7)
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_iteration[0]);
|
|
|
|
else if (count > 0)
|
|
|
|
TCP_REASS_COUNTER_INCR(&tcp_reass_iteration[count]);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
insert_it:
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
1998-04-30 00:43:29 +04:00
|
|
|
* Allocate a new queue entry since the received segment did not
|
|
|
|
* collapse onto any other out-of-order block; thus we are allocating
|
|
|
|
* a new block. If it had collapsed, tiqe would not be NULL and
|
|
|
|
* we would be reusing it.
|
|
|
|
* XXX If we can't, just drop the packet. XXX
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1998-04-30 00:43:29 +04:00
|
|
|
if (tiqe == NULL) {
|
2005-03-30 00:10:16 +04:00
|
|
|
tiqe = tcpipqent_alloc();
|
1998-04-30 00:43:29 +04:00
|
|
|
if (tiqe == NULL) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVMEMDROP);
|
1998-04-30 00:43:29 +04:00
|
|
|
m_freem(m);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
goto out;
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1998-04-30 00:43:29 +04:00
|
|
|
/*
|
|
|
|
* Update the counters.
|
|
|
|
*/
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVOOPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVOOBYTE] += rcvoobyte;
|
1998-04-30 00:43:29 +04:00
|
|
|
if (rcvpartdupbyte) {
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps[TCP_STAT_RCVPARTDUPPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVPARTDUPBYTE] += rcvpartdupbyte;
|
1998-04-30 00:43:29 +04:00
|
|
|
}
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STAT_PUTREF();
|
1998-04-30 00:43:29 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Insert the new fragment queue entry into both queues.
|
|
|
|
*/
|
1995-11-21 04:07:34 +03:00
|
|
|
tiqe->ipqe_m = m;
|
2004-04-22 19:05:33 +04:00
|
|
|
tiqe->ipre_mlast = m;
|
1998-04-30 00:43:29 +04:00
|
|
|
tiqe->ipqe_seq = pkt_seq;
|
|
|
|
tiqe->ipqe_len = pkt_len;
|
|
|
|
tiqe->ipqe_flags = pkt_flags;
|
1995-11-21 04:07:34 +03:00
|
|
|
if (p == NULL) {
|
2002-05-07 06:59:38 +04:00
|
|
|
TAILQ_INSERT_HEAD(&tp->segq, tiqe, ipqe_q);
|
1998-04-30 00:43:29 +04:00
|
|
|
#ifdef TCPREASS_DEBUG
|
|
|
|
if (tiqe->ipqe_seq != tp->rcv_nxt)
|
|
|
|
printf("tcp_reass[%p]: insert %u:%u(%u) at front\n",
|
|
|
|
tp, pkt_seq, pkt_seq + pkt_len, pkt_len);
|
|
|
|
#endif
|
1995-11-21 04:07:34 +03:00
|
|
|
} else {
|
2002-05-07 06:59:38 +04:00
|
|
|
TAILQ_INSERT_AFTER(&tp->segq, p, tiqe, ipqe_q);
|
1998-04-30 00:43:29 +04:00
|
|
|
#ifdef TCPREASS_DEBUG
|
|
|
|
printf("tcp_reass[%p]: insert %u:%u(%u) after %u:%u(%u)\n",
|
|
|
|
tp, pkt_seq, pkt_seq + pkt_len, pkt_len,
|
|
|
|
p->ipqe_seq, p->ipqe_seq + p->ipqe_len, p->ipqe_len);
|
|
|
|
#endif
|
1995-11-21 04:07:34 +03:00
|
|
|
}
|
2005-03-16 03:39:56 +03:00
|
|
|
tp->t_segqlen++;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2002-05-07 06:59:38 +04:00
|
|
|
skip_replacement:
|
|
|
|
|
|
|
|
TAILQ_INSERT_HEAD(&tp->timeq, tiqe, ipqe_timeq);
|
1998-04-30 00:43:29 +04:00
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
present:
|
|
|
|
/*
|
|
|
|
* Present data to user, advancing rcv_nxt through
|
|
|
|
* completed sequence space.
|
|
|
|
*/
|
1994-10-14 19:01:48 +03:00
|
|
|
if (TCPS_HAVEESTABLISHED(tp->t_state) == 0)
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
goto out;
|
2002-05-07 06:59:38 +04:00
|
|
|
q = TAILQ_FIRST(&tp->segq);
|
1998-04-30 00:43:29 +04:00
|
|
|
if (q == NULL || q->ipqe_seq != tp->rcv_nxt)
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
goto out;
|
1998-04-30 00:43:29 +04:00
|
|
|
if (tp->t_state == TCPS_SYN_RECEIVED && q->ipqe_len)
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
goto out;
|
1995-11-21 04:07:34 +03:00
|
|
|
|
1998-04-30 00:43:29 +04:00
|
|
|
tp->rcv_nxt += q->ipqe_len;
|
|
|
|
pkt_flags = q->ipqe_flags & TH_FIN;
|
2007-12-20 22:53:29 +03:00
|
|
|
nd6_hint(tp);
|
1998-04-30 00:43:29 +04:00
|
|
|
|
2002-05-07 06:59:38 +04:00
|
|
|
TAILQ_REMOVE(&tp->segq, q, ipqe_q);
|
|
|
|
TAILQ_REMOVE(&tp->timeq, q, ipqe_timeq);
|
2005-03-16 03:39:56 +03:00
|
|
|
tp->t_segqlen--;
|
|
|
|
KASSERT(tp->t_segqlen >= 0);
|
|
|
|
KASSERT(tp->t_segqlen != 0 ||
|
|
|
|
(TAILQ_EMPTY(&tp->segq) && TAILQ_EMPTY(&tp->timeq)));
|
1998-04-30 00:43:29 +04:00
|
|
|
if (so->so_state & SS_CANTRCVMORE)
|
|
|
|
m_freem(q->ipqe_m);
|
|
|
|
else
|
2002-07-04 01:36:57 +04:00
|
|
|
sbappendstream(&so->so_rcv, q->ipqe_m);
|
2005-03-30 00:10:16 +04:00
|
|
|
tcpipqent_free(q);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
TCP_REASS_UNLOCK(tp);
|
1993-03-21 12:45:37 +03:00
|
|
|
sorwakeup(so);
|
1998-04-30 00:43:29 +04:00
|
|
|
return (pkt_flags);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
out:
|
|
|
|
TCP_REASS_UNLOCK(tp);
|
|
|
|
return (0);
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
2000-10-20 00:22:59 +04:00
|
|
|
#ifdef INET6
|
1999-07-01 12:12:45 +04:00
|
|
|
int
|
2005-02-04 02:39:32 +03:00
|
|
|
tcp6_input(struct mbuf **mp, int *offp, int proto)
|
1999-07-01 12:12:45 +04:00
|
|
|
{
|
|
|
|
struct mbuf *m = *mp;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* draft-itojun-ipv6-tcp-to-anycast
|
|
|
|
* better place to put this in?
|
|
|
|
*/
|
|
|
|
if (m->m_flags & M_ANYCAST6) {
|
1999-12-11 12:55:14 +03:00
|
|
|
struct ip6_hdr *ip6;
|
|
|
|
if (m->m_len < sizeof(struct ip6_hdr)) {
|
|
|
|
if ((m = m_pullup(m, sizeof(struct ip6_hdr))) == NULL) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVSHORT);
|
1999-12-11 12:55:14 +03:00
|
|
|
return IPPROTO_DONE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
2001-05-08 14:15:13 +04:00
|
|
|
icmp6_error(m, ICMP6_DST_UNREACH, ICMP6_DST_UNREACH_ADDR,
|
2007-03-04 08:59:00 +03:00
|
|
|
(char *)&ip6->ip6_dst - (char *)ip6);
|
1999-07-01 12:12:45 +04:00
|
|
|
return IPPROTO_DONE;
|
|
|
|
}
|
|
|
|
|
|
|
|
tcp_input(m, *offp, proto);
|
|
|
|
return IPPROTO_DONE;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2002-06-29 08:13:21 +04:00
|
|
|
#ifdef INET
|
|
|
|
static void
|
2005-02-04 02:39:32 +03:00
|
|
|
tcp4_log_refused(const struct ip *ip, const struct tcphdr *th)
|
2002-06-29 08:13:21 +04:00
|
|
|
{
|
|
|
|
char src[4*sizeof "123"];
|
|
|
|
char dst[4*sizeof "123"];
|
|
|
|
|
|
|
|
if (ip) {
|
2003-05-16 07:56:49 +04:00
|
|
|
strlcpy(src, inet_ntoa(ip->ip_src), sizeof(src));
|
|
|
|
strlcpy(dst, inet_ntoa(ip->ip_dst), sizeof(dst));
|
2002-06-29 08:13:21 +04:00
|
|
|
}
|
|
|
|
else {
|
2003-05-16 07:56:49 +04:00
|
|
|
strlcpy(src, "(unknown)", sizeof(src));
|
|
|
|
strlcpy(dst, "(unknown)", sizeof(dst));
|
2002-06-29 08:13:21 +04:00
|
|
|
}
|
|
|
|
log(LOG_INFO,
|
|
|
|
"Connection attempt to TCP %s:%d from %s:%d\n",
|
|
|
|
dst, ntohs(th->th_dport),
|
|
|
|
src, ntohs(th->th_sport));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef INET6
|
|
|
|
static void
|
2005-02-04 02:39:32 +03:00
|
|
|
tcp6_log_refused(const struct ip6_hdr *ip6, const struct tcphdr *th)
|
2002-06-29 08:13:21 +04:00
|
|
|
{
|
|
|
|
char src[INET6_ADDRSTRLEN];
|
|
|
|
char dst[INET6_ADDRSTRLEN];
|
|
|
|
|
|
|
|
if (ip6) {
|
2003-05-16 07:56:49 +04:00
|
|
|
strlcpy(src, ip6_sprintf(&ip6->ip6_src), sizeof(src));
|
|
|
|
strlcpy(dst, ip6_sprintf(&ip6->ip6_dst), sizeof(dst));
|
2002-06-29 08:13:21 +04:00
|
|
|
}
|
|
|
|
else {
|
2003-05-16 07:56:49 +04:00
|
|
|
strlcpy(src, "(unknown v6)", sizeof(src));
|
|
|
|
strlcpy(dst, "(unknown v6)", sizeof(dst));
|
2002-06-29 08:13:21 +04:00
|
|
|
}
|
|
|
|
log(LOG_INFO,
|
|
|
|
"Connection attempt to TCP [%s]:%d from [%s]:%d\n",
|
|
|
|
dst, ntohs(th->th_dport),
|
|
|
|
src, ntohs(th->th_sport));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2004-12-21 08:51:31 +03:00
|
|
|
/*
|
|
|
|
* Checksum extended TCP header and data.
|
|
|
|
*/
|
|
|
|
int
|
2006-11-16 04:32:37 +03:00
|
|
|
tcp_input_checksum(int af, struct mbuf *m, const struct tcphdr *th,
|
2006-10-12 05:30:41 +04:00
|
|
|
int toff, int off, int tlen)
|
2004-12-21 08:51:31 +03:00
|
|
|
{
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX it's better to record and check if this mbuf is
|
|
|
|
* already checked.
|
|
|
|
*/
|
|
|
|
|
|
|
|
switch (af) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
|
|
|
switch (m->m_pkthdr.csum_flags &
|
|
|
|
((m->m_pkthdr.rcvif->if_csum_flags_rx & M_CSUM_TCPv4) |
|
|
|
|
M_CSUM_TCP_UDP_BAD | M_CSUM_DATA)) {
|
|
|
|
case M_CSUM_TCPv4|M_CSUM_TCP_UDP_BAD:
|
|
|
|
TCP_CSUM_COUNTER_INCR(&tcp_hwcsum_bad);
|
|
|
|
goto badcsum;
|
|
|
|
|
|
|
|
case M_CSUM_TCPv4|M_CSUM_DATA: {
|
|
|
|
u_int32_t hw_csum = m->m_pkthdr.csum_data;
|
|
|
|
|
|
|
|
TCP_CSUM_COUNTER_INCR(&tcp_hwcsum_data);
|
|
|
|
if (m->m_pkthdr.csum_flags & M_CSUM_NO_PSEUDOHDR) {
|
|
|
|
const struct ip *ip =
|
|
|
|
mtod(m, const struct ip *);
|
|
|
|
|
|
|
|
hw_csum = in_cksum_phdr(ip->ip_src.s_addr,
|
|
|
|
ip->ip_dst.s_addr,
|
|
|
|
htons(hw_csum + tlen + off + IPPROTO_TCP));
|
|
|
|
}
|
|
|
|
if ((hw_csum ^ 0xffff) != 0)
|
|
|
|
goto badcsum;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
case M_CSUM_TCPv4:
|
|
|
|
/* Checksum was okay. */
|
|
|
|
TCP_CSUM_COUNTER_INCR(&tcp_hwcsum_ok);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
/*
|
|
|
|
* Must compute it ourselves. Maybe skip checksum
|
|
|
|
* on loopback interfaces.
|
|
|
|
*/
|
|
|
|
if (__predict_true(!(m->m_pkthdr.rcvif->if_flags &
|
|
|
|
IFF_LOOPBACK) ||
|
|
|
|
tcp_do_loopback_cksum)) {
|
|
|
|
TCP_CSUM_COUNTER_INCR(&tcp_swcsum);
|
|
|
|
if (in4_cksum(m, IPPROTO_TCP, toff,
|
|
|
|
tlen + off) != 0)
|
|
|
|
goto badcsum;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
#endif /* INET4 */
|
|
|
|
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
2005-08-10 17:05:16 +04:00
|
|
|
switch (m->m_pkthdr.csum_flags &
|
|
|
|
((m->m_pkthdr.rcvif->if_csum_flags_rx & M_CSUM_TCPv6) |
|
|
|
|
M_CSUM_TCP_UDP_BAD | M_CSUM_DATA)) {
|
|
|
|
case M_CSUM_TCPv6|M_CSUM_TCP_UDP_BAD:
|
|
|
|
TCP_CSUM_COUNTER_INCR(&tcp6_hwcsum_bad);
|
|
|
|
goto badcsum;
|
|
|
|
|
|
|
|
#if 0 /* notyet */
|
|
|
|
case M_CSUM_TCPv6|M_CSUM_DATA:
|
|
|
|
#endif
|
|
|
|
|
|
|
|
case M_CSUM_TCPv6:
|
|
|
|
/* Checksum was okay. */
|
|
|
|
TCP_CSUM_COUNTER_INCR(&tcp6_hwcsum_ok);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
/*
|
|
|
|
* Must compute it ourselves. Maybe skip checksum
|
|
|
|
* on loopback interfaces.
|
|
|
|
*/
|
|
|
|
if (__predict_true((m->m_flags & M_LOOP) == 0 ||
|
|
|
|
tcp_do_loopback_cksum)) {
|
|
|
|
TCP_CSUM_COUNTER_INCR(&tcp6_swcsum);
|
|
|
|
if (in6_cksum(m, IPPROTO_TCP, toff,
|
|
|
|
tlen + off) != 0)
|
|
|
|
goto badcsum;
|
|
|
|
}
|
2004-12-21 08:51:31 +03:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
#endif /* INET6 */
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
badcsum:
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVBADSUM);
|
2004-12-21 08:51:31 +03:00
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
/* When a packet arrives addressed to a vestigial tcpbp, we
|
|
|
|
* nevertheless have to respond to it per the spec.
|
|
|
|
*/
|
|
|
|
static void tcp_vtw_input(struct tcphdr *th, vestigial_inpcb_t *vp,
|
|
|
|
struct mbuf *m, int tlen, int multicast)
|
|
|
|
{
|
|
|
|
int tiflags;
|
|
|
|
int todrop, dupseg;
|
|
|
|
uint32_t t_flags = 0;
|
|
|
|
uint64_t *tcps;
|
|
|
|
|
|
|
|
tiflags = th->th_flags;
|
|
|
|
todrop = vp->rcv_nxt - th->th_seq;
|
|
|
|
dupseg = false;
|
|
|
|
|
|
|
|
if (todrop > 0) {
|
|
|
|
if (tiflags & TH_SYN) {
|
|
|
|
tiflags &= ~TH_SYN;
|
|
|
|
++th->th_seq;
|
|
|
|
if (th->th_urp > 1)
|
|
|
|
--th->th_urp;
|
|
|
|
else {
|
|
|
|
tiflags &= ~TH_URG;
|
|
|
|
th->th_urp = 0;
|
|
|
|
}
|
|
|
|
--todrop;
|
|
|
|
}
|
|
|
|
if (todrop > tlen ||
|
|
|
|
(todrop == tlen && (tiflags & TH_FIN) == 0)) {
|
|
|
|
/*
|
|
|
|
* Any valid FIN or RST must be to the left of the
|
|
|
|
* window. At this point the FIN or RST must be a
|
|
|
|
* duplicate or out of sequence; drop it.
|
|
|
|
*/
|
|
|
|
if (tiflags & TH_RST)
|
|
|
|
goto drop;
|
|
|
|
tiflags &= ~(TH_FIN|TH_RST);
|
|
|
|
/*
|
|
|
|
* Send an ACK to resynchronize and drop any data.
|
|
|
|
* But keep on processing for RST or ACK.
|
|
|
|
*/
|
|
|
|
t_flags |= TF_ACKNOW;
|
|
|
|
todrop = tlen;
|
|
|
|
dupseg = true;
|
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVDUPPACK] += 1;
|
|
|
|
tcps[TCP_STAT_RCVDUPBYTE] += todrop;
|
|
|
|
TCP_STAT_PUTREF();
|
|
|
|
} else if ((tiflags & TH_RST)
|
|
|
|
&& th->th_seq != vp->rcv_nxt) {
|
|
|
|
/*
|
|
|
|
* Test for reset before adjusting the sequence
|
|
|
|
* number for overlapping data.
|
|
|
|
*/
|
|
|
|
goto dropafterack_ratelim;
|
|
|
|
} else {
|
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVPARTDUPPACK] += 1;
|
|
|
|
tcps[TCP_STAT_RCVPARTDUPBYTE] += todrop;
|
|
|
|
TCP_STAT_PUTREF();
|
|
|
|
}
|
|
|
|
|
|
|
|
// tcp_new_dsack(tp, th->th_seq, todrop);
|
|
|
|
// hdroptlen += todrop; /*drop from head afterwards*/
|
|
|
|
|
|
|
|
th->th_seq += todrop;
|
|
|
|
tlen -= todrop;
|
|
|
|
|
|
|
|
if (th->th_urp > todrop)
|
|
|
|
th->th_urp -= todrop;
|
|
|
|
else {
|
|
|
|
tiflags &= ~TH_URG;
|
|
|
|
th->th_urp = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If new data are received on a connection after the
|
|
|
|
* user processes are gone, then RST the other end.
|
|
|
|
*/
|
|
|
|
if (tlen) {
|
|
|
|
TCP_STATINC(TCP_STAT_RCVAFTERCLOSE);
|
|
|
|
goto dropwithreset;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If segment ends after window, drop trailing data
|
|
|
|
* (and PUSH and FIN); if nothing left, just ACK.
|
|
|
|
*/
|
|
|
|
todrop = (th->th_seq + tlen) - (vp->rcv_nxt+vp->rcv_wnd);
|
|
|
|
|
|
|
|
if (todrop > 0) {
|
|
|
|
TCP_STATINC(TCP_STAT_RCVPACKAFTERWIN);
|
|
|
|
if (todrop >= tlen) {
|
|
|
|
/*
|
|
|
|
* The segment actually starts after the window.
|
|
|
|
* th->th_seq + tlen - vp->rcv_nxt - vp->rcv_wnd >= tlen
|
|
|
|
* th->th_seq - vp->rcv_nxt - vp->rcv_wnd >= 0
|
|
|
|
* th->th_seq >= vp->rcv_nxt + vp->rcv_wnd
|
|
|
|
*/
|
|
|
|
TCP_STATADD(TCP_STAT_RCVBYTEAFTERWIN, tlen);
|
|
|
|
/*
|
|
|
|
* If a new connection request is received
|
|
|
|
* while in TIME_WAIT, drop the old connection
|
|
|
|
* and start over if the sequence numbers
|
|
|
|
* are above the previous ones.
|
|
|
|
*/
|
|
|
|
if ((tiflags & TH_SYN)
|
|
|
|
&& SEQ_GT(th->th_seq, vp->rcv_nxt)) {
|
|
|
|
/* We only support this in the !NOFDREF case, which
|
|
|
|
* is to say: not here.
|
|
|
|
*/
|
|
|
|
goto dropwithreset;;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If window is closed can only take segments at
|
|
|
|
* window edge, and have to drop data and PUSH from
|
|
|
|
* incoming segments. Continue processing, but
|
|
|
|
* remember to ack. Otherwise, drop segment
|
|
|
|
* and (if not RST) ack.
|
|
|
|
*/
|
|
|
|
if (vp->rcv_wnd == 0 && th->th_seq == vp->rcv_nxt) {
|
|
|
|
t_flags |= TF_ACKNOW;
|
|
|
|
TCP_STATINC(TCP_STAT_RCVWINPROBE);
|
|
|
|
} else
|
|
|
|
goto dropafterack;
|
|
|
|
} else
|
|
|
|
TCP_STATADD(TCP_STAT_RCVBYTEAFTERWIN, todrop);
|
|
|
|
m_adj(m, -todrop);
|
|
|
|
tlen -= todrop;
|
|
|
|
tiflags &= ~(TH_PUSH|TH_FIN);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tiflags & TH_RST) {
|
|
|
|
if (th->th_seq != vp->rcv_nxt)
|
|
|
|
goto dropafterack_ratelim;
|
|
|
|
|
|
|
|
vtw_del(vp->ctl, vp->vtw);
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the ACK bit is off we drop the segment and return.
|
|
|
|
*/
|
|
|
|
if ((tiflags & TH_ACK) == 0) {
|
|
|
|
if (t_flags & TF_ACKNOW)
|
|
|
|
goto dropafterack;
|
|
|
|
else
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In TIME_WAIT state the only thing that should arrive
|
|
|
|
* is a retransmission of the remote FIN. Acknowledge
|
|
|
|
* it and restart the finack timer.
|
|
|
|
*/
|
|
|
|
vtw_restart(vp);
|
|
|
|
goto dropafterack;
|
|
|
|
|
|
|
|
dropafterack:
|
|
|
|
/*
|
|
|
|
* Generate an ACK dropping incoming segment if it occupies
|
|
|
|
* sequence space, where the ACK reflects our state.
|
|
|
|
*/
|
|
|
|
if (tiflags & TH_RST)
|
|
|
|
goto drop;
|
|
|
|
goto dropafterack2;
|
|
|
|
|
|
|
|
dropafterack_ratelim:
|
|
|
|
/*
|
|
|
|
* We may want to rate-limit ACKs against SYN/RST attack.
|
|
|
|
*/
|
|
|
|
if (ppsratecheck(&tcp_ackdrop_ppslim_last, &tcp_ackdrop_ppslim_count,
|
|
|
|
tcp_ackdrop_ppslim) == 0) {
|
|
|
|
/* XXX stat */
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
/* ...fall into dropafterack2... */
|
|
|
|
|
|
|
|
dropafterack2:
|
|
|
|
(void)tcp_respond(0, m, m, th, th->th_seq + tlen, th->th_ack,
|
|
|
|
TH_ACK);
|
|
|
|
return;
|
|
|
|
|
|
|
|
dropwithreset:
|
|
|
|
/*
|
|
|
|
* Generate a RST, dropping incoming segment.
|
|
|
|
* Make ACK acceptable to originator of segment.
|
|
|
|
*/
|
|
|
|
if (tiflags & TH_RST)
|
|
|
|
goto drop;
|
|
|
|
|
|
|
|
if (tiflags & TH_ACK)
|
|
|
|
tcp_respond(0, m, m, th, (tcp_seq)0, th->th_ack, TH_RST);
|
|
|
|
else {
|
|
|
|
if (tiflags & TH_SYN)
|
|
|
|
++tlen;
|
|
|
|
(void)tcp_respond(0, m, m, th, th->th_seq + tlen, (tcp_seq)0,
|
|
|
|
TH_RST|TH_ACK);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
drop:
|
|
|
|
m_freem(m);
|
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
2005-08-12 08:19:22 +04:00
|
|
|
* TCP input routine, follows pages 65-76 of RFC 793 very closely.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1994-01-09 02:07:16 +03:00
|
|
|
void
|
1996-02-14 02:40:59 +03:00
|
|
|
tcp_input(struct mbuf *m, ...)
|
1993-03-21 12:45:37 +03:00
|
|
|
{
|
2000-03-30 16:51:13 +04:00
|
|
|
struct tcphdr *th;
|
1999-07-01 12:12:45 +04:00
|
|
|
struct ip *ip;
|
2000-03-30 16:51:13 +04:00
|
|
|
struct inpcb *inp;
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
struct ip6_hdr *ip6;
|
2000-03-30 16:51:13 +04:00
|
|
|
struct in6pcb *in6p;
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
2002-09-11 06:41:19 +04:00
|
|
|
u_int8_t *optp = NULL;
|
1996-02-14 02:40:59 +03:00
|
|
|
int optlen = 0;
|
1999-07-01 12:12:45 +04:00
|
|
|
int len, tlen, toff, hdroptlen = 0;
|
2000-03-30 16:51:13 +04:00
|
|
|
struct tcpcb *tp = 0;
|
|
|
|
int tiflags;
|
1996-02-14 02:40:59 +03:00
|
|
|
struct socket *so = NULL;
|
2011-04-26 02:12:43 +04:00
|
|
|
int todrop, acked, ourfinisacked, needoutput = 0;
|
|
|
|
bool dupseg;
|
2002-10-22 07:07:06 +04:00
|
|
|
#ifdef TCP_DEBUG
|
1996-02-14 02:40:59 +03:00
|
|
|
short ostate = 0;
|
2002-10-22 07:07:06 +04:00
|
|
|
#endif
|
1995-04-13 10:35:38 +04:00
|
|
|
u_long tiwin;
|
1997-07-24 01:26:40 +04:00
|
|
|
struct tcp_opt_info opti;
|
1999-07-01 12:12:45 +04:00
|
|
|
int off, iphlen;
|
1996-02-14 02:40:59 +03:00
|
|
|
va_list ap;
|
1999-07-01 12:12:45 +04:00
|
|
|
int af; /* af on the wire */
|
|
|
|
struct mbuf *tcp_saveti = NULL;
|
2005-06-06 16:10:09 +04:00
|
|
|
uint32_t ts_rtt;
|
2006-09-05 04:29:35 +04:00
|
|
|
uint8_t iptos;
|
2008-04-12 09:58:22 +04:00
|
|
|
uint64_t *tcps;
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
vestigial_inpcb_t vestige;
|
|
|
|
|
|
|
|
vestige.valid = 0;
|
1996-02-14 02:40:59 +03:00
|
|
|
|
2003-02-26 09:31:08 +03:00
|
|
|
MCLAIM(m, &tcp_rx_mowner);
|
1996-02-14 02:40:59 +03:00
|
|
|
va_start(ap, m);
|
1999-07-01 12:12:45 +04:00
|
|
|
toff = va_arg(ap, int);
|
2002-10-22 07:07:06 +04:00
|
|
|
(void)va_arg(ap, int); /* ignore value, advance ap */
|
1996-02-14 02:40:59 +03:00
|
|
|
va_end(ap);
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVTOTAL);
|
1997-07-24 01:26:40 +04:00
|
|
|
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(&opti, 0, sizeof(opti));
|
1997-07-24 01:26:40 +04:00
|
|
|
opti.ts_present = 0;
|
|
|
|
opti.maxseg = 0;
|
|
|
|
|
2000-02-12 20:19:34 +03:00
|
|
|
/*
|
|
|
|
* RFC1122 4.2.3.10, p. 104: discard bcast/mcast SYN.
|
|
|
|
*
|
|
|
|
* TCP is, by definition, unicast, so we reject all
|
|
|
|
* multicast outright.
|
|
|
|
*
|
|
|
|
* Note, there are additional src/dst address checks in
|
|
|
|
* the AF-specific code below.
|
|
|
|
*/
|
|
|
|
if (m->m_flags & (M_BCAST|M_MCAST)) {
|
|
|
|
/* XXX stat */
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
|
|
|
if (m->m_flags & M_ANYCAST6) {
|
|
|
|
/* XXX stat */
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
2004-04-25 03:59:13 +04:00
|
|
|
* Get IP and TCP header.
|
1993-03-21 12:45:37 +03:00
|
|
|
* Note: IP leaves IP header in first mbuf.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
switch (ip->ip_v) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case 4:
|
2011-04-26 02:12:43 +04:00
|
|
|
#ifdef INET6
|
|
|
|
ip6 = NULL;
|
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
af = AF_INET;
|
|
|
|
iphlen = sizeof(struct ip);
|
1999-12-13 18:17:17 +03:00
|
|
|
IP6_EXTHDR_GET(th, struct tcphdr *, m, toff,
|
|
|
|
sizeof(struct tcphdr));
|
|
|
|
if (th == NULL) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVSHORT);
|
1999-12-13 18:17:17 +03:00
|
|
|
return;
|
|
|
|
}
|
2000-02-12 20:19:34 +03:00
|
|
|
/* We do the checksum after PCB lookup... */
|
2002-08-14 04:23:27 +04:00
|
|
|
len = ntohs(ip->ip_len);
|
2000-02-12 20:19:34 +03:00
|
|
|
tlen = len - toff;
|
2006-09-05 04:29:35 +04:00
|
|
|
iptos = ip->ip_tos;
|
1999-07-01 12:12:45 +04:00
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case 6:
|
|
|
|
ip = NULL;
|
|
|
|
iphlen = sizeof(struct ip6_hdr);
|
|
|
|
af = AF_INET6;
|
1999-12-13 18:17:17 +03:00
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
IP6_EXTHDR_GET(th, struct tcphdr *, m, toff,
|
|
|
|
sizeof(struct tcphdr));
|
|
|
|
if (th == NULL) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVSHORT);
|
1999-12-13 18:17:17 +03:00
|
|
|
return;
|
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
|
1999-12-22 07:03:01 +03:00
|
|
|
/* Be proactive about malicious use of IPv4 mapped address */
|
|
|
|
if (IN6_IS_ADDR_V4MAPPED(&ip6->ip6_src) ||
|
|
|
|
IN6_IS_ADDR_V4MAPPED(&ip6->ip6_dst)) {
|
|
|
|
/* XXX stat */
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
|
2000-07-27 10:18:13 +04:00
|
|
|
/*
|
|
|
|
* Be proactive about unspecified IPv6 address in source.
|
|
|
|
* As we use all-zero to indicate unbounded/unconnected pcb,
|
|
|
|
* unspecified IPv6 address can be used to confuse us.
|
|
|
|
*
|
|
|
|
* Note that packets with unspecified IPv6 destination is
|
|
|
|
* already dropped in ip6_input.
|
|
|
|
*/
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&ip6->ip6_src)) {
|
|
|
|
/* XXX stat */
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
/*
|
2000-02-12 20:19:34 +03:00
|
|
|
* Make sure destination address is not multicast.
|
|
|
|
* Source address checked in ip6_input().
|
1999-07-01 12:12:45 +04:00
|
|
|
*/
|
2000-02-12 20:19:34 +03:00
|
|
|
if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) {
|
|
|
|
/* XXX stat */
|
1999-07-01 12:12:45 +04:00
|
|
|
goto drop;
|
|
|
|
}
|
2000-02-12 20:19:34 +03:00
|
|
|
|
|
|
|
/* We do the checksum after PCB lookup... */
|
|
|
|
len = m->m_pkthdr.len;
|
|
|
|
tlen = len - toff;
|
2006-09-05 04:29:35 +04:00
|
|
|
iptos = (ntohl(ip6->ip6_flow) >> 20) & 0xff;
|
1999-07-01 12:12:45 +04:00
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
m_freem(m);
|
|
|
|
return;
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
Changes to allow the IPv4 and IPv6 layers to align headers themseves,
as necessary:
* Implement a new mbuf utility routine, m_copyup(), is is like
m_pullup(), except that it always prepends and copies, rather
than only doing so if the desired length is larger than m->m_len.
m_copyup() also allows an offset into the destination mbuf, which
allows space for packet headers, in the forwarding case.
* Add *_HDR_ALIGNED_P() macros for IP, IPv6, ICMP, and IGMP. These
macros expand to 1 if __NO_STRICT_ALIGNMENT is defined, so that
architectures which do not have strict alignment constraints don't
pay for the test or visit the new align-if-needed path.
* Use the new macros to check if a header needs to be aligned, or to
assert that it already is, as appropriate.
Note: This code is still somewhat experimental. However, the new
code path won't be visited if individual device drivers continue
to guarantee that packets are delivered to layer 3 already properly
aligned (which are rules that are already in use).
2002-07-01 02:40:32 +04:00
|
|
|
KASSERT(TCP_HDR_ALIGNED_P(th));
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* Check that TCP offset makes sense,
|
|
|
|
* pull out TCP options and adjust length. XXX
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
off = th->th_off << 2;
|
1993-03-21 12:45:37 +03:00
|
|
|
if (off < sizeof (struct tcphdr) || off > tlen) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVBADOFF);
|
1993-03-21 12:45:37 +03:00
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
tlen -= off;
|
1999-07-01 12:12:45 +04:00
|
|
|
|
2002-06-09 20:33:36 +04:00
|
|
|
/*
|
1999-07-01 12:12:45 +04:00
|
|
|
* tcp_input() has been modified to use tlen to mean the TCP data
|
|
|
|
* length throughout the function. Other functions can use
|
|
|
|
* m->m_pkthdr.len as the basis for calculating the TCP data length.
|
|
|
|
* rja
|
|
|
|
*/
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
if (off > sizeof (struct tcphdr)) {
|
1999-12-13 18:17:17 +03:00
|
|
|
IP6_EXTHDR_GET(th, struct tcphdr *, m, toff, off);
|
|
|
|
if (th == NULL) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVSHORT);
|
1999-12-13 18:17:17 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* NOTE: ip/ip6 will not be affected by m_pulldown()
|
|
|
|
* (as they're before toff) and we don't need to update those.
|
|
|
|
*/
|
Changes to allow the IPv4 and IPv6 layers to align headers themseves,
as necessary:
* Implement a new mbuf utility routine, m_copyup(), is is like
m_pullup(), except that it always prepends and copies, rather
than only doing so if the desired length is larger than m->m_len.
m_copyup() also allows an offset into the destination mbuf, which
allows space for packet headers, in the forwarding case.
* Add *_HDR_ALIGNED_P() macros for IP, IPv6, ICMP, and IGMP. These
macros expand to 1 if __NO_STRICT_ALIGNMENT is defined, so that
architectures which do not have strict alignment constraints don't
pay for the test or visit the new align-if-needed path.
* Use the new macros to check if a header needs to be aligned, or to
assert that it already is, as appropriate.
Note: This code is still somewhat experimental. However, the new
code path won't be visited if individual device drivers continue
to guarantee that packets are delivered to layer 3 already properly
aligned (which are rules that are already in use).
2002-07-01 02:40:32 +04:00
|
|
|
KASSERT(TCP_HDR_ALIGNED_P(th));
|
1994-05-13 10:02:48 +04:00
|
|
|
optlen = off - sizeof (struct tcphdr);
|
2002-09-11 06:41:19 +04:00
|
|
|
optp = ((u_int8_t *)th) + sizeof(struct tcphdr);
|
2002-06-09 20:33:36 +04:00
|
|
|
/*
|
1994-05-13 10:02:48 +04:00
|
|
|
* Do quick retrieval of timestamp options ("options
|
|
|
|
* prediction?"). If timestamp is the only option and it's
|
|
|
|
* formatted as recommended in RFC 1323 appendix A, we
|
|
|
|
* quickly get the values now and not bother calling
|
|
|
|
* tcp_dooptions(), etc.
|
|
|
|
*/
|
|
|
|
if ((optlen == TCPOLEN_TSTAMP_APPA ||
|
|
|
|
(optlen > TCPOLEN_TSTAMP_APPA &&
|
|
|
|
optp[TCPOLEN_TSTAMP_APPA] == TCPOPT_EOL)) &&
|
1995-04-13 10:35:38 +04:00
|
|
|
*(u_int32_t *)optp == htonl(TCPOPT_TSTAMP_HDR) &&
|
1999-07-01 12:12:45 +04:00
|
|
|
(th->th_flags & TH_SYN) == 0) {
|
1997-07-24 01:26:40 +04:00
|
|
|
opti.ts_present = 1;
|
|
|
|
opti.ts_val = ntohl(*(u_int32_t *)(optp + 4));
|
|
|
|
opti.ts_ecr = ntohl(*(u_int32_t *)(optp + 8));
|
1994-05-13 10:02:48 +04:00
|
|
|
optp = NULL; /* we've parsed the options */
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
tiflags = th->th_flags;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Locate pcb for segment.
|
|
|
|
*/
|
|
|
|
findpcb:
|
1999-07-01 12:12:45 +04:00
|
|
|
inp = NULL;
|
|
|
|
#ifdef INET6
|
|
|
|
in6p = NULL;
|
|
|
|
#endif
|
|
|
|
switch (af) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET:
|
|
|
|
inp = in_pcblookup_connect(&tcbtable, ip->ip_src, th->th_sport,
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
ip->ip_dst, th->th_dport,
|
|
|
|
&vestige);
|
|
|
|
if (inp == 0 && !vestige.valid) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_PCBHASHMISS);
|
1999-07-01 12:12:45 +04:00
|
|
|
inp = in_pcblookup_bind(&tcbtable, ip->ip_dst, th->th_dport);
|
|
|
|
}
|
2000-10-20 00:22:59 +04:00
|
|
|
#ifdef INET6
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
if (inp == 0 && !vestige.valid) {
|
1999-07-01 12:12:45 +04:00
|
|
|
struct in6_addr s, d;
|
|
|
|
|
|
|
|
/* mapped addr case */
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(&s, 0, sizeof(s));
|
1999-07-01 12:12:45 +04:00
|
|
|
s.s6_addr16[5] = htons(0xffff);
|
|
|
|
bcopy(&ip->ip_src, &s.s6_addr32[3], sizeof(ip->ip_src));
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(&d, 0, sizeof(d));
|
1999-07-01 12:12:45 +04:00
|
|
|
d.s6_addr16[5] = htons(0xffff);
|
|
|
|
bcopy(&ip->ip_dst, &d.s6_addr32[3], sizeof(ip->ip_dst));
|
2003-09-04 13:16:57 +04:00
|
|
|
in6p = in6_pcblookup_connect(&tcbtable, &s,
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
th->th_sport, &d, th->th_dport,
|
|
|
|
0, &vestige);
|
|
|
|
if (in6p == 0 && !vestige.valid) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_PCBHASHMISS);
|
2003-09-04 13:16:57 +04:00
|
|
|
in6p = in6_pcblookup_bind(&tcbtable, &d,
|
|
|
|
th->th_dport, 0);
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#ifndef INET6
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
if (inp == 0 && !vestige.valid)
|
1999-07-01 12:12:45 +04:00
|
|
|
#else
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
if (inp == 0 && in6p == 0 && !vestige.valid)
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
|
|
|
{
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_NOPORT);
|
2002-10-16 19:15:28 +04:00
|
|
|
if (tcp_log_refused &&
|
|
|
|
(tiflags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN) {
|
2002-06-29 08:13:21 +04:00
|
|
|
tcp4_log_refused(ip, th);
|
1999-05-24 00:33:50 +04:00
|
|
|
}
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_host(th);
|
2000-02-15 22:54:11 +03:00
|
|
|
goto dropwithreset_ratelim;
|
1996-01-31 06:49:23 +03:00
|
|
|
}
|
2011-12-19 15:59:56 +04:00
|
|
|
#if defined(KAME_IPSEC) || defined(FAST_IPSEC)
|
2003-09-10 04:58:29 +04:00
|
|
|
if (inp && (inp->inp_socket->so_options & SO_ACCEPTCONN) == 0 &&
|
|
|
|
ipsec4_in_reject(m, inp)) {
|
2008-04-23 10:09:04 +04:00
|
|
|
IPSEC_STATINC(IPSEC_STAT_IN_POLVIO);
|
1999-07-01 12:12:45 +04:00
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
2003-09-10 04:58:29 +04:00
|
|
|
else if (in6p &&
|
|
|
|
(in6p->in6p_socket->so_options & SO_ACCEPTCONN) == 0 &&
|
2007-02-10 12:43:05 +03:00
|
|
|
ipsec6_in_reject_so(m, in6p->in6p_socket)) {
|
2008-04-23 10:09:04 +04:00
|
|
|
IPSEC_STATINC(IPSEC_STAT_IN_POLVIO);
|
1999-07-01 12:12:45 +04:00
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#endif /*IPSEC*/
|
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif /*INET*/
|
2000-10-20 00:22:59 +04:00
|
|
|
#ifdef INET6
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET6:
|
1999-07-17 11:07:08 +04:00
|
|
|
{
|
|
|
|
int faith;
|
|
|
|
|
|
|
|
#if defined(NFAITH) && NFAITH > 0
|
2001-05-08 14:15:13 +04:00
|
|
|
faith = faithprefix(&ip6->ip6_dst);
|
1999-07-17 11:07:08 +04:00
|
|
|
#else
|
|
|
|
faith = 0;
|
|
|
|
#endif
|
2003-09-04 13:16:57 +04:00
|
|
|
in6p = in6_pcblookup_connect(&tcbtable, &ip6->ip6_src,
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
th->th_sport, &ip6->ip6_dst, th->th_dport, faith, &vestige);
|
|
|
|
if (!in6p && !vestige.valid) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_PCBHASHMISS);
|
2003-09-04 13:16:57 +04:00
|
|
|
in6p = in6_pcblookup_bind(&tcbtable, &ip6->ip6_dst,
|
1999-07-17 11:07:08 +04:00
|
|
|
th->th_dport, faith);
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
if (!in6p && !vestige.valid) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_NOPORT);
|
2002-10-16 19:15:28 +04:00
|
|
|
if (tcp_log_refused &&
|
|
|
|
(tiflags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN) {
|
2002-06-29 08:13:21 +04:00
|
|
|
tcp6_log_refused(ip6, th);
|
2002-03-12 07:36:47 +03:00
|
|
|
}
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_host(th);
|
2000-02-15 22:54:11 +03:00
|
|
|
goto dropwithreset_ratelim;
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
2011-12-19 15:59:56 +04:00
|
|
|
#if defined(KAME_IPSEC) || defined(FAST_IPSEC)
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
if (in6p
|
|
|
|
&& (in6p->in6p_socket->so_options & SO_ACCEPTCONN) == 0
|
|
|
|
&& ipsec6_in_reject(m, in6p)) {
|
2008-04-23 10:09:04 +04:00
|
|
|
IPSEC6_STATINC(IPSEC_STAT_IN_POLVIO);
|
1999-07-01 12:12:45 +04:00
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
#endif /*IPSEC*/
|
|
|
|
break;
|
1999-07-17 11:07:08 +04:00
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
1996-09-09 18:51:07 +04:00
|
|
|
/*
|
|
|
|
* If the state is CLOSED (i.e., TCB does not exist) then
|
|
|
|
* all data in the incoming segment is discarded.
|
|
|
|
* If the TCB exists but is in CLOSED state, it is embryonic,
|
|
|
|
* but should either do a listen or a connect soon.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
tp = NULL;
|
|
|
|
so = NULL;
|
|
|
|
if (inp) {
|
2009-07-19 03:09:53 +04:00
|
|
|
/* Check the minimum TTL for socket. */
|
|
|
|
if (ip->ip_ttl < inp->inp_ip_minttl)
|
|
|
|
goto drop;
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
so = inp->inp_socket;
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
|
|
|
else if (in6p) {
|
|
|
|
tp = in6totcpcb(in6p);
|
|
|
|
so = in6p->in6p_socket;
|
|
|
|
}
|
|
|
|
#endif
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
else if (vestige.valid) {
|
|
|
|
int mc = 0;
|
|
|
|
|
|
|
|
/* We do not support the resurrection of vtw tcpcps.
|
|
|
|
*/
|
|
|
|
if (tcp_input_checksum(af, m, th, toff, off, tlen))
|
|
|
|
goto badcsum;
|
|
|
|
|
|
|
|
switch (af) {
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
mc = IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
case AF_INET:
|
|
|
|
mc = (IN_MULTICAST(ip->ip_dst.s_addr)
|
|
|
|
|| in_broadcast(ip->ip_dst, m->m_pkthdr.rcvif));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
tcp_fields_to_host(th);
|
|
|
|
tcp_vtw_input(th, &vestige, m, tlen, mc);
|
|
|
|
m = 0;
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
if (tp == 0) {
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_host(th);
|
2000-02-15 22:54:11 +03:00
|
|
|
goto dropwithreset_ratelim;
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
if (tp->t_state == TCPS_CLOSED)
|
|
|
|
goto drop;
|
2000-02-12 20:19:34 +03:00
|
|
|
|
2008-07-04 22:22:21 +04:00
|
|
|
KASSERT(so->so_lock == softnet_lock);
|
|
|
|
KASSERT(solocked(so));
|
|
|
|
|
2000-02-12 20:19:34 +03:00
|
|
|
/*
|
|
|
|
* Checksum extended TCP header and data.
|
|
|
|
*/
|
2004-12-21 08:51:31 +03:00
|
|
|
if (tcp_input_checksum(af, m, th, toff, off, tlen))
|
|
|
|
goto badcsum;
|
2000-02-12 20:19:34 +03:00
|
|
|
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_host(th);
|
2000-02-12 20:19:34 +03:00
|
|
|
|
1994-05-13 10:02:48 +04:00
|
|
|
/* Unscale the window into a 32-bit value. */
|
|
|
|
if ((tiflags & TH_SYN) == 0)
|
1999-07-01 12:12:45 +04:00
|
|
|
tiwin = th->th_win << tp->snd_scale;
|
1994-05-13 10:02:48 +04:00
|
|
|
else
|
1999-07-01 12:12:45 +04:00
|
|
|
tiwin = th->th_win;
|
|
|
|
|
|
|
|
#ifdef INET6
|
|
|
|
/* save packet options if user wanted */
|
|
|
|
if (in6p && (in6p->in6p_flags & IN6P_CONTROLOPTS)) {
|
|
|
|
if (in6p->in6p_options) {
|
|
|
|
m_freem(in6p->in6p_options);
|
|
|
|
in6p->in6p_options = 0;
|
|
|
|
}
|
2006-04-15 06:32:22 +04:00
|
|
|
KASSERT(ip6 != NULL);
|
1999-07-01 12:12:45 +04:00
|
|
|
ip6_savecontrol(in6p, &in6p->in6p_options, ip6, m);
|
|
|
|
}
|
|
|
|
#endif
|
1994-05-13 10:02:48 +04:00
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
if (so->so_options & (SO_DEBUG|SO_ACCEPTCONN)) {
|
1999-07-01 12:12:45 +04:00
|
|
|
union syn_cache_sa src;
|
|
|
|
union syn_cache_sa dst;
|
|
|
|
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(&src, 0, sizeof(src));
|
|
|
|
memset(&dst, 0, sizeof(dst));
|
1999-07-01 12:12:45 +04:00
|
|
|
switch (af) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET:
|
|
|
|
src.sin.sin_len = sizeof(struct sockaddr_in);
|
|
|
|
src.sin.sin_family = AF_INET;
|
|
|
|
src.sin.sin_addr = ip->ip_src;
|
|
|
|
src.sin.sin_port = th->th_sport;
|
|
|
|
|
|
|
|
dst.sin.sin_len = sizeof(struct sockaddr_in);
|
|
|
|
dst.sin.sin_family = AF_INET;
|
|
|
|
dst.sin.sin_addr = ip->ip_dst;
|
|
|
|
dst.sin.sin_port = th->th_dport;
|
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
src.sin6.sin6_len = sizeof(struct sockaddr_in6);
|
|
|
|
src.sin6.sin6_family = AF_INET6;
|
|
|
|
src.sin6.sin6_addr = ip6->ip6_src;
|
|
|
|
src.sin6.sin6_port = th->th_sport;
|
|
|
|
|
|
|
|
dst.sin6.sin6_len = sizeof(struct sockaddr_in6);
|
|
|
|
dst.sin6.sin6_family = AF_INET6;
|
|
|
|
dst.sin6.sin6_addr = ip6->ip6_dst;
|
|
|
|
dst.sin6.sin6_port = th->th_dport;
|
|
|
|
break;
|
|
|
|
#endif /* INET6 */
|
|
|
|
default:
|
|
|
|
goto badsyn; /*sanity*/
|
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
if (so->so_options & SO_DEBUG) {
|
2002-10-22 07:07:06 +04:00
|
|
|
#ifdef TCP_DEBUG
|
1993-03-21 12:45:37 +03:00
|
|
|
ostate = tp->t_state;
|
2002-10-22 07:07:06 +04:00
|
|
|
#endif
|
2000-06-30 20:44:33 +04:00
|
|
|
|
|
|
|
tcp_saveti = NULL;
|
|
|
|
if (iphlen + sizeof(struct tcphdr) > MHLEN)
|
|
|
|
goto nosave;
|
|
|
|
|
|
|
|
if (m->m_len > iphlen && (m->m_flags & M_EXT) == 0) {
|
|
|
|
tcp_saveti = m_copym(m, 0, iphlen, M_DONTWAIT);
|
|
|
|
if (!tcp_saveti)
|
|
|
|
goto nosave;
|
|
|
|
} else {
|
|
|
|
MGETHDR(tcp_saveti, M_DONTWAIT, MT_HEADER);
|
|
|
|
if (!tcp_saveti)
|
|
|
|
goto nosave;
|
2003-02-26 09:31:08 +03:00
|
|
|
MCLAIM(m, &tcp_mowner);
|
2000-06-30 20:44:33 +04:00
|
|
|
tcp_saveti->m_len = iphlen;
|
|
|
|
m_copydata(m, 0, iphlen,
|
2007-03-04 08:59:00 +03:00
|
|
|
mtod(tcp_saveti, void *));
|
2000-06-30 20:44:33 +04:00
|
|
|
}
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
if (M_TRAILINGSPACE(tcp_saveti) < sizeof(struct tcphdr)) {
|
|
|
|
m_freem(tcp_saveti);
|
|
|
|
tcp_saveti = NULL;
|
|
|
|
} else {
|
|
|
|
tcp_saveti->m_len += sizeof(struct tcphdr);
|
2007-03-04 08:59:00 +03:00
|
|
|
memcpy(mtod(tcp_saveti, char *) + iphlen, th,
|
2000-06-30 20:44:33 +04:00
|
|
|
sizeof(struct tcphdr));
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
2000-06-30 20:44:33 +04:00
|
|
|
nosave:;
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
if (so->so_options & SO_ACCEPTCONN) {
|
2002-06-09 20:33:36 +04:00
|
|
|
if ((tiflags & (TH_RST|TH_ACK|TH_SYN)) != TH_SYN) {
|
1997-11-21 09:18:30 +03:00
|
|
|
if (tiflags & TH_RST) {
|
1999-07-01 12:12:45 +04:00
|
|
|
syn_cache_reset(&src.sa, &dst.sa, th);
|
1997-11-21 09:18:30 +03:00
|
|
|
} else if ((tiflags & (TH_ACK|TH_SYN)) ==
|
|
|
|
(TH_ACK|TH_SYN)) {
|
|
|
|
/*
|
|
|
|
* Received a SYN,ACK. This should
|
|
|
|
* never happen while we are in
|
|
|
|
* LISTEN. Send an RST.
|
|
|
|
*/
|
|
|
|
goto badsyn;
|
|
|
|
} else if (tiflags & TH_ACK) {
|
1999-07-01 12:12:45 +04:00
|
|
|
so = syn_cache_get(&src.sa, &dst.sa,
|
|
|
|
th, toff, tlen, so, m);
|
1997-07-24 01:26:40 +04:00
|
|
|
if (so == NULL) {
|
|
|
|
/*
|
|
|
|
* We don't have a SYN for
|
|
|
|
* this ACK; send an RST.
|
|
|
|
*/
|
1997-11-21 09:18:30 +03:00
|
|
|
goto badsyn;
|
1997-07-24 01:26:40 +04:00
|
|
|
} else if (so ==
|
|
|
|
(struct socket *)(-1)) {
|
|
|
|
/*
|
|
|
|
* We were unable to create
|
|
|
|
* the connection. If the
|
|
|
|
* 3-way handshake was
|
1998-09-19 08:34:34 +04:00
|
|
|
* completed, and RST has
|
1997-07-24 01:26:40 +04:00
|
|
|
* been sent to the peer.
|
|
|
|
* Since the mbuf might be
|
|
|
|
* in use for the reply,
|
|
|
|
* do not free it.
|
|
|
|
*/
|
|
|
|
m = NULL;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We have created a
|
|
|
|
* full-blown connection.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
tp = NULL;
|
|
|
|
inp = NULL;
|
|
|
|
#ifdef INET6
|
|
|
|
in6p = NULL;
|
|
|
|
#endif
|
|
|
|
switch (so->so_proto->pr_domain->dom_family) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET:
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
in6p = sotoin6pcb(so);
|
|
|
|
tp = in6totcpcb(in6p);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
if (tp == NULL)
|
|
|
|
goto badsyn; /*XXX*/
|
1997-07-24 01:26:40 +04:00
|
|
|
tiwin <<= tp->snd_scale;
|
|
|
|
goto after_listen;
|
|
|
|
}
|
2002-06-09 20:33:36 +04:00
|
|
|
} else {
|
1998-09-19 08:32:51 +04:00
|
|
|
/*
|
|
|
|
* None of RST, SYN or ACK was set.
|
|
|
|
* This is an invalid packet for a
|
|
|
|
* TCB in LISTEN state. Send a RST.
|
|
|
|
*/
|
|
|
|
goto badsyn;
|
|
|
|
}
|
2002-06-09 20:33:36 +04:00
|
|
|
} else {
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
1997-11-21 09:18:30 +03:00
|
|
|
* Received a SYN.
|
2006-10-12 15:46:30 +04:00
|
|
|
*
|
|
|
|
* RFC1122 4.2.3.10, p. 104: discard bcast/mcast SYN
|
1997-11-21 09:18:30 +03:00
|
|
|
*/
|
2006-10-12 15:46:30 +04:00
|
|
|
if (m->m_flags & (M_BCAST|M_MCAST))
|
|
|
|
goto drop;
|
|
|
|
|
|
|
|
switch (af) {
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst))
|
|
|
|
goto drop;
|
|
|
|
break;
|
|
|
|
#endif /* INET6 */
|
|
|
|
case AF_INET:
|
|
|
|
if (IN_MULTICAST(ip->ip_dst.s_addr) ||
|
|
|
|
in_broadcast(ip->ip_dst, m->m_pkthdr.rcvif))
|
|
|
|
goto drop;
|
|
|
|
break;
|
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
|
|
|
|
#ifdef INET6
|
2002-08-19 06:13:46 +04:00
|
|
|
/*
|
|
|
|
* If deprecated address is forbidden, we do
|
|
|
|
* not accept SYN to deprecated interface
|
|
|
|
* address to prevent any new inbound
|
|
|
|
* connection from getting established.
|
2002-08-19 06:17:54 +04:00
|
|
|
* When we do not accept SYN, we send a TCP
|
|
|
|
* RST, with deprecated source address (instead
|
2002-08-19 06:13:46 +04:00
|
|
|
* of dropping it). We compromise it as it is
|
2002-08-19 06:17:54 +04:00
|
|
|
* much better for peer to send a RST, and
|
|
|
|
* RST will be the final packet for the
|
|
|
|
* exchange.
|
2002-08-19 06:13:46 +04:00
|
|
|
*
|
2002-08-19 06:17:54 +04:00
|
|
|
* If we do not forbid deprecated addresses, we
|
|
|
|
* accept the SYN packet. RFC2462 does not
|
|
|
|
* suggest dropping SYN in this case.
|
|
|
|
* If we decipher RFC2462 5.5.4, it says like
|
|
|
|
* this:
|
2002-08-19 06:13:46 +04:00
|
|
|
* 1. use of deprecated addr with existing
|
|
|
|
* communication is okay - "SHOULD continue
|
|
|
|
* to be used"
|
|
|
|
* 2. use of it with new communication:
|
|
|
|
* (2a) "SHOULD NOT be used if alternate
|
|
|
|
* address with sufficient scope is
|
|
|
|
* available"
|
2005-02-27 01:45:09 +03:00
|
|
|
* (2b) nothing mentioned otherwise.
|
2002-08-19 06:13:46 +04:00
|
|
|
* Here we fall into (2b) case as we have no
|
2002-08-19 06:17:54 +04:00
|
|
|
* choice in our source address selection - we
|
|
|
|
* must obey the peer.
|
2002-08-19 06:13:46 +04:00
|
|
|
*
|
|
|
|
* The wording in RFC2462 is confusing, and
|
|
|
|
* there are multiple description text for
|
|
|
|
* deprecated address handling - worse, they
|
2002-08-19 06:17:54 +04:00
|
|
|
* are not exactly the same. I believe 5.5.4
|
|
|
|
* is the best one, so we follow 5.5.4.
|
2002-08-19 06:13:46 +04:00
|
|
|
*/
|
2002-08-19 06:17:54 +04:00
|
|
|
if (af == AF_INET6 && !ip6_use_deprecated) {
|
2002-08-19 06:13:46 +04:00
|
|
|
struct in6_ifaddr *ia6;
|
|
|
|
if ((ia6 = in6ifa_ifpwithaddr(m->m_pkthdr.rcvif,
|
|
|
|
&ip6->ip6_dst)) &&
|
|
|
|
(ia6->ia6_flags & IN6_IFF_DEPRECATED)) {
|
|
|
|
tp = NULL;
|
|
|
|
goto dropwithreset;
|
|
|
|
}
|
|
|
|
}
|
2002-08-19 06:17:54 +04:00
|
|
|
#endif
|
|
|
|
|
2011-12-19 15:59:56 +04:00
|
|
|
#if defined(KAME_IPSEC) || defined(FAST_IPSEC)
|
2003-09-10 04:58:29 +04:00
|
|
|
switch (af) {
|
2004-03-29 08:59:02 +04:00
|
|
|
#ifdef INET
|
2003-09-10 04:58:29 +04:00
|
|
|
case AF_INET:
|
|
|
|
if (ipsec4_in_reject_so(m, so)) {
|
2008-04-23 10:09:04 +04:00
|
|
|
IPSEC_STATINC(IPSEC_STAT_IN_POLVIO);
|
2003-09-10 04:58:29 +04:00
|
|
|
tp = NULL;
|
|
|
|
goto dropwithreset;
|
|
|
|
}
|
|
|
|
break;
|
2004-03-29 08:59:02 +04:00
|
|
|
#endif
|
2003-09-10 04:58:29 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
if (ipsec6_in_reject_so(m, so)) {
|
2008-04-23 10:09:04 +04:00
|
|
|
IPSEC6_STATINC(IPSEC_STAT_IN_POLVIO);
|
2003-09-10 04:58:29 +04:00
|
|
|
tp = NULL;
|
|
|
|
goto dropwithreset;
|
|
|
|
}
|
|
|
|
break;
|
2007-02-10 12:43:05 +03:00
|
|
|
#endif /*INET6*/
|
2003-09-10 04:58:29 +04:00
|
|
|
}
|
2007-02-10 12:43:05 +03:00
|
|
|
#endif /*IPSEC*/
|
2003-09-10 04:58:29 +04:00
|
|
|
|
2002-08-19 06:17:54 +04:00
|
|
|
/*
|
|
|
|
* LISTEN socket received a SYN
|
|
|
|
* from itself? This can't possibly
|
|
|
|
* be valid; drop the packet.
|
|
|
|
*/
|
|
|
|
if (th->th_sport == th->th_dport) {
|
|
|
|
int i;
|
|
|
|
|
|
|
|
switch (af) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
|
|
|
i = in_hosteq(ip->ip_src, ip->ip_dst);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
i = IN6_ARE_ADDR_EQUAL(&ip6->ip6_src, &ip6->ip6_dst);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
i = 1;
|
|
|
|
}
|
|
|
|
if (i) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_BADSYN);
|
2002-08-19 06:17:54 +04:00
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
}
|
2002-08-19 06:13:46 +04:00
|
|
|
|
1997-11-21 09:18:30 +03:00
|
|
|
/*
|
|
|
|
* SYN looks ok; create compressed TCP
|
|
|
|
* state for it.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
|
|
|
if (so->so_qlen <= so->so_qlimit &&
|
1999-07-01 12:12:45 +04:00
|
|
|
syn_cache_add(&src.sa, &dst.sa, th, tlen,
|
|
|
|
so, m, optp, optlen, &opti))
|
1997-07-24 01:26:40 +04:00
|
|
|
m = NULL;
|
|
|
|
}
|
|
|
|
goto drop;
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
after_listen:
|
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
/*
|
|
|
|
* Should not happen now that all embryonic connections
|
|
|
|
* are handled with compressed state.
|
|
|
|
*/
|
|
|
|
if (tp->t_state == TCPS_LISTEN)
|
|
|
|
panic("tcp_input: TCPS_LISTEN");
|
|
|
|
#endif
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* Segment received on connection.
|
|
|
|
* Reset idle time and keep-alive timer.
|
|
|
|
*/
|
2001-09-10 19:23:09 +04:00
|
|
|
tp->t_rcvtime = tcp_now;
|
1996-09-11 03:26:05 +04:00
|
|
|
if (TCPS_HAVEESTABLISHED(tp->t_state))
|
2007-06-20 19:29:17 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_KEEP, tp->t_keepidle);
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
/*
|
1997-07-24 01:26:40 +04:00
|
|
|
* Process options.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
2004-05-18 18:44:14 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
if (optp || (tp->t_flags & TF_SIGNATURE))
|
|
|
|
#else
|
1997-07-24 01:26:40 +04:00
|
|
|
if (optp)
|
2004-05-18 18:44:14 +04:00
|
|
|
#endif
|
|
|
|
if (tcp_dooptions(tp, optp, optlen, th, m, toff, &opti) < 0)
|
|
|
|
goto drop;
|
1994-05-13 10:02:48 +04:00
|
|
|
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if (TCP_SACK_ENABLED(tp)) {
|
|
|
|
tcp_del_sackholes(tp, th);
|
|
|
|
}
|
|
|
|
|
2006-09-05 04:29:35 +04:00
|
|
|
if (TCP_ECN_ALLOWED(tp)) {
|
2010-04-16 07:13:03 +04:00
|
|
|
if (tiflags & TH_CWR) {
|
|
|
|
tp->t_flags &= ~TF_ECN_SND_ECE;
|
|
|
|
}
|
2006-09-05 04:29:35 +04:00
|
|
|
switch (iptos & IPTOS_ECN_MASK) {
|
|
|
|
case IPTOS_ECN_CE:
|
|
|
|
tp->t_flags |= TF_ECN_SND_ECE;
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_ECN_CE);
|
2006-09-05 04:29:35 +04:00
|
|
|
break;
|
|
|
|
case IPTOS_ECN_ECT0:
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_ECN_ECT);
|
2006-09-05 04:29:35 +04:00
|
|
|
break;
|
|
|
|
case IPTOS_ECN_ECT1:
|
|
|
|
/* XXX: ignore for now -- rpaulo */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Congestion experienced.
|
|
|
|
* Ignore if we are already trying to recover.
|
|
|
|
*/
|
|
|
|
if ((tiflags & TH_ECE) && SEQ_GEQ(tp->snd_una, tp->snd_recover))
|
2006-10-15 21:45:06 +04:00
|
|
|
tp->t_congctl->cong_exp(tp);
|
2006-09-05 04:29:35 +04:00
|
|
|
}
|
|
|
|
|
2005-01-27 00:49:27 +03:00
|
|
|
if (opti.ts_present && opti.ts_ecr) {
|
|
|
|
/*
|
|
|
|
* Calculate the RTT from the returned time stamp and the
|
|
|
|
* connection's time base. If the time stamp is later than
|
2005-01-27 19:56:06 +03:00
|
|
|
* the current time, or is extremely old, fall back to non-1323
|
2011-04-14 19:48:48 +04:00
|
|
|
* RTT calculation. Since ts_rtt is unsigned, we can test both
|
2005-01-27 19:56:06 +03:00
|
|
|
* at the same time.
|
2011-04-20 17:35:51 +04:00
|
|
|
*
|
|
|
|
* Note that ts_rtt is in units of slow ticks (500
|
|
|
|
* ms). Since most earthbound RTTs are < 500 ms,
|
|
|
|
* observed values will have large quantization noise.
|
|
|
|
* Our smoothed RTT is then the fraction of observed
|
|
|
|
* samples that are 1 tick instead of 0 (times 500
|
|
|
|
* ms).
|
|
|
|
*
|
|
|
|
* ts_rtt is increased by 1 to denote a valid sample,
|
|
|
|
* with 0 indicating an invalid measurement. This
|
|
|
|
* extra 1 must be removed when ts_rtt is used, or
|
|
|
|
* else an an erroneous extra 500 ms will result.
|
2005-01-27 00:49:27 +03:00
|
|
|
*/
|
2005-06-06 16:10:09 +04:00
|
|
|
ts_rtt = TCP_TIMESTAMP(tp) - opti.ts_ecr + 1;
|
|
|
|
if (ts_rtt > TCP_PAWS_IDLE)
|
|
|
|
ts_rtt = 0;
|
|
|
|
} else {
|
|
|
|
ts_rtt = 0;
|
2005-01-27 00:49:27 +03:00
|
|
|
}
|
|
|
|
|
2002-06-09 20:33:36 +04:00
|
|
|
/*
|
1993-03-21 12:45:37 +03:00
|
|
|
* Header prediction: check for the two common cases
|
|
|
|
* of a uni-directional data xfer. If the packet has
|
|
|
|
* no control flags, is in-sequence, the window didn't
|
|
|
|
* change and we're not retransmitting, it's a
|
|
|
|
* candidate. If the length is zero and the ack moved
|
|
|
|
* forward, we're the sender side of the xfer. Just
|
|
|
|
* free the data acked & wake any higher level process
|
|
|
|
* that was blocked waiting for space. If the length
|
|
|
|
* is non-zero and the ack didn't move, we're the
|
|
|
|
* receiver side. If we're getting packets in-order
|
|
|
|
* (the reassembly queue is empty), add the data to
|
|
|
|
* the socket buffer and note that we need a delayed ack.
|
|
|
|
*/
|
|
|
|
if (tp->t_state == TCPS_ESTABLISHED &&
|
2006-09-05 04:29:35 +04:00
|
|
|
(tiflags & (TH_SYN|TH_FIN|TH_RST|TH_URG|TH_ECE|TH_CWR|TH_ACK))
|
|
|
|
== TH_ACK &&
|
1997-07-24 01:26:40 +04:00
|
|
|
(!opti.ts_present || TSTMP_GEQ(opti.ts_val, tp->ts_recent)) &&
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_seq == tp->rcv_nxt &&
|
1994-05-13 10:02:48 +04:00
|
|
|
tiwin && tiwin == tp->snd_wnd &&
|
1993-03-21 12:45:37 +03:00
|
|
|
tp->snd_nxt == tp->snd_max) {
|
1994-05-13 10:02:48 +04:00
|
|
|
|
2002-06-09 20:33:36 +04:00
|
|
|
/*
|
1994-05-13 10:02:48 +04:00
|
|
|
* If last ACK falls within this segment's sequence numbers,
|
2008-02-05 02:56:14 +03:00
|
|
|
* record the timestamp.
|
|
|
|
* NOTE that the test is modified according to the latest
|
|
|
|
* proposal of the tcplw@cray.com list (Braden 1993/04/26).
|
|
|
|
*
|
|
|
|
* note that we already know
|
|
|
|
* TSTMP_GEQ(opti.ts_val, tp->ts_recent)
|
1994-05-13 10:02:48 +04:00
|
|
|
*/
|
1997-07-24 01:26:40 +04:00
|
|
|
if (opti.ts_present &&
|
2008-02-05 02:56:14 +03:00
|
|
|
SEQ_LEQ(th->th_seq, tp->last_ack_sent)) {
|
2005-01-27 20:14:04 +03:00
|
|
|
tp->ts_recent_age = tcp_now;
|
1997-07-24 01:26:40 +04:00
|
|
|
tp->ts_recent = opti.ts_val;
|
1994-05-13 10:02:48 +04:00
|
|
|
}
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
if (tlen == 0) {
|
2005-01-27 00:49:27 +03:00
|
|
|
/* Ack prediction. */
|
1999-07-01 12:12:45 +04:00
|
|
|
if (SEQ_GT(th->th_ack, tp->snd_una) &&
|
|
|
|
SEQ_LEQ(th->th_ack, tp->snd_max) &&
|
1995-06-11 13:36:28 +04:00
|
|
|
tp->snd_cwnd >= tp->snd_wnd &&
|
2005-01-27 06:39:36 +03:00
|
|
|
tp->t_partialacks < 0) {
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* this is a pure ack for outstanding data.
|
|
|
|
*/
|
2005-06-06 16:10:09 +04:00
|
|
|
if (ts_rtt)
|
2011-05-26 03:20:57 +04:00
|
|
|
tcp_xmit_timer(tp, ts_rtt - 1);
|
2001-09-10 19:23:09 +04:00
|
|
|
else if (tp->t_rtttime &&
|
1999-07-01 12:12:45 +04:00
|
|
|
SEQ_GT(th->th_ack, tp->t_rtseq))
|
2001-09-10 19:23:09 +04:00
|
|
|
tcp_xmit_timer(tp,
|
2004-04-25 04:08:54 +04:00
|
|
|
tcp_now - tp->t_rtttime);
|
1999-07-01 12:12:45 +04:00
|
|
|
acked = th->th_ack - tp->snd_una;
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_PREDACK]++;
|
|
|
|
tcps[TCP_STAT_RCVACKPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVACKBYTE] += acked;
|
|
|
|
TCP_STAT_PUTREF();
|
2007-12-20 22:53:29 +03:00
|
|
|
nd6_hint(tp);
|
2003-07-02 23:33:20 +04:00
|
|
|
|
2003-10-24 14:25:40 +04:00
|
|
|
if (acked > (tp->t_lastoff - tp->t_inoff))
|
|
|
|
tp->t_lastm = NULL;
|
1993-03-21 12:45:37 +03:00
|
|
|
sbdrop(&so->so_snd, acked);
|
2003-10-24 14:25:40 +04:00
|
|
|
tp->t_lastoff -= acked;
|
2003-06-29 22:58:26 +04:00
|
|
|
|
2008-02-20 14:44:07 +03:00
|
|
|
icmp_check(tp, th, acked);
|
2005-07-19 21:00:02 +04:00
|
|
|
|
2005-01-27 00:49:27 +03:00
|
|
|
tp->snd_una = th->th_ack;
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
tp->snd_fack = tp->snd_una;
|
2005-01-27 00:49:27 +03:00
|
|
|
if (SEQ_LT(tp->snd_high, tp->snd_una))
|
|
|
|
tp->snd_high = tp->snd_una;
|
1993-03-21 12:45:37 +03:00
|
|
|
m_freem(m);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If all outstanding data are acked, stop
|
|
|
|
* retransmit timer, otherwise restart timer
|
|
|
|
* using current (possibly backed-off) value.
|
|
|
|
* If process is waiting for space,
|
2008-03-01 17:16:49 +03:00
|
|
|
* wakeup/selnotify/signal. If data
|
1993-03-21 12:45:37 +03:00
|
|
|
* are ready to send, let tcp_output
|
|
|
|
* decide between more output or persist.
|
|
|
|
*/
|
|
|
|
if (tp->snd_una == tp->snd_max)
|
1998-05-06 05:21:20 +04:00
|
|
|
TCP_TIMER_DISARM(tp, TCPT_REXMT);
|
|
|
|
else if (TCP_TIMER_ISARMED(tp,
|
|
|
|
TCPT_PERSIST) == 0)
|
|
|
|
TCP_TIMER_ARM(tp, TCPT_REXMT,
|
|
|
|
tp->t_rxtcur);
|
1993-03-21 12:45:37 +03:00
|
|
|
|
1998-04-30 00:43:29 +04:00
|
|
|
sowwakeup(so);
|
2010-04-01 18:31:51 +04:00
|
|
|
if (so->so_snd.sb_cc) {
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_LOCK(1, NULL);
|
1993-03-21 12:45:37 +03:00
|
|
|
(void) tcp_output(tp);
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
2010-04-01 18:31:51 +04:00
|
|
|
}
|
1999-07-22 16:56:56 +04:00
|
|
|
if (tcp_saveti)
|
|
|
|
m_freem(tcp_saveti);
|
1993-03-21 12:45:37 +03:00
|
|
|
return;
|
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
} else if (th->th_ack == tp->snd_una &&
|
2002-05-07 06:59:38 +04:00
|
|
|
TAILQ_FIRST(&tp->segq) == NULL &&
|
1999-07-01 12:12:45 +04:00
|
|
|
tlen <= sbspace(&so->so_rcv)) {
|
2007-08-02 06:42:40 +04:00
|
|
|
int newsize = 0; /* automatic sockbuf scaling */
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* this is a pure, in-sequence data packet
|
|
|
|
* with nothing on the reassembly queue and
|
|
|
|
* we have enough buffer space to take it.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
tp->rcv_nxt += tlen;
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_PREDDAT]++;
|
|
|
|
tcps[TCP_STAT_RCVPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVBYTE] += tlen;
|
|
|
|
TCP_STAT_PUTREF();
|
2007-12-20 22:53:29 +03:00
|
|
|
nd6_hint(tp);
|
2007-08-02 06:42:40 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Automatic sizing enables the performance of large buffers
|
|
|
|
* and most of the efficiency of small ones by only allocating
|
|
|
|
* space when it is needed.
|
|
|
|
*
|
|
|
|
* On the receive side the socket buffer memory is only rarely
|
|
|
|
* used to any significant extent. This allows us to be much
|
|
|
|
* more aggressive in scaling the receive socket buffer. For
|
|
|
|
* the case that the buffer space is actually used to a large
|
|
|
|
* extent and we run out of kernel memory we can simply drop
|
|
|
|
* the new segments; TCP on the sender will just retransmit it
|
|
|
|
* later. Setting the buffer size too big may only consume too
|
|
|
|
* much kernel memory if the application doesn't read() from
|
|
|
|
* the socket or packet loss or reordering makes use of the
|
|
|
|
* reassembly queue.
|
|
|
|
*
|
|
|
|
* The criteria to step up the receive buffer one notch are:
|
|
|
|
* 1. the number of bytes received during the time it takes
|
|
|
|
* one timestamp to be reflected back to us (the RTT);
|
|
|
|
* 2. received bytes per RTT is within seven eighth of the
|
|
|
|
* current socket buffer size;
|
|
|
|
* 3. receive buffer size has not hit maximal automatic size;
|
|
|
|
*
|
|
|
|
* This algorithm does one step per RTT at most and only if
|
|
|
|
* we receive a bulk stream w/o packet losses or reorderings.
|
|
|
|
* Shrinking the buffer during idle times is not necessary as
|
|
|
|
* it doesn't consume any memory when idle.
|
|
|
|
*
|
|
|
|
* TODO: Only step up if the application is actually serving
|
|
|
|
* the buffer to better manage the socket buffer resources.
|
|
|
|
*/
|
|
|
|
if (tcp_do_autorcvbuf &&
|
|
|
|
opti.ts_ecr &&
|
|
|
|
(so->so_rcv.sb_flags & SB_AUTOSIZE)) {
|
|
|
|
if (opti.ts_ecr > tp->rfbuf_ts &&
|
2007-08-02 17:06:30 +04:00
|
|
|
opti.ts_ecr - tp->rfbuf_ts < PR_SLOWHZ) {
|
2007-08-02 06:42:40 +04:00
|
|
|
if (tp->rfbuf_cnt >
|
|
|
|
(so->so_rcv.sb_hiwat / 8 * 7) &&
|
|
|
|
so->so_rcv.sb_hiwat <
|
|
|
|
tcp_autorcvbuf_max) {
|
|
|
|
newsize =
|
|
|
|
min(so->so_rcv.sb_hiwat +
|
|
|
|
tcp_autorcvbuf_inc,
|
|
|
|
tcp_autorcvbuf_max);
|
|
|
|
}
|
|
|
|
/* Start over with next RTT. */
|
|
|
|
tp->rfbuf_ts = 0;
|
|
|
|
tp->rfbuf_cnt = 0;
|
|
|
|
} else
|
|
|
|
tp->rfbuf_cnt += tlen; /* add up */
|
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
1994-05-13 10:02:48 +04:00
|
|
|
* Drop TCP, IP headers and TCP options then add data
|
|
|
|
* to socket buffer.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
2002-09-06 03:02:18 +04:00
|
|
|
if (so->so_state & SS_CANTRCVMORE)
|
|
|
|
m_freem(m);
|
|
|
|
else {
|
2007-08-02 06:42:40 +04:00
|
|
|
/*
|
|
|
|
* Set new socket buffer size.
|
|
|
|
* Give up when limit is reached.
|
|
|
|
*/
|
|
|
|
if (newsize)
|
|
|
|
if (!sbreserve(&so->so_rcv,
|
|
|
|
newsize, so))
|
|
|
|
so->so_rcv.sb_flags &= ~SB_AUTOSIZE;
|
2002-09-06 03:02:18 +04:00
|
|
|
m_adj(m, toff + off);
|
|
|
|
sbappendstream(&so->so_rcv, m);
|
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
sorwakeup(so);
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_setup_ack(tp, th);
|
2010-04-01 18:31:51 +04:00
|
|
|
if (tp->t_flags & TF_ACKNOW) {
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_LOCK(1, NULL);
|
1997-12-11 09:33:29 +03:00
|
|
|
(void) tcp_output(tp);
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
2010-04-01 18:31:51 +04:00
|
|
|
}
|
1999-07-22 16:56:56 +04:00
|
|
|
if (tcp_saveti)
|
|
|
|
m_freem(tcp_saveti);
|
1993-03-21 12:45:37 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
1999-12-08 19:22:20 +03:00
|
|
|
* Compute mbuf offset to TCP data segment.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1999-12-08 19:22:20 +03:00
|
|
|
hdroptlen = toff + off;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate amount of space in receive window,
|
|
|
|
* and then do TCP input processing.
|
|
|
|
* Receive window is amount of space in rcv queue,
|
|
|
|
* but not less than advertised window.
|
|
|
|
*/
|
|
|
|
{ int win;
|
|
|
|
|
|
|
|
win = sbspace(&so->so_rcv);
|
|
|
|
if (win < 0)
|
|
|
|
win = 0;
|
1997-07-06 11:04:34 +04:00
|
|
|
tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
2007-08-02 06:42:40 +04:00
|
|
|
/* Reset receive buffer auto scaling when not in bulk receive mode. */
|
|
|
|
tp->rfbuf_ts = 0;
|
|
|
|
tp->rfbuf_cnt = 0;
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
switch (tp->t_state) {
|
|
|
|
/*
|
|
|
|
* If the state is SYN_SENT:
|
|
|
|
* if seg contains an ACK, but not for our SYN, drop the input.
|
|
|
|
* if seg contains a RST, then drop the connection.
|
|
|
|
* if seg does not contain SYN, then drop it.
|
|
|
|
* Otherwise this is an acceptable SYN segment
|
|
|
|
* initialize tp->rcv_nxt and tp->irs
|
|
|
|
* if seg contains ack then advance tp->snd_una
|
2006-09-05 04:29:35 +04:00
|
|
|
* if seg contains a ECE and ECN support is enabled, the stream
|
|
|
|
* is ECN capable.
|
1993-03-21 12:45:37 +03:00
|
|
|
* if SYN has been acked change to ESTABLISHED else SYN_RCVD state
|
|
|
|
* arrange for segment to be acked (eventually)
|
|
|
|
* continue processing rest of data/controls, beginning with URG
|
|
|
|
*/
|
|
|
|
case TCPS_SYN_SENT:
|
|
|
|
if ((tiflags & TH_ACK) &&
|
1999-07-01 12:12:45 +04:00
|
|
|
(SEQ_LEQ(th->th_ack, tp->iss) ||
|
|
|
|
SEQ_GT(th->th_ack, tp->snd_max)))
|
1993-03-21 12:45:37 +03:00
|
|
|
goto dropwithreset;
|
|
|
|
if (tiflags & TH_RST) {
|
|
|
|
if (tiflags & TH_ACK)
|
|
|
|
tp = tcp_drop(tp, ECONNREFUSED);
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
if ((tiflags & TH_SYN) == 0)
|
|
|
|
goto drop;
|
|
|
|
if (tiflags & TH_ACK) {
|
2005-01-27 00:49:27 +03:00
|
|
|
tp->snd_una = th->th_ack;
|
1993-03-21 12:45:37 +03:00
|
|
|
if (SEQ_LT(tp->snd_nxt, tp->snd_una))
|
|
|
|
tp->snd_nxt = tp->snd_una;
|
2005-01-27 00:49:27 +03:00
|
|
|
if (SEQ_LT(tp->snd_high, tp->snd_una))
|
|
|
|
tp->snd_high = tp->snd_una;
|
2000-05-05 18:51:46 +04:00
|
|
|
TCP_TIMER_DISARM(tp, TCPT_REXMT);
|
2006-09-05 04:29:35 +04:00
|
|
|
|
|
|
|
if ((tiflags & TH_ECE) && tcp_do_ecn) {
|
|
|
|
tp->t_flags |= TF_ECN_PERMIT;
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_ECN_SHS);
|
2006-09-05 04:29:35 +04:00
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
tp->irs = th->th_seq;
|
1993-03-21 12:45:37 +03:00
|
|
|
tcp_rcvseqinit(tp);
|
|
|
|
tp->t_flags |= TF_ACKNOW;
|
1997-09-23 01:49:55 +04:00
|
|
|
tcp_mss_from_peer(tp, opti.maxseg);
|
1998-04-01 02:49:09 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the initial congestion window. If we
|
|
|
|
* had to retransmit the SYN, we must initialize cwnd
|
1998-07-18 02:58:56 +04:00
|
|
|
* to 1 segment (i.e. the Loss Window).
|
1998-04-01 02:49:09 +04:00
|
|
|
*/
|
1998-07-18 02:58:56 +04:00
|
|
|
if (tp->t_flags & TF_SYN_REXMT)
|
|
|
|
tp->snd_cwnd = tp->t_peermss;
|
2003-03-01 07:40:27 +03:00
|
|
|
else {
|
|
|
|
int ss = tcp_init_win;
|
|
|
|
#ifdef INET
|
|
|
|
if (inp != NULL && in_localaddr(inp->inp_faddr))
|
|
|
|
ss = tcp_init_win_local;
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
if (in6p != NULL && in6_localaddr(&in6p->in6p_faddr))
|
|
|
|
ss = tcp_init_win_local;
|
|
|
|
#endif
|
|
|
|
tp->snd_cwnd = TCP_INITIAL_WINDOW(ss, tp->t_peermss);
|
|
|
|
}
|
1998-04-01 02:49:09 +04:00
|
|
|
|
1997-09-23 01:49:55 +04:00
|
|
|
tcp_rmx_rtt(tp);
|
2000-05-05 19:05:29 +04:00
|
|
|
if (tiflags & TH_ACK) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_CONNECTS);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
/*
|
|
|
|
* move tcp_established before soisconnected
|
2011-05-17 09:40:24 +04:00
|
|
|
* because upcall handler can drive tcp_output
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
* functionality.
|
|
|
|
* XXX we might call soisconnected at the end of
|
|
|
|
* all processing
|
|
|
|
*/
|
1997-09-23 01:49:55 +04:00
|
|
|
tcp_established(tp);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
soisconnected(so);
|
1994-05-13 10:02:48 +04:00
|
|
|
/* Do window scaling on this connection? */
|
|
|
|
if ((tp->t_flags & (TF_RCVD_SCALE|TF_REQ_SCALE)) ==
|
2002-10-22 08:24:50 +04:00
|
|
|
(TF_RCVD_SCALE|TF_REQ_SCALE)) {
|
1994-05-13 10:02:48 +04:00
|
|
|
tp->snd_scale = tp->requested_s_scale;
|
|
|
|
tp->rcv_scale = tp->request_r_scale;
|
|
|
|
}
|
1998-12-19 00:38:02 +03:00
|
|
|
TCP_REASS_LOCK(tp);
|
1999-07-01 12:12:45 +04:00
|
|
|
(void) tcp_reass(tp, NULL, (struct mbuf *)0, &tlen);
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* if we didn't have to retransmit the SYN,
|
|
|
|
* use its rtt as our initial srtt & rtt var.
|
|
|
|
*/
|
2001-09-10 19:23:09 +04:00
|
|
|
if (tp->t_rtttime)
|
|
|
|
tcp_xmit_timer(tp, tcp_now - tp->t_rtttime);
|
1993-03-21 12:45:37 +03:00
|
|
|
} else
|
|
|
|
tp->t_state = TCPS_SYN_RECEIVED;
|
|
|
|
|
|
|
|
/*
|
1999-07-01 12:12:45 +04:00
|
|
|
* Advance th->th_seq to correspond to first data byte.
|
1993-03-21 12:45:37 +03:00
|
|
|
* If data, trim to stay within window,
|
|
|
|
* dropping FIN if necessary.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_seq++;
|
|
|
|
if (tlen > tp->rcv_wnd) {
|
|
|
|
todrop = tlen - tp->rcv_wnd;
|
1993-03-21 12:45:37 +03:00
|
|
|
m_adj(m, -todrop);
|
1999-07-01 12:12:45 +04:00
|
|
|
tlen = tp->rcv_wnd;
|
1993-03-21 12:45:37 +03:00
|
|
|
tiflags &= ~TH_FIN;
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVPACKAFTERWIN]++;
|
|
|
|
tcps[TCP_STAT_RCVBYTEAFTERWIN] += todrop;
|
|
|
|
TCP_STAT_PUTREF();
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
tp->snd_wl1 = th->th_seq - 1;
|
|
|
|
tp->rcv_up = th->th_seq;
|
1993-03-21 12:45:37 +03:00
|
|
|
goto step6;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the state is SYN_RECEIVED:
|
|
|
|
* If seg contains an ACK, but not for our SYN, drop the input
|
|
|
|
* and generate an RST. See page 36, rfc793
|
|
|
|
*/
|
|
|
|
case TCPS_SYN_RECEIVED:
|
|
|
|
if ((tiflags & TH_ACK) &&
|
1999-07-01 12:12:45 +04:00
|
|
|
(SEQ_LEQ(th->th_ack, tp->iss) ||
|
|
|
|
SEQ_GT(th->th_ack, tp->snd_max)))
|
1997-07-24 01:26:40 +04:00
|
|
|
goto dropwithreset;
|
|
|
|
break;
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* States other than LISTEN or SYN_SENT.
|
1994-05-13 10:02:48 +04:00
|
|
|
* First check timestamp, if present.
|
2002-06-09 20:33:36 +04:00
|
|
|
* Then check that at least some bytes of segment are within
|
1993-03-21 12:45:37 +03:00
|
|
|
* receive window. If segment begins before rcv_nxt,
|
|
|
|
* drop leading data (and SYN); if nothing left, just ack.
|
2002-06-09 20:33:36 +04:00
|
|
|
*
|
1994-05-13 10:02:48 +04:00
|
|
|
* RFC 1323 PAWS: If we have a timestamp reply on this segment
|
|
|
|
* and it's less than ts_recent, drop it.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1997-07-24 01:26:40 +04:00
|
|
|
if (opti.ts_present && (tiflags & TH_RST) == 0 && tp->ts_recent &&
|
|
|
|
TSTMP_LT(opti.ts_val, tp->ts_recent)) {
|
1994-05-13 10:02:48 +04:00
|
|
|
|
|
|
|
/* Check to see if ts_recent is over 24 days old. */
|
2005-01-27 20:14:04 +03:00
|
|
|
if (tcp_now - tp->ts_recent_age > TCP_PAWS_IDLE) {
|
1994-05-13 10:02:48 +04:00
|
|
|
/*
|
|
|
|
* Invalidate ts_recent. If this segment updates
|
|
|
|
* ts_recent, the age will be reset later and ts_recent
|
|
|
|
* will get a valid value. If it does not, setting
|
|
|
|
* ts_recent to zero will at least satisfy the
|
|
|
|
* requirement that zero be placed in the timestamp
|
|
|
|
* echo reply when ts_recent isn't valid. The
|
|
|
|
* age isn't reset until we get a valid ts_recent
|
|
|
|
* because we don't want out-of-order segments to be
|
|
|
|
* dropped when ts_recent is old.
|
|
|
|
*/
|
|
|
|
tp->ts_recent = 0;
|
|
|
|
} else {
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVDUPPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVDUPBYTE] += tlen;
|
|
|
|
tcps[TCP_STAT_PAWSDROP]++;
|
|
|
|
TCP_STAT_PUTREF();
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
tcp_new_dsack(tp, th->th_seq, tlen);
|
1994-05-13 10:02:48 +04:00
|
|
|
goto dropafterack;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
todrop = tp->rcv_nxt - th->th_seq;
|
2007-02-22 09:16:03 +03:00
|
|
|
dupseg = false;
|
1993-03-21 12:45:37 +03:00
|
|
|
if (todrop > 0) {
|
|
|
|
if (tiflags & TH_SYN) {
|
|
|
|
tiflags &= ~TH_SYN;
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_seq++;
|
2002-06-09 20:33:36 +04:00
|
|
|
if (th->th_urp > 1)
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_urp--;
|
1996-09-09 18:51:07 +04:00
|
|
|
else {
|
1993-03-21 12:45:37 +03:00
|
|
|
tiflags &= ~TH_URG;
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_urp = 0;
|
1996-09-09 18:51:07 +04:00
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
todrop--;
|
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
if (todrop > tlen ||
|
|
|
|
(todrop == tlen && (tiflags & TH_FIN) == 0)) {
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
2004-04-18 03:35:37 +04:00
|
|
|
* Any valid FIN or RST must be to the left of the
|
|
|
|
* window. At this point the FIN or RST must be a
|
|
|
|
* duplicate or out of sequence; drop it.
|
1994-04-12 22:07:46 +04:00
|
|
|
*/
|
2004-04-18 03:35:37 +04:00
|
|
|
if (tiflags & TH_RST)
|
|
|
|
goto drop;
|
|
|
|
tiflags &= ~(TH_FIN|TH_RST);
|
1994-04-12 22:07:46 +04:00
|
|
|
/*
|
1998-01-24 08:04:27 +03:00
|
|
|
* Send an ACK to resynchronize and drop any data.
|
|
|
|
* But keep on processing for RST or ACK.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1994-04-12 22:07:46 +04:00
|
|
|
tp->t_flags |= TF_ACKNOW;
|
1999-07-01 12:12:45 +04:00
|
|
|
todrop = tlen;
|
2007-02-22 09:16:03 +03:00
|
|
|
dupseg = true;
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVDUPPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVDUPBYTE] += todrop;
|
|
|
|
TCP_STAT_PUTREF();
|
2004-04-27 18:46:07 +04:00
|
|
|
} else if ((tiflags & TH_RST) &&
|
2009-06-20 21:29:31 +04:00
|
|
|
th->th_seq != tp->rcv_nxt) {
|
2004-04-27 18:46:07 +04:00
|
|
|
/*
|
|
|
|
* Test for reset before adjusting the sequence
|
|
|
|
* number for overlapping data.
|
|
|
|
*/
|
|
|
|
goto dropafterack_ratelim;
|
1993-03-21 12:45:37 +03:00
|
|
|
} else {
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVPARTDUPPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVPARTDUPBYTE] += todrop;
|
|
|
|
TCP_STAT_PUTREF();
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
tcp_new_dsack(tp, th->th_seq, todrop);
|
1999-12-08 19:22:20 +03:00
|
|
|
hdroptlen += todrop; /*drop from head afterwards*/
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_seq += todrop;
|
|
|
|
tlen -= todrop;
|
|
|
|
if (th->th_urp > todrop)
|
|
|
|
th->th_urp -= todrop;
|
1993-03-21 12:45:37 +03:00
|
|
|
else {
|
|
|
|
tiflags &= ~TH_URG;
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_urp = 0;
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If new data are received on a connection after the
|
|
|
|
* user processes are gone, then RST the other end.
|
|
|
|
*/
|
|
|
|
if ((so->so_state & SS_NOFDREF) &&
|
1999-07-01 12:12:45 +04:00
|
|
|
tp->t_state > TCPS_CLOSE_WAIT && tlen) {
|
1993-03-21 12:45:37 +03:00
|
|
|
tp = tcp_close(tp);
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVAFTERCLOSE);
|
1993-03-21 12:45:37 +03:00
|
|
|
goto dropwithreset;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If segment ends after window, drop trailing data
|
|
|
|
* (and PUSH and FIN); if nothing left, just ACK.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
todrop = (th->th_seq + tlen) - (tp->rcv_nxt+tp->rcv_wnd);
|
1993-03-21 12:45:37 +03:00
|
|
|
if (todrop > 0) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVPACKAFTERWIN);
|
1999-07-01 12:12:45 +04:00
|
|
|
if (todrop >= tlen) {
|
2004-04-18 03:35:37 +04:00
|
|
|
/*
|
|
|
|
* The segment actually starts after the window.
|
|
|
|
* th->th_seq + tlen - tp->rcv_nxt - tp->rcv_wnd >= tlen
|
|
|
|
* th->th_seq - tp->rcv_nxt - tp->rcv_wnd >= 0
|
|
|
|
* th->th_seq >= tp->rcv_nxt + tp->rcv_wnd
|
|
|
|
*/
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATADD(TCP_STAT_RCVBYTEAFTERWIN, tlen);
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* If a new connection request is received
|
|
|
|
* while in TIME_WAIT, drop the old connection
|
|
|
|
* and start over if the sequence numbers
|
|
|
|
* are above the previous ones.
|
2002-08-28 06:23:57 +04:00
|
|
|
*
|
|
|
|
* NOTE: We will checksum the packet again, and
|
|
|
|
* so we need to put the header fields back into
|
|
|
|
* network order!
|
|
|
|
* XXX This kind of sucks, but we don't expect
|
|
|
|
* XXX this to happen very often, so maybe it
|
|
|
|
* XXX doesn't matter so much.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
|
|
|
if (tiflags & TH_SYN &&
|
|
|
|
tp->t_state == TCPS_TIME_WAIT &&
|
1999-07-01 12:12:45 +04:00
|
|
|
SEQ_GT(th->th_seq, tp->rcv_nxt)) {
|
1993-03-21 12:45:37 +03:00
|
|
|
tp = tcp_close(tp);
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_net(th);
|
1993-03-21 12:45:37 +03:00
|
|
|
goto findpcb;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If window is closed can only take segments at
|
|
|
|
* window edge, and have to drop data and PUSH from
|
|
|
|
* incoming segments. Continue processing, but
|
|
|
|
* remember to ack. Otherwise, drop segment
|
2004-04-18 03:35:37 +04:00
|
|
|
* and (if not RST) ack.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (tp->rcv_wnd == 0 && th->th_seq == tp->rcv_nxt) {
|
1993-03-21 12:45:37 +03:00
|
|
|
tp->t_flags |= TF_ACKNOW;
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVWINPROBE);
|
1993-03-21 12:45:37 +03:00
|
|
|
} else
|
|
|
|
goto dropafterack;
|
|
|
|
} else
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATADD(TCP_STAT_RCVBYTEAFTERWIN, todrop);
|
1993-03-21 12:45:37 +03:00
|
|
|
m_adj(m, -todrop);
|
1999-07-01 12:12:45 +04:00
|
|
|
tlen -= todrop;
|
1993-03-21 12:45:37 +03:00
|
|
|
tiflags &= ~(TH_PUSH|TH_FIN);
|
|
|
|
}
|
|
|
|
|
1994-05-13 10:02:48 +04:00
|
|
|
/*
|
|
|
|
* If last ACK falls within this segment's sequence numbers,
|
2008-02-05 02:56:14 +03:00
|
|
|
* record the timestamp.
|
|
|
|
* NOTE:
|
|
|
|
* 1) That the test incorporates suggestions from the latest
|
|
|
|
* proposal of the tcplw@cray.com list (Braden 1993/04/26).
|
|
|
|
* 2) That updating only on newer timestamps interferes with
|
|
|
|
* our earlier PAWS tests, so this check should be solely
|
|
|
|
* predicated on the sequence space of this segment.
|
|
|
|
* 3) That we modify the segment boundary check to be
|
|
|
|
* Last.ACK.Sent <= SEG.SEQ + SEG.Len
|
|
|
|
* instead of RFC1323's
|
|
|
|
* Last.ACK.Sent < SEG.SEQ + SEG.Len,
|
|
|
|
* This modified check allows us to overcome RFC1323's
|
|
|
|
* limitations as described in Stevens TCP/IP Illustrated
|
|
|
|
* Vol. 2 p.869. In such cases, we can still calculate the
|
|
|
|
* RTT correctly when RCV.NXT == Last.ACK.Sent.
|
1994-05-13 10:02:48 +04:00
|
|
|
*/
|
2008-02-05 02:56:14 +03:00
|
|
|
if (opti.ts_present &&
|
1999-07-01 12:12:45 +04:00
|
|
|
SEQ_LEQ(th->th_seq, tp->last_ack_sent) &&
|
2008-02-05 02:56:14 +03:00
|
|
|
SEQ_LEQ(tp->last_ack_sent, th->th_seq + tlen +
|
|
|
|
((tiflags & (TH_SYN|TH_FIN)) != 0))) {
|
2005-01-27 20:14:04 +03:00
|
|
|
tp->ts_recent_age = tcp_now;
|
1997-07-24 01:26:40 +04:00
|
|
|
tp->ts_recent = opti.ts_val;
|
1994-05-13 10:02:48 +04:00
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* If the RST bit is set examine the state:
|
|
|
|
* SYN_RECEIVED STATE:
|
|
|
|
* If passive open, return to LISTEN state.
|
|
|
|
* If active open, inform user that connection was refused.
|
|
|
|
* ESTABLISHED, FIN_WAIT_1, FIN_WAIT2, CLOSE_WAIT STATES:
|
|
|
|
* Inform user that connection was reset, and close tcb.
|
|
|
|
* CLOSING, LAST_ACK, TIME_WAIT STATES
|
|
|
|
* Close the tcb.
|
|
|
|
*/
|
2004-04-20 20:52:12 +04:00
|
|
|
if (tiflags & TH_RST) {
|
2009-06-20 21:29:31 +04:00
|
|
|
if (th->th_seq != tp->rcv_nxt)
|
2004-04-20 20:52:12 +04:00
|
|
|
goto dropafterack_ratelim;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2004-04-20 20:52:12 +04:00
|
|
|
switch (tp->t_state) {
|
|
|
|
case TCPS_SYN_RECEIVED:
|
|
|
|
so->so_error = ECONNREFUSED;
|
|
|
|
goto close;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2004-04-20 20:52:12 +04:00
|
|
|
case TCPS_ESTABLISHED:
|
|
|
|
case TCPS_FIN_WAIT_1:
|
|
|
|
case TCPS_FIN_WAIT_2:
|
|
|
|
case TCPS_CLOSE_WAIT:
|
|
|
|
so->so_error = ECONNRESET;
|
|
|
|
close:
|
|
|
|
tp->t_state = TCPS_CLOSED;
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_DROPS);
|
2004-04-20 20:52:12 +04:00
|
|
|
tp = tcp_close(tp);
|
|
|
|
goto drop;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2004-04-20 20:52:12 +04:00
|
|
|
case TCPS_CLOSING:
|
|
|
|
case TCPS_LAST_ACK:
|
|
|
|
case TCPS_TIME_WAIT:
|
|
|
|
tp = tcp_close(tp);
|
|
|
|
goto drop;
|
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2004-04-18 03:35:37 +04:00
|
|
|
* Since we've covered the SYN-SENT and SYN-RECEIVED states above
|
|
|
|
* we must be in a synchronized state. RFC791 states (under RST
|
|
|
|
* generation) that any unacceptable segment (an out-of-order SYN
|
|
|
|
* qualifies) received in a synchronized state must elicit only an
|
|
|
|
* empty acknowledgment segment ... and the connection remains in
|
|
|
|
* the same state.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
2004-04-20 23:49:15 +04:00
|
|
|
if (tiflags & TH_SYN) {
|
|
|
|
if (tp->rcv_nxt == th->th_seq) {
|
|
|
|
tcp_respond(tp, m, m, th, (tcp_seq)0, th->th_ack - 1,
|
|
|
|
TH_ACK);
|
|
|
|
if (tcp_saveti)
|
|
|
|
m_freem(tcp_saveti);
|
|
|
|
return;
|
|
|
|
}
|
2005-02-27 01:45:09 +03:00
|
|
|
|
2004-04-20 20:52:12 +04:00
|
|
|
goto dropafterack_ratelim;
|
2004-04-20 23:49:15 +04:00
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the ACK bit is off we drop the segment and return.
|
|
|
|
*/
|
1999-04-10 02:01:07 +04:00
|
|
|
if ((tiflags & TH_ACK) == 0) {
|
|
|
|
if (tp->t_flags & TF_ACKNOW)
|
|
|
|
goto dropafterack;
|
|
|
|
else
|
|
|
|
goto drop;
|
|
|
|
}
|
2002-06-09 20:33:36 +04:00
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* Ack processing.
|
|
|
|
*/
|
|
|
|
switch (tp->t_state) {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In SYN_RECEIVED state if the ack ACKs our SYN then enter
|
|
|
|
* ESTABLISHED state and continue processing, otherwise
|
|
|
|
* send an RST.
|
|
|
|
*/
|
|
|
|
case TCPS_SYN_RECEIVED:
|
1999-07-01 12:12:45 +04:00
|
|
|
if (SEQ_GT(tp->snd_una, th->th_ack) ||
|
|
|
|
SEQ_GT(th->th_ack, tp->snd_max))
|
1993-03-21 12:45:37 +03:00
|
|
|
goto dropwithreset;
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_CONNECTS);
|
1993-03-21 12:45:37 +03:00
|
|
|
soisconnected(so);
|
1997-09-23 01:49:55 +04:00
|
|
|
tcp_established(tp);
|
1994-05-13 10:02:48 +04:00
|
|
|
/* Do window scaling? */
|
|
|
|
if ((tp->t_flags & (TF_RCVD_SCALE|TF_REQ_SCALE)) ==
|
2002-10-22 08:24:50 +04:00
|
|
|
(TF_RCVD_SCALE|TF_REQ_SCALE)) {
|
1994-05-13 10:02:48 +04:00
|
|
|
tp->snd_scale = tp->requested_s_scale;
|
|
|
|
tp->rcv_scale = tp->request_r_scale;
|
|
|
|
}
|
1998-12-19 00:38:02 +03:00
|
|
|
TCP_REASS_LOCK(tp);
|
1999-07-01 12:12:45 +04:00
|
|
|
(void) tcp_reass(tp, NULL, (struct mbuf *)0, &tlen);
|
|
|
|
tp->snd_wl1 = th->th_seq - 1;
|
1993-03-21 12:45:37 +03:00
|
|
|
/* fall into ... */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In ESTABLISHED state: drop duplicate ACKs; ACK out of range
|
|
|
|
* ACKs. If the ack is in the range
|
1999-07-01 12:12:45 +04:00
|
|
|
* tp->snd_una < th->th_ack <= tp->snd_max
|
|
|
|
* then advance tp->snd_una to th->th_ack and drop
|
1993-03-21 12:45:37 +03:00
|
|
|
* data from the retransmission queue. If this ACK reflects
|
|
|
|
* more up to date window information we update our window information.
|
|
|
|
*/
|
|
|
|
case TCPS_ESTABLISHED:
|
|
|
|
case TCPS_FIN_WAIT_1:
|
|
|
|
case TCPS_FIN_WAIT_2:
|
|
|
|
case TCPS_CLOSE_WAIT:
|
|
|
|
case TCPS_CLOSING:
|
|
|
|
case TCPS_LAST_ACK:
|
|
|
|
case TCPS_TIME_WAIT:
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
if (SEQ_LEQ(th->th_ack, tp->snd_una)) {
|
2005-01-28 03:18:22 +03:00
|
|
|
if (tlen == 0 && !dupseg && tiwin == tp->snd_wnd) {
|
2011-03-09 03:44:23 +03:00
|
|
|
TCP_STATINC(TCP_STAT_RCVDUPACK);
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* If we have outstanding data (other than
|
|
|
|
* a window probe), this is a completely
|
|
|
|
* duplicate ack (ie, window info didn't
|
|
|
|
* change), the ack is the biggest we've
|
|
|
|
* seen and we've seen exactly our rexmt
|
|
|
|
* threshhold of them, assume a packet
|
|
|
|
* has been dropped and retransmit it.
|
|
|
|
* Kludge snd_nxt & the congestion
|
|
|
|
* window so we send only this one
|
|
|
|
* packet.
|
|
|
|
*/
|
1998-05-06 05:21:20 +04:00
|
|
|
if (TCP_TIMER_ISARMED(tp, TCPT_REXMT) == 0 ||
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_ack != tp->snd_una)
|
1993-03-21 12:45:37 +03:00
|
|
|
tp->t_dupacks = 0;
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
else if (tp->t_partialacks < 0 &&
|
2008-01-29 15:34:47 +03:00
|
|
|
(++tp->t_dupacks == tcprexmtthresh ||
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
TCP_FACK_FASTRECOV(tp))) {
|
2006-10-09 20:27:07 +04:00
|
|
|
/*
|
|
|
|
* Do the fast retransmit, and adjust
|
|
|
|
* congestion control paramenters.
|
|
|
|
*/
|
|
|
|
if (tp->t_congctl->fast_retransmit(tp, th)) {
|
|
|
|
/* False fast retransmit */
|
2005-01-27 00:49:27 +03:00
|
|
|
break;
|
2006-10-09 20:27:07 +04:00
|
|
|
} else
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
goto drop;
|
1993-03-21 12:45:37 +03:00
|
|
|
} else if (tp->t_dupacks > tcprexmtthresh) {
|
1997-11-08 05:35:22 +03:00
|
|
|
tp->snd_cwnd += tp->t_segsz;
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_LOCK(1, NULL);
|
1993-03-21 12:45:37 +03:00
|
|
|
(void) tcp_output(tp);
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
1993-03-21 12:45:37 +03:00
|
|
|
goto drop;
|
|
|
|
}
|
2005-01-28 03:18:22 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If the ack appears to be very old, only
|
|
|
|
* allow data that is in-sequence. This
|
|
|
|
* makes it somewhat more difficult to insert
|
|
|
|
* forged data by guessing sequence numbers.
|
|
|
|
* Sent an ack to try to update the send
|
|
|
|
* sequence number on the other side.
|
|
|
|
*/
|
|
|
|
if (tlen && th->th_seq != tp->rcv_nxt &&
|
2004-04-20 20:52:12 +04:00
|
|
|
SEQ_LT(th->th_ack,
|
2005-01-28 03:18:22 +03:00
|
|
|
tp->snd_una - tp->max_sndwnd))
|
|
|
|
goto dropafterack;
|
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If the congestion window was inflated to account
|
|
|
|
* for the other side's cached packets, retract it.
|
|
|
|
*/
|
2006-10-09 20:27:07 +04:00
|
|
|
/* XXX: make SACK have his own congestion control
|
|
|
|
* struct -- rpaulo */
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if (TCP_SACK_ENABLED(tp))
|
|
|
|
tcp_sack_newack(tp, th);
|
|
|
|
else
|
2006-10-09 20:27:07 +04:00
|
|
|
tp->t_congctl->fast_retransmit_newack(tp, th);
|
1999-07-01 12:12:45 +04:00
|
|
|
if (SEQ_GT(th->th_ack, tp->snd_max)) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVACKTOOMUCH);
|
1993-03-21 12:45:37 +03:00
|
|
|
goto dropafterack;
|
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
acked = th->th_ack - tp->snd_una;
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVACKPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVACKBYTE] += acked;
|
|
|
|
TCP_STAT_PUTREF();
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
/*
|
1994-05-13 10:02:48 +04:00
|
|
|
* If we have a timestamp reply, update smoothed
|
|
|
|
* round trip time. If no timestamp is present but
|
|
|
|
* transmit timer is running and timed sequence
|
1993-03-21 12:45:37 +03:00
|
|
|
* number was acked, update smoothed round trip time.
|
|
|
|
* Since we now have an rtt measurement, cancel the
|
|
|
|
* timer backoff (cf., Phil Karn's retransmit alg.).
|
|
|
|
* Recompute the initial retransmit timer.
|
|
|
|
*/
|
2005-06-06 16:10:09 +04:00
|
|
|
if (ts_rtt)
|
2011-05-26 03:20:57 +04:00
|
|
|
tcp_xmit_timer(tp, ts_rtt - 1);
|
2001-09-10 19:23:09 +04:00
|
|
|
else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
|
|
|
|
tcp_xmit_timer(tp, tcp_now - tp->t_rtttime);
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If all outstanding data is acked, stop retransmit
|
|
|
|
* timer and remember to restart (more output or persist).
|
|
|
|
* If there is more data to be acked, restart retransmit
|
|
|
|
* timer, using current (possibly backed-off) value.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (th->th_ack == tp->snd_max) {
|
1998-05-06 05:21:20 +04:00
|
|
|
TCP_TIMER_DISARM(tp, TCPT_REXMT);
|
1993-03-21 12:45:37 +03:00
|
|
|
needoutput = 1;
|
1998-05-06 05:21:20 +04:00
|
|
|
} else if (TCP_TIMER_ISARMED(tp, TCPT_PERSIST) == 0)
|
|
|
|
TCP_TIMER_ARM(tp, TCPT_REXMT, tp->t_rxtcur);
|
2006-10-09 20:27:07 +04:00
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
2006-10-09 20:27:07 +04:00
|
|
|
* New data has been acked, adjust the congestion window.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
2006-10-10 15:13:02 +04:00
|
|
|
tp->t_congctl->newack(tp, th);
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2007-12-20 22:53:29 +03:00
|
|
|
nd6_hint(tp);
|
1993-03-21 12:45:37 +03:00
|
|
|
if (acked > so->so_snd.sb_cc) {
|
|
|
|
tp->snd_wnd -= so->so_snd.sb_cc;
|
|
|
|
sbdrop(&so->so_snd, (int)so->so_snd.sb_cc);
|
|
|
|
ourfinisacked = 1;
|
|
|
|
} else {
|
2003-10-24 14:25:40 +04:00
|
|
|
if (acked > (tp->t_lastoff - tp->t_inoff))
|
|
|
|
tp->t_lastm = NULL;
|
1993-03-21 12:45:37 +03:00
|
|
|
sbdrop(&so->so_snd, acked);
|
2003-10-24 14:25:40 +04:00
|
|
|
tp->t_lastoff -= acked;
|
2004-04-14 22:07:52 +04:00
|
|
|
tp->snd_wnd -= acked;
|
1993-03-21 12:45:37 +03:00
|
|
|
ourfinisacked = 0;
|
|
|
|
}
|
1998-04-30 00:43:29 +04:00
|
|
|
sowwakeup(so);
|
2005-07-19 21:00:02 +04:00
|
|
|
|
2008-02-20 14:44:07 +03:00
|
|
|
icmp_check(tp, th, acked);
|
2005-07-19 21:00:02 +04:00
|
|
|
|
2005-01-27 00:49:27 +03:00
|
|
|
tp->snd_una = th->th_ack;
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if (SEQ_GT(tp->snd_una, tp->snd_fack))
|
|
|
|
tp->snd_fack = tp->snd_una;
|
1993-03-21 12:45:37 +03:00
|
|
|
if (SEQ_LT(tp->snd_nxt, tp->snd_una))
|
|
|
|
tp->snd_nxt = tp->snd_una;
|
2005-01-27 00:49:27 +03:00
|
|
|
if (SEQ_LT(tp->snd_high, tp->snd_una))
|
|
|
|
tp->snd_high = tp->snd_una;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
switch (tp->t_state) {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In FIN_WAIT_1 STATE in addition to the processing
|
|
|
|
* for the ESTABLISHED state if our FIN is now acknowledged
|
|
|
|
* then enter FIN_WAIT_2.
|
|
|
|
*/
|
|
|
|
case TCPS_FIN_WAIT_1:
|
|
|
|
if (ourfinisacked) {
|
|
|
|
/*
|
|
|
|
* If we can't receive any more
|
|
|
|
* data, then closing user can proceed.
|
|
|
|
* Starting the timer is contrary to the
|
|
|
|
* specification, but if we don't get a FIN
|
|
|
|
* we'll hang forever.
|
|
|
|
*/
|
|
|
|
if (so->so_state & SS_CANTRCVMORE) {
|
|
|
|
soisdisconnected(so);
|
2007-06-20 19:29:17 +04:00
|
|
|
if (tp->t_maxidle > 0)
|
1998-09-10 14:46:03 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_2MSL,
|
2007-06-20 19:29:17 +04:00
|
|
|
tp->t_maxidle);
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
tp->t_state = TCPS_FIN_WAIT_2;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In CLOSING STATE in addition to the processing for
|
|
|
|
* the ESTABLISHED state if the ACK acknowledges our FIN
|
|
|
|
* then enter the TIME-WAIT state, otherwise ignore
|
|
|
|
* the segment.
|
|
|
|
*/
|
|
|
|
case TCPS_CLOSING:
|
|
|
|
if (ourfinisacked) {
|
|
|
|
tp->t_state = TCPS_TIME_WAIT;
|
|
|
|
tcp_canceltimers(tp);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_2MSL, 2 * tp->t_msl);
|
1993-03-21 12:45:37 +03:00
|
|
|
soisdisconnected(so);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In LAST_ACK, we may still be waiting for data to drain
|
|
|
|
* and/or to be acked, as well as for the ack of our FIN.
|
|
|
|
* If our FIN is now acknowledged, delete the TCB,
|
|
|
|
* enter the closed state and return.
|
|
|
|
*/
|
|
|
|
case TCPS_LAST_ACK:
|
|
|
|
if (ourfinisacked) {
|
|
|
|
tp = tcp_close(tp);
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In TIME_WAIT state the only thing that should arrive
|
|
|
|
* is a retransmission of the remote FIN. Acknowledge
|
|
|
|
* it and restart the finack timer.
|
|
|
|
*/
|
|
|
|
case TCPS_TIME_WAIT:
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_2MSL, 2 * tp->t_msl);
|
1993-03-21 12:45:37 +03:00
|
|
|
goto dropafterack;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
step6:
|
|
|
|
/*
|
|
|
|
* Update window information.
|
|
|
|
* Don't look at window if no ACK: TAC's send garbage on first SYN.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((tiflags & TH_ACK) && (SEQ_LT(tp->snd_wl1, th->th_seq) ||
|
2006-02-18 20:34:49 +03:00
|
|
|
(tp->snd_wl1 == th->th_seq && (SEQ_LT(tp->snd_wl2, th->th_ack) ||
|
|
|
|
(tp->snd_wl2 == th->th_ack && tiwin > tp->snd_wnd))))) {
|
1993-03-21 12:45:37 +03:00
|
|
|
/* keep track of pure window updates */
|
1999-07-01 12:12:45 +04:00
|
|
|
if (tlen == 0 &&
|
|
|
|
tp->snd_wl2 == th->th_ack && tiwin > tp->snd_wnd)
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RCVWINUPD);
|
1994-05-13 10:02:48 +04:00
|
|
|
tp->snd_wnd = tiwin;
|
1999-07-01 12:12:45 +04:00
|
|
|
tp->snd_wl1 = th->th_seq;
|
|
|
|
tp->snd_wl2 = th->th_ack;
|
1993-03-21 12:45:37 +03:00
|
|
|
if (tp->snd_wnd > tp->max_sndwnd)
|
|
|
|
tp->max_sndwnd = tp->snd_wnd;
|
|
|
|
needoutput = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Process segments with URG.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((tiflags & TH_URG) && th->th_urp &&
|
1993-03-21 12:45:37 +03:00
|
|
|
TCPS_HAVERCVDFIN(tp->t_state) == 0) {
|
|
|
|
/*
|
|
|
|
* This is a kludge, but if we receive and accept
|
|
|
|
* random urgent pointers, we'll crash in
|
|
|
|
* soreceive. It's hard to imagine someone
|
|
|
|
* actually wanting to send this much urgent data.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (th->th_urp + so->so_rcv.sb_cc > sb_max) {
|
|
|
|
th->th_urp = 0; /* XXX */
|
1993-03-21 12:45:37 +03:00
|
|
|
tiflags &= ~TH_URG; /* XXX */
|
|
|
|
goto dodata; /* XXX */
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If this segment advances the known urgent pointer,
|
|
|
|
* then mark the data stream. This should not happen
|
|
|
|
* in CLOSE_WAIT, CLOSING, LAST_ACK or TIME_WAIT STATES since
|
2002-06-09 20:33:36 +04:00
|
|
|
* a FIN has been received from the remote side.
|
1993-03-21 12:45:37 +03:00
|
|
|
* In these states we ignore the URG.
|
|
|
|
*
|
|
|
|
* According to RFC961 (Assigned Protocols),
|
|
|
|
* the urgent pointer points to the last octet
|
|
|
|
* of urgent data. We continue, however,
|
|
|
|
* to consider it to indicate the first octet
|
2002-06-09 20:33:36 +04:00
|
|
|
* of data past the urgent section as the original
|
1993-03-21 12:45:37 +03:00
|
|
|
* spec states (in one of two places).
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (SEQ_GT(th->th_seq+th->th_urp, tp->rcv_up)) {
|
|
|
|
tp->rcv_up = th->th_seq + th->th_urp;
|
1993-03-21 12:45:37 +03:00
|
|
|
so->so_oobmark = so->so_rcv.sb_cc +
|
|
|
|
(tp->rcv_up - tp->rcv_nxt) - 1;
|
|
|
|
if (so->so_oobmark == 0)
|
|
|
|
so->so_state |= SS_RCVATMARK;
|
|
|
|
sohasoutofband(so);
|
|
|
|
tp->t_oobflags &= ~(TCPOOB_HAVEDATA | TCPOOB_HADDATA);
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Remove out of band data so doesn't get presented to user.
|
|
|
|
* This can happen independent of advancing the URG pointer,
|
|
|
|
* but if two URG's are pending at once, some out-of-band
|
|
|
|
* data may creep in... ick.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (th->th_urp <= (u_int16_t) tlen
|
1993-03-21 12:45:37 +03:00
|
|
|
#ifdef SO_OOBINLINE
|
|
|
|
&& (so->so_options & SO_OOBINLINE) == 0
|
|
|
|
#endif
|
|
|
|
)
|
1999-12-08 19:22:20 +03:00
|
|
|
tcp_pulloutofband(so, th, m, hdroptlen);
|
1993-03-21 12:45:37 +03:00
|
|
|
} else
|
|
|
|
/*
|
|
|
|
* If no out of band data is expected,
|
|
|
|
* pull receive urgent pointer along
|
|
|
|
* with the receive window.
|
|
|
|
*/
|
|
|
|
if (SEQ_GT(tp->rcv_nxt, tp->rcv_up))
|
|
|
|
tp->rcv_up = tp->rcv_nxt;
|
|
|
|
dodata: /* XXX */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Process the segment text, merging it into the TCP sequencing queue,
|
1999-09-10 07:24:14 +04:00
|
|
|
* and arranging for acknowledgement of receipt if necessary.
|
1993-03-21 12:45:37 +03:00
|
|
|
* This process logically involves adjusting tp->rcv_wnd as data
|
|
|
|
* is presented to the user (this happens in tcp_usrreq.c,
|
|
|
|
* case PRU_RCVD). If a FIN has already been received on this
|
|
|
|
* connection then we just ignore the text.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((tlen || (tiflags & TH_FIN)) &&
|
1993-03-21 12:45:37 +03:00
|
|
|
TCPS_HAVERCVDFIN(tp->t_state) == 0) {
|
1999-07-01 12:12:45 +04:00
|
|
|
/*
|
|
|
|
* Insert segment ti into reassembly queue of tcp with
|
|
|
|
* control block tp. Return TH_FIN if reassembly now includes
|
|
|
|
* a segment with FIN. The macro form does the common case
|
|
|
|
* inline (segment is the next to be received on an
|
|
|
|
* established connection, and the queue is empty),
|
|
|
|
* avoiding linkage into and removal from the queue and
|
|
|
|
* repetition of various conversions.
|
|
|
|
* Set DELACK for segments received in order, but ack
|
|
|
|
* immediately when segments are out of order
|
|
|
|
* (so fast retransmit can work).
|
|
|
|
*/
|
|
|
|
/* NOTE: this was TCP_REASS() macro, but used only once */
|
|
|
|
TCP_REASS_LOCK(tp);
|
|
|
|
if (th->th_seq == tp->rcv_nxt &&
|
2002-05-07 06:59:38 +04:00
|
|
|
TAILQ_FIRST(&tp->segq) == NULL &&
|
1999-07-01 12:12:45 +04:00
|
|
|
tp->t_state == TCPS_ESTABLISHED) {
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_setup_ack(tp, th);
|
1999-07-01 12:12:45 +04:00
|
|
|
tp->rcv_nxt += tlen;
|
|
|
|
tiflags = th->th_flags & TH_FIN;
|
2008-04-12 09:58:22 +04:00
|
|
|
tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_RCVPACK]++;
|
|
|
|
tcps[TCP_STAT_RCVBYTE] += tlen;
|
|
|
|
TCP_STAT_PUTREF();
|
2007-12-20 22:53:29 +03:00
|
|
|
nd6_hint(tp);
|
2002-09-06 03:02:18 +04:00
|
|
|
if (so->so_state & SS_CANTRCVMORE)
|
|
|
|
m_freem(m);
|
|
|
|
else {
|
|
|
|
m_adj(m, hdroptlen);
|
|
|
|
sbappendstream(&(so)->so_rcv, m);
|
|
|
|
}
|
2008-08-04 08:08:47 +04:00
|
|
|
TCP_REASS_UNLOCK(tp);
|
1999-07-01 12:12:45 +04:00
|
|
|
sorwakeup(so);
|
|
|
|
} else {
|
1999-12-08 19:22:20 +03:00
|
|
|
m_adj(m, hdroptlen);
|
1999-07-01 12:12:45 +04:00
|
|
|
tiflags = tcp_reass(tp, th, m, &tlen);
|
|
|
|
tp->t_flags |= TF_ACKNOW;
|
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* Note the amount of data that peer has sent into
|
|
|
|
* our window, in order to estimate the sender's
|
|
|
|
* buffer size.
|
|
|
|
*/
|
|
|
|
len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt);
|
|
|
|
} else {
|
|
|
|
m_freem(m);
|
1999-07-01 12:12:45 +04:00
|
|
|
m = NULL;
|
1993-03-21 12:45:37 +03:00
|
|
|
tiflags &= ~TH_FIN;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If FIN is received ACK the FIN and let the user know
|
1996-01-31 08:56:56 +03:00
|
|
|
* that the connection is closing. Ignore a FIN received before
|
|
|
|
* the connection is fully established.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1996-01-31 08:56:56 +03:00
|
|
|
if ((tiflags & TH_FIN) && TCPS_HAVEESTABLISHED(tp->t_state)) {
|
1993-03-21 12:45:37 +03:00
|
|
|
if (TCPS_HAVERCVDFIN(tp->t_state) == 0) {
|
|
|
|
socantrcvmore(so);
|
|
|
|
tp->t_flags |= TF_ACKNOW;
|
|
|
|
tp->rcv_nxt++;
|
|
|
|
}
|
|
|
|
switch (tp->t_state) {
|
|
|
|
|
|
|
|
/*
|
1996-01-31 08:56:56 +03:00
|
|
|
* In ESTABLISHED STATE enter the CLOSE_WAIT state.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
|
|
|
case TCPS_ESTABLISHED:
|
|
|
|
tp->t_state = TCPS_CLOSE_WAIT;
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If still in FIN_WAIT_1 STATE FIN has not been acked so
|
|
|
|
* enter the CLOSING state.
|
|
|
|
*/
|
|
|
|
case TCPS_FIN_WAIT_1:
|
|
|
|
tp->t_state = TCPS_CLOSING;
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In FIN_WAIT_2 state enter the TIME_WAIT state,
|
2002-06-09 20:33:36 +04:00
|
|
|
* starting the time-wait timer, turning off the other
|
1993-03-21 12:45:37 +03:00
|
|
|
* standard timers.
|
|
|
|
*/
|
|
|
|
case TCPS_FIN_WAIT_2:
|
|
|
|
tp->t_state = TCPS_TIME_WAIT;
|
|
|
|
tcp_canceltimers(tp);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_2MSL, 2 * tp->t_msl);
|
1993-03-21 12:45:37 +03:00
|
|
|
soisdisconnected(so);
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In TIME_WAIT state restart the 2 MSL time_wait timer.
|
|
|
|
*/
|
|
|
|
case TCPS_TIME_WAIT:
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_2MSL, 2 * tp->t_msl);
|
1993-03-21 12:45:37 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2001-07-08 20:18:56 +04:00
|
|
|
#ifdef TCP_DEBUG
|
|
|
|
if (so->so_options & SO_DEBUG)
|
1999-07-01 12:12:45 +04:00
|
|
|
tcp_trace(TA_INPUT, ostate, tp, tcp_saveti, 0);
|
2001-07-08 20:18:56 +04:00
|
|
|
#endif
|
1993-03-21 12:45:37 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Return any desired output.
|
|
|
|
*/
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if (needoutput || (tp->t_flags & TF_ACKNOW)) {
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_LOCK(1, NULL);
|
1993-03-21 12:45:37 +03:00
|
|
|
(void) tcp_output(tp);
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
}
|
1999-07-22 16:56:56 +04:00
|
|
|
if (tcp_saveti)
|
|
|
|
m_freem(tcp_saveti);
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
|
|
|
|
if (tp->t_state == TCPS_TIME_WAIT
|
|
|
|
&& (so->so_state & SS_NOFDREF)
|
|
|
|
&& (tp->t_inpcb || af != AF_INET)
|
|
|
|
&& (tp->t_in6pcb || af != AF_INET6)
|
|
|
|
&& ((af == AF_INET ? tcp4_vtw_enable : tcp6_vtw_enable) & 1) != 0
|
|
|
|
&& TAILQ_EMPTY(&tp->segq)
|
|
|
|
&& vtw_add(af, tp)) {
|
|
|
|
;
|
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
return;
|
|
|
|
|
1997-11-21 09:18:30 +03:00
|
|
|
badsyn:
|
|
|
|
/*
|
|
|
|
* Received a bad SYN. Increment counters and dropwithreset.
|
|
|
|
*/
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_BADSYN);
|
1997-11-21 09:18:30 +03:00
|
|
|
tp = NULL;
|
|
|
|
goto dropwithreset;
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
dropafterack:
|
|
|
|
/*
|
|
|
|
* Generate an ACK dropping incoming segment if it occupies
|
|
|
|
* sequence space, where the ACK reflects our state.
|
|
|
|
*/
|
|
|
|
if (tiflags & TH_RST)
|
|
|
|
goto drop;
|
2004-04-20 20:52:12 +04:00
|
|
|
goto dropafterack2;
|
|
|
|
|
|
|
|
dropafterack_ratelim:
|
|
|
|
/*
|
|
|
|
* We may want to rate-limit ACKs against SYN/RST attack.
|
|
|
|
*/
|
|
|
|
if (ppsratecheck(&tcp_ackdrop_ppslim_last, &tcp_ackdrop_ppslim_count,
|
|
|
|
tcp_ackdrop_ppslim) == 0) {
|
|
|
|
/* XXX stat */
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
/* ...fall into dropafterack2... */
|
|
|
|
|
|
|
|
dropafterack2:
|
1993-03-21 12:45:37 +03:00
|
|
|
m_freem(m);
|
|
|
|
tp->t_flags |= TF_ACKNOW;
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_LOCK(1, NULL);
|
1993-03-21 12:45:37 +03:00
|
|
|
(void) tcp_output(tp);
|
2010-04-01 04:24:41 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
1999-07-22 16:56:56 +04:00
|
|
|
if (tcp_saveti)
|
|
|
|
m_freem(tcp_saveti);
|
1993-03-21 12:45:37 +03:00
|
|
|
return;
|
|
|
|
|
2000-02-15 22:54:11 +03:00
|
|
|
dropwithreset_ratelim:
|
|
|
|
/*
|
|
|
|
* We may want to rate-limit RSTs in certain situations,
|
|
|
|
* particularly if we are sending an RST in response to
|
|
|
|
* an attempt to connect to or otherwise communicate with
|
|
|
|
* a port for which we have no socket.
|
|
|
|
*/
|
2000-07-27 15:34:06 +04:00
|
|
|
if (ppsratecheck(&tcp_rst_ppslim_last, &tcp_rst_ppslim_count,
|
|
|
|
tcp_rst_ppslim) == 0) {
|
|
|
|
/* XXX stat */
|
|
|
|
goto drop;
|
|
|
|
}
|
2000-02-15 22:54:11 +03:00
|
|
|
/* ...fall into dropwithreset... */
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
dropwithreset:
|
|
|
|
/*
|
|
|
|
* Generate a RST, dropping incoming segment.
|
|
|
|
* Make ACK acceptable to originator of segment.
|
|
|
|
*/
|
2000-02-12 20:19:34 +03:00
|
|
|
if (tiflags & TH_RST)
|
1993-03-21 12:45:37 +03:00
|
|
|
goto drop;
|
2002-03-19 17:35:20 +03:00
|
|
|
|
2002-03-24 20:09:01 +03:00
|
|
|
switch (af) {
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
/* For following calls to tcp_respond */
|
|
|
|
if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst))
|
|
|
|
goto drop;
|
|
|
|
break;
|
|
|
|
#endif /* INET6 */
|
|
|
|
case AF_INET:
|
|
|
|
if (IN_MULTICAST(ip->ip_dst.s_addr) ||
|
|
|
|
in_broadcast(ip->ip_dst, m->m_pkthdr.rcvif))
|
|
|
|
goto drop;
|
|
|
|
}
|
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
if (tiflags & TH_ACK)
|
1999-07-15 02:37:13 +04:00
|
|
|
(void)tcp_respond(tp, m, m, th, (tcp_seq)0, th->th_ack, TH_RST);
|
1993-03-21 12:45:37 +03:00
|
|
|
else {
|
|
|
|
if (tiflags & TH_SYN)
|
1999-07-01 12:12:45 +04:00
|
|
|
tlen++;
|
1999-07-15 02:37:13 +04:00
|
|
|
(void)tcp_respond(tp, m, m, th, th->th_seq + tlen, (tcp_seq)0,
|
1993-03-21 12:45:37 +03:00
|
|
|
TH_RST|TH_ACK);
|
|
|
|
}
|
1999-07-22 16:56:56 +04:00
|
|
|
if (tcp_saveti)
|
|
|
|
m_freem(tcp_saveti);
|
1993-03-21 12:45:37 +03:00
|
|
|
return;
|
|
|
|
|
2001-06-02 20:17:09 +04:00
|
|
|
badcsum:
|
1993-03-21 12:45:37 +03:00
|
|
|
drop:
|
|
|
|
/*
|
|
|
|
* Drop space held by incoming segment and return.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (tp) {
|
|
|
|
if (tp->t_inpcb)
|
|
|
|
so = tp->t_inpcb->inp_socket;
|
|
|
|
#ifdef INET6
|
|
|
|
else if (tp->t_in6pcb)
|
|
|
|
so = tp->t_in6pcb->in6p_socket;
|
|
|
|
#endif
|
|
|
|
else
|
|
|
|
so = NULL;
|
2001-07-08 20:18:56 +04:00
|
|
|
#ifdef TCP_DEBUG
|
1999-07-22 16:56:56 +04:00
|
|
|
if (so && (so->so_options & SO_DEBUG) != 0)
|
1999-07-01 12:12:45 +04:00
|
|
|
tcp_trace(TA_DROP, ostate, tp, tcp_saveti, 0);
|
2001-07-08 20:18:56 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
1999-07-22 16:56:56 +04:00
|
|
|
if (tcp_saveti)
|
|
|
|
m_freem(tcp_saveti);
|
1993-03-21 12:45:37 +03:00
|
|
|
m_freem(m);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2004-05-18 18:44:14 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
int
|
2007-03-04 08:59:00 +03:00
|
|
|
tcp_signature_apply(void *fstate, void *data, u_int len)
|
2004-05-18 18:44:14 +04:00
|
|
|
{
|
|
|
|
|
|
|
|
MD5Update(fstate, (u_char *)data, len);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct secasvar *
|
|
|
|
tcp_signature_getsav(struct mbuf *m, struct tcphdr *th)
|
|
|
|
{
|
|
|
|
struct secasvar *sav;
|
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
union sockaddr_union dst;
|
|
|
|
#endif
|
|
|
|
struct ip *ip;
|
|
|
|
struct ip6_hdr *ip6;
|
|
|
|
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
switch (ip->ip_v) {
|
|
|
|
case 4:
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
ip6 = NULL;
|
|
|
|
break;
|
|
|
|
case 6:
|
|
|
|
ip = NULL;
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
/* Extract the destination from the IP header in the mbuf. */
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(&dst, 0, sizeof(union sockaddr_union));
|
2007-02-10 12:43:05 +03:00
|
|
|
if (ip !=NULL) {
|
|
|
|
dst.sa.sa_len = sizeof(struct sockaddr_in);
|
|
|
|
dst.sa.sa_family = AF_INET;
|
|
|
|
dst.sin.sin_addr = ip->ip_dst;
|
|
|
|
} else {
|
|
|
|
dst.sa.sa_len = sizeof(struct sockaddr_in6);
|
|
|
|
dst.sa.sa_family = AF_INET6;
|
|
|
|
dst.sin6.sin6_addr = ip6->ip6_dst;
|
|
|
|
}
|
2004-05-18 18:44:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Look up an SADB entry which matches the address of the peer.
|
|
|
|
*/
|
|
|
|
sav = KEY_ALLOCSA(&dst, IPPROTO_TCP, htonl(TCP_SIG_SPI));
|
|
|
|
#else
|
|
|
|
if (ip)
|
2007-03-04 08:59:00 +03:00
|
|
|
sav = key_allocsa(AF_INET, (void *)&ip->ip_src,
|
|
|
|
(void *)&ip->ip_dst, IPPROTO_TCP,
|
2005-04-26 09:37:45 +04:00
|
|
|
htonl(TCP_SIG_SPI), 0, 0);
|
2004-05-18 18:44:14 +04:00
|
|
|
else
|
2007-03-04 08:59:00 +03:00
|
|
|
sav = key_allocsa(AF_INET6, (void *)&ip6->ip6_src,
|
|
|
|
(void *)&ip6->ip6_dst, IPPROTO_TCP,
|
2005-04-26 09:37:45 +04:00
|
|
|
htonl(TCP_SIG_SPI), 0, 0);
|
2004-05-18 18:44:14 +04:00
|
|
|
#endif
|
|
|
|
|
|
|
|
return (sav); /* freesav must be performed by caller */
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
tcp_signature(struct mbuf *m, struct tcphdr *th, int thoff,
|
|
|
|
struct secasvar *sav, char *sig)
|
|
|
|
{
|
|
|
|
MD5_CTX ctx;
|
|
|
|
struct ip *ip;
|
|
|
|
struct ipovly *ipovly;
|
|
|
|
struct ip6_hdr *ip6;
|
|
|
|
struct ippseudo ippseudo;
|
|
|
|
struct ip6_hdr_pseudo ip6pseudo;
|
|
|
|
struct tcphdr th0;
|
2004-06-26 07:29:15 +04:00
|
|
|
int l, tcphdrlen;
|
2004-05-18 18:44:14 +04:00
|
|
|
|
|
|
|
if (sav == NULL)
|
|
|
|
return (-1);
|
|
|
|
|
2004-06-26 07:29:15 +04:00
|
|
|
tcphdrlen = th->th_off * 4;
|
|
|
|
|
2004-05-18 18:44:14 +04:00
|
|
|
switch (mtod(m, struct ip *)->ip_v) {
|
|
|
|
case 4:
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
ip6 = NULL;
|
|
|
|
break;
|
|
|
|
case 6:
|
|
|
|
ip = NULL;
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return (-1);
|
|
|
|
}
|
|
|
|
|
|
|
|
MD5Init(&ctx);
|
|
|
|
|
|
|
|
if (ip) {
|
|
|
|
memset(&ippseudo, 0, sizeof(ippseudo));
|
|
|
|
ipovly = (struct ipovly *)ip;
|
|
|
|
ippseudo.ippseudo_src = ipovly->ih_src;
|
|
|
|
ippseudo.ippseudo_dst = ipovly->ih_dst;
|
|
|
|
ippseudo.ippseudo_pad = 0;
|
|
|
|
ippseudo.ippseudo_p = IPPROTO_TCP;
|
|
|
|
ippseudo.ippseudo_len = htons(m->m_pkthdr.len - thoff);
|
|
|
|
MD5Update(&ctx, (char *)&ippseudo, sizeof(ippseudo));
|
|
|
|
} else {
|
|
|
|
memset(&ip6pseudo, 0, sizeof(ip6pseudo));
|
|
|
|
ip6pseudo.ip6ph_src = ip6->ip6_src;
|
|
|
|
in6_clearscope(&ip6pseudo.ip6ph_src);
|
|
|
|
ip6pseudo.ip6ph_dst = ip6->ip6_dst;
|
|
|
|
in6_clearscope(&ip6pseudo.ip6ph_dst);
|
|
|
|
ip6pseudo.ip6ph_len = htons(m->m_pkthdr.len - thoff);
|
|
|
|
ip6pseudo.ip6ph_nxt = IPPROTO_TCP;
|
|
|
|
MD5Update(&ctx, (char *)&ip6pseudo, sizeof(ip6pseudo));
|
|
|
|
}
|
|
|
|
|
|
|
|
th0 = *th;
|
|
|
|
th0.th_sum = 0;
|
|
|
|
MD5Update(&ctx, (char *)&th0, sizeof(th0));
|
|
|
|
|
2004-06-26 07:29:15 +04:00
|
|
|
l = m->m_pkthdr.len - thoff - tcphdrlen;
|
2004-05-18 18:44:14 +04:00
|
|
|
if (l > 0)
|
2004-06-26 07:29:15 +04:00
|
|
|
m_apply(m, thoff + tcphdrlen,
|
|
|
|
m->m_pkthdr.len - thoff - tcphdrlen,
|
2004-05-18 18:44:14 +04:00
|
|
|
tcp_signature_apply, &ctx);
|
|
|
|
|
|
|
|
MD5Update(&ctx, _KEYBUF(sav->key_auth), _KEYLEN(sav->key_auth));
|
|
|
|
MD5Final(sig, &ctx);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-04-14 19:48:48 +04:00
|
|
|
/*
|
|
|
|
* tcp_dooptions: parse and process tcp options.
|
|
|
|
*
|
|
|
|
* returns -1 if this segment should be dropped. (eg. wrong signature)
|
|
|
|
* otherwise returns 0.
|
|
|
|
*/
|
|
|
|
|
2006-10-21 14:08:54 +04:00
|
|
|
static int
|
|
|
|
tcp_dooptions(struct tcpcb *tp, const u_char *cp, int cnt,
|
2007-05-19 01:31:16 +04:00
|
|
|
struct tcphdr *th,
|
2006-11-16 04:32:37 +03:00
|
|
|
struct mbuf *m, int toff, struct tcp_opt_info *oi)
|
1993-03-21 12:45:37 +03:00
|
|
|
{
|
1995-04-13 10:35:38 +04:00
|
|
|
u_int16_t mss;
|
2004-05-18 18:44:14 +04:00
|
|
|
int opt, optlen = 0;
|
|
|
|
#ifdef TCP_SIGNATURE
|
2007-03-04 08:59:00 +03:00
|
|
|
void *sigp = NULL;
|
2004-05-18 18:44:14 +04:00
|
|
|
char sigbuf[TCP_SIGLEN];
|
|
|
|
struct secasvar *sav = NULL;
|
|
|
|
#endif
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2004-05-18 18:44:14 +04:00
|
|
|
for (; cp && cnt > 0; cnt -= optlen, cp += optlen) {
|
1993-03-21 12:45:37 +03:00
|
|
|
opt = cp[0];
|
|
|
|
if (opt == TCPOPT_EOL)
|
|
|
|
break;
|
|
|
|
if (opt == TCPOPT_NOP)
|
|
|
|
optlen = 1;
|
|
|
|
else {
|
2000-07-09 16:49:08 +04:00
|
|
|
if (cnt < 2)
|
|
|
|
break;
|
1993-03-21 12:45:37 +03:00
|
|
|
optlen = cp[1];
|
2000-07-09 16:49:08 +04:00
|
|
|
if (optlen < 2 || optlen > cnt)
|
1993-03-21 12:45:37 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
switch (opt) {
|
|
|
|
|
|
|
|
default:
|
|
|
|
continue;
|
|
|
|
|
|
|
|
case TCPOPT_MAXSEG:
|
1994-05-13 10:02:48 +04:00
|
|
|
if (optlen != TCPOLEN_MAXSEG)
|
1993-03-21 12:45:37 +03:00
|
|
|
continue;
|
1999-07-01 12:12:45 +04:00
|
|
|
if (!(th->th_flags & TH_SYN))
|
1993-03-21 12:45:37 +03:00
|
|
|
continue;
|
2005-08-12 02:25:18 +04:00
|
|
|
if (TCPS_HAVERCVDSYN(tp->t_state))
|
|
|
|
continue;
|
1997-07-24 01:26:40 +04:00
|
|
|
bcopy(cp + 2, &mss, sizeof(mss));
|
|
|
|
oi->maxseg = ntohs(mss);
|
1993-03-21 12:45:37 +03:00
|
|
|
break;
|
1994-05-13 10:02:48 +04:00
|
|
|
|
|
|
|
case TCPOPT_WINDOW:
|
|
|
|
if (optlen != TCPOLEN_WINDOW)
|
|
|
|
continue;
|
1999-07-01 12:12:45 +04:00
|
|
|
if (!(th->th_flags & TH_SYN))
|
1994-05-13 10:02:48 +04:00
|
|
|
continue;
|
2005-08-12 02:25:18 +04:00
|
|
|
if (TCPS_HAVERCVDSYN(tp->t_state))
|
|
|
|
continue;
|
1994-05-13 10:02:48 +04:00
|
|
|
tp->t_flags |= TF_RCVD_SCALE;
|
1998-04-29 01:52:16 +04:00
|
|
|
tp->requested_s_scale = cp[2];
|
|
|
|
if (tp->requested_s_scale > TCP_MAX_WINSHIFT) {
|
1999-07-01 12:12:45 +04:00
|
|
|
#if 0 /*XXX*/
|
|
|
|
char *p;
|
|
|
|
|
|
|
|
if (ip)
|
|
|
|
p = ntohl(ip->ip_src);
|
|
|
|
#ifdef INET6
|
|
|
|
else if (ip6)
|
|
|
|
p = ip6_sprintf(&ip6->ip6_src);
|
|
|
|
#endif
|
|
|
|
else
|
|
|
|
p = "(unknown)";
|
|
|
|
log(LOG_ERR, "TCP: invalid wscale %d from %s, "
|
|
|
|
"assuming %d\n",
|
|
|
|
tp->requested_s_scale, p,
|
|
|
|
TCP_MAX_WINSHIFT);
|
|
|
|
#else
|
|
|
|
log(LOG_ERR, "TCP: invalid wscale %d, "
|
|
|
|
"assuming %d\n",
|
1998-04-29 01:52:16 +04:00
|
|
|
tp->requested_s_scale,
|
|
|
|
TCP_MAX_WINSHIFT);
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
1998-04-29 01:52:16 +04:00
|
|
|
tp->requested_s_scale = TCP_MAX_WINSHIFT;
|
|
|
|
}
|
1994-05-13 10:02:48 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
case TCPOPT_TIMESTAMP:
|
|
|
|
if (optlen != TCPOLEN_TIMESTAMP)
|
|
|
|
continue;
|
1997-09-23 01:49:55 +04:00
|
|
|
oi->ts_present = 1;
|
1997-07-24 01:26:40 +04:00
|
|
|
bcopy(cp + 2, &oi->ts_val, sizeof(oi->ts_val));
|
|
|
|
NTOHL(oi->ts_val);
|
|
|
|
bcopy(cp + 6, &oi->ts_ecr, sizeof(oi->ts_ecr));
|
|
|
|
NTOHL(oi->ts_ecr);
|
1994-05-13 10:02:48 +04:00
|
|
|
|
2005-08-12 02:25:18 +04:00
|
|
|
if (!(th->th_flags & TH_SYN))
|
|
|
|
continue;
|
|
|
|
if (TCPS_HAVERCVDSYN(tp->t_state))
|
|
|
|
continue;
|
2002-06-09 20:33:36 +04:00
|
|
|
/*
|
1994-05-13 10:02:48 +04:00
|
|
|
* A timestamp received in a SYN makes
|
|
|
|
* it ok to send timestamp requests and replies.
|
|
|
|
*/
|
2005-08-12 02:25:18 +04:00
|
|
|
tp->t_flags |= TF_RCVD_TSTMP;
|
|
|
|
tp->ts_recent = oi->ts_val;
|
|
|
|
tp->ts_recent_age = tcp_now;
|
2005-06-30 06:58:28 +04:00
|
|
|
break;
|
|
|
|
|
1998-04-30 00:43:29 +04:00
|
|
|
case TCPOPT_SACK_PERMITTED:
|
|
|
|
if (optlen != TCPOLEN_SACK_PERMITTED)
|
|
|
|
continue;
|
1999-07-01 12:12:45 +04:00
|
|
|
if (!(th->th_flags & TH_SYN))
|
1998-04-30 00:43:29 +04:00
|
|
|
continue;
|
2005-08-12 02:25:18 +04:00
|
|
|
if (TCPS_HAVERCVDSYN(tp->t_state))
|
|
|
|
continue;
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if (tcp_do_sack) {
|
|
|
|
tp->t_flags |= TF_SACK_PERMIT;
|
|
|
|
tp->t_flags |= TF_WILL_SACK;
|
|
|
|
}
|
1998-04-30 00:43:29 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
case TCPOPT_SACK:
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
tcp_sack_option(tp, th, cp, optlen);
|
1998-04-30 00:43:29 +04:00
|
|
|
break;
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
case TCPOPT_SIGNATURE:
|
|
|
|
if (optlen != TCPOLEN_SIGNATURE)
|
|
|
|
continue;
|
2009-03-18 18:14:29 +03:00
|
|
|
if (sigp && memcmp(sigp, cp + 2, TCP_SIGLEN))
|
2004-05-18 18:44:14 +04:00
|
|
|
return (-1);
|
|
|
|
|
|
|
|
sigp = sigbuf;
|
|
|
|
memcpy(sigbuf, cp + 2, TCP_SIGLEN);
|
|
|
|
tp->t_flags |= TF_SIGNATURE;
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
break;
|
|
|
|
#endif
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
}
|
2004-05-18 18:44:14 +04:00
|
|
|
|
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
if (tp->t_flags & TF_SIGNATURE) {
|
|
|
|
|
|
|
|
sav = tcp_signature_getsav(m, th);
|
|
|
|
|
|
|
|
if (sav == NULL && tp->t_state == TCPS_LISTEN)
|
|
|
|
return (-1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((sigp ? TF_SIGNATURE : 0) ^ (tp->t_flags & TF_SIGNATURE)) {
|
2004-06-26 07:29:15 +04:00
|
|
|
if (sav == NULL)
|
|
|
|
return (-1);
|
2004-05-18 18:44:14 +04:00
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
KEY_FREESAV(&sav);
|
|
|
|
#else
|
|
|
|
key_freesav(sav);
|
|
|
|
#endif
|
|
|
|
return (-1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sigp) {
|
|
|
|
char sig[TCP_SIGLEN];
|
|
|
|
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_net(th);
|
2004-05-18 18:44:14 +04:00
|
|
|
if (tcp_signature(m, th, toff, sav, sig) < 0) {
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_host(th);
|
2004-06-26 07:29:15 +04:00
|
|
|
if (sav == NULL)
|
|
|
|
return (-1);
|
2004-05-18 18:44:14 +04:00
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
KEY_FREESAV(&sav);
|
|
|
|
#else
|
|
|
|
key_freesav(sav);
|
|
|
|
#endif
|
|
|
|
return (-1);
|
|
|
|
}
|
2008-02-20 14:44:07 +03:00
|
|
|
tcp_fields_to_host(th);
|
2004-05-18 18:44:14 +04:00
|
|
|
|
2009-03-18 18:14:29 +03:00
|
|
|
if (memcmp(sig, sigp, TCP_SIGLEN)) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_BADSIG);
|
2004-06-26 07:29:15 +04:00
|
|
|
if (sav == NULL)
|
|
|
|
return (-1);
|
2004-05-18 18:44:14 +04:00
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
KEY_FREESAV(&sav);
|
|
|
|
#else
|
|
|
|
key_freesav(sav);
|
|
|
|
#endif
|
|
|
|
return (-1);
|
|
|
|
} else
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_GOODSIG);
|
2004-05-18 18:44:14 +04:00
|
|
|
|
|
|
|
key_sa_recordxfer(sav, m);
|
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
KEY_FREESAV(&sav);
|
|
|
|
#else
|
|
|
|
key_freesav(sav);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return (0);
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pull out of band byte out of a segment so
|
|
|
|
* it doesn't appear in the user's data queue.
|
|
|
|
* It is still reflected in the segment length for
|
|
|
|
* sequencing purposes.
|
|
|
|
*/
|
1994-01-09 02:07:16 +03:00
|
|
|
void
|
2005-02-04 02:39:32 +03:00
|
|
|
tcp_pulloutofband(struct socket *so, struct tcphdr *th,
|
|
|
|
struct mbuf *m, int off)
|
1993-03-21 12:45:37 +03:00
|
|
|
{
|
1999-12-08 19:22:20 +03:00
|
|
|
int cnt = off + th->th_urp - 1;
|
2002-06-09 20:33:36 +04:00
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
while (cnt >= 0) {
|
|
|
|
if (m->m_len > cnt) {
|
2007-03-04 08:59:00 +03:00
|
|
|
char *cp = mtod(m, char *) + cnt;
|
1993-03-21 12:45:37 +03:00
|
|
|
struct tcpcb *tp = sototcpcb(so);
|
|
|
|
|
|
|
|
tp->t_iobc = *cp;
|
|
|
|
tp->t_oobflags |= TCPOOB_HAVEDATA;
|
|
|
|
bcopy(cp+1, cp, (unsigned)(m->m_len - cnt - 1));
|
|
|
|
m->m_len--;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
cnt -= m->m_len;
|
|
|
|
m = m->m_next;
|
|
|
|
if (m == 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
panic("tcp_pulloutofband");
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Collect new round-trip time estimate
|
|
|
|
* and update averages and current timeout.
|
2011-04-20 17:35:51 +04:00
|
|
|
*
|
|
|
|
* rtt is in units of slow ticks (typically 500 ms) -- essentially the
|
|
|
|
* difference of two timestamps.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1994-01-09 02:07:16 +03:00
|
|
|
void
|
2005-02-04 02:39:32 +03:00
|
|
|
tcp_xmit_timer(struct tcpcb *tp, uint32_t rtt)
|
1993-03-21 12:45:37 +03:00
|
|
|
{
|
2001-09-10 19:23:09 +04:00
|
|
|
int32_t delta;
|
1993-03-21 12:45:37 +03:00
|
|
|
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_RTTUPDATED);
|
1993-03-21 12:45:37 +03:00
|
|
|
if (tp->t_srtt != 0) {
|
|
|
|
/*
|
2011-04-20 17:35:51 +04:00
|
|
|
* Compute the amount to add to srtt for smoothing,
|
|
|
|
* *alpha, or 2^(-TCP_RTT_SHIFT). Because
|
|
|
|
* srtt is stored in 1/32 slow ticks, we conceptually
|
|
|
|
* shift left 5 bits, subtract srtt to get the
|
|
|
|
* diference, and then shift right by TCP_RTT_SHIFT
|
|
|
|
* (3) to obtain 1/8 of the difference.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1995-06-12 00:39:22 +04:00
|
|
|
delta = (rtt << 2) - (tp->t_srtt >> TCP_RTT_SHIFT);
|
2011-04-20 17:35:51 +04:00
|
|
|
/*
|
|
|
|
* This can never happen, because delta's lowest
|
|
|
|
* possible value is 1/8 of t_srtt. But if it does,
|
|
|
|
* set srtt to some reasonable value, here chosen
|
|
|
|
* as 1/8 tick.
|
|
|
|
*/
|
1993-03-21 12:45:37 +03:00
|
|
|
if ((tp->t_srtt += delta) <= 0)
|
1996-12-10 21:20:19 +03:00
|
|
|
tp->t_srtt = 1 << 2;
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
2011-04-20 17:35:51 +04:00
|
|
|
* RFC2988 requires that rttvar be updated first.
|
|
|
|
* This code is compliant because "delta" is the old
|
|
|
|
* srtt minus the new observation (scaled).
|
|
|
|
*
|
|
|
|
* RFC2988 says:
|
|
|
|
* rttvar = (1-beta) * rttvar + beta * |srtt-observed|
|
|
|
|
*
|
|
|
|
* delta is in units of 1/32 ticks, and has then been
|
|
|
|
* divided by 8. This is equivalent to being in 1/16s
|
|
|
|
* units and divided by 4. Subtract from it 1/4 of
|
|
|
|
* the existing rttvar to form the (signed) amount to
|
|
|
|
* adjust.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
|
|
|
if (delta < 0)
|
|
|
|
delta = -delta;
|
|
|
|
delta -= (tp->t_rttvar >> TCP_RTTVAR_SHIFT);
|
2011-04-20 17:35:51 +04:00
|
|
|
/*
|
|
|
|
* As with srtt, this should never happen. There is
|
|
|
|
* no support in RFC2988 for this operation. But 1/4s
|
2011-04-20 18:08:07 +04:00
|
|
|
* as rttvar when faced with something arguably wrong
|
2011-04-20 17:35:51 +04:00
|
|
|
* is ok.
|
|
|
|
*/
|
1993-03-21 12:45:37 +03:00
|
|
|
if ((tp->t_rttvar += delta) <= 0)
|
1996-12-10 21:20:19 +03:00
|
|
|
tp->t_rttvar = 1 << 2;
|
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 22:28:44 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If srtt exceeds .01 second, ensure we use the 'remote' MSL
|
|
|
|
* Problem is: it doesn't work. Disabled by defaulting
|
|
|
|
* tcp_rttlocal to 0; see corresponding code in
|
|
|
|
* tcp_subr that selects local vs remote in a different way.
|
|
|
|
*
|
|
|
|
* The static branch prediction hint here should be removed
|
|
|
|
* when the rtt estimator is fixed and the rtt_enable code
|
|
|
|
* is turned back on.
|
|
|
|
*/
|
|
|
|
if (__predict_false(tcp_rttlocal) && tcp_msl_enable
|
|
|
|
&& tp->t_srtt > tcp_msl_remote_threshold
|
|
|
|
&& tp->t_msl < tcp_msl_remote) {
|
|
|
|
tp->t_msl = tcp_msl_remote;
|
|
|
|
}
|
1993-03-21 12:45:37 +03:00
|
|
|
} else {
|
2002-06-09 20:33:36 +04:00
|
|
|
/*
|
2011-04-20 17:35:51 +04:00
|
|
|
* This is the first measurement. Per RFC2988, 2.2,
|
|
|
|
* set rtt=R and srtt=R/2.
|
|
|
|
* For srtt, storage representation is 1/32 ticks,
|
|
|
|
* so shift left by 5.
|
2011-04-20 18:08:07 +04:00
|
|
|
* For rttvar, storage representation is 1/16 ticks,
|
2011-04-20 17:35:51 +04:00
|
|
|
* So shift left by 4, but then right by 1 to halve.
|
1993-03-21 12:45:37 +03:00
|
|
|
*/
|
1995-06-12 00:39:22 +04:00
|
|
|
tp->t_srtt = rtt << (TCP_RTT_SHIFT + 2);
|
|
|
|
tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT + 2 - 1);
|
1993-03-21 12:45:37 +03:00
|
|
|
}
|
2001-09-10 19:23:09 +04:00
|
|
|
tp->t_rtttime = 0;
|
1993-03-21 12:45:37 +03:00
|
|
|
tp->t_rxtshift = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the retransmit should happen at rtt + 4 * rttvar.
|
|
|
|
* Because of the way we do the smoothing, srtt and rttvar
|
|
|
|
* will each average +1/2 tick of bias. When we compute
|
|
|
|
* the retransmit timer, we want 1/2 tick of rounding and
|
|
|
|
* 1 extra tick because of +-1/2 tick uncertainty in the
|
|
|
|
* firing of the timer. The bias will give us exactly the
|
|
|
|
* 1.5 tick we need. But, because the bias is
|
|
|
|
* statistical, we have to test that we don't drop below
|
|
|
|
* the minimum feasible timer (which is 2 ticks).
|
|
|
|
*/
|
2001-09-10 19:23:09 +04:00
|
|
|
TCPT_RANGESET(tp->t_rxtcur, TCP_REXMTVAL(tp),
|
|
|
|
max(tp->t_rttmin, rtt + 2), TCPTV_REXMTMAX);
|
2002-06-09 20:33:36 +04:00
|
|
|
|
1993-03-21 12:45:37 +03:00
|
|
|
/*
|
|
|
|
* We received an ack for a packet that wasn't retransmitted;
|
|
|
|
* it is probably safe to discard any error indications we've
|
|
|
|
* received recently. This isn't quite right, but close enough
|
|
|
|
* for now (a route might have failed after we sent a segment,
|
|
|
|
* and the return path might not be symmetrical).
|
|
|
|
*/
|
|
|
|
tp->t_softerror = 0;
|
|
|
|
}
|
|
|
|
|
1998-10-05 01:33:52 +04:00
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
|
|
|
* TCP compressed state engine. Currently used to hold compressed
|
|
|
|
* state for SYN_RECEIVED.
|
|
|
|
*/
|
|
|
|
|
|
|
|
u_long syn_cache_count;
|
|
|
|
u_int32_t syn_hash1, syn_hash2;
|
|
|
|
|
|
|
|
#define SYN_HASH(sa, sp, dp) \
|
|
|
|
((((sa)->s_addr^syn_hash1)*(((((u_int32_t)(dp))<<16) + \
|
1998-04-03 12:02:45 +04:00
|
|
|
((u_int32_t)(sp)))^syn_hash2)))
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifndef INET6
|
|
|
|
#define SYN_HASHALL(hash, src, dst) \
|
|
|
|
do { \
|
2005-05-30 01:41:23 +04:00
|
|
|
hash = SYN_HASH(&((const struct sockaddr_in *)(src))->sin_addr, \
|
|
|
|
((const struct sockaddr_in *)(src))->sin_port, \
|
|
|
|
((const struct sockaddr_in *)(dst))->sin_port); \
|
2002-11-02 10:20:42 +03:00
|
|
|
} while (/*CONSTCOND*/ 0)
|
1999-07-01 12:12:45 +04:00
|
|
|
#else
|
|
|
|
#define SYN_HASH6(sa, sp, dp) \
|
|
|
|
((((sa)->s6_addr32[0] ^ (sa)->s6_addr32[3] ^ syn_hash1) * \
|
|
|
|
(((((u_int32_t)(dp))<<16) + ((u_int32_t)(sp)))^syn_hash2)) \
|
|
|
|
& 0x7fffffff)
|
|
|
|
|
|
|
|
#define SYN_HASHALL(hash, src, dst) \
|
|
|
|
do { \
|
|
|
|
switch ((src)->sa_family) { \
|
|
|
|
case AF_INET: \
|
2005-05-30 01:41:23 +04:00
|
|
|
hash = SYN_HASH(&((const struct sockaddr_in *)(src))->sin_addr, \
|
|
|
|
((const struct sockaddr_in *)(src))->sin_port, \
|
|
|
|
((const struct sockaddr_in *)(dst))->sin_port); \
|
1999-07-01 12:12:45 +04:00
|
|
|
break; \
|
|
|
|
case AF_INET6: \
|
2005-05-30 01:41:23 +04:00
|
|
|
hash = SYN_HASH6(&((const struct sockaddr_in6 *)(src))->sin6_addr, \
|
|
|
|
((const struct sockaddr_in6 *)(src))->sin6_port, \
|
|
|
|
((const struct sockaddr_in6 *)(dst))->sin6_port); \
|
1999-07-01 12:12:45 +04:00
|
|
|
break; \
|
|
|
|
default: \
|
|
|
|
hash = 0; \
|
|
|
|
} \
|
2001-09-12 01:03:20 +04:00
|
|
|
} while (/*CONSTCOND*/0)
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif /* INET6 */
|
1997-07-24 01:26:40 +04:00
|
|
|
|
2009-01-29 23:38:22 +03:00
|
|
|
static struct pool syn_cache_pool;
|
1998-08-02 04:35:51 +04:00
|
|
|
|
1999-04-29 07:54:22 +04:00
|
|
|
/*
|
|
|
|
* We don't estimate RTT with SYNs, so each packet starts with the default
|
2001-09-12 01:03:20 +04:00
|
|
|
* RTT and each timer step has a fixed timeout value.
|
1999-04-29 07:54:22 +04:00
|
|
|
*/
|
|
|
|
#define SYN_CACHE_TIMER_ARM(sc) \
|
|
|
|
do { \
|
|
|
|
TCPT_RANGESET((sc)->sc_rxtcur, \
|
|
|
|
TCPTV_SRTTDFLT * tcp_backoff[(sc)->sc_rxtshift], TCPTV_MIN, \
|
|
|
|
TCPTV_REXMTMAX); \
|
2001-09-12 01:03:20 +04:00
|
|
|
callout_reset(&(sc)->sc_timer, \
|
|
|
|
(sc)->sc_rxtcur * (hz / PR_SLOWHZ), syn_cache_timer, (sc)); \
|
|
|
|
} while (/*CONSTCOND*/0)
|
1999-04-29 07:54:22 +04:00
|
|
|
|
Two changes, designed to make us even more resilient against TCP
ISS attacks (which we already fend off quite well).
1. First-cut implementation of RFC1948, Steve Bellovin's cryptographic
hash method of generating TCP ISS values. Note, this code is experimental
and disabled by default (experimental enough that I don't export the
variable via sysctl yet, either). There are a couple of issues I'd
like to discuss with Steve, so this code should only be used by people
who really know what they're doing.
2. Per a recent thread on Bugtraq, it's possible to determine a system's
uptime by snooping the RFC1323 TCP timestamp options sent by a host; in
4.4BSD, timestamps are created by incrementing the tcp_now variable
at 2 Hz; there's even a company out there that uses this to determine
web server uptime. According to Newsham's paper "The Problem With
Random Increments", while NetBSD's TCP ISS generation method is much
better than the "random increment" method used by FreeBSD and OpenBSD,
it is still theoretically possible to mount an attack against NetBSD's
method if the attacker knows how many times the tcp_iss_seq variable
has been incremented. By not leaking uptime information, we can make
that much harder to determine. So, we avoid the leak by giving each
TCP connection a timebase of 0.
2001-03-20 23:07:51 +03:00
|
|
|
#define SYN_CACHE_TIMESTAMP(sc) (tcp_now - (sc)->sc_timebase)
|
|
|
|
|
2007-11-10 02:55:58 +03:00
|
|
|
static inline void
|
|
|
|
syn_cache_rm(struct syn_cache *sc)
|
|
|
|
{
|
|
|
|
TAILQ_REMOVE(&tcp_syn_cache[sc->sc_bucketidx].sch_bucket,
|
|
|
|
sc, sc_bucketq);
|
|
|
|
sc->sc_tp = NULL;
|
|
|
|
LIST_REMOVE(sc, sc_tpq);
|
|
|
|
tcp_syn_cache[sc->sc_bucketidx].sch_length--;
|
|
|
|
callout_stop(&sc->sc_timer);
|
|
|
|
syn_cache_count--;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
syn_cache_put(struct syn_cache *sc)
|
|
|
|
{
|
|
|
|
if (sc->sc_ipopts)
|
|
|
|
(void) m_free(sc->sc_ipopts);
|
|
|
|
rtcache_free(&sc->sc_route);
|
2010-04-22 00:40:16 +04:00
|
|
|
sc->sc_flags |= SCF_DEAD;
|
|
|
|
if (!callout_invoking(&sc->sc_timer))
|
|
|
|
callout_schedule(&(sc)->sc_timer, 1);
|
2007-11-10 02:55:58 +03:00
|
|
|
}
|
|
|
|
|
1998-05-07 05:37:27 +04:00
|
|
|
void
|
2005-02-04 02:39:32 +03:00
|
|
|
syn_cache_init(void)
|
1998-05-07 05:37:27 +04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2009-01-29 23:38:22 +03:00
|
|
|
pool_init(&syn_cache_pool, sizeof(struct syn_cache), 0, 0, 0,
|
|
|
|
"synpl", NULL, IPL_SOFTNET);
|
|
|
|
|
1999-04-29 07:54:22 +04:00
|
|
|
/* Initialize the hash buckets. */
|
1998-05-07 05:37:27 +04:00
|
|
|
for (i = 0; i < tcp_syn_cache_size; i++)
|
2001-09-12 01:03:20 +04:00
|
|
|
TAILQ_INIT(&tcp_syn_cache[i].sch_bucket);
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2005-02-04 02:39:32 +03:00
|
|
|
syn_cache_insert(struct syn_cache *sc, struct tcpcb *tp)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
1999-04-29 07:54:22 +04:00
|
|
|
struct syn_cache_head *scp;
|
1997-07-24 01:26:40 +04:00
|
|
|
struct syn_cache *sc2;
|
2001-09-12 01:03:20 +04:00
|
|
|
int s;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
1998-05-07 05:37:27 +04:00
|
|
|
/*
|
|
|
|
* If there are no entries in the hash table, reinitialize
|
|
|
|
* the hash secrets.
|
|
|
|
*/
|
1997-07-24 01:26:40 +04:00
|
|
|
if (syn_cache_count == 0) {
|
First step of random number subsystem rework described in
<20111022023242.BA26F14A158@mail.netbsd.org>. This change includes
the following:
An initial cleanup and minor reorganization of the entropy pool
code in sys/dev/rnd.c and sys/dev/rndpool.c. Several bugs are
fixed. Some effort is made to accumulate entropy more quickly at
boot time.
A generic interface, "rndsink", is added, for stream generators to
request that they be re-keyed with good quality entropy from the pool
as soon as it is available.
The arc4random()/arc4randbytes() implementation in libkern is
adjusted to use the rndsink interface for rekeying, which helps
address the problem of low-quality keys at boot time.
An implementation of the FIPS 140-2 statistical tests for random
number generator quality is provided (libkern/rngtest.c). This
is based on Greg Rose's implementation from Qualcomm.
A new random stream generator, nist_ctr_drbg, is provided. It is
based on an implementation of the NIST SP800-90 CTR_DRBG by
Henric Jungheim. This generator users AES in a modified counter
mode to generate a backtracking-resistant random stream.
An abstraction layer, "cprng", is provided for in-kernel consumers
of randomness. The arc4random/arc4randbytes API is deprecated for
in-kernel use. It is replaced by "cprng_strong". The current
cprng_fast implementation wraps the existing arc4random
implementation. The current cprng_strong implementation wraps the
new CTR_DRBG implementation. Both interfaces are rekeyed from
the entropy pool automatically at intervals justifiable from best
current cryptographic practice.
In some quick tests, cprng_fast() is about the same speed as
the old arc4randbytes(), and cprng_strong() is about 20% faster
than rnd_extract_data(). Performance is expected to improve.
The AES code in src/crypto/rijndael is no longer an optional
kernel component, as it is required by cprng_strong, which is
not an optional kernel component.
The entropy pool output is subjected to the rngtest tests at
startup time; if it fails, the system will reboot. There is
approximately a 3/10000 chance of a false positive from these
tests. Entropy pool _input_ from hardware random numbers is
subjected to the rngtest tests at attach time, as well as the
FIPS continuous-output test, to detect bad or stuck hardware
RNGs; if any are detected, they are detached, but the system
continues to run.
A problem with rndctl(8) is fixed -- datastructures with
pointers in arrays are no longer passed to userspace (this
was not a security problem, but rather a major issue for
compat32). A new kernel will require a new rndctl.
The sysctl kern.arandom() and kern.urandom() nodes are hooked
up to the new generators, but the /dev/*random pseudodevices
are not, yet.
Manual pages for the new kernel interfaces are forthcoming.
2011-11-20 02:51:18 +04:00
|
|
|
syn_hash1 = cprng_fast32();
|
|
|
|
syn_hash2 = cprng_fast32();
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
SYN_HASHALL(sc->sc_hash, &sc->sc_src.sa, &sc->sc_dst.sa);
|
1999-04-29 07:54:22 +04:00
|
|
|
sc->sc_bucketidx = sc->sc_hash % tcp_syn_cache_size;
|
|
|
|
scp = &tcp_syn_cache[sc->sc_bucketidx];
|
1997-07-24 01:26:40 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure that we don't overflow the per-bucket
|
|
|
|
* limit or the total cache size limit.
|
|
|
|
*/
|
|
|
|
s = splsoftnet();
|
|
|
|
if (scp->sch_length >= tcp_syn_bucket_limit) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_BUCKETOVERFLOW);
|
1998-05-07 05:37:27 +04:00
|
|
|
/*
|
1999-04-29 07:54:22 +04:00
|
|
|
* The bucket is full. Toss the oldest element in the
|
2001-09-12 01:03:20 +04:00
|
|
|
* bucket. This will be the first entry in the bucket.
|
1999-04-29 07:54:22 +04:00
|
|
|
*/
|
2001-09-12 01:03:20 +04:00
|
|
|
sc2 = TAILQ_FIRST(&scp->sch_bucket);
|
1999-04-29 07:54:22 +04:00
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
/*
|
|
|
|
* This should never happen; we should always find an
|
|
|
|
* entry in our bucket.
|
1998-05-07 05:37:27 +04:00
|
|
|
*/
|
2001-09-12 01:03:20 +04:00
|
|
|
if (sc2 == NULL)
|
|
|
|
panic("syn_cache_insert: bucketoverflow: impossible");
|
1999-04-29 07:54:22 +04:00
|
|
|
#endif
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_rm(sc2);
|
|
|
|
syn_cache_put(sc2); /* calls pool_put but see spl above */
|
1997-07-24 01:26:40 +04:00
|
|
|
} else if (syn_cache_count >= tcp_syn_cache_limit) {
|
2001-09-12 01:03:20 +04:00
|
|
|
struct syn_cache_head *scp2, *sce;
|
|
|
|
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_OVERFLOWED);
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
1999-04-29 07:54:22 +04:00
|
|
|
* The cache is full. Toss the oldest entry in the
|
2001-09-12 01:03:20 +04:00
|
|
|
* first non-empty bucket we can find.
|
|
|
|
*
|
|
|
|
* XXX We would really like to toss the oldest
|
|
|
|
* entry in the cache, but we hope that this
|
|
|
|
* condition doesn't happen very often.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
2001-09-12 01:03:20 +04:00
|
|
|
scp2 = scp;
|
|
|
|
if (TAILQ_EMPTY(&scp2->sch_bucket)) {
|
|
|
|
sce = &tcp_syn_cache[tcp_syn_cache_size];
|
|
|
|
for (++scp2; scp2 != scp; scp2++) {
|
|
|
|
if (scp2 >= sce)
|
|
|
|
scp2 = &tcp_syn_cache[0];
|
|
|
|
if (! TAILQ_EMPTY(&scp2->sch_bucket))
|
|
|
|
break;
|
|
|
|
}
|
1999-04-29 07:54:22 +04:00
|
|
|
#ifdef DIAGNOSTIC
|
2001-09-12 01:03:20 +04:00
|
|
|
/*
|
|
|
|
* This should never happen; we should always find a
|
|
|
|
* non-empty bucket.
|
|
|
|
*/
|
|
|
|
if (scp2 == scp)
|
|
|
|
panic("syn_cache_insert: cacheoverflow: "
|
|
|
|
"impossible");
|
1999-04-29 07:54:22 +04:00
|
|
|
#endif
|
2001-09-12 01:03:20 +04:00
|
|
|
}
|
|
|
|
sc2 = TAILQ_FIRST(&scp2->sch_bucket);
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_rm(sc2);
|
|
|
|
syn_cache_put(sc2); /* calls pool_put but see spl above */
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
1999-04-29 07:54:22 +04:00
|
|
|
/*
|
|
|
|
* Initialize the entry's timer.
|
|
|
|
*/
|
|
|
|
sc->sc_rxttot = 0;
|
|
|
|
sc->sc_rxtshift = 0;
|
|
|
|
SYN_CACHE_TIMER_ARM(sc);
|
1997-07-24 01:26:40 +04:00
|
|
|
|
1999-08-25 19:23:12 +04:00
|
|
|
/* Link it from tcpcb entry */
|
|
|
|
LIST_INSERT_HEAD(&tp->t_sc, sc, sc_tpq);
|
|
|
|
|
1998-05-07 05:37:27 +04:00
|
|
|
/* Put it into the bucket. */
|
2001-09-12 01:03:20 +04:00
|
|
|
TAILQ_INSERT_TAIL(&scp->sch_bucket, sc, sc_bucketq);
|
1999-04-29 07:54:22 +04:00
|
|
|
scp->sch_length++;
|
1998-05-07 05:37:27 +04:00
|
|
|
syn_cache_count++;
|
|
|
|
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_ADDED);
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
1999-04-29 07:54:22 +04:00
|
|
|
* Walk the timer queues, looking for SYN,ACKs that need to be retransmitted.
|
|
|
|
* If we have retransmitted an entry the maximum number of times, expire
|
|
|
|
* that entry.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
|
|
|
void
|
2001-09-12 01:03:20 +04:00
|
|
|
syn_cache_timer(void *arg)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
2001-09-12 01:03:20 +04:00
|
|
|
struct syn_cache *sc = arg;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
2008-04-24 15:38:36 +04:00
|
|
|
mutex_enter(softnet_lock);
|
|
|
|
KERNEL_LOCK(1, NULL);
|
2003-07-20 20:35:07 +04:00
|
|
|
callout_ack(&sc->sc_timer);
|
|
|
|
|
|
|
|
if (__predict_false(sc->sc_flags & SCF_DEAD)) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_DELAYED_FREE);
|
2007-07-10 00:51:58 +04:00
|
|
|
callout_destroy(&sc->sc_timer);
|
2003-07-20 20:35:07 +04:00
|
|
|
pool_put(&syn_cache_pool, sc);
|
2008-04-24 15:38:36 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
|
|
|
mutex_exit(softnet_lock);
|
2003-07-20 20:35:07 +04:00
|
|
|
return;
|
|
|
|
}
|
1999-04-29 07:54:22 +04:00
|
|
|
|
2001-09-12 01:03:20 +04:00
|
|
|
if (__predict_false(sc->sc_rxtshift == TCP_MAXRXTSHIFT)) {
|
|
|
|
/* Drop it -- too many retransmissions. */
|
|
|
|
goto dropit;
|
|
|
|
}
|
|
|
|
|
1999-04-29 07:54:22 +04:00
|
|
|
/*
|
2001-09-12 01:03:20 +04:00
|
|
|
* Compute the total amount of time this entry has
|
|
|
|
* been on a queue. If this entry has been on longer
|
|
|
|
* than the keep alive timer would allow, expire it.
|
1999-04-29 07:54:22 +04:00
|
|
|
*/
|
2001-09-12 01:03:20 +04:00
|
|
|
sc->sc_rxttot += sc->sc_rxtcur;
|
2007-06-20 19:29:17 +04:00
|
|
|
if (sc->sc_rxttot >= tcp_keepinit)
|
2001-09-12 01:03:20 +04:00
|
|
|
goto dropit;
|
1999-04-29 07:54:22 +04:00
|
|
|
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_RETRANSMITTED);
|
2001-09-12 01:03:20 +04:00
|
|
|
(void) syn_cache_respond(sc, NULL);
|
1999-04-29 07:54:22 +04:00
|
|
|
|
2001-09-12 01:03:20 +04:00
|
|
|
/* Advance the timer back-off. */
|
|
|
|
sc->sc_rxtshift++;
|
|
|
|
SYN_CACHE_TIMER_ARM(sc);
|
1999-04-29 07:54:22 +04:00
|
|
|
|
2008-04-24 15:38:36 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
|
|
|
mutex_exit(softnet_lock);
|
2001-09-12 01:03:20 +04:00
|
|
|
return;
|
1999-04-29 07:54:22 +04:00
|
|
|
|
2001-09-12 01:03:20 +04:00
|
|
|
dropit:
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_TIMED_OUT);
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_rm(sc);
|
2010-04-22 00:40:16 +04:00
|
|
|
if (sc->sc_ipopts)
|
|
|
|
(void) m_free(sc->sc_ipopts);
|
|
|
|
rtcache_free(&sc->sc_route);
|
|
|
|
callout_destroy(&sc->sc_timer);
|
|
|
|
pool_put(&syn_cache_pool, sc);
|
2008-04-24 15:38:36 +04:00
|
|
|
KERNEL_UNLOCK_ONE(NULL);
|
|
|
|
mutex_exit(softnet_lock);
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
1999-08-25 19:23:12 +04:00
|
|
|
/*
|
|
|
|
* Remove syn cache created by the specified tcb entry,
|
|
|
|
* because this does not make sense to keep them
|
|
|
|
* (if there's no tcb entry, syn cache entry will never be used)
|
|
|
|
*/
|
|
|
|
void
|
2005-02-04 02:39:32 +03:00
|
|
|
syn_cache_cleanup(struct tcpcb *tp)
|
1999-08-25 19:23:12 +04:00
|
|
|
{
|
|
|
|
struct syn_cache *sc, *nsc;
|
|
|
|
int s;
|
|
|
|
|
|
|
|
s = splsoftnet();
|
|
|
|
|
|
|
|
for (sc = LIST_FIRST(&tp->t_sc); sc != NULL; sc = nsc) {
|
|
|
|
nsc = LIST_NEXT(sc, sc_tpq);
|
|
|
|
|
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
if (sc->sc_tp != tp)
|
|
|
|
panic("invalid sc_tp in syn_cache_cleanup");
|
|
|
|
#endif
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_rm(sc);
|
|
|
|
syn_cache_put(sc); /* calls pool_put but see spl above */
|
1999-08-25 19:23:12 +04:00
|
|
|
}
|
|
|
|
/* just for safety */
|
|
|
|
LIST_INIT(&tp->t_sc);
|
|
|
|
|
|
|
|
splx(s);
|
|
|
|
}
|
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
|
|
|
* Find an entry in the syn cache.
|
|
|
|
*/
|
|
|
|
struct syn_cache *
|
2005-05-30 01:41:23 +04:00
|
|
|
syn_cache_lookup(const struct sockaddr *src, const struct sockaddr *dst,
|
2005-02-04 02:39:32 +03:00
|
|
|
struct syn_cache_head **headp)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
1998-05-07 05:37:27 +04:00
|
|
|
struct syn_cache *sc;
|
|
|
|
struct syn_cache_head *scp;
|
1997-07-24 01:26:40 +04:00
|
|
|
u_int32_t hash;
|
|
|
|
int s;
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
SYN_HASHALL(hash, src, dst);
|
1997-07-24 01:26:40 +04:00
|
|
|
|
1998-05-07 05:37:27 +04:00
|
|
|
scp = &tcp_syn_cache[hash % tcp_syn_cache_size];
|
|
|
|
*headp = scp;
|
1997-07-24 01:26:40 +04:00
|
|
|
s = splsoftnet();
|
2001-09-12 01:03:20 +04:00
|
|
|
for (sc = TAILQ_FIRST(&scp->sch_bucket); sc != NULL;
|
|
|
|
sc = TAILQ_NEXT(sc, sc_bucketq)) {
|
1997-07-24 01:26:40 +04:00
|
|
|
if (sc->sc_hash != hash)
|
|
|
|
continue;
|
2009-03-18 18:14:29 +03:00
|
|
|
if (!memcmp(&sc->sc_src, src, src->sa_len) &&
|
|
|
|
!memcmp(&sc->sc_dst, dst, dst->sa_len)) {
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
return (sc);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
splx(s);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function gets called when we receive an ACK for a
|
|
|
|
* socket in the LISTEN state. We look up the connection
|
|
|
|
* in the syn cache, and if its there, we pull it out of
|
|
|
|
* the cache and turn it into a full-blown connection in
|
|
|
|
* the SYN-RECEIVED state.
|
|
|
|
*
|
|
|
|
* The return values may not be immediately obvious, and their effects
|
|
|
|
* can be subtle, so here they are:
|
|
|
|
*
|
|
|
|
* NULL SYN was not found in cache; caller should drop the
|
|
|
|
* packet and send an RST.
|
|
|
|
*
|
|
|
|
* -1 We were unable to create the new connection, and are
|
|
|
|
* aborting it. An ACK,RST is being sent to the peer
|
|
|
|
* (unless we got screwey sequence numbners; see below),
|
|
|
|
* because the 3-way handshake has been completed. Caller
|
|
|
|
* should not free the mbuf, since we may be using it. If
|
|
|
|
* we are not, we will free it.
|
|
|
|
*
|
|
|
|
* Otherwise, the return value is a pointer to the new socket
|
|
|
|
* associated with the connection.
|
|
|
|
*/
|
|
|
|
struct socket *
|
2005-02-04 02:39:32 +03:00
|
|
|
syn_cache_get(struct sockaddr *src, struct sockaddr *dst,
|
2006-11-16 04:32:37 +03:00
|
|
|
struct tcphdr *th, unsigned int hlen, unsigned int tlen,
|
2005-02-04 02:39:32 +03:00
|
|
|
struct socket *so, struct mbuf *m)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
1998-05-07 05:37:27 +04:00
|
|
|
struct syn_cache *sc;
|
|
|
|
struct syn_cache_head *scp;
|
2000-03-30 16:51:13 +04:00
|
|
|
struct inpcb *inp = NULL;
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
2000-03-30 16:51:13 +04:00
|
|
|
struct in6pcb *in6p = NULL;
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
2000-03-30 16:51:13 +04:00
|
|
|
struct tcpcb *tp = 0;
|
1997-07-24 01:26:40 +04:00
|
|
|
struct mbuf *am;
|
|
|
|
int s;
|
1999-07-01 12:12:45 +04:00
|
|
|
struct socket *oso;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
|
|
|
s = splsoftnet();
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((sc = syn_cache_lookup(src, dst, &scp)) == NULL) {
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
1999-01-24 04:19:28 +03:00
|
|
|
* Verify the sequence and ack numbers. Try getting the correct
|
|
|
|
* response again.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((th->th_ack != sc->sc_iss + 1) ||
|
|
|
|
SEQ_LEQ(th->th_seq, sc->sc_irs) ||
|
|
|
|
SEQ_GT(th->th_seq, sc->sc_irs + 1 + sc->sc_win)) {
|
1999-04-29 07:54:22 +04:00
|
|
|
(void) syn_cache_respond(sc, m);
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
return ((struct socket *)(-1));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Remove this cache entry */
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_rm(sc);
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, create the full blown connection, and set things up
|
|
|
|
* as they would have been set up if we had created the
|
|
|
|
* connection when the SYN arrived. If we can't create
|
|
|
|
* the connection, abort it.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
/*
|
|
|
|
* inp still has the OLD in_pcb stuff, set the
|
|
|
|
* v6-related flags on the new guy, too. This is
|
|
|
|
* done particularly for the case where an AF_INET6
|
|
|
|
* socket is bound only to a port, and a v4 connection
|
|
|
|
* comes in on that port.
|
2002-06-09 20:33:36 +04:00
|
|
|
* we also copy the flowinfo from the original pcb
|
1999-07-01 12:12:45 +04:00
|
|
|
* to the new one.
|
|
|
|
*/
|
|
|
|
oso = so;
|
1997-07-24 01:26:40 +04:00
|
|
|
so = sonewconn(so, SS_ISCONNECTED);
|
|
|
|
if (so == NULL)
|
|
|
|
goto resetandabort;
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
switch (so->so_proto->pr_domain->dom_family) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET:
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
in6p = sotoin6pcb(so);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
switch (src->sa_family) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET:
|
|
|
|
if (inp) {
|
|
|
|
inp->inp_laddr = ((struct sockaddr_in *)dst)->sin_addr;
|
|
|
|
inp->inp_lport = ((struct sockaddr_in *)dst)->sin_port;
|
|
|
|
inp->inp_options = ip_srcroute();
|
|
|
|
in_pcbstate(inp, INP_BOUND);
|
|
|
|
if (inp->inp_options == NULL) {
|
|
|
|
inp->inp_options = sc->sc_ipopts;
|
|
|
|
sc->sc_ipopts = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
|
|
|
else if (in6p) {
|
|
|
|
/* IPv4 packet to AF_INET6 socket */
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(&in6p->in6p_laddr, 0, sizeof(in6p->in6p_laddr));
|
1999-07-01 12:12:45 +04:00
|
|
|
in6p->in6p_laddr.s6_addr16[5] = htons(0xffff);
|
|
|
|
bcopy(&((struct sockaddr_in *)dst)->sin_addr,
|
|
|
|
&in6p->in6p_laddr.s6_addr32[3],
|
|
|
|
sizeof(((struct sockaddr_in *)dst)->sin_addr));
|
|
|
|
in6p->in6p_lport = ((struct sockaddr_in *)dst)->sin_port;
|
|
|
|
in6totcpcb(in6p)->t_family = AF_INET;
|
2003-05-30 05:15:04 +04:00
|
|
|
if (sotoin6pcb(oso)->in6p_flags & IN6P_IPV6_V6ONLY)
|
|
|
|
in6p->in6p_flags |= IN6P_IPV6_V6ONLY;
|
|
|
|
else
|
|
|
|
in6p->in6p_flags &= ~IN6P_IPV6_V6ONLY;
|
2003-09-04 13:16:57 +04:00
|
|
|
in6_pcbstate(in6p, IN6P_BOUND);
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
if (in6p) {
|
|
|
|
in6p->in6p_laddr = ((struct sockaddr_in6 *)dst)->sin6_addr;
|
|
|
|
in6p->in6p_lport = ((struct sockaddr_in6 *)dst)->sin6_port;
|
2003-09-04 13:16:57 +04:00
|
|
|
in6_pcbstate(in6p, IN6P_BOUND);
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
|
|
|
if (in6p && in6totcpcb(in6p)->t_family == AF_INET6 && sotoinpcb(oso)) {
|
|
|
|
struct in6pcb *oin6p = sotoin6pcb(oso);
|
|
|
|
/* inherit socket options from the listening socket */
|
|
|
|
in6p->in6p_flags |= (oin6p->in6p_flags & IN6P_CONTROLOPTS);
|
|
|
|
if (in6p->in6p_flags & IN6P_CONTROLOPTS) {
|
|
|
|
m_freem(in6p->in6p_options);
|
|
|
|
in6p->in6p_options = 0;
|
|
|
|
}
|
|
|
|
ip6_savecontrol(in6p, &in6p->in6p_options,
|
|
|
|
mtod(m, struct ip6_hdr *), m);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-12-19 15:59:56 +04:00
|
|
|
#if defined(KAME_IPSEC) || defined(FAST_IPSEC)
|
2000-01-31 17:18:52 +03:00
|
|
|
/*
|
|
|
|
* we make a copy of policy, instead of sharing the policy,
|
|
|
|
* for better behavior in terms of SA lookup and dead SA removal.
|
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if (inp) {
|
2000-01-31 17:18:52 +03:00
|
|
|
/* copy old policy into new socket's */
|
2002-06-11 23:39:59 +04:00
|
|
|
if (ipsec_copy_pcbpolicy(sotoinpcb(oso)->inp_sp, inp->inp_sp))
|
1999-07-01 12:12:45 +04:00
|
|
|
printf("tcp_input: could not copy policy\n");
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
|
|
|
else if (in6p) {
|
2000-01-31 17:18:52 +03:00
|
|
|
/* copy old policy into new socket's */
|
2002-06-11 23:39:59 +04:00
|
|
|
if (ipsec_copy_pcbpolicy(sotoin6pcb(oso)->in6p_sp,
|
|
|
|
in6p->in6p_sp))
|
1999-07-01 12:12:45 +04:00
|
|
|
printf("tcp_input: could not copy policy\n");
|
1998-04-07 09:09:19 +04:00
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
|
|
|
#endif
|
1997-07-24 01:26:40 +04:00
|
|
|
|
1999-01-24 04:19:28 +03:00
|
|
|
/*
|
|
|
|
* Give the new socket our cached route reference.
|
|
|
|
*/
|
Here are various changes designed to protect against bad IPv4
routing caused by stale route caches (struct route). Route caches
are sprinkled throughout PCBs, the IP fast-forwarding table, and
IP tunnel interfaces (gre, gif, stf).
Stale IPv6 and ISO route caches will be treated by separate patches.
Thank you to Christoph Badura for suggesting the general approach
to invalidating route caches that I take here.
Here are the details:
Add hooks to struct domain for tracking and for invalidating each
domain's route caches: dom_rtcache, dom_rtflush, and dom_rtflushall.
Introduce helper subroutines, rtflush(ro) for invalidating a route
cache, rtflushall(family) for invalidating all route caches in a
routing domain, and rtcache(ro) for notifying the domain of a new
cached route.
Chain together all IPv4 route caches where ro_rt != NULL. Provide
in_rtcache() for adding a route to the chain. Provide in_rtflush()
and in_rtflushall() for invalidating IPv4 route caches. In
in_rtflush(), set ro_rt to NULL, and remove the route from the
chain. In in_rtflushall(), walk the chain and remove every route
cache.
In rtrequest1(), call rtflushall() to invalidate route caches when
a route is added.
In gif(4), discard the workaround for stale caches that involves
expiring them every so often.
Replace the pattern 'RTFREE(ro->ro_rt); ro->ro_rt = NULL;' with a
call to rtflush(ro).
Update ipflow_fastforward() and all other users of route caches so
that they expect a cached route, ro->ro_rt, to turn to NULL.
Take care when moving a 'struct route' to rtflush() the source and
to rtcache() the destination.
In domain initializers, use .dom_xxx tags.
KNF here and there.
2006-12-09 08:33:04 +03:00
|
|
|
if (inp) {
|
Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.
The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.
Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.
DETAILS
1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:
struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);
sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.
sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.
The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).
2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.
3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:
int rtcache_setdst(struct route *, const struct sockaddr *);
rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.
It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.
4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
2007-05-03 00:40:22 +04:00
|
|
|
rtcache_copy(&inp->inp_route, &sc->sc_route);
|
|
|
|
rtcache_free(&sc->sc_route);
|
Here are various changes designed to protect against bad IPv4
routing caused by stale route caches (struct route). Route caches
are sprinkled throughout PCBs, the IP fast-forwarding table, and
IP tunnel interfaces (gre, gif, stf).
Stale IPv6 and ISO route caches will be treated by separate patches.
Thank you to Christoph Badura for suggesting the general approach
to invalidating route caches that I take here.
Here are the details:
Add hooks to struct domain for tracking and for invalidating each
domain's route caches: dom_rtcache, dom_rtflush, and dom_rtflushall.
Introduce helper subroutines, rtflush(ro) for invalidating a route
cache, rtflushall(family) for invalidating all route caches in a
routing domain, and rtcache(ro) for notifying the domain of a new
cached route.
Chain together all IPv4 route caches where ro_rt != NULL. Provide
in_rtcache() for adding a route to the chain. Provide in_rtflush()
and in_rtflushall() for invalidating IPv4 route caches. In
in_rtflush(), set ro_rt to NULL, and remove the route from the
chain. In in_rtflushall(), walk the chain and remove every route
cache.
In rtrequest1(), call rtflushall() to invalidate route caches when
a route is added.
In gif(4), discard the workaround for stale caches that involves
expiring them every so often.
Replace the pattern 'RTFREE(ro->ro_rt); ro->ro_rt = NULL;' with a
call to rtflush(ro).
Update ipflow_fastforward() and all other users of route caches so
that they expect a cached route, ro->ro_rt, to turn to NULL.
Take care when moving a 'struct route' to rtflush() the source and
to rtcache() the destination.
In domain initializers, use .dom_xxx tags.
KNF here and there.
2006-12-09 08:33:04 +03:00
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
Here are various changes designed to protect against bad IPv4
routing caused by stale route caches (struct route). Route caches
are sprinkled throughout PCBs, the IP fast-forwarding table, and
IP tunnel interfaces (gre, gif, stf).
Stale IPv6 and ISO route caches will be treated by separate patches.
Thank you to Christoph Badura for suggesting the general approach
to invalidating route caches that I take here.
Here are the details:
Add hooks to struct domain for tracking and for invalidating each
domain's route caches: dom_rtcache, dom_rtflush, and dom_rtflushall.
Introduce helper subroutines, rtflush(ro) for invalidating a route
cache, rtflushall(family) for invalidating all route caches in a
routing domain, and rtcache(ro) for notifying the domain of a new
cached route.
Chain together all IPv4 route caches where ro_rt != NULL. Provide
in_rtcache() for adding a route to the chain. Provide in_rtflush()
and in_rtflushall() for invalidating IPv4 route caches. In
in_rtflush(), set ro_rt to NULL, and remove the route from the
chain. In in_rtflushall(), walk the chain and remove every route
cache.
In rtrequest1(), call rtflushall() to invalidate route caches when
a route is added.
In gif(4), discard the workaround for stale caches that involves
expiring them every so often.
Replace the pattern 'RTFREE(ro->ro_rt); ro->ro_rt = NULL;' with a
call to rtflush(ro).
Update ipflow_fastforward() and all other users of route caches so
that they expect a cached route, ro->ro_rt, to turn to NULL.
Take care when moving a 'struct route' to rtflush() the source and
to rtcache() the destination.
In domain initializers, use .dom_xxx tags.
KNF here and there.
2006-12-09 08:33:04 +03:00
|
|
|
else {
|
Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.
The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.
Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.
DETAILS
1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:
struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);
sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.
sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.
The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).
2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.
3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:
int rtcache_setdst(struct route *, const struct sockaddr *);
rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.
It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.
4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
2007-05-03 00:40:22 +04:00
|
|
|
rtcache_copy(&in6p->in6p_route, &sc->sc_route);
|
|
|
|
rtcache_free(&sc->sc_route);
|
Here are various changes designed to protect against bad IPv4
routing caused by stale route caches (struct route). Route caches
are sprinkled throughout PCBs, the IP fast-forwarding table, and
IP tunnel interfaces (gre, gif, stf).
Stale IPv6 and ISO route caches will be treated by separate patches.
Thank you to Christoph Badura for suggesting the general approach
to invalidating route caches that I take here.
Here are the details:
Add hooks to struct domain for tracking and for invalidating each
domain's route caches: dom_rtcache, dom_rtflush, and dom_rtflushall.
Introduce helper subroutines, rtflush(ro) for invalidating a route
cache, rtflushall(family) for invalidating all route caches in a
routing domain, and rtcache(ro) for notifying the domain of a new
cached route.
Chain together all IPv4 route caches where ro_rt != NULL. Provide
in_rtcache() for adding a route to the chain. Provide in_rtflush()
and in_rtflushall() for invalidating IPv4 route caches. In
in_rtflush(), set ro_rt to NULL, and remove the route from the
chain. In in_rtflushall(), walk the chain and remove every route
cache.
In rtrequest1(), call rtflushall() to invalidate route caches when
a route is added.
In gif(4), discard the workaround for stale caches that involves
expiring them every so often.
Replace the pattern 'RTFREE(ro->ro_rt); ro->ro_rt = NULL;' with a
call to rtflush(ro).
Update ipflow_fastforward() and all other users of route caches so
that they expect a cached route, ro->ro_rt, to turn to NULL.
Take care when moving a 'struct route' to rtflush() the source and
to rtcache() the destination.
In domain initializers, use .dom_xxx tags.
KNF here and there.
2006-12-09 08:33:04 +03:00
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
#endif
|
1999-01-24 04:19:28 +03:00
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
am = m_get(M_DONTWAIT, MT_SONAME); /* XXX */
|
1998-01-18 08:56:15 +03:00
|
|
|
if (am == NULL)
|
1997-07-24 01:26:40 +04:00
|
|
|
goto resetandabort;
|
2003-02-26 09:31:08 +03:00
|
|
|
MCLAIM(am, &tcp_mowner);
|
1999-07-01 12:12:45 +04:00
|
|
|
am->m_len = src->sa_len;
|
2007-03-04 08:59:00 +03:00
|
|
|
bcopy(src, mtod(am, void *), src->sa_len);
|
1999-07-01 12:12:45 +04:00
|
|
|
if (inp) {
|
2007-12-16 17:12:34 +03:00
|
|
|
if (in_pcbconnect(inp, am, &lwp0)) {
|
1999-07-01 12:12:45 +04:00
|
|
|
(void) m_free(am);
|
|
|
|
goto resetandabort;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
|
|
|
else if (in6p) {
|
|
|
|
if (src->sa_family == AF_INET) {
|
|
|
|
/* IPv4 packet to AF_INET6 socket */
|
|
|
|
struct sockaddr_in6 *sin6;
|
|
|
|
sin6 = mtod(am, struct sockaddr_in6 *);
|
|
|
|
am->m_len = sizeof(*sin6);
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(sin6, 0, sizeof(*sin6));
|
1999-07-01 12:12:45 +04:00
|
|
|
sin6->sin6_family = AF_INET6;
|
|
|
|
sin6->sin6_len = sizeof(*sin6);
|
|
|
|
sin6->sin6_port = ((struct sockaddr_in *)src)->sin_port;
|
|
|
|
sin6->sin6_addr.s6_addr16[5] = htons(0xffff);
|
|
|
|
bcopy(&((struct sockaddr_in *)src)->sin_addr,
|
|
|
|
&sin6->sin6_addr.s6_addr32[3],
|
|
|
|
sizeof(sin6->sin6_addr.s6_addr32[3]));
|
|
|
|
}
|
2005-11-15 21:39:46 +03:00
|
|
|
if (in6_pcbconnect(in6p, am, NULL)) {
|
1999-07-01 12:12:45 +04:00
|
|
|
(void) m_free(am);
|
|
|
|
goto resetandabort;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
else {
|
1997-07-24 01:26:40 +04:00
|
|
|
(void) m_free(am);
|
|
|
|
goto resetandabort;
|
|
|
|
}
|
|
|
|
(void) m_free(am);
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
if (inp)
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
#ifdef INET6
|
|
|
|
else if (in6p)
|
|
|
|
tp = in6totcpcb(in6p);
|
|
|
|
#endif
|
|
|
|
else
|
|
|
|
tp = NULL;
|
2002-07-18 07:23:01 +04:00
|
|
|
tp->t_flags = sototcpcb(oso)->t_flags & TF_NODELAY;
|
1997-07-24 01:26:40 +04:00
|
|
|
if (sc->sc_request_r_scale != 15) {
|
|
|
|
tp->requested_s_scale = sc->sc_requested_s_scale;
|
|
|
|
tp->request_r_scale = sc->sc_request_r_scale;
|
|
|
|
tp->snd_scale = sc->sc_requested_s_scale;
|
|
|
|
tp->rcv_scale = sc->sc_request_r_scale;
|
2002-10-22 08:24:50 +04:00
|
|
|
tp->t_flags |= TF_REQ_SCALE|TF_RCVD_SCALE;
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
1998-04-03 12:02:45 +04:00
|
|
|
if (sc->sc_flags & SCF_TIMESTAMP)
|
2002-10-22 08:24:50 +04:00
|
|
|
tp->t_flags |= TF_REQ_TSTMP|TF_RCVD_TSTMP;
|
Two changes, designed to make us even more resilient against TCP
ISS attacks (which we already fend off quite well).
1. First-cut implementation of RFC1948, Steve Bellovin's cryptographic
hash method of generating TCP ISS values. Note, this code is experimental
and disabled by default (experimental enough that I don't export the
variable via sysctl yet, either). There are a couple of issues I'd
like to discuss with Steve, so this code should only be used by people
who really know what they're doing.
2. Per a recent thread on Bugtraq, it's possible to determine a system's
uptime by snooping the RFC1323 TCP timestamp options sent by a host; in
4.4BSD, timestamps are created by incrementing the tcp_now variable
at 2 Hz; there's even a company out there that uses this to determine
web server uptime. According to Newsham's paper "The Problem With
Random Increments", while NetBSD's TCP ISS generation method is much
better than the "random increment" method used by FreeBSD and OpenBSD,
it is still theoretically possible to mount an attack against NetBSD's
method if the attacker knows how many times the tcp_iss_seq variable
has been incremented. By not leaking uptime information, we can make
that much harder to determine. So, we avoid the leak by giving each
TCP connection a timebase of 0.
2001-03-20 23:07:51 +03:00
|
|
|
tp->ts_timebase = sc->sc_timebase;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
|
|
|
tp->t_template = tcp_template(tp);
|
|
|
|
if (tp->t_template == 0) {
|
|
|
|
tp = tcp_drop(tp, ENOBUFS); /* destroys socket */
|
|
|
|
so = NULL;
|
|
|
|
m_freem(m);
|
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
|
|
|
|
tp->iss = sc->sc_iss;
|
|
|
|
tp->irs = sc->sc_irs;
|
|
|
|
tcp_sendseqinit(tp);
|
|
|
|
tcp_rcvseqinit(tp);
|
|
|
|
tp->t_state = TCPS_SYN_RECEIVED;
|
2007-06-20 19:29:17 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_KEEP, tp->t_keepinit);
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_ACCEPTS);
|
1997-07-24 01:26:40 +04:00
|
|
|
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if ((sc->sc_flags & SCF_SACK_PERMIT) && tcp_do_sack)
|
|
|
|
tp->t_flags |= TF_WILL_SACK;
|
|
|
|
|
2006-09-05 04:29:35 +04:00
|
|
|
if ((sc->sc_flags & SCF_ECN_PERMIT) && tcp_do_ecn)
|
|
|
|
tp->t_flags |= TF_ECN_PERMIT;
|
|
|
|
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
if (sc->sc_flags & SCF_SIGNATURE)
|
|
|
|
tp->t_flags |= TF_SIGNATURE;
|
|
|
|
#endif
|
|
|
|
|
1997-09-23 01:49:55 +04:00
|
|
|
/* Initialize tp->t_ourmss before we deal with the peer's! */
|
|
|
|
tp->t_ourmss = sc->sc_ourmaxseg;
|
|
|
|
tcp_mss_from_peer(tp, sc->sc_peermaxseg);
|
1998-04-01 02:49:09 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the initial congestion window. If we
|
|
|
|
* had to retransmit the SYN,ACK, we must initialize cwnd
|
1998-07-18 02:58:56 +04:00
|
|
|
* to 1 segment (i.e. the Loss Window).
|
1998-04-01 02:49:09 +04:00
|
|
|
*/
|
1999-04-29 07:54:22 +04:00
|
|
|
if (sc->sc_rxtshift)
|
1998-07-18 02:58:56 +04:00
|
|
|
tp->snd_cwnd = tp->t_peermss;
|
2003-03-01 07:40:27 +03:00
|
|
|
else {
|
|
|
|
int ss = tcp_init_win;
|
|
|
|
#ifdef INET
|
|
|
|
if (inp != NULL && in_localaddr(inp->inp_faddr))
|
|
|
|
ss = tcp_init_win_local;
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
if (in6p != NULL && in6_localaddr(&in6p->in6p_faddr))
|
|
|
|
ss = tcp_init_win_local;
|
|
|
|
#endif
|
|
|
|
tp->snd_cwnd = TCP_INITIAL_WINDOW(ss, tp->t_peermss);
|
|
|
|
}
|
1998-04-01 02:49:09 +04:00
|
|
|
|
1997-09-23 01:49:55 +04:00
|
|
|
tcp_rmx_rtt(tp);
|
1997-07-24 01:26:40 +04:00
|
|
|
tp->snd_wl1 = sc->sc_irs;
|
|
|
|
tp->rcv_up = sc->sc_irs + 1;
|
|
|
|
|
|
|
|
/*
|
2003-01-05 02:43:02 +03:00
|
|
|
* This is what whould have happened in tcp_output() when
|
1997-07-24 01:26:40 +04:00
|
|
|
* the SYN,ACK was sent.
|
|
|
|
*/
|
|
|
|
tp->snd_up = tp->snd_una;
|
|
|
|
tp->snd_max = tp->snd_nxt = tp->iss+1;
|
1998-05-06 05:21:20 +04:00
|
|
|
TCP_TIMER_ARM(tp, TCPT_REXMT, tp->t_rxtcur);
|
1999-04-29 07:54:22 +04:00
|
|
|
if (sc->sc_win > 0 && SEQ_GT(tp->rcv_nxt + sc->sc_win, tp->rcv_adv))
|
|
|
|
tp->rcv_adv = tp->rcv_nxt + sc->sc_win;
|
1997-07-24 01:26:40 +04:00
|
|
|
tp->last_ack_sent = tp->rcv_nxt;
|
2005-01-27 06:39:36 +03:00
|
|
|
tp->t_partialacks = -1;
|
|
|
|
tp->t_dupacks = 0;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_COMPLETED);
|
2006-10-05 21:35:19 +04:00
|
|
|
s = splsoftnet();
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_put(sc);
|
2006-10-05 21:35:19 +04:00
|
|
|
splx(s);
|
1997-07-24 01:26:40 +04:00
|
|
|
return (so);
|
|
|
|
|
|
|
|
resetandabort:
|
2004-04-25 07:29:11 +04:00
|
|
|
(void)tcp_respond(NULL, m, m, th, (tcp_seq)0, th->th_ack, TH_RST);
|
1997-07-24 01:26:40 +04:00
|
|
|
abort:
|
2008-07-03 19:35:28 +04:00
|
|
|
if (so != NULL) {
|
|
|
|
(void) soqremque(so, 1);
|
1997-07-24 01:26:40 +04:00
|
|
|
(void) soabort(so);
|
2008-07-28 22:41:07 +04:00
|
|
|
mutex_enter(softnet_lock);
|
2008-07-03 19:35:28 +04:00
|
|
|
}
|
2006-10-05 21:35:19 +04:00
|
|
|
s = splsoftnet();
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_put(sc);
|
2006-10-05 21:35:19 +04:00
|
|
|
splx(s);
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_ABORTED);
|
1997-07-24 01:26:40 +04:00
|
|
|
return ((struct socket *)(-1));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is called when we get a RST for a
|
2001-06-19 17:42:07 +04:00
|
|
|
* non-existent connection, so that we can see if the
|
1997-07-24 01:26:40 +04:00
|
|
|
* connection is in the syn cache. If it is, zap it.
|
|
|
|
*/
|
|
|
|
|
|
|
|
void
|
2005-02-04 02:39:32 +03:00
|
|
|
syn_cache_reset(struct sockaddr *src, struct sockaddr *dst, struct tcphdr *th)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
1998-05-07 05:37:27 +04:00
|
|
|
struct syn_cache *sc;
|
|
|
|
struct syn_cache_head *scp;
|
1997-07-24 01:26:40 +04:00
|
|
|
int s = splsoftnet();
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((sc = syn_cache_lookup(src, dst, &scp)) == NULL) {
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
return;
|
|
|
|
}
|
1999-07-01 12:12:45 +04:00
|
|
|
if (SEQ_LT(th->th_seq, sc->sc_irs) ||
|
|
|
|
SEQ_GT(th->th_seq, sc->sc_irs+1)) {
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
return;
|
|
|
|
}
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_rm(sc);
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_RESET);
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_put(sc); /* calls pool_put but see spl above */
|
2006-10-05 21:35:19 +04:00
|
|
|
splx(s);
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2005-05-30 01:41:23 +04:00
|
|
|
syn_cache_unreach(const struct sockaddr *src, const struct sockaddr *dst,
|
2005-02-04 02:39:32 +03:00
|
|
|
struct tcphdr *th)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
1998-05-07 05:37:27 +04:00
|
|
|
struct syn_cache *sc;
|
|
|
|
struct syn_cache_head *scp;
|
1997-07-24 01:26:40 +04:00
|
|
|
int s;
|
|
|
|
|
|
|
|
s = splsoftnet();
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((sc = syn_cache_lookup(src, dst, &scp)) == NULL) {
|
1997-07-24 01:26:40 +04:00
|
|
|
splx(s);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
/* If the sequence number != sc_iss, then it's a bogus ICMP msg */
|
|
|
|
if (ntohl (th->th_seq) != sc->sc_iss) {
|
|
|
|
splx(s);
|
|
|
|
return;
|
|
|
|
}
|
1998-09-09 05:32:27 +04:00
|
|
|
|
|
|
|
/*
|
2004-01-02 15:01:39 +03:00
|
|
|
* If we've retransmitted 3 times and this is our second error,
|
1998-09-09 05:32:27 +04:00
|
|
|
* we remove the entry. Otherwise, we allow it to continue on.
|
|
|
|
* This prevents us from incorrectly nuking an entry during a
|
|
|
|
* spurious network outage.
|
|
|
|
*
|
|
|
|
* See tcp_notify().
|
|
|
|
*/
|
1999-04-29 07:54:22 +04:00
|
|
|
if ((sc->sc_flags & SCF_UNREACH) == 0 || sc->sc_rxtshift < 3) {
|
1998-09-09 05:32:27 +04:00
|
|
|
sc->sc_flags |= SCF_UNREACH;
|
|
|
|
splx(s);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_rm(sc);
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_UNREACH);
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_put(sc); /* calls pool_put but see spl above */
|
2006-10-05 21:35:19 +04:00
|
|
|
splx(s);
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Given a LISTEN socket and an inbound SYN request, add
|
|
|
|
* this to the syn cache, and send back a segment:
|
|
|
|
* <SEQ=ISS><ACK=RCV_NXT><CTL=SYN,ACK>
|
|
|
|
* to the source.
|
1997-09-23 01:49:55 +04:00
|
|
|
*
|
1998-06-02 22:33:02 +04:00
|
|
|
* IMPORTANT NOTE: We do _NOT_ ACK data that might accompany the SYN.
|
|
|
|
* Doing so would require that we hold onto the data and deliver it
|
|
|
|
* to the application. However, if we are the target of a SYN-flood
|
|
|
|
* DoS attack, an attacker could send data which would eventually
|
|
|
|
* consume all available buffer space if it were ACKed. By not ACKing
|
|
|
|
* the data, we avoid this DoS scenario.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
int
|
2005-02-04 02:39:32 +03:00
|
|
|
syn_cache_add(struct sockaddr *src, struct sockaddr *dst, struct tcphdr *th,
|
|
|
|
unsigned int hlen, struct socket *so, struct mbuf *m, u_char *optp,
|
|
|
|
int optlen, struct tcp_opt_info *oi)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
1997-09-23 01:49:55 +04:00
|
|
|
struct tcpcb tb, *tp;
|
1997-07-24 01:26:40 +04:00
|
|
|
long win;
|
1998-05-07 05:37:27 +04:00
|
|
|
struct syn_cache *sc;
|
1997-07-24 01:26:40 +04:00
|
|
|
struct syn_cache_head *scp;
|
1998-04-07 09:09:19 +04:00
|
|
|
struct mbuf *ipopts;
|
2004-05-18 18:44:14 +04:00
|
|
|
struct tcp_opt_info opti;
|
2006-10-05 21:35:19 +04:00
|
|
|
int s;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
1997-09-23 01:49:55 +04:00
|
|
|
tp = sototcpcb(so);
|
1997-07-24 01:26:40 +04:00
|
|
|
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(&opti, 0, sizeof(opti));
|
2004-05-18 18:44:14 +04:00
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
|
|
|
* RFC1122 4.2.3.10, p. 104: discard bcast/mcast SYN
|
2000-02-12 20:19:34 +03:00
|
|
|
*
|
|
|
|
* Note this check is performed in tcp_input() very early on.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize some local state.
|
|
|
|
*/
|
|
|
|
win = sbspace(&so->so_rcv);
|
|
|
|
if (win > TCP_MAXWIN)
|
|
|
|
win = TCP_MAXWIN;
|
|
|
|
|
2000-10-17 07:06:42 +04:00
|
|
|
switch (src->sa_family) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
1999-07-01 12:12:45 +04:00
|
|
|
/*
|
|
|
|
* Remember the IP options, if any.
|
|
|
|
*/
|
|
|
|
ipopts = ip_srcroute();
|
2000-10-17 07:06:42 +04:00
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
1999-07-01 12:12:45 +04:00
|
|
|
ipopts = NULL;
|
2000-10-17 07:06:42 +04:00
|
|
|
}
|
1998-04-07 09:09:19 +04:00
|
|
|
|
2004-05-18 18:44:14 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
if (optp || (tp->t_flags & TF_SIGNATURE))
|
|
|
|
#else
|
|
|
|
if (optp)
|
|
|
|
#endif
|
|
|
|
{
|
2004-06-26 07:29:15 +04:00
|
|
|
tb.t_flags = tcp_do_rfc1323 ? (TF_REQ_SCALE|TF_REQ_TSTMP) : 0;
|
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
tb.t_flags |= (tp->t_flags & TF_SIGNATURE);
|
|
|
|
#endif
|
2005-08-12 18:41:00 +04:00
|
|
|
tb.t_state = TCPS_LISTEN;
|
2004-05-18 18:44:14 +04:00
|
|
|
if (tcp_dooptions(&tb, optp, optlen, th, m, m->m_pkthdr.len -
|
|
|
|
sizeof(struct tcphdr) - optlen - hlen, oi) < 0)
|
|
|
|
return (0);
|
1997-07-24 01:26:40 +04:00
|
|
|
} else
|
|
|
|
tb.t_flags = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if we already have an entry for this connection.
|
1999-04-29 07:54:22 +04:00
|
|
|
* If we do, resend the SYN,ACK. We do not count this
|
|
|
|
* as a retransmission (XXX though maybe we should).
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
if ((sc = syn_cache_lookup(src, dst, &scp)) != NULL) {
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_DUPESYN);
|
1998-04-07 09:09:19 +04:00
|
|
|
if (ipopts) {
|
|
|
|
/*
|
|
|
|
* If we were remembering a previous source route,
|
|
|
|
* forget it and use the new one we've been given.
|
|
|
|
*/
|
|
|
|
if (sc->sc_ipopts)
|
|
|
|
(void) m_free(sc->sc_ipopts);
|
|
|
|
sc->sc_ipopts = ipopts;
|
|
|
|
}
|
1999-04-29 07:54:22 +04:00
|
|
|
sc->sc_timestamp = tb.ts_recent;
|
|
|
|
if (syn_cache_respond(sc, m) == 0) {
|
2008-04-12 09:58:22 +04:00
|
|
|
uint64_t *tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_SNDACKS]++;
|
|
|
|
tcps[TCP_STAT_SNDTOTAL]++;
|
|
|
|
TCP_STAT_PUTREF();
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
|
2006-10-05 21:35:19 +04:00
|
|
|
s = splsoftnet();
|
1998-08-02 04:35:51 +04:00
|
|
|
sc = pool_get(&syn_cache_pool, PR_NOWAIT);
|
2006-10-05 21:35:19 +04:00
|
|
|
splx(s);
|
2002-06-09 20:33:36 +04:00
|
|
|
if (sc == NULL) {
|
1998-04-07 09:09:19 +04:00
|
|
|
if (ipopts)
|
|
|
|
(void) m_free(ipopts);
|
1997-07-24 01:26:40 +04:00
|
|
|
return (0);
|
1998-04-07 09:09:19 +04:00
|
|
|
}
|
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
1998-04-07 09:09:19 +04:00
|
|
|
* Fill in the cache, and put the necessary IP and TCP
|
1997-07-24 01:26:40 +04:00
|
|
|
* options into the reply.
|
|
|
|
*/
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(sc, 0, sizeof(struct syn_cache));
|
2008-04-24 15:38:36 +04:00
|
|
|
callout_init(&sc->sc_timer, CALLOUT_MPSAFE);
|
1999-07-01 12:12:45 +04:00
|
|
|
bcopy(src, &sc->sc_src, src->sa_len);
|
|
|
|
bcopy(dst, &sc->sc_dst, dst->sa_len);
|
1998-04-01 02:49:09 +04:00
|
|
|
sc->sc_flags = 0;
|
1998-04-07 09:09:19 +04:00
|
|
|
sc->sc_ipopts = ipopts;
|
1999-07-01 12:12:45 +04:00
|
|
|
sc->sc_irs = th->th_seq;
|
Two changes, designed to make us even more resilient against TCP
ISS attacks (which we already fend off quite well).
1. First-cut implementation of RFC1948, Steve Bellovin's cryptographic
hash method of generating TCP ISS values. Note, this code is experimental
and disabled by default (experimental enough that I don't export the
variable via sysctl yet, either). There are a couple of issues I'd
like to discuss with Steve, so this code should only be used by people
who really know what they're doing.
2. Per a recent thread on Bugtraq, it's possible to determine a system's
uptime by snooping the RFC1323 TCP timestamp options sent by a host; in
4.4BSD, timestamps are created by incrementing the tcp_now variable
at 2 Hz; there's even a company out there that uses this to determine
web server uptime. According to Newsham's paper "The Problem With
Random Increments", while NetBSD's TCP ISS generation method is much
better than the "random increment" method used by FreeBSD and OpenBSD,
it is still theoretically possible to mount an attack against NetBSD's
method if the attacker knows how many times the tcp_iss_seq variable
has been incremented. By not leaking uptime information, we can make
that much harder to determine. So, we avoid the leak by giving each
TCP connection a timebase of 0.
2001-03-20 23:07:51 +03:00
|
|
|
switch (src->sa_family) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
|
|
|
{
|
|
|
|
struct sockaddr_in *srcin = (void *) src;
|
|
|
|
struct sockaddr_in *dstin = (void *) dst;
|
|
|
|
|
|
|
|
sc->sc_iss = tcp_new_iss1(&dstin->sin_addr,
|
|
|
|
&srcin->sin_addr, dstin->sin_port,
|
|
|
|
srcin->sin_port, sizeof(dstin->sin_addr), 0);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif /* INET */
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
{
|
|
|
|
struct sockaddr_in6 *srcin6 = (void *) src;
|
|
|
|
struct sockaddr_in6 *dstin6 = (void *) dst;
|
|
|
|
|
|
|
|
sc->sc_iss = tcp_new_iss1(&dstin6->sin6_addr,
|
|
|
|
&srcin6->sin6_addr, dstin6->sin6_port,
|
|
|
|
srcin6->sin6_port, sizeof(dstin6->sin6_addr), 0);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
|
|
|
}
|
1997-07-24 01:26:40 +04:00
|
|
|
sc->sc_peermaxseg = oi->maxseg;
|
1998-04-14 01:18:19 +04:00
|
|
|
sc->sc_ourmaxseg = tcp_mss_to_advertise(m->m_flags & M_PKTHDR ?
|
1999-09-23 06:21:30 +04:00
|
|
|
m->m_pkthdr.rcvif : NULL,
|
|
|
|
sc->sc_src.sa.sa_family);
|
1999-04-29 07:54:22 +04:00
|
|
|
sc->sc_win = win;
|
2008-02-05 12:38:47 +03:00
|
|
|
sc->sc_timebase = tcp_now - 1; /* see tcp_newtcpcb() */
|
1999-04-29 07:54:22 +04:00
|
|
|
sc->sc_timestamp = tb.ts_recent;
|
2002-10-22 08:24:50 +04:00
|
|
|
if ((tb.t_flags & (TF_REQ_TSTMP|TF_RCVD_TSTMP)) ==
|
|
|
|
(TF_REQ_TSTMP|TF_RCVD_TSTMP))
|
1998-04-03 12:02:45 +04:00
|
|
|
sc->sc_flags |= SCF_TIMESTAMP;
|
1997-07-24 01:26:40 +04:00
|
|
|
if ((tb.t_flags & (TF_RCVD_SCALE|TF_REQ_SCALE)) ==
|
|
|
|
(TF_RCVD_SCALE|TF_REQ_SCALE)) {
|
|
|
|
sc->sc_requested_s_scale = tb.requested_s_scale;
|
|
|
|
sc->sc_request_r_scale = 0;
|
2007-08-02 06:42:40 +04:00
|
|
|
/*
|
2007-11-04 14:04:26 +03:00
|
|
|
* Pick the smallest possible scaling factor that
|
|
|
|
* will still allow us to scale up to sb_max.
|
|
|
|
*
|
|
|
|
* We do this because there are broken firewalls that
|
|
|
|
* will corrupt the window scale option, leading to
|
|
|
|
* the other endpoint believing that our advertised
|
|
|
|
* window is unscaled. At scale factors larger than
|
|
|
|
* 5 the unscaled window will drop below 1500 bytes,
|
|
|
|
* leading to serious problems when traversing these
|
|
|
|
* broken firewalls.
|
|
|
|
*
|
|
|
|
* With the default sbmax of 256K, a scale factor
|
|
|
|
* of 3 will be chosen by this algorithm. Those who
|
|
|
|
* choose a larger sbmax should watch out
|
|
|
|
* for the compatiblity problems mentioned above.
|
2007-08-02 06:42:40 +04:00
|
|
|
*
|
|
|
|
* RFC1323: The Window field in a SYN (i.e., a <SYN>
|
|
|
|
* or <SYN,ACK>) segment itself is never scaled.
|
|
|
|
*/
|
1997-07-24 01:26:40 +04:00
|
|
|
while (sc->sc_request_r_scale < TCP_MAX_WINSHIFT &&
|
2007-11-04 14:04:26 +03:00
|
|
|
(TCP_MAXWIN << sc->sc_request_r_scale) < sb_max)
|
1997-07-24 01:26:40 +04:00
|
|
|
sc->sc_request_r_scale++;
|
|
|
|
} else {
|
|
|
|
sc->sc_requested_s_scale = 15;
|
|
|
|
sc->sc_request_r_scale = 15;
|
|
|
|
}
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if ((tb.t_flags & TF_SACK_PERMIT) && tcp_do_sack)
|
|
|
|
sc->sc_flags |= SCF_SACK_PERMIT;
|
2006-09-05 04:29:35 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ECN setup packet recieved.
|
|
|
|
*/
|
|
|
|
if ((th->th_flags & (TH_ECE|TH_CWR)) && tcp_do_ecn)
|
|
|
|
sc->sc_flags |= SCF_ECN_PERMIT;
|
|
|
|
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
2004-05-18 18:44:14 +04:00
|
|
|
if (tb.t_flags & TF_SIGNATURE)
|
|
|
|
sc->sc_flags |= SCF_SIGNATURE;
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
#endif
|
1999-08-25 19:23:12 +04:00
|
|
|
sc->sc_tp = tp;
|
1999-04-29 07:54:22 +04:00
|
|
|
if (syn_cache_respond(sc, m) == 0) {
|
2008-04-12 09:58:22 +04:00
|
|
|
uint64_t *tcps = TCP_STAT_GETREF();
|
|
|
|
tcps[TCP_STAT_SNDACKS]++;
|
|
|
|
tcps[TCP_STAT_SNDTOTAL]++;
|
|
|
|
TCP_STAT_PUTREF();
|
1999-08-25 19:23:12 +04:00
|
|
|
syn_cache_insert(sc, tp);
|
1997-07-24 01:26:40 +04:00
|
|
|
} else {
|
2006-10-05 21:35:19 +04:00
|
|
|
s = splsoftnet();
|
2010-05-26 21:38:29 +04:00
|
|
|
/*
|
|
|
|
* syn_cache_put() will try to schedule the timer, so
|
|
|
|
* we need to initialize it
|
|
|
|
*/
|
|
|
|
SYN_CACHE_TIMER_ARM(sc);
|
2007-11-10 02:55:58 +03:00
|
|
|
syn_cache_put(sc);
|
2006-10-05 21:35:19 +04:00
|
|
|
splx(s);
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_SC_DROPPED);
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2005-02-04 02:39:32 +03:00
|
|
|
syn_cache_respond(struct syn_cache *sc, struct mbuf *m)
|
1997-07-24 01:26:40 +04:00
|
|
|
{
|
2007-12-20 23:24:49 +03:00
|
|
|
#ifdef INET6
|
2007-12-20 22:53:29 +03:00
|
|
|
struct rtentry *rt;
|
2007-12-20 23:24:49 +03:00
|
|
|
#endif
|
1999-12-08 19:22:20 +03:00
|
|
|
struct route *ro;
|
1997-07-24 01:26:40 +04:00
|
|
|
u_int8_t *optp;
|
1999-01-24 04:19:28 +03:00
|
|
|
int optlen, error;
|
|
|
|
u_int16_t tlen;
|
1999-07-01 12:12:45 +04:00
|
|
|
struct ip *ip = NULL;
|
|
|
|
#ifdef INET6
|
|
|
|
struct ip6_hdr *ip6 = NULL;
|
|
|
|
#endif
|
2006-09-05 04:29:35 +04:00
|
|
|
struct tcpcb *tp = NULL;
|
1999-07-01 12:12:45 +04:00
|
|
|
struct tcphdr *th;
|
|
|
|
u_int hlen;
|
2003-08-23 01:53:01 +04:00
|
|
|
struct socket *so;
|
1999-07-01 12:12:45 +04:00
|
|
|
|
Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.
The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.
Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.
DETAILS
1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:
struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);
sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.
sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.
The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).
2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.
3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:
int rtcache_setdst(struct route *, const struct sockaddr *);
rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.
It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.
4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
2007-05-03 00:40:22 +04:00
|
|
|
ro = &sc->sc_route;
|
1999-07-01 12:12:45 +04:00
|
|
|
switch (sc->sc_src.sa.sa_family) {
|
|
|
|
case AF_INET:
|
|
|
|
hlen = sizeof(struct ip);
|
|
|
|
break;
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
hlen = sizeof(struct ip6_hdr);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
if (m)
|
|
|
|
m_freem(m);
|
2004-01-02 15:01:39 +03:00
|
|
|
return (EAFNOSUPPORT);
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
1997-07-24 01:26:40 +04:00
|
|
|
|
1999-01-24 04:19:28 +03:00
|
|
|
/* Compute the size of the TCP options. */
|
1997-07-24 01:26:40 +04:00
|
|
|
optlen = 4 + (sc->sc_request_r_scale != 15 ? 4 : 0) +
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
((sc->sc_flags & SCF_SACK_PERMIT) ? (TCPOLEN_SACK_PERMITTED + 2) : 0) +
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
2004-05-18 18:44:14 +04:00
|
|
|
((sc->sc_flags & SCF_SIGNATURE) ? (TCPOLEN_SIGNATURE + 2) : 0) +
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
#endif
|
2004-05-18 18:44:14 +04:00
|
|
|
((sc->sc_flags & SCF_TIMESTAMP) ? TCPOLEN_TSTAMP_APPA : 0);
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
tlen = hlen + sizeof(struct tcphdr) + optlen;
|
1999-01-24 04:19:28 +03:00
|
|
|
|
|
|
|
/*
|
2000-06-30 20:44:33 +04:00
|
|
|
* Create the IP+TCP header from scratch.
|
1999-01-24 04:19:28 +03:00
|
|
|
*/
|
2000-06-30 20:44:33 +04:00
|
|
|
if (m)
|
|
|
|
m_freem(m);
|
2000-07-23 09:00:01 +04:00
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
if (max_linkhdr + tlen > MCLBYTES)
|
|
|
|
return (ENOBUFS);
|
|
|
|
#endif
|
2000-06-30 20:44:33 +04:00
|
|
|
MGETHDR(m, M_DONTWAIT, MT_DATA);
|
2010-12-02 22:07:27 +03:00
|
|
|
if (m && (max_linkhdr + tlen) > MHLEN) {
|
2000-06-30 20:44:33 +04:00
|
|
|
MCLGET(m, M_DONTWAIT);
|
2000-07-06 01:45:14 +04:00
|
|
|
if ((m->m_flags & M_EXT) == 0) {
|
2000-06-30 20:44:33 +04:00
|
|
|
m_freem(m);
|
|
|
|
m = NULL;
|
|
|
|
}
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
2000-06-30 20:44:33 +04:00
|
|
|
if (m == NULL)
|
|
|
|
return (ENOBUFS);
|
2003-02-26 09:31:08 +03:00
|
|
|
MCLAIM(m, &tcp_tx_mowner);
|
1997-07-24 01:26:40 +04:00
|
|
|
|
1999-01-24 04:19:28 +03:00
|
|
|
/* Fixup the mbuf. */
|
|
|
|
m->m_data += max_linkhdr;
|
|
|
|
m->m_len = m->m_pkthdr.len = tlen;
|
1999-08-25 19:23:12 +04:00
|
|
|
if (sc->sc_tp) {
|
|
|
|
tp = sc->sc_tp;
|
|
|
|
if (tp->t_inpcb)
|
|
|
|
so = tp->t_inpcb->inp_socket;
|
|
|
|
#ifdef INET6
|
|
|
|
else if (tp->t_in6pcb)
|
|
|
|
so = tp->t_in6pcb->in6p_socket;
|
|
|
|
#endif
|
|
|
|
else
|
|
|
|
so = NULL;
|
2003-08-23 02:49:34 +04:00
|
|
|
} else
|
|
|
|
so = NULL;
|
2000-06-30 20:44:33 +04:00
|
|
|
m->m_pkthdr.rcvif = NULL;
|
1999-07-01 12:12:45 +04:00
|
|
|
memset(mtod(m, u_char *), 0, tlen);
|
|
|
|
|
|
|
|
switch (sc->sc_src.sa.sa_family) {
|
|
|
|
case AF_INET:
|
|
|
|
ip = mtod(m, struct ip *);
|
2004-05-18 18:44:14 +04:00
|
|
|
ip->ip_v = 4;
|
1999-07-01 12:12:45 +04:00
|
|
|
ip->ip_dst = sc->sc_src.sin.sin_addr;
|
|
|
|
ip->ip_src = sc->sc_dst.sin.sin_addr;
|
|
|
|
ip->ip_p = IPPROTO_TCP;
|
|
|
|
th = (struct tcphdr *)(ip + 1);
|
|
|
|
th->th_dport = sc->sc_src.sin.sin_port;
|
|
|
|
th->th_sport = sc->sc_dst.sin.sin_port;
|
|
|
|
break;
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
2004-05-18 18:44:14 +04:00
|
|
|
ip6->ip6_vfc = IPV6_VERSION;
|
1999-07-01 12:12:45 +04:00
|
|
|
ip6->ip6_dst = sc->sc_src.sin6.sin6_addr;
|
|
|
|
ip6->ip6_src = sc->sc_dst.sin6.sin6_addr;
|
|
|
|
ip6->ip6_nxt = IPPROTO_TCP;
|
|
|
|
/* ip6_plen will be updated in ip6_output() */
|
|
|
|
th = (struct tcphdr *)(ip6 + 1);
|
|
|
|
th->th_dport = sc->sc_src.sin6.sin6_port;
|
|
|
|
th->th_sport = sc->sc_dst.sin6.sin6_port;
|
|
|
|
break;
|
|
|
|
#endif
|
1999-07-02 16:45:32 +04:00
|
|
|
default:
|
|
|
|
th = NULL;
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
1999-01-24 04:19:28 +03:00
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
th->th_seq = htonl(sc->sc_iss);
|
|
|
|
th->th_ack = htonl(sc->sc_irs + 1);
|
|
|
|
th->th_off = (sizeof(struct tcphdr) + optlen) >> 2;
|
|
|
|
th->th_flags = TH_SYN|TH_ACK;
|
|
|
|
th->th_win = htons(sc->sc_win);
|
|
|
|
/* th_sum already 0 */
|
|
|
|
/* th_urp already 0 */
|
1999-01-24 04:19:28 +03:00
|
|
|
|
|
|
|
/* Tack on the TCP options. */
|
1999-07-01 12:12:45 +04:00
|
|
|
optp = (u_int8_t *)(th + 1);
|
1999-01-24 04:19:28 +03:00
|
|
|
*optp++ = TCPOPT_MAXSEG;
|
|
|
|
*optp++ = 4;
|
|
|
|
*optp++ = (sc->sc_ourmaxseg >> 8) & 0xff;
|
|
|
|
*optp++ = sc->sc_ourmaxseg & 0xff;
|
1997-07-24 01:26:40 +04:00
|
|
|
|
|
|
|
if (sc->sc_request_r_scale != 15) {
|
1999-01-24 04:19:28 +03:00
|
|
|
*((u_int32_t *)optp) = htonl(TCPOPT_NOP << 24 |
|
1997-07-24 01:26:40 +04:00
|
|
|
TCPOPT_WINDOW << 16 | TCPOLEN_WINDOW << 8 |
|
|
|
|
sc->sc_request_r_scale);
|
1999-01-24 04:19:28 +03:00
|
|
|
optp += 4;
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
1998-04-03 12:02:45 +04:00
|
|
|
if (sc->sc_flags & SCF_TIMESTAMP) {
|
1999-01-24 04:19:28 +03:00
|
|
|
u_int32_t *lp = (u_int32_t *)(optp);
|
1997-07-24 01:26:40 +04:00
|
|
|
/* Form timestamp option as shown in appendix A of RFC 1323. */
|
|
|
|
*lp++ = htonl(TCPOPT_TSTAMP_HDR);
|
Two changes, designed to make us even more resilient against TCP
ISS attacks (which we already fend off quite well).
1. First-cut implementation of RFC1948, Steve Bellovin's cryptographic
hash method of generating TCP ISS values. Note, this code is experimental
and disabled by default (experimental enough that I don't export the
variable via sysctl yet, either). There are a couple of issues I'd
like to discuss with Steve, so this code should only be used by people
who really know what they're doing.
2. Per a recent thread on Bugtraq, it's possible to determine a system's
uptime by snooping the RFC1323 TCP timestamp options sent by a host; in
4.4BSD, timestamps are created by incrementing the tcp_now variable
at 2 Hz; there's even a company out there that uses this to determine
web server uptime. According to Newsham's paper "The Problem With
Random Increments", while NetBSD's TCP ISS generation method is much
better than the "random increment" method used by FreeBSD and OpenBSD,
it is still theoretically possible to mount an attack against NetBSD's
method if the attacker knows how many times the tcp_iss_seq variable
has been incremented. By not leaking uptime information, we can make
that much harder to determine. So, we avoid the leak by giving each
TCP connection a timebase of 0.
2001-03-20 23:07:51 +03:00
|
|
|
*lp++ = htonl(SYN_CACHE_TIMESTAMP(sc));
|
1999-04-29 07:54:22 +04:00
|
|
|
*lp = htonl(sc->sc_timestamp);
|
1999-01-24 04:19:28 +03:00
|
|
|
optp += TCPOLEN_TSTAMP_APPA;
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|
|
|
|
|
Commit TCP SACK patches from Kentaro A. Karahone's patch at:
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
2005-02-28 19:20:59 +03:00
|
|
|
if (sc->sc_flags & SCF_SACK_PERMIT) {
|
|
|
|
u_int8_t *p = optp;
|
|
|
|
|
|
|
|
/* Let the peer know that we will SACK. */
|
|
|
|
p[0] = TCPOPT_SACK_PERMITTED;
|
|
|
|
p[1] = 2;
|
|
|
|
p[2] = TCPOPT_NOP;
|
|
|
|
p[3] = TCPOPT_NOP;
|
|
|
|
optp += 4;
|
|
|
|
}
|
|
|
|
|
2006-09-05 04:29:35 +04:00
|
|
|
/*
|
|
|
|
* Send ECN SYN-ACK setup packet.
|
|
|
|
* Routes can be asymetric, so, even if we receive a packet
|
|
|
|
* with ECE and CWR set, we must not assume no one will block
|
|
|
|
* the ECE packet we are about to send.
|
|
|
|
*/
|
|
|
|
if ((sc->sc_flags & SCF_ECN_PERMIT) && tp &&
|
|
|
|
SEQ_GEQ(tp->snd_nxt, tp->snd_max)) {
|
|
|
|
th->th_flags |= TH_ECE;
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_ECN_SHS);
|
2006-09-05 04:29:35 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* draft-ietf-tcpm-ecnsyn-00.txt
|
|
|
|
*
|
|
|
|
* "[...] a TCP node MAY respond to an ECN-setup
|
|
|
|
* SYN packet by setting ECT in the responding
|
|
|
|
* ECN-setup SYN/ACK packet, indicating to routers
|
|
|
|
* that the SYN/ACK packet is ECN-Capable.
|
|
|
|
* This allows a congested router along the path
|
|
|
|
* to mark the packet instead of dropping the
|
|
|
|
* packet as an indication of congestion."
|
|
|
|
*
|
|
|
|
* "[...] There can be a great benefit in setting
|
|
|
|
* an ECN-capable codepoint in SYN/ACK packets [...]
|
|
|
|
* Congestion is most likely to occur in
|
|
|
|
* the server-to-client direction. As a result,
|
|
|
|
* setting an ECN-capable codepoint in SYN/ACK
|
|
|
|
* packets can reduce the occurence of three-second
|
|
|
|
* retransmit timeouts resulting from the drop
|
|
|
|
* of SYN/ACK packets."
|
|
|
|
*
|
|
|
|
* Page 4 and 6, January 2006.
|
|
|
|
*/
|
|
|
|
|
|
|
|
switch (sc->sc_src.sa.sa_family) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
|
|
|
ip->ip_tos |= IPTOS_ECN_ECT0;
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
ip6->ip6_flow |= htonl(IPTOS_ECN_ECT0 << 20);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
}
|
2008-04-12 09:58:22 +04:00
|
|
|
TCP_STATINC(TCP_STAT_ECN_ECT);
|
2006-09-05 04:29:35 +04:00
|
|
|
}
|
|
|
|
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
#ifdef TCP_SIGNATURE
|
|
|
|
if (sc->sc_flags & SCF_SIGNATURE) {
|
2004-05-18 18:44:14 +04:00
|
|
|
struct secasvar *sav;
|
|
|
|
u_int8_t *sigp;
|
|
|
|
|
|
|
|
sav = tcp_signature_getsav(m, th);
|
2005-02-27 01:45:09 +03:00
|
|
|
|
2004-05-18 18:44:14 +04:00
|
|
|
if (sav == NULL) {
|
|
|
|
if (m)
|
|
|
|
m_freem(m);
|
|
|
|
return (EPERM);
|
|
|
|
}
|
|
|
|
|
|
|
|
*optp++ = TCPOPT_SIGNATURE;
|
|
|
|
*optp++ = TCPOLEN_SIGNATURE;
|
|
|
|
sigp = optp;
|
2009-03-18 19:00:08 +03:00
|
|
|
memset(optp, 0, TCP_SIGLEN);
|
2004-05-18 18:44:14 +04:00
|
|
|
optp += TCP_SIGLEN;
|
|
|
|
*optp++ = TCPOPT_NOP;
|
|
|
|
*optp++ = TCPOPT_EOL;
|
|
|
|
|
|
|
|
(void)tcp_signature(m, th, hlen, sav, sigp);
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
|
2004-05-18 18:44:14 +04:00
|
|
|
key_sa_recordxfer(sav, m);
|
|
|
|
#ifdef FAST_IPSEC
|
|
|
|
KEY_FREESAV(&sav);
|
|
|
|
#else
|
|
|
|
key_freesav(sav);
|
|
|
|
#endif
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
}
|
2004-05-18 18:44:14 +04:00
|
|
|
#endif
|
Initial commit of a port of the FreeBSD implementation of RFC 2385
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
2004-04-26 02:25:03 +04:00
|
|
|
|
1999-01-24 04:19:28 +03:00
|
|
|
/* Compute the packet's checksum. */
|
1999-07-01 12:12:45 +04:00
|
|
|
switch (sc->sc_src.sa.sa_family) {
|
|
|
|
case AF_INET:
|
|
|
|
ip->ip_len = htons(tlen - hlen);
|
|
|
|
th->th_sum = 0;
|
2004-05-18 18:44:14 +04:00
|
|
|
th->th_sum = in4_cksum(m, IPPROTO_TCP, hlen, tlen - hlen);
|
1999-07-01 12:12:45 +04:00
|
|
|
break;
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
ip6->ip6_plen = htons(tlen - hlen);
|
|
|
|
th->th_sum = 0;
|
|
|
|
th->th_sum = in6_cksum(m, IPPROTO_TCP, hlen, tlen - hlen);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
}
|
1999-01-24 04:19:28 +03:00
|
|
|
|
1997-07-24 01:26:40 +04:00
|
|
|
/*
|
1999-01-24 04:19:28 +03:00
|
|
|
* Fill in some straggling IP bits. Note the stack expects
|
|
|
|
* ip_len to be in host order, for convenience.
|
1997-07-24 01:26:40 +04:00
|
|
|
*/
|
1999-07-01 12:12:45 +04:00
|
|
|
switch (sc->sc_src.sa.sa_family) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET:
|
2002-08-14 04:23:27 +04:00
|
|
|
ip->ip_len = htons(tlen);
|
1999-07-01 12:12:45 +04:00
|
|
|
ip->ip_ttl = ip_defttl;
|
|
|
|
/* XXX tos? */
|
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
1999-12-15 09:28:43 +03:00
|
|
|
ip6->ip6_vfc &= ~IPV6_VERSION_MASK;
|
|
|
|
ip6->ip6_vfc |= IPV6_VERSION;
|
1999-07-01 12:12:45 +04:00
|
|
|
ip6->ip6_plen = htons(tlen - hlen);
|
1999-12-13 18:17:17 +03:00
|
|
|
/* ip6_hlim will be initialized afterwards */
|
1999-07-01 12:12:45 +04:00
|
|
|
/* XXX flowlabel? */
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
}
|
1997-07-24 01:26:40 +04:00
|
|
|
|
2003-08-23 00:20:09 +04:00
|
|
|
/* XXX use IPsec policy on listening socket, on SYN ACK */
|
|
|
|
tp = sc->sc_tp;
|
|
|
|
|
1999-07-01 12:12:45 +04:00
|
|
|
switch (sc->sc_src.sa.sa_family) {
|
2000-10-17 07:06:42 +04:00
|
|
|
#ifdef INET
|
1999-07-01 12:12:45 +04:00
|
|
|
case AF_INET:
|
2000-10-17 06:57:01 +04:00
|
|
|
error = ip_output(m, sc->sc_ipopts, ro,
|
2005-02-27 01:45:09 +03:00
|
|
|
(ip_mtudisc ? IP_MTUDISC : 0),
|
2011-08-31 22:31:02 +04:00
|
|
|
NULL, so);
|
1999-07-01 12:12:45 +04:00
|
|
|
break;
|
2000-10-17 07:06:42 +04:00
|
|
|
#endif
|
1999-07-01 12:12:45 +04:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
1999-12-13 18:17:17 +03:00
|
|
|
ip6->ip6_hlim = in6_selecthlim(NULL,
|
2008-01-14 07:19:09 +03:00
|
|
|
(rt = rtcache_validate(ro)) != NULL ? rt->rt_ifp
|
|
|
|
: NULL);
|
1999-12-13 18:17:17 +03:00
|
|
|
|
Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.
The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.
Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.
DETAILS
1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:
struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);
sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.
sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.
The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).
2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.
3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:
int rtcache_setdst(struct route *, const struct sockaddr *);
rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.
It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.
4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
2007-05-03 00:40:22 +04:00
|
|
|
error = ip6_output(m, NULL /*XXX*/, ro, 0, NULL, so, NULL);
|
1999-07-01 12:12:45 +04:00
|
|
|
break;
|
|
|
|
#endif
|
1999-07-02 16:45:32 +04:00
|
|
|
default:
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
break;
|
1999-07-01 12:12:45 +04:00
|
|
|
}
|
1999-01-24 04:19:28 +03:00
|
|
|
return (error);
|
1997-07-24 01:26:40 +04:00
|
|
|
}
|