NetBSD/sys/netinet6/udp6_usrreq.c

483 lines
13 KiB
C
Raw Normal View History

Reduces the resources demanded by TCP sessions in TIME_WAIT-state using methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime Truncation (MSLT). MSLT and VTW were contributed by Coyote Point Systems, Inc. Even after a TCP session enters the TIME_WAIT state, its corresponding socket and protocol control blocks (PCBs) stick around until the TCP Maximum Segment Lifetime (MSL) expires. On a host whose workload necessarily creates and closes down many TCP sockets, the sockets & PCBs for TCP sessions in TIME_WAIT state amount to many megabytes of dead weight in RAM. Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to a class based on the nearness of the peer. Corresponding to each class is an MSL, and a session uses the MSL of its class. The classes are loopback (local host equals remote host), local (local host and remote host are on the same link/subnet), and remote (local host and remote host communicate via one or more gateways). Classes corresponding to nearer peers have lower MSLs by default: 2 seconds for loopback, 10 seconds for local, 60 seconds for remote. Loopback and local sessions expire more quickly when MSLT is used. Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket dead weight with a compact representation of the session, called a "vestigial PCB". VTW data structures are designed to be very fast and memory-efficient: for fast insertion and lookup of vestigial PCBs, the PCBs are stored in a hash table that is designed to minimize the number of cacheline visits per lookup/insertion. The memory both for vestigial PCBs and for elements of the PCB hashtable come from fixed-size pools, and linked data structures exploit this to conserve memory by representing references with a narrow index/offset from the start of a pool instead of a pointer. When space for new vestigial PCBs runs out, VTW makes room by discarding old vestigial PCBs, oldest first. VTW cooperates with MSLT. It may help to think of VTW as a "FIN cache" by analogy to the SYN cache. A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT sessions as fast as it can is approximately 17% idle when VTW is active versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM when VTW is active (approximately 64k vestigial PCBs are created) than when it is inactive.
2011-05-03 22:28:44 +04:00
/* $NetBSD: udp6_usrreq.c,v 1.89 2011/05/03 18:28:45 dyoung Exp $ */
/* $KAME: udp6_usrreq.c,v 1.86 2001/05/27 17:33:00 itojun Exp $ */
1999-07-04 01:24:45 +04:00
/*
* Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the project nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
/*
* Copyright (c) 1982, 1986, 1989, 1993
* The Regents of the University of California. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* @(#)udp_var.h 8.1 (Berkeley) 6/10/93
*/
2001-11-13 03:56:55 +03:00
#include <sys/cdefs.h>
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime Truncation (MSLT). MSLT and VTW were contributed by Coyote Point Systems, Inc. Even after a TCP session enters the TIME_WAIT state, its corresponding socket and protocol control blocks (PCBs) stick around until the TCP Maximum Segment Lifetime (MSL) expires. On a host whose workload necessarily creates and closes down many TCP sockets, the sockets & PCBs for TCP sessions in TIME_WAIT state amount to many megabytes of dead weight in RAM. Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to a class based on the nearness of the peer. Corresponding to each class is an MSL, and a session uses the MSL of its class. The classes are loopback (local host equals remote host), local (local host and remote host are on the same link/subnet), and remote (local host and remote host communicate via one or more gateways). Classes corresponding to nearer peers have lower MSLs by default: 2 seconds for loopback, 10 seconds for local, 60 seconds for remote. Loopback and local sessions expire more quickly when MSLT is used. Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket dead weight with a compact representation of the session, called a "vestigial PCB". VTW data structures are designed to be very fast and memory-efficient: for fast insertion and lookup of vestigial PCBs, the PCBs are stored in a hash table that is designed to minimize the number of cacheline visits per lookup/insertion. The memory both for vestigial PCBs and for elements of the PCB hashtable come from fixed-size pools, and linked data structures exploit this to conserve memory by representing references with a narrow index/offset from the start of a pool instead of a pointer. When space for new vestigial PCBs runs out, VTW makes room by discarding old vestigial PCBs, oldest first. VTW cooperates with MSLT. It may help to think of VTW as a "FIN cache" by analogy to the SYN cache. A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT sessions as fast as it can is approximately 17% idle when VTW is active versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM when VTW is active (approximately 64k vestigial PCBs are created) than when it is inactive.
2011-05-03 22:28:44 +04:00
__KERNEL_RCSID(0, "$NetBSD: udp6_usrreq.c,v 1.89 2011/05/03 18:28:45 dyoung Exp $");
#include <sys/param.h>
#include <sys/malloc.h>
#include <sys/mbuf.h>
#include <sys/protosw.h>
#include <sys/socket.h>
#include <sys/socketvar.h>
#include <sys/errno.h>
#include <sys/stat.h>
#include <sys/systm.h>
#include <sys/proc.h>
#include <sys/syslog.h>
#include <sys/sysctl.h>
#include <net/if.h>
#include <net/route.h>
#include <net/if_types.h>
#include <netinet/in.h>
#include <netinet/in_var.h>
#include <netinet/in_systm.h>
#include <netinet/ip.h>
#include <netinet/ip_var.h>
#include <netinet/in_pcb.h>
#include <netinet/udp.h>
#include <netinet/udp_var.h>
#include <netinet/ip6.h>
#include <netinet6/ip6_var.h>
#include <netinet6/in6_pcb.h>
#include <netinet/icmp6.h>
#include <netinet6/udp6_var.h>
2008-04-15 08:43:25 +04:00
#include <netinet6/udp6_private.h>
#include <netinet6/ip6protosw.h>
#include <netinet/in_offload.h>
#include "faith.h"
#if defined(NFAITH) && NFAITH > 0
#include <net/if_faith.h>
#endif
/*
* UDP protocol implementation.
* Per RFC 768, August, 1980.
*/
extern struct inpcbtable udbtable;
2008-04-15 08:43:25 +04:00
percpu_t *udp6stat_percpu;
2007-11-01 23:33:56 +03:00
static void udp6_notify(struct in6pcb *, int);
static void sysctl_net_inet6_udp6_setup(struct sysctllog **);
void
udp6_init(void)
{
sysctl_net_inet6_udp6_setup(NULL);
}
/*
* Notify a udp user of an asynchronous error;
* just wake up so that he can collect error status.
*/
static void
udp6_notify(struct in6pcb *in6p, int errno)
{
in6p->in6p_socket->so_error = errno;
sorwakeup(in6p->in6p_socket);
sowwakeup(in6p->in6p_socket);
}
void *
udp6_ctlinput(int cmd, const struct sockaddr *sa, void *d)
{
struct udphdr uh;
struct ip6_hdr *ip6;
const struct sockaddr_in6 *sa6 = (const struct sockaddr_in6 *)sa;
struct mbuf *m;
int off;
void *cmdarg;
struct ip6ctlparam *ip6cp = NULL;
const struct sockaddr_in6 *sa6_src = NULL;
void (*notify)(struct in6pcb *, int) = udp6_notify;
struct udp_portonly {
u_int16_t uh_sport;
u_int16_t uh_dport;
} *uhp;
if (sa->sa_family != AF_INET6 ||
sa->sa_len != sizeof(struct sockaddr_in6))
return NULL;
if ((unsigned)cmd >= PRC_NCMDS)
return NULL;
if (PRC_IS_REDIRECT(cmd))
notify = in6_rtchange, d = NULL;
else if (cmd == PRC_HOSTDEAD)
d = NULL;
else if (cmd == PRC_MSGSIZE) {
/* special code is present, see below */
notify = in6_rtchange;
}
else if (inet6ctlerrmap[cmd] == 0)
return NULL;
/* if the parameter is from icmp6, decode it. */
if (d != NULL) {
ip6cp = (struct ip6ctlparam *)d;
m = ip6cp->ip6c_m;
ip6 = ip6cp->ip6c_ip6;
off = ip6cp->ip6c_off;
cmdarg = ip6cp->ip6c_cmdarg;
sa6_src = ip6cp->ip6c_src;
} else {
m = NULL;
ip6 = NULL;
cmdarg = NULL;
sa6_src = &sa6_any;
2003-10-25 12:26:14 +04:00
off = 0;
}
if (ip6) {
/*
* XXX: We assume that when IPV6 is non NULL,
* M and OFF are valid.
*/
/* check if we can safely examine src and dst ports */
if (m->m_pkthdr.len < off + sizeof(*uhp)) {
if (cmd == PRC_MSGSIZE)
icmp6_mtudisc_update((struct ip6ctlparam *)d, 0);
return NULL;
}
2009-03-18 19:00:08 +03:00
memset(&uh, 0, sizeof(uh));
m_copydata(m, off, sizeof(*uhp), (void *)&uh);
if (cmd == PRC_MSGSIZE) {
int valid = 0;
/*
* Check to see if we have a valid UDP socket
* corresponding to the address in the ICMPv6 message
* payload.
*/
if (in6_pcblookup_connect(&udbtable, &sa6->sin6_addr,
uh.uh_dport, (const struct in6_addr *)&sa6_src->sin6_addr,
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime Truncation (MSLT). MSLT and VTW were contributed by Coyote Point Systems, Inc. Even after a TCP session enters the TIME_WAIT state, its corresponding socket and protocol control blocks (PCBs) stick around until the TCP Maximum Segment Lifetime (MSL) expires. On a host whose workload necessarily creates and closes down many TCP sockets, the sockets & PCBs for TCP sessions in TIME_WAIT state amount to many megabytes of dead weight in RAM. Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to a class based on the nearness of the peer. Corresponding to each class is an MSL, and a session uses the MSL of its class. The classes are loopback (local host equals remote host), local (local host and remote host are on the same link/subnet), and remote (local host and remote host communicate via one or more gateways). Classes corresponding to nearer peers have lower MSLs by default: 2 seconds for loopback, 10 seconds for local, 60 seconds for remote. Loopback and local sessions expire more quickly when MSLT is used. Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket dead weight with a compact representation of the session, called a "vestigial PCB". VTW data structures are designed to be very fast and memory-efficient: for fast insertion and lookup of vestigial PCBs, the PCBs are stored in a hash table that is designed to minimize the number of cacheline visits per lookup/insertion. The memory both for vestigial PCBs and for elements of the PCB hashtable come from fixed-size pools, and linked data structures exploit this to conserve memory by representing references with a narrow index/offset from the start of a pool instead of a pointer. When space for new vestigial PCBs runs out, VTW makes room by discarding old vestigial PCBs, oldest first. VTW cooperates with MSLT. It may help to think of VTW as a "FIN cache" by analogy to the SYN cache. A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT sessions as fast as it can is approximately 17% idle when VTW is active versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM when VTW is active (approximately 64k vestigial PCBs are created) than when it is inactive.
2011-05-03 22:28:44 +04:00
uh.uh_sport, 0, 0))
valid++;
#if 0
/*
* As the use of sendto(2) is fairly popular,
* we may want to allow non-connected pcb too.
* But it could be too weak against attacks...
* We should at least check if the local address (= s)
* is really ours.
*/
else if (in6_pcblookup_bind(&udbtable, &sa6->sin6_addr,
uh.uh_dport, 0))
valid++;
#endif
/*
* Depending on the value of "valid" and routing table
* size (mtudisc_{hi,lo}wat), we will:
* - recalculate the new MTU and create the
* corresponding routing entry, or
* - ignore the MTU change notification.
*/
icmp6_mtudisc_update((struct ip6ctlparam *)d, valid);
/*
* regardless of if we called
* icmp6_mtudisc_update(), we need to call
* in6_pcbnotify(), to notify path MTU change
* to the userland (RFC3542), because some
* unconnected sockets may share the same
* destination and want to know the path MTU.
*/
}
(void) in6_pcbnotify(&udbtable, sa, uh.uh_dport,
(const struct sockaddr *)sa6_src, uh.uh_sport, cmd, cmdarg,
notify);
} else {
(void) in6_pcbnotify(&udbtable, sa, 0,
(const struct sockaddr *)sa6_src, 0, cmd, cmdarg, notify);
}
return NULL;
}
extern int udp6_sendspace;
extern int udp6_recvspace;
int
udp6_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf *addr6,
struct mbuf *control, struct lwp *l)
{
struct in6pcb *in6p = sotoin6pcb(so);
int error = 0;
int s;
/*
* MAPPED_ADDR implementation info:
* Mapped addr support for PRU_CONTROL is not necessary.
* Because typical user of PRU_CONTROL is such as ifconfig,
* and they don't associate any addr to their socket. Then
* socket family is only hint about the PRU_CONTROL'ed address
* family, especially when getting addrs from kernel.
* So AF_INET socket need to be used to control AF_INET addrs,
* and AF_INET6 socket for AF_INET6 addrs.
*/
if (req == PRU_CONTROL)
return in6_control(so, (u_long)m, (void *)addr6,
(struct ifnet *)control, l);
if (req == PRU_PURGEIF) {
mutex_enter(softnet_lock);
in6_pcbpurgeif0(&udbtable, (struct ifnet *)control);
in6_purgeif((struct ifnet *)control);
in6_pcbpurgeif(&udbtable, (struct ifnet *)control);
mutex_exit(softnet_lock);
return 0;
}
if (req == PRU_ATTACH)
sosetlock(so);
else if (in6p == NULL) {
error = EINVAL;
goto release;
}
switch (req) {
case PRU_ATTACH:
/*
* MAPPED_ADDR implementation spec:
* Always attach for IPv6,
* and only when necessary for IPv4.
*/
if (in6p != NULL) {
error = EINVAL;
break;
}
s = splsoftnet();
error = in6_pcballoc(so, &udbtable);
splx(s);
if (error)
break;
error = soreserve(so, udp6_sendspace, udp6_recvspace);
if (error)
break;
in6p = sotoin6pcb(so);
in6p->in6p_cksum = -1; /* just to be sure */
break;
case PRU_DETACH:
in6_pcbdetach(in6p);
break;
case PRU_BIND:
s = splsoftnet();
error = in6_pcbbind(in6p, addr6, l);
splx(s);
break;
case PRU_CONNECT:
if (!IN6_IS_ADDR_UNSPECIFIED(&in6p->in6p_faddr)) {
error = EISCONN;
break;
}
s = splsoftnet();
error = in6_pcbconnect(in6p, addr6, l);
splx(s);
if (error == 0)
soisconnected(so);
break;
case PRU_DISCONNECT:
if (IN6_IS_ADDR_UNSPECIFIED(&in6p->in6p_faddr)) {
error = ENOTCONN;
break;
}
s = splsoftnet();
in6_pcbdisconnect(in6p);
2009-03-18 19:00:08 +03:00
memset((void *)&in6p->in6p_laddr, 0, sizeof(in6p->in6p_laddr));
splx(s);
so->so_state &= ~SS_ISCONNECTED; /* XXX */
in6_pcbstate(in6p, IN6P_BOUND); /* XXX */
break;
case PRU_SHUTDOWN:
socantsendmore(so);
break;
case PRU_SEND:
s = splsoftnet();
error = udp6_output(in6p, m, addr6, control, l);
splx(s);
return error;
case PRU_ABORT:
soisdisconnected(so);
in6_pcbdetach(in6p);
break;
case PRU_SOCKADDR:
in6_setsockaddr(in6p, addr6);
break;
case PRU_PEERADDR:
in6_setpeeraddr(in6p, addr6);
break;
case PRU_SENSE:
/*
* stat: don't bother with a blocksize
*/
return 0;
case PRU_LISTEN:
case PRU_CONNECT2:
case PRU_ACCEPT:
case PRU_SENDOOB:
case PRU_FASTTIMO:
case PRU_SLOWTIMO:
case PRU_PROTORCV:
case PRU_PROTOSEND:
error = EOPNOTSUPP;
break;
case PRU_RCVD:
case PRU_RCVOOB:
return EOPNOTSUPP; /* do not free mbuf's */
default:
panic("udp6_usrreq");
}
release:
if (control != NULL)
m_freem(control);
if (m != NULL)
m_freem(m);
return error;
}
2008-04-15 08:43:25 +04:00
static int
sysctl_net_inet6_udp6_stats(SYSCTLFN_ARGS)
{
return (NETSTAT_SYSCTL(udp6stat_percpu, UDP6_NSTATS));
2008-04-15 08:43:25 +04:00
}
static void
sysctl_net_inet6_udp6_setup(struct sysctllog **clog)
{
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT,
CTLTYPE_NODE, "net", NULL,
NULL, 0, NULL, 0,
CTL_NET, CTL_EOL);
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT,
CTLTYPE_NODE, "inet6", NULL,
NULL, 0, NULL, 0,
CTL_NET, PF_INET6, CTL_EOL);
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT,
CTLTYPE_NODE, "udp6",
SYSCTL_DESCR("UDPv6 related settings"),
NULL, 0, NULL, 0,
CTL_NET, PF_INET6, IPPROTO_UDP, CTL_EOL);
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT|CTLFLAG_READWRITE,
CTLTYPE_INT, "sendspace",
SYSCTL_DESCR("Default UDP send buffer size"),
NULL, 0, &udp6_sendspace, 0,
CTL_NET, PF_INET6, IPPROTO_UDP, UDP6CTL_SENDSPACE,
CTL_EOL);
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT|CTLFLAG_READWRITE,
CTLTYPE_INT, "recvspace",
SYSCTL_DESCR("Default UDP receive buffer size"),
NULL, 0, &udp6_recvspace, 0,
CTL_NET, PF_INET6, IPPROTO_UDP, UDP6CTL_RECVSPACE,
CTL_EOL);
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT|CTLFLAG_READWRITE,
CTLTYPE_INT, "do_loopback_cksum",
SYSCTL_DESCR("Perform UDP checksum on loopback"),
NULL, 0, &udp_do_loopback_cksum, 0,
CTL_NET, PF_INET6, IPPROTO_UDP, UDP6CTL_LOOPBACKCKSUM,
CTL_EOL);
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT,
CTLTYPE_STRUCT, "pcblist",
SYSCTL_DESCR("UDP protocol control block list"),
sysctl_inpcblist, 0, &udbtable, 0,
CTL_NET, PF_INET6, IPPROTO_UDP, CTL_CREATE,
CTL_EOL);
sysctl_createv(clog, 0, NULL, NULL,
CTLFLAG_PERMANENT,
CTLTYPE_STRUCT, "stats",
SYSCTL_DESCR("UDPv6 statistics"),
2008-04-15 08:43:25 +04:00
sysctl_net_inet6_udp6_stats, 0, NULL, 0,
CTL_NET, PF_INET6, IPPROTO_UDP, UDP6CTL_STATS,
CTL_EOL);
}
2008-04-15 08:43:25 +04:00
void
udp6_statinc(u_int stat)
{
KASSERT(stat < UDP6_NSTATS);
UDP6_STATINC(stat);
}