add Freenix 2000 paper on m_pulldown(9), by itojun.
This commit is contained in:
parent
acd533ce9b
commit
daddfe35da
|
@ -0,0 +1,108 @@
|
|||
.\" $Id: 0.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
|
||||
.\"
|
||||
.EQ
|
||||
delim $$
|
||||
.EN
|
||||
.if n .ND
|
||||
.TL
|
||||
Mbuf issues in 4.4BSD IPv6/IPsec support
|
||||
.br
|
||||
\(em experiences from KAME IPv6/IPsec implemntation \(em
|
||||
.AU
|
||||
Jun-ichiro itojun Hagino
|
||||
.AI
|
||||
KAME Project
|
||||
Research Laboratory, Internet Initiative Japan Inc.
|
||||
\f[CR]http://www.kame.net/\fP
|
||||
.I itojun@iijlab.net
|
||||
.AB
|
||||
The 4.4BSD network stack has made certain assumptions regarding the packets it will handle.
|
||||
In particular, 4.4BSD assumes that
|
||||
(1) the total protocol header length is shorter than or equal to MHLEN,
|
||||
usually 100 bytes, and
|
||||
(2) there are a limited number of protocol headers on a packet.
|
||||
Neither of these assumptions hold any longer,
|
||||
due to the way IPv6/IPsec specifications are written.
|
||||
.PP
|
||||
We at the KAME project
|
||||
are implementing IPv6 and IPsec support code on top of 4.4BSD.
|
||||
To cope with the problems, we have introduced the following changes:
|
||||
(1) a new function called
|
||||
.I m_pulldown,
|
||||
which adjusts the mbuf chain with a minimal number of copies/allocations, and
|
||||
(2) a new calling sequence for parsing inbound packet headers.
|
||||
These changes allow us to manipulate incoming packets in a safer,
|
||||
more efficient, and more spec-conformant way.
|
||||
The technique described in this paper is integrated into the KAME IPv6/IPsec
|
||||
stack kit, and is freely available under BSD copyright.
|
||||
The KAME codebase is being merged into NetBSD, OpenBSD and FreeBSD.
|
||||
An integration into BSD/OS is planned.
|
||||
.AE
|
||||
.\".LP
|
||||
.de PT
|
||||
.lt \\n(LLu
|
||||
.pc %
|
||||
.nr PN \\n%
|
||||
.tl '\\*(LH'\\*(CH'\\*(RH'
|
||||
.lt \\n(.lu
|
||||
..
|
||||
.\".af PN i
|
||||
.\".ce
|
||||
.\".B "TABLE OF CONTENTS"
|
||||
.\".LP
|
||||
.\".sp 1
|
||||
.\".nf
|
||||
.\".B "1. Introduction"
|
||||
.\".LP
|
||||
.\".sp .5v
|
||||
.\".nf
|
||||
.\".B "2. The \fIgprof\fP Profiler"
|
||||
.\"\0.1. Data Presentation"
|
||||
.\"\0.1.1. The Flat Profile
|
||||
.\"\0.1.2. The Call Graph Profile
|
||||
.\"\0.2 Profiling the Kernel
|
||||
.\".LP
|
||||
.\".sp .5v
|
||||
.\".nf
|
||||
.\".B "3. Using \fIgprof\fP to Improve Performance
|
||||
.\"\0.1. Using the Profiler
|
||||
.\"\0.2. An Example of Tuning
|
||||
.\".LP
|
||||
.\".sp .5v
|
||||
.\".nf
|
||||
.\".B "4. Conclusions"
|
||||
.\".LP
|
||||
.\".sp .5v
|
||||
.\".nf
|
||||
.\".B Acknowledgements
|
||||
.\".LP
|
||||
.\".sp .5v
|
||||
.\".nf
|
||||
.\".B References
|
||||
.\".af PN 1
|
||||
.ds CH
|
||||
.ds LH
|
||||
.ds RH
|
||||
.\".ds LH mbuf issues in 4.4BSD IPv6 support
|
||||
.\".ds RH Contents
|
||||
.\".bp 1
|
||||
.ds CF
|
||||
.ds LF
|
||||
.ds RF
|
||||
.\".if t .ds CF Freenix2000
|
||||
.\".if t .ds LF
|
||||
.\".if t .ds RF Jun-ichiro itojun Hagino
|
||||
.\".bp 1
|
||||
.de _d
|
||||
.if t .ta .6i 2.1i 2.6i
|
||||
.\" 2.94 went to 2.6, 3.64 to 3.30
|
||||
.if n .ta .84i 2.6i 3.30i
|
||||
..
|
||||
.de _f
|
||||
.if t .ta .5i 1.25i 2.5i
|
||||
.\" 3.5i went to 3.8i
|
||||
.if n .ta .7i 1.75i 3.8i
|
||||
..
|
||||
.nr figure 0
|
||||
.nr table 0
|
||||
.if t .2C
|
|
@ -0,0 +1,343 @@
|
|||
.\" $Id: 1.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
|
||||
.\"
|
||||
.\".ds RH 4.4BSD incompatibility with IPv6/IPsec packet processing
|
||||
.NH 1
|
||||
4.4BSD incompatibility with IPv6/IPsec packet processing
|
||||
.PP
|
||||
The 4.4BSD network code holds a packet in a chain of ``mbuf'' structures.
|
||||
Each mbuf structure has three flavors:
|
||||
.IP \(sq
|
||||
non-cluster header mbuf, which holds MHLEN
|
||||
(100 bytes in a 32bit architecture installation of 4.4BSD),
|
||||
.IP \(sq
|
||||
non-cluster data mbuf, which holds MLEN (104 bytes), and
|
||||
.IP \(sq
|
||||
cluster mbuf which holds MCLBYTES (2048 bytes).
|
||||
.LP
|
||||
We can make a chain of mbuf structures as a linked list.
|
||||
Mbuf chains will efficiently hold variable-length packet data.
|
||||
Such chains also enable us to insert or remove
|
||||
some of the packet data from the chain
|
||||
without data copies.
|
||||
.PP
|
||||
When processing inbound packets, 4.4BSD uses a function called
|
||||
.I m_pullup
|
||||
to ease the manipulation of data content in the mbufs.
|
||||
It also uses a deep function call tree for inbound packet processing.
|
||||
While these two items work just fine for traditional IPv4 processing,
|
||||
they do not work as well with IPv6 and IPsec processing.
|
||||
.NH 2
|
||||
Restrictions in 4.4BSD m_pullup
|
||||
.PP
|
||||
For input packet processing,
|
||||
the 4.4BSD network stack uses the
|
||||
.I m_pullup
|
||||
function to ease parsing efforts
|
||||
by adjusting the data content in mbufs for placement onto the continuous memory
|
||||
region.
|
||||
.I m_pullup
|
||||
is defined as follows:
|
||||
.DS
|
||||
.SM
|
||||
\f[CR]struct mbuf *
|
||||
m_pullup(m, len)
|
||||
struct mbuf *m;
|
||||
int len;\fP
|
||||
.DE
|
||||
.NL
|
||||
.I m_pullup
|
||||
will ensure that the first
|
||||
.I len
|
||||
bytes in the packet
|
||||
are placed in the continuous memory region.
|
||||
After a call to
|
||||
.I m_pullup,
|
||||
the caller can safely access the the first
|
||||
.I len
|
||||
bytes of the packet, assuming that they are continuous.
|
||||
The caller can, for example, safely use pointer variables into
|
||||
the continuous region, as long as they point inside the
|
||||
.I len
|
||||
boundary.
|
||||
.PP
|
||||
.1C
|
||||
.KS
|
||||
.PS
|
||||
box wid boxwid*1.2 "IPv6 header" "next = routing"
|
||||
box same "routing header" "next = auth"
|
||||
box same "auth header" "next = TCP"
|
||||
box same "TCP header"
|
||||
box same "TCP payload"
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: IPv6 extension header chain
|
||||
.KE
|
||||
.if t .2C
|
||||
.I m_pullup
|
||||
makes certain assumptions regarding protocol headers.
|
||||
.I m_pullup
|
||||
can only take
|
||||
.I len
|
||||
upto MHLEN.
|
||||
If the total packet header length is longer than MHLEN,
|
||||
.I m_pullup
|
||||
will fail, and the result will be a loss of the packet.
|
||||
Under IPv4,
|
||||
.[
|
||||
RFC791
|
||||
.]
|
||||
the length assumption worked fine in most cases,
|
||||
since for almost every protocol, the total length of the protocol header part
|
||||
was less than MHLEN.
|
||||
Each packet has only two protocol headers, including the IPv4 header.
|
||||
For example, the total length of the protocol header part of a TCP packet
|
||||
(up to TCP data payload) is a maximum of 120 bytes.
|
||||
Typically, this length is 40 to 48 bytes.
|
||||
When an IPv4 option is present, it is stripped off before TCP
|
||||
header processing, and the maximum length passed to
|
||||
.I m_pullup
|
||||
will be 100.
|
||||
.IP 1
|
||||
The IPv4 header occupies 20 bytes.
|
||||
.IP 2
|
||||
The IPv4 option occupies 40 bytes maximum.
|
||||
It will be stripped off before we parse the TCP header.
|
||||
Also note that the use of IPv4 options is very rare.
|
||||
.IP 3
|
||||
The TCP header length is 20 bytes.
|
||||
.IP 4
|
||||
The TCP option is 40 bytes maximum.
|
||||
In most cases it is 0 to 8 bytes.
|
||||
.LP
|
||||
.PP
|
||||
IPv6 specification
|
||||
.[
|
||||
RFC2460
|
||||
.]
|
||||
and IPsec specification
|
||||
.[
|
||||
RFC2401
|
||||
.]
|
||||
allow more flexible use of protocol headers
|
||||
by introducing chained extension headers.
|
||||
With chained extension headers, each header has a ``next header field'' in it.
|
||||
A chain of headers can be made as shown
|
||||
.nr figure +1
|
||||
in Figure \n[figure].
|
||||
.nr figure -1
|
||||
The type of protocol header is determined by
|
||||
inspecting the previous protocol header.
|
||||
There is no restriction in the number of extension headers in the spec.
|
||||
.PP
|
||||
Because of extension header chains, there is now no upper limit in
|
||||
protocol packet header length.
|
||||
The
|
||||
.I m_pullup
|
||||
function would impose unnecessary restriction
|
||||
to the extension header processing.
|
||||
In addition,
|
||||
with the introduction of IPsec, it is now impossible to strip off extension headers
|
||||
during inbound packet processing.
|
||||
All of the data on the packet must be retained if it is to be authenticated
|
||||
using Authentication Header.
|
||||
.[
|
||||
RFC2402
|
||||
.]
|
||||
Continuing the use of
|
||||
.I m_pullup
|
||||
will limit the
|
||||
number of extension headers allowed on the packet,
|
||||
and could jeopadize the possible usefulness of IPv6 extension headers. \**
|
||||
.FS
|
||||
In IPv4 days, the IPv4 options turned out to be unusable
|
||||
due to a lack of implementation.
|
||||
This was because most commercial products simply did not support IPv4 options.
|
||||
.FE
|
||||
.PP
|
||||
Another problem related to
|
||||
.I m_pullup
|
||||
is that it tends to copy the protocol header even
|
||||
when it is unnecessary to do so.
|
||||
For example, consider the mbuf chain shown
|
||||
.nr figure +1
|
||||
in Figure \n[figure]:
|
||||
.nr figure -1
|
||||
.KS
|
||||
.PS
|
||||
define pointer { box ht boxht*1/4 }
|
||||
define payload { box }
|
||||
IP: [
|
||||
IPp: pointer
|
||||
IPd: payload with .n at bottom of IPp "IPv4"
|
||||
]
|
||||
move
|
||||
TCP: [
|
||||
TCPp: pointer
|
||||
TCPd: payload with .n at bottom of TCPp "TCP" "TCP payload"
|
||||
]
|
||||
arrow from IP.IPp.center to TCP.TCPp.center
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
.nr beforepullup \n[figure]
|
||||
Figure \n[figure]: mbuf chain before \fIm_pullup\fP
|
||||
.KE
|
||||
Here, the first mbuf contains an IPv4 header in the continuous region,
|
||||
and the second mbuf contains a TCP header in the continuous region.
|
||||
When we look at the content of the TCP header,
|
||||
under 4.4BSD the code will look like the following:
|
||||
.DS
|
||||
.SM
|
||||
\f[CR]struct ip *ip;
|
||||
struct tcphdr *th;
|
||||
ip = mtod(m, struct ip *);
|
||||
/* extra copy with m_pullup */
|
||||
m = m_pullup(m, iphdrlen + tcphdrlen);
|
||||
/* MUST reinit ip */
|
||||
ip = mtod(m, struct ip *);
|
||||
th = mtod(m, caddr_t) + iphdrlen;\fP
|
||||
.NL
|
||||
.DE
|
||||
As a result, we will get a mbuf chain shown in
|
||||
.nr figure +1
|
||||
Figure \n[figure].
|
||||
.nr figure -1
|
||||
.KF
|
||||
.PS
|
||||
define pointer { box ht boxht*1/4 }
|
||||
define payload { box }
|
||||
IP: [
|
||||
IPp: pointer
|
||||
IPd: payload with .n at bottom of IPp "IPv4" "TCP"
|
||||
]
|
||||
move
|
||||
TCP: [
|
||||
TCPp: pointer
|
||||
TCPd: payload with .n at bottom of TCPp "TCP payload"
|
||||
]
|
||||
arrow from IP.IPp.center to TCP.TCPp.center
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: mbuf chain in figure \n[beforepullup] after \fIm_pullup\fP
|
||||
.KE
|
||||
Because
|
||||
.I m_pullup
|
||||
is only able to make a continuous
|
||||
region starting from the top of the mbuf chain,
|
||||
it copies the TCP portion in second mbuf
|
||||
into the first mbuf.
|
||||
The copy could be avoided if
|
||||
.I m_pullup
|
||||
were clever enough
|
||||
to handle this case.
|
||||
Also, the caller side is required to reinitialize all of
|
||||
the pointers that point to the content of mbuf,
|
||||
since after
|
||||
.I m_pullup,
|
||||
the first mbuf on the chain
|
||||
.1C
|
||||
.KS
|
||||
.PS
|
||||
ellipse "\fIip6_input\fP"
|
||||
arrow
|
||||
ellipse "\fIrthdr6_input\fP"
|
||||
arrow
|
||||
ellipse "\fIah_input\fP"
|
||||
arrow "stack" "overflow"
|
||||
ellipse "\fIesp_input\fP"
|
||||
arrow
|
||||
ellipse "\fItcp_input\fP"
|
||||
.PE
|
||||
.ce
|
||||
Figure 5: an excessively deep call chain can cause kernel stack overflow
|
||||
.KE
|
||||
.if t .2C
|
||||
.LP
|
||||
can be reallocated and lives at
|
||||
a different address than before.
|
||||
While
|
||||
.I m_pullup
|
||||
design has provided simplicity in packet parsing,
|
||||
it is disadvantageous for protocols like IPv6.
|
||||
.PP
|
||||
The problems can be summarized as follows:
|
||||
(1)
|
||||
.I m_pullup
|
||||
imposes too strong restriction
|
||||
on the total length of the packet header (MHLEN);
|
||||
(2)
|
||||
.I m_pullup
|
||||
makes an extra copy even when this can be avoided; and
|
||||
(3)
|
||||
.I m_pullup
|
||||
requires the caller to reinitialize all of the pointers into the mbuf chain.
|
||||
.NH 2
|
||||
Protocol header processing with a deep function call chain
|
||||
.PP
|
||||
Under 4.4BSD, protocol header processing will make a chain of function calls.
|
||||
For example, if we have an IPv4 TCP packet, the following function call chain will be made
|
||||
.nr figure +1
|
||||
(see Figure \n[figure]):
|
||||
.nr figure -1
|
||||
.IP (1)
|
||||
.I ipintr
|
||||
will be called from the network software interrupt logic,
|
||||
.IP (2)
|
||||
.I ipintr
|
||||
processes the IPv4 header, then calls
|
||||
.I tcp_input.
|
||||
.\".I ipintr
|
||||
.\"can be called
|
||||
.\".I ip_input
|
||||
.\"from its functionality.
|
||||
.IP (3)
|
||||
.I tcp_input
|
||||
will process the TCP header and pass the data payload
|
||||
to the socket queues.
|
||||
.LP
|
||||
.KF
|
||||
.PS
|
||||
ellipse "\fIipintr\fP"
|
||||
arrow
|
||||
ellipse "\fItcp_input\fP"
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: function call chain in IPv4 inbound packet processing
|
||||
.KE
|
||||
.PP
|
||||
If chained extension headers are handled as described above,
|
||||
the kernel stack can overflow by a deep function call chain, as shown in
|
||||
.nr figure +1
|
||||
Figure \n[figure].
|
||||
.nr figure -1
|
||||
.nr figure +1
|
||||
IPv6/IPsec specifications do not define any upper limit
|
||||
to the number of extension headers on a packet,
|
||||
so a malicious party can transmit a ``legal'' packet with a large number of chained
|
||||
headers in order to attack IPv6/IPsec implementations.
|
||||
We have experienced kernel stack overflow in IPsec code,
|
||||
tunnelled packet processing code, and in several other cases.
|
||||
The IPsec processing routines tend to use a large chunk of memory
|
||||
on the kernel stack, in order to hold intermediate data and the secret keys
|
||||
used for encryption. \**
|
||||
.FS
|
||||
For example, blowfish encryption processing code typically uses
|
||||
an intermediate data region of 4K or more.
|
||||
With typical 4.4BSD installation on i386 architecture,
|
||||
the kernel stack region occupies less than 8K bytes and does not grow on demand.
|
||||
.FE
|
||||
We cannot put the intermediate data region into a static data region outside of
|
||||
the kernel stack,
|
||||
because it would become a source of performance drawback on multiprocessors
|
||||
due to data locking.
|
||||
.PP
|
||||
Even though the IPv6 specifications do not define any restrictions
|
||||
on the number of extension headers, it may be possible
|
||||
to impose additional restriction in an IPv6 implementation for safety.
|
||||
In any case, it is not possible to estimate the amount of the
|
||||
kernel stack, which will be used by protocol handlers.
|
||||
We need a better calling convention for IPv6/IPsec header processing,
|
||||
regardless of the limits in the number of extension headers we may impose.
|
|
@ -0,0 +1,286 @@
|
|||
.\" $Id: 2.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
|
||||
.\"
|
||||
.\".ds RH KAME approach
|
||||
.NH 1
|
||||
KAME approach
|
||||
.PP
|
||||
This section describes the approaches we at the KAME project
|
||||
took against the problems mentioned in the previous section.
|
||||
We introduce a new function called
|
||||
.I m_pulldown,
|
||||
in place of
|
||||
.I m_pullup,
|
||||
for adjusting payload data in the mbuf.
|
||||
We also change the calling sequence for the protocol input function.
|
||||
.NH 2
|
||||
What is the KAME project?
|
||||
.PP
|
||||
In the early days of IPv6/IPsec development,
|
||||
the Japanese research community felt it very important to make
|
||||
a reference code available in a freely-redistributable form
|
||||
for educational, research and deployment purposes.
|
||||
The KAME project is a consortium of 7 Japanese companies and
|
||||
an academic research group.
|
||||
The project aims to deliver IPv6/IPsec reference implementation
|
||||
for 4.4BSD, under BSD license.
|
||||
The KAME project intends to deliver the most
|
||||
spec-conformant IPv6/IPsec implementation possible.
|
||||
.NH 2
|
||||
m_pulldown function
|
||||
.PP
|
||||
Here we introduce a new function,
|
||||
.I m_pulldown,
|
||||
to address the 3 problems with
|
||||
.I m_pullup
|
||||
that we have described above.
|
||||
The actual source code is included at the end of this paper.
|
||||
The function prototype is as follows:
|
||||
.DS
|
||||
.SM
|
||||
\f[CR]struct mbuf *
|
||||
m_pulldown(m, off, len, offp)
|
||||
struct mbuf *m;
|
||||
int off, len;
|
||||
int *offp;\fP
|
||||
.NL
|
||||
.DE
|
||||
.I m_pulldown
|
||||
will ensure that the data region in the mbuf chain,
|
||||
starting at
|
||||
.I off
|
||||
and ending at
|
||||
.I "off + len",
|
||||
is put into a continuous memory region.
|
||||
.I len
|
||||
must be smaller than, or equal to, MCLBYTES (2048 bytes).
|
||||
The function returns a pointer to an intermediate mbuf in the chain
|
||||
(we refer to the pointer as \fIn\fP), and puts the new offset in
|
||||
.I n
|
||||
to
|
||||
.I *offp.
|
||||
If
|
||||
.I offp
|
||||
is NULL, the resulting region can be located by
|
||||
.I "mtod(n, caddr_t)";
|
||||
if
|
||||
.I offp
|
||||
is non-null, it will be located at
|
||||
.I "mtod(n, caddr_t) + *offp".
|
||||
The mbuf prior to
|
||||
.I off
|
||||
will remain untouched,
|
||||
so it is safe to keep the pointers to the mbuf chain.
|
||||
For example, consider the mbuf chain
|
||||
.nr figure +1
|
||||
on Figure \n[figure]
|
||||
.nr figure -1
|
||||
as the input.
|
||||
.KF
|
||||
.PS
|
||||
define pointer { box ht boxht*1/4 }
|
||||
define payload { box }
|
||||
IP: [
|
||||
IPp: pointer
|
||||
IPd: payload with .n at bottom of IPp "mbuf1" "50 bytes"
|
||||
]
|
||||
move
|
||||
TCP: [
|
||||
TCPp: pointer
|
||||
TCPd: payload with .n at bottom of TCPp "mbuf2" "20 bytes"
|
||||
]
|
||||
arrow from IP.IPp.center to TCP.TCPp.center
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: mbuf chain before the call to \fIm_pulldown\fP
|
||||
.KE
|
||||
If we call
|
||||
.I m_pulldown
|
||||
with
|
||||
.I "off = 40",
|
||||
.I "len = 10",
|
||||
and a non-null
|
||||
.I offp,
|
||||
the mbuf chain will remain unchanged.
|
||||
The return value will be a pointer to mbuf1, and
|
||||
.I *offp
|
||||
will be
|
||||
filled with 40.
|
||||
If we call
|
||||
.I m_pulldown
|
||||
with
|
||||
.I "off = 40",
|
||||
.I "len = 20",
|
||||
and null
|
||||
.I offp,
|
||||
then the mbuf chain will be modified as shown
|
||||
.nr figure +1
|
||||
in Figure \n[figure],
|
||||
.nr figure -1
|
||||
by allocating a new mbuf, mbuf3,
|
||||
into the middle and moving data from both mbuf1 and mbuf2.
|
||||
The function returns a pointer to mbuf3.
|
||||
.KF
|
||||
.PS
|
||||
define pointer { box ht boxht*1/4 }
|
||||
define payload { box }
|
||||
IP: [
|
||||
IPp: pointer
|
||||
IPd: payload with .n at bottom of IPp "mbuf1" "40 bytes"
|
||||
]
|
||||
move 0.2;
|
||||
INT: [
|
||||
INTp: pointer
|
||||
INTd: payload with .n at bottom of INTp "mbuf3" "20 bytes"
|
||||
]
|
||||
move 0.2;
|
||||
TCP: [
|
||||
TCPp: pointer
|
||||
TCPd: payload with .n at bottom of TCPp "mbuf2'" "10 bytes"
|
||||
]
|
||||
arrow from IP.IPp.center to INT.INTp.center
|
||||
arrow from INT.INTp.center to TCP.TCPp.center
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: mbuf chain after call to \fIm_pulldown\fP, with \fIoff = 40\fP and \fIlen = 20\fP
|
||||
.KE
|
||||
The
|
||||
.I m_pulldown
|
||||
function solves all 3 problems in
|
||||
.I m_pullup
|
||||
that were described in the previous section.
|
||||
.I m_pulldown
|
||||
does not copy mbufs when copying is not necessary.
|
||||
Since it does not modify the mbuf chain prior to the speficied offset
|
||||
.I off,
|
||||
it is not necessary for the caller to re-initialize the pointers into the mbuf data
|
||||
region.
|
||||
With
|
||||
.I m_pullup,
|
||||
we always needed to specify the data payload length, starting from the very first byte
|
||||
in the packet.
|
||||
With
|
||||
.I m_pulldown,
|
||||
we pass
|
||||
.I off
|
||||
as the offset to the data payload we are interested in.
|
||||
This change avoids extra data manipulation when we are only interested in
|
||||
the intermediate data portion of the packet.
|
||||
It also eases the assumption regarding total packet header length.
|
||||
While
|
||||
.I m_pullup
|
||||
assumes that the total packet header length is smaller than or equal to MHLEN
|
||||
(100 bytes),
|
||||
.I m_pulldown
|
||||
assumes that single packet header length is smaller than or equal to MCLBYTES
|
||||
(2048 bytes).
|
||||
With mbuf framework this is the best we
|
||||
can do, since there is no way to hold continuous region longer than
|
||||
MCLBYTES in a standard mbuf chain.
|
||||
.NH 2
|
||||
New function prototype for inbound packet processing
|
||||
.PP
|
||||
For IPv6 processing, our code does not make a deep function call chain.
|
||||
Rather, we make a loop in the very last part of
|
||||
.I ip6_input,
|
||||
as shown in Figure 8.
|
||||
IPPROTO_DONE is a pseudo-protocol type value that identifies the end of the
|
||||
extension header chain.
|
||||
If more protocol headers exist,
|
||||
each header processing code will update the pointer variables
|
||||
and return the next extension header type.
|
||||
If the final header in the chain has been reached,
|
||||
IPPROTO_DONE is returned.
|
||||
.\" figure 8
|
||||
.nr figure +1
|
||||
With this code, we no longer have a deep call chain for IPv6/IPsec processing.
|
||||
Rather,
|
||||
.I ip6_input
|
||||
will make calls to each extension header processor
|
||||
directly.
|
||||
This avoids the possibility of overflowing the kernel stack due to multiple
|
||||
extension header processing.
|
||||
.KF
|
||||
.PS
|
||||
A: ellipse "\fIip6_input\fP"
|
||||
right
|
||||
move
|
||||
move
|
||||
up
|
||||
move
|
||||
B: ellipse "\fIrthdr6_input\fP"
|
||||
move to last ellipse .s
|
||||
down
|
||||
C: ellipse "\fIah_input\fP"
|
||||
D: ellipse "\fIesp_input\fP"
|
||||
E: ellipse "\fItcp_input\fP"
|
||||
|
||||
arrow from 1/4 <A.e, A.ne> to 1/4 <B.w, B.nw>
|
||||
arrow from 1/4 <B.w, B.sw> to 1/4 <A.e, A.se>
|
||||
|
||||
arrow from 1/4 <A.e, A.ne> to 1/4 <C.w, C.nw>
|
||||
arrow from 1/4 <C.w, C.sw> to 1/4 <A.e, A.se>
|
||||
|
||||
arrow from 1/4 <A.e, A.ne> to 1/4 <D.w, D.nw>
|
||||
arrow from 1/4 <D.w, D.sw> to 1/4 <A.e, A.se>
|
||||
|
||||
arrow from 3/8 <A.e, A.ne> to 1/4 <E.w, E.nw>
|
||||
arrow from 3/8 <E.w, E.sw> to 1/4 <A.e, A.se>
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: KAME avoids function call chain by making a loop in \fIip6_input\fP
|
||||
.KE
|
||||
.PP
|
||||
Regardless of the calling sequence imposed by the
|
||||
.I pr_input
|
||||
function prototype, it is important not to use up the kernel
|
||||
stack region in protocol handlers.
|
||||
Sometimes it is necessary to decrease the size of kernel stack usage
|
||||
by using pointer variables and dynamically allocated regions.
|
||||
.1C
|
||||
.KF
|
||||
.DS
|
||||
.ps 8
|
||||
.vs 9
|
||||
\f[CR]struct ip6protosw {
|
||||
int (*pr_input) __P((struct mbuf **, int *, int));
|
||||
/* and other members */
|
||||
};
|
||||
|
||||
ip6_input(m)
|
||||
struct mbuf *m;
|
||||
{
|
||||
/* in the very last part */
|
||||
extern struct ip6protosw inet6sw[];
|
||||
/* the first one in extension header chain */
|
||||
nxt = ip6.ip6_nxt;
|
||||
while (nxt != IPPROTO_DONE)
|
||||
nxt = (*inet6sw[ip6_protox[nxt]].pr_input)(&m, &off, nxt);
|
||||
}
|
||||
|
||||
/* in each header processing code */
|
||||
int
|
||||
foohdr_input(mp, offp, proto)
|
||||
struct mbuf **mp;
|
||||
int *offp;
|
||||
int proto;
|
||||
{
|
||||
/* some processing, may modify mbuf chain */
|
||||
|
||||
if (we have more header to go) {
|
||||
*mp = newm;
|
||||
*offp = nxtoff;
|
||||
return nxt;
|
||||
} else {
|
||||
m_freem(newm);
|
||||
return IPPROTO_DONE;
|
||||
}
|
||||
}\fP
|
||||
.DE
|
||||
.NL
|
||||
.ce
|
||||
Figure 8: KAME IPv6 header chain processing code.
|
||||
.KE
|
||||
.if t .2C
|
|
@ -0,0 +1,77 @@
|
|||
.\" $Id: 4.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
|
||||
.\"
|
||||
.\".ds RH Alternative approaches
|
||||
.NH 1
|
||||
Alternative approaches
|
||||
.PP
|
||||
Many BSD-based IPv6 stacks have been implemented.
|
||||
While the most popular stacks include NRL, INRIA and KAME,
|
||||
dozens of other BSD-based IPv6 implementations have been made.
|
||||
This section presents alternative approaches for purposes of comparison.
|
||||
.NH 2
|
||||
NRL m_pullup2
|
||||
.PP
|
||||
The latest NRL IPv6 release copes with the
|
||||
.I m_pullup
|
||||
limitation by introducing a new function,
|
||||
.I m_pullup2.
|
||||
.I m_pullup2
|
||||
works similarly to
|
||||
.I m_pullup,
|
||||
but it allows
|
||||
.I len
|
||||
to extend up to MCLBYTES, which corresponds to 2048 bytes in a typical installation.
|
||||
When
|
||||
the
|
||||
.I len
|
||||
parameter is smaller than or equal to MHLEN,
|
||||
.I m_pullup2
|
||||
simply calls
|
||||
.I m_pullup
|
||||
from the inside.
|
||||
.PP
|
||||
While
|
||||
.I m_pullup2
|
||||
works well for packet headers up to MCLBYTES with very little change
|
||||
in code, it does not avoid making unnecessary copies.
|
||||
It also imposes restrictions on the total length of packet headers.
|
||||
The assumption here is that the total length of packet headers is less than
|
||||
MCLBYTES.
|
||||
.NH 2
|
||||
Hydrangea changes to m_devget
|
||||
.PP
|
||||
The Hydrangea IPv6 stack was implemented by a group of Japanese researchers,
|
||||
and is one of the ancestors of the KAME IPv6 stack.
|
||||
The Hydrangea IPv6 stack avoids the need for
|
||||
.I m_pullup
|
||||
by modifying the mbuf allocation policy in drivers.
|
||||
For inbound packets, the drivers allocate mbufs by using the
|
||||
.I m_devget
|
||||
function, or by re-implementing the behavior of
|
||||
.I m_devget.
|
||||
.I m_devget
|
||||
allocates mbuf as follows:
|
||||
.IP 1
|
||||
If the packet fits in MHLEN (100 bytes), allocate a single non-cluster mbuf.
|
||||
.IP 2
|
||||
If the packet is larger than MHLEN but fits in MHLEN + MLEN (204 bytes),
|
||||
allocate two non-cluster mbufs.
|
||||
.IP 3
|
||||
Otherwise, allocate multiple cluster mbufs, MCLBYTES (2048 bytes) in size.
|
||||
.LP
|
||||
For typical packets, the second case is where
|
||||
.I m_pullup
|
||||
is used.
|
||||
The Hydrangea stack avoids the use of
|
||||
.I m_pullup
|
||||
by eliminating the second case.
|
||||
.PP
|
||||
This approach worked well in most cases, but failed for (1) loopback interface,
|
||||
(2) tunnelled packets, and (3) non-conforming drivers.
|
||||
With the Hydrangea approach, every device driver had to be examined
|
||||
to ensure the new mbuf allocation policy.
|
||||
We could not be sure if the constraint was guaranteed until we checked the
|
||||
driver code,
|
||||
and the Hydrangea approach raised many support issues.
|
||||
This was one of our motivations for introducing
|
||||
.I m_pulldown.
|
|
@ -0,0 +1,382 @@
|
|||
.\" $Id: 8.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
|
||||
.\"
|
||||
.\".ds RH Comparisons
|
||||
.NH 1
|
||||
Comparisons
|
||||
.PP
|
||||
This section compares the following three approaches in terms of
|
||||
their characteristics and actual behavior:
|
||||
(1) 4.4BSD
|
||||
.I m_pullup,
|
||||
(2) NRL
|
||||
.I m_pullup2,
|
||||
and (3) KAME
|
||||
.I m_pulldown.
|
||||
.LP
|
||||
.NH 2
|
||||
Comparison of assumption
|
||||
.PP
|
||||
Table 1 shows the assumptions made by each of the three approaches.
|
||||
As mentioned earlier,
|
||||
.I m_pullup
|
||||
imposes too stringent requirement for the total length of packet headers.
|
||||
.I m_pullup2
|
||||
is workable in most cases, although
|
||||
this approach adds more restrictions than the specification claims.
|
||||
.I m_pulldown
|
||||
assumes that the single packet header is smaller than MCLBYTES,
|
||||
but makes
|
||||
no restriction regarding the total length of packet headers.
|
||||
With a standard mbuf chain,
|
||||
this is the best
|
||||
.I m_pulldown
|
||||
can do, since there is no way to hold continuous region longer than MCLBYTES.
|
||||
This characteristic can contribute to better specification conformance,
|
||||
since
|
||||
.I m_pulldown
|
||||
will impose fewer additional restrictions due to the
|
||||
requirements of implementation.
|
||||
.PP
|
||||
Among the three approaches, only
|
||||
.I m_pulldown
|
||||
avoids making unnecessary copies of intermediate header data and
|
||||
avoids pointer reinitialization after calls to these functions.
|
||||
These attributes result in smaller overhead during input packet processing.
|
||||
.PP
|
||||
.nr table +1
|
||||
At present,
|
||||
we know of no other 4.4BSD-based IPv6/IPsec stack that addresses kernel
|
||||
stack overflow issues,
|
||||
although we are open to
|
||||
new perspectives and new information.
|
||||
.NH 2
|
||||
Performance comparison based on simulated statistics
|
||||
.PP
|
||||
To compare the behavior and performance of
|
||||
.I m_pulldown
|
||||
against
|
||||
.I m_pullup
|
||||
and
|
||||
.I m_pullup2
|
||||
using the same set of traffic and
|
||||
mbuf chains, we have gathered simulated statistics for
|
||||
.I m_pullup
|
||||
and
|
||||
.I m_pullup2,
|
||||
in
|
||||
.I m_pulldown
|
||||
function.
|
||||
By running a kernel using the modified
|
||||
.I m_pulldown
|
||||
function,
|
||||
we can easily
|
||||
gather statistics for these three functions against exactly the same traffic.
|
||||
.PP
|
||||
The comparison was made on a computer
|
||||
(with Celeron 366MHz CPU, 192M bytes of memory)
|
||||
running NetBSD 1.4.1 with the KAME IPv6/IPsec stack.
|
||||
Network drivers allocate mbufs just as normal 4.4BSD does.
|
||||
.I m_pulldown
|
||||
is called whenever it is needed to ensure continuity in packet data
|
||||
during inbound packet processing.
|
||||
The role of the computer is as an end node, not a router.
|
||||
.PP
|
||||
To describe the content of the following table,
|
||||
we must look at the source code fragment.
|
||||
.nr figure +1
|
||||
Figure \n[figure]
|
||||
.nr figure -1
|
||||
shows the code fragment from our source code.
|
||||
The code fragment will
|
||||
(1) make the TCP header on the mbuf chain
|
||||
.I m
|
||||
at offset
|
||||
.I hdrlen
|
||||
continuous, and (2) point the region with pointer
|
||||
.I th.
|
||||
We use a macro named IP6_EXTHDR_CHECK,
|
||||
and the code before and after the macro expansion is shown in the figure.
|
||||
.KF
|
||||
.LD
|
||||
.ps 6
|
||||
.vs 7
|
||||
\f[CR]/* ensure that *th from hdrlen is continuous */
|
||||
/* before macro expansion... */
|
||||
struct tcphdr *th;
|
||||
IP6_EXTHDR_CHECK(th, struct tcphdr *, m,
|
||||
hdrlen, sizeof(*th));
|
||||
if (th == NULL)
|
||||
return; /*m is already freed*/
|
||||
|
||||
|
||||
/* after macro expansion... */
|
||||
struct tcphdr *th;
|
||||
int off;
|
||||
struct mbuf *n;
|
||||
if (m->m_len < hdrlen + sizeof(*th)) {
|
||||
n = m_pulldown(m, hdrlen, sizeof(*th), &off);
|
||||
if (n)
|
||||
th = (struct tcphdr *)(mtod(n, caddr_t) + off);
|
||||
else
|
||||
th = NULL;
|
||||
} else
|
||||
th = (struct tcphdr *)(mtod(m, caddr_t) + hdrlen);
|
||||
if (th == NULL)
|
||||
return;\fP
|
||||
.NL
|
||||
.DE
|
||||
.nr figure +1
|
||||
Figure \n[figure]: code fragment for trimming mbuf chain.
|
||||
.KE
|
||||
In Table 2,
|
||||
the first column identifies the test case.
|
||||
The second column shows the number of times
|
||||
the IP6_EXTHDR_CHECK macro was used.
|
||||
In other words, it shows the number of times we have made checks against
|
||||
mbuf length.
|
||||
The remaining columns show, from left to right,
|
||||
the number of times memory allocation/copy was performed in each of the variants.
|
||||
In the case of
|
||||
.I m_pullup,
|
||||
we counted the number of cases we passed
|
||||
.I len
|
||||
in excess of MHLEN (96 bytes in this installation).
|
||||
.\"With
|
||||
.\".I m_pullup2
|
||||
.\"and
|
||||
.\".I m_pulldown,
|
||||
.\"there were no such failures.
|
||||
This result suggests
|
||||
that there was no packet with a packet header portion larger than
|
||||
MCLBYTES (2048 bytes).
|
||||
.\" The percentage in parentheses is ratio against the number on the first column.
|
||||
In the evaluation we have used
|
||||
.I m_pulldown
|
||||
against IPv6 traffic only.
|
||||
.1C
|
||||
.KF
|
||||
.TS
|
||||
center box;
|
||||
l cfI cfI cfI
|
||||
l c c c.
|
||||
m_pullup m_pullup2 m_pulldown
|
||||
_
|
||||
total header length MHLEN(100) MCLBYTES(2048) \(mi
|
||||
single header length \(mi \(mi MCLBYTES(2048)
|
||||
_
|
||||
T{
|
||||
avoids copy on intermediate headers
|
||||
T} no no yes
|
||||
_
|
||||
T{
|
||||
avoids pointer reinitialization
|
||||
T} no no yes
|
||||
.TE
|
||||
.ce
|
||||
Table 1: assumptions in mbuf manipulation approaches.
|
||||
.KE
|
||||
.KF
|
||||
.TS
|
||||
center box;
|
||||
c |c |cfI s s |cfI s s |cfI s
|
||||
c |r |c c c |c c c |c c
|
||||
r |r |r r r |r r r |r r.
|
||||
test len checks m_pulldown m_pullup m_pullup2
|
||||
call alloc copy alloc copy fail alloc copy
|
||||
_
|
||||
(1) 204923 1706 1595 1596 165 165 1541 1596 1596
|
||||
(2) 1063995 23786 22931 23008 1171 1229 22557 22895 22953
|
||||
(3) 520028 1245 948 957 432 432 813 945 945
|
||||
(4) 438602 180 6 6 178 178 2 24 24
|
||||
(5) 5570 2236 206 206 812 812 1424 1424 1424
|
||||
.TE
|
||||
.ce
|
||||
Table 2: number of mbuf allocation/copy against traffic
|
||||
.KE
|
||||
.KF
|
||||
.TS
|
||||
center box;
|
||||
c |c c c c |c c c
|
||||
c |r r r r |r r r.
|
||||
test IPv6 input TCP UDP ICMPv6 1 mbuf 2 mbufs ext mbuf(s)
|
||||
_
|
||||
(1) 29334 20892 2699 5739 3624 15632 10078
|
||||
(2) 313218 215919 15930 80263 38751 172976 101491
|
||||
(3) 132267 117822 8561 5882 12782 59799 59686
|
||||
(4) 73160 66512 5249 1343 7475 42053 23632
|
||||
(5) 1433 148 53 52 103 1203 127
|
||||
.TE
|
||||
.ce
|
||||
Table 3: Traffic characteristics for tests in Table 2
|
||||
.KE
|
||||
.if t .2C
|
||||
.PP
|
||||
From these measured results, we obtain several interesting observations.
|
||||
.I m_pullup
|
||||
actually failed on IPv6 trafic.
|
||||
If an IPv6 implementation uses
|
||||
.I m_pullup
|
||||
for IPv6 input processing,
|
||||
it must be coded carefully so as to avoid trying
|
||||
.I m_pullup
|
||||
against any length longer than MHLEN.
|
||||
To achieve this end, the code copies the data portion from the mbuf
|
||||
chain to a separate buffer, and the cost of memory copies becomes a penalty.
|
||||
.PP
|
||||
Due to the nature of this simulation,
|
||||
the comparison described above may contain an implicit bias.
|
||||
Since the IPv6 protocol processing code is written by using
|
||||
.I m_pulldown,
|
||||
the code is somewhat biased toward
|
||||
.I m_pulldown.
|
||||
If a programmer had to write the entire IPv6 protocol processing with
|
||||
.I m_pullup
|
||||
only, he or she would use
|
||||
.I m_copydata
|
||||
to copy intermediate
|
||||
extension headers buried deep inside the header chains,
|
||||
thus making it unnecessary to call
|
||||
.I m_pullup.
|
||||
In any case, a call to
|
||||
.I m_copydata
|
||||
will result in a data copy,
|
||||
which causes extra overhead.
|
||||
.\"The author thinks that this bias toward
|
||||
.\".I m_pulldown
|
||||
.\"is therefore negligible.
|
||||
.PP
|
||||
In all cases, the number of length checks (second column) exceeds the
|
||||
number of inbound packets.
|
||||
This behavior is the same as in the original 4.4BSD stack;
|
||||
we did not add a significant number of length checks to the code.
|
||||
This is because
|
||||
.I m_pulldown
|
||||
(or
|
||||
.I m_pullup
|
||||
in the 4.4BSD case)
|
||||
is called
|
||||
as necessary during the parsing of the headers.
|
||||
For example, to process a TCP-over-IPv6 packet, at least 3
|
||||
checks would be made against m->m_len;
|
||||
these checks would be made
|
||||
to grab the IPv6 header (40 bytes),
|
||||
to grab the TCP header (20 bytes), and to grab the TCP header
|
||||
and options (20 to 60 bytes).
|
||||
The length of the TCP option part is kept inside the TCP header,
|
||||
so the length needs to be checked twice for the TCP part.
|
||||
.\"If the function call overhead is more significant than the actual
|
||||
.\".I m_pullup
|
||||
.\"or
|
||||
.\".I m_pulldown
|
||||
.\"operation,
|
||||
.\"we may be able to blindly call
|
||||
.\".I m_pulldown
|
||||
.\"with the maximum TCP option length
|
||||
.\"(60 bytes) in order to reduce the number of function calls.
|
||||
.KF
|
||||
.PS
|
||||
Ao: box invis ht boxht*2
|
||||
A: box at center of Ao "IPv6 header"
|
||||
Bo: box invis ht boxht*2
|
||||
B: box at center of Bo "TCP header" "(len)"
|
||||
Co: box invis ht boxht*2
|
||||
C: box at center of Co "TCP options"
|
||||
D: box "payload"
|
||||
|
||||
arrow from 1/3 of the way between Ao.sw and Ao.se to Ao.sw
|
||||
arrow from 2/3 of the way between Ao.sw and Ao.se to Ao.se
|
||||
line invis from Ao.sw to Ao.se "40"
|
||||
line from Ao.sw to 4/5 of the way between Ao.sw and A.sw
|
||||
line from Ao.se to 4/5 of the way between Ao.se and A.se
|
||||
|
||||
arrow from 1/3 of the way between Bo.nw and Bo.ne to Bo.nw
|
||||
arrow from 2/3 of the way between Bo.nw and Bo.ne to Bo.ne
|
||||
line invis from Bo.nw to Bo.ne "20"
|
||||
line from Bo.nw to 4/5 of the way between Bo.nw and B.nw
|
||||
line from Bo.ne to 4/5 of the way between Bo.ne and B.ne
|
||||
|
||||
arrow from 1/3 of the way between Bo.sw and Co.se to Bo.sw
|
||||
arrow from 2/3 of the way between Bo.sw and Co.se to Co.se
|
||||
line invis from Bo.sw to Co.se "20 to 60"
|
||||
line from Bo.sw to 4/5 of the way between Bo.sw and B.sw
|
||||
line from Co.se to 4/5 of the way between Co.se and C.se
|
||||
.PE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: processing a TCP-over-IPv6 packet requires 3 length checks.
|
||||
.KE
|
||||
The results suggest that we call
|
||||
.I m_pulldown
|
||||
more frequently in ICMPv6 processing than in the processing of other protocols.
|
||||
These additional calls are made for parsing of ICMPv6 and for neighbor discovery options.
|
||||
The use of loopback interface also contributes to the use of
|
||||
.I m_pulldown.
|
||||
.PP
|
||||
In the tests, the number of copies made in the
|
||||
.I m_pullup2
|
||||
case is similar to the number made in the
|
||||
.I m_pulldown
|
||||
case.
|
||||
.I m_pulldown
|
||||
makes less copies than
|
||||
.I m_pullup2
|
||||
against packets like below:
|
||||
.IP \(sq
|
||||
A packet is kept in multiple mbuf.
|
||||
With mbuf allocation policy in
|
||||
.I m_devget,
|
||||
we will see two mbufs to hold single packet
|
||||
if the packet is larger than MHLEN and smaller than MHLEN + MLEN,
|
||||
or the packet is larger than MCLBYTES.
|
||||
.IP \(sq
|
||||
We have extension headers in multiple mbufs.
|
||||
Header portion in the packet needs to occupy first mbuf and
|
||||
subsequent mbufs.
|
||||
.LP
|
||||
To demonstrate the difference, we have generated an IPv6 packet with a
|
||||
routing header, with 4 IPv6 addresses.
|
||||
The test result is presented as the 5th test in Table 2.
|
||||
Packet will look like
|
||||
.nr figure +1
|
||||
Figure \n[figure].
|
||||
.nr figure -1
|
||||
First 112 bytes are occupied by an IPv6 header and a routing header,
|
||||
and the remaining 16 bytes are used for an ICMPv6 header and payload.
|
||||
The packet met the above condition, and
|
||||
.I m_pulldown
|
||||
made less copies than
|
||||
.I m_pullup2.
|
||||
To process single incoming ICMPv6 packet shown in the figure,
|
||||
.I m_pullup2
|
||||
made 7 copies while
|
||||
.I m_pulldown
|
||||
made only 1 copy.
|
||||
.KF
|
||||
.LD
|
||||
.ps 6
|
||||
.vs 7
|
||||
\f[CR]node A (source) = 2001:240:0:200:260:97ff:fe07:69ea
|
||||
node B (destination) = 2001:240:0:200:a00:5aff:fe38:6f86
|
||||
17:39:43.346078 A > B:
|
||||
srcrt (type=0,segleft=4,[0]B,[1]B,[2]B,[3]B):
|
||||
icmp6: echo request (len 88, hlim 64)
|
||||
6000 0000 0058 2b40 2001 0240 0000 0200
|
||||
0260 97ff fe07 69ea 2001 0240 0000 0200
|
||||
0a00 5aff fe38 6f86 3a08 0004 0000 0000
|
||||
2001 0240 0000 0200 0a00 5aff fe38 6f86
|
||||
2001 0240 0000 0200 0a00 5aff fe38 6f86
|
||||
2001 0240 0000 0200 0a00 5aff fe38 6f86
|
||||
2001 0240 0000 0200 0a00 5aff fe38 6f86
|
||||
8000 b650 030e 00c8 ce6e fd38 d553 0700
|
||||
.DE
|
||||
.ce
|
||||
.nr figure +1
|
||||
Figure \n[figure]: Packets with IPv6 routing header.
|
||||
.KE
|
||||
.PP
|
||||
During the test, we experienced no kernel stack overflow,
|
||||
thanks to a new calling sequence between IPv6 protocol handlers.
|
||||
.PP
|
||||
The number of copies and mbuf allocations vary very much by tests.
|
||||
We need to investigate the traffic characteristic more carefully,
|
||||
for example, about the average length of header portion in packets.
|
|
@ -0,0 +1,234 @@
|
|||
.\" $Id: 9.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
|
||||
.\"
|
||||
.\".ds RH Related work
|
||||
.NH 1
|
||||
Related work
|
||||
.PP
|
||||
Van Jacobson proposed pbuf structure \**
|
||||
.FS
|
||||
A reference should be here,
|
||||
but I'm having hard time finding published literature for it.
|
||||
.FE
|
||||
as an alternative to BSD mbuf structure.
|
||||
The proposal has two main arguments.
|
||||
First is the use of continuous data buffer, instead of chained fragments
|
||||
like mbufs.
|
||||
Another is the improvement to TCP performance by restructuring
|
||||
TCP input/output handling.
|
||||
While the latter point still holds for IPv6,
|
||||
we believe that the former point must be reviewed carefully before being used with IPv6.
|
||||
Our experience suggests that we need to insert many intermediate extension headers into
|
||||
the packet data during IPv6 outbound packet processing.
|
||||
We believe that mbuf is more suitable
|
||||
than the proposed pbuf structure for handling the packet data efficiently.
|
||||
Using pbuf may result in the making of more copies than in the mbuf case.
|
||||
.PP
|
||||
In a cross-BSD portability paper,
|
||||
.[
|
||||
metz four bsds
|
||||
.]
|
||||
Craig Metz described
|
||||
.I nbuf
|
||||
structure in NRL IPv6/IPsec stack.
|
||||
nbuf is a wrapper structure used to unify linux linear-buffer packet management
|
||||
and BSD mbuf structure, and is not closely related to the topic of this paper.
|
||||
The
|
||||
.I m_pullup2
|
||||
example discussed in this paper is drawn from the NRL implementation.
|
||||
.\".ds RH Conclusions
|
||||
.NH 1
|
||||
Conclusions
|
||||
.PP
|
||||
This paper discussed mbuf manipulation in a 4.4BSD-based IPv6/IPsec stack,
|
||||
namely KAME IPv6/IPsec implementation.
|
||||
4.4BSD makes certain assumptions regarding packet header length and its format.
|
||||
For IPv6/IPsec support, we removed those assumptions from the
|
||||
4.4BSD code.
|
||||
We introduced the
|
||||
.I m_pulldown
|
||||
function and a new function call sequence for inbound packet processing.
|
||||
These innovations helped us to implement IPv6/IPsec in a very spec-conformant manner,
|
||||
with fewer implementation restrictions added against specifications.
|
||||
.PP
|
||||
The described code is publically available, under a BSD-like license,
|
||||
at \f[CR]ftp://ftp.kame.net/\fP.
|
||||
KAME IPv6/IPsec stack is being merged into 4.4BSD variants like FreeBSD,
|
||||
NetBSD and OpenBSD.
|
||||
An integration into BSD/OS is planned.
|
||||
We will be able to see official releases of these OSes with KAME code soon.
|
||||
.PP
|
||||
.\".ds RH Acknowledgements
|
||||
.NH 1
|
||||
Acknowledgements
|
||||
.PP
|
||||
The paper was made possible by the collective efforts of researchers at
|
||||
the KAME project and the WIDE project and of other IPv6 implementers at large.
|
||||
We would also like to acknowledge all four BSD groups who helped
|
||||
us improve the KAME IPv6 stack code
|
||||
by sending bug reports and improvement suggestions,
|
||||
and the Freenix reviewers helped polish the paper.
|
||||
.[
|
||||
$LIST$
|
||||
.]
|
||||
.if t .2C
|
||||
.LD
|
||||
.ps 5
|
||||
.vs 6
|
||||
\f[CR]\s5/*
|
||||
* ensure that [off, off + len) is contiguous on the mbuf chain "m".
|
||||
* packet chain before "off" is kept untouched.
|
||||
* if offp == NULL, the target will start at <retval, 0> on resulting chain.
|
||||
* if offp != NULL, the target will start at <retval, *offp> on resulting chain.
|
||||
*
|
||||
* on error return (NULL return value), original "m" will be freed.
|
||||
*
|
||||
* XXX M_TRAILINGSPACE/M_LEADINGSPACE on shared cluster (sharedcluster)
|
||||
*/
|
||||
struct mbuf *
|
||||
m_pulldown(m, off, len, offp)
|
||||
struct mbuf *m;
|
||||
int off, len;
|
||||
int *offp;
|
||||
{
|
||||
struct mbuf *n, *o;
|
||||
int hlen, tlen, olen;
|
||||
int sharedcluster;
|
||||
|
||||
/* check invalid arguments. */
|
||||
if (m == NULL)
|
||||
panic("m == NULL in m_pulldown()");
|
||||
if (len > MCLBYTES) {
|
||||
m_freem(m);
|
||||
return NULL; /* impossible */
|
||||
}
|
||||
|
||||
n = m;
|
||||
while (n != NULL && off > 0) {
|
||||
if (n->m_len > off)
|
||||
break;
|
||||
off -= n->m_len;
|
||||
n = n->m_next;
|
||||
}
|
||||
/* be sure to point non-empty mbuf */
|
||||
while (n != NULL && n->m_len == 0)
|
||||
n = n->m_next;
|
||||
if (!n) {
|
||||
m_freem(m);
|
||||
return NULL; /* mbuf chain too short */
|
||||
}
|
||||
|
||||
/*
|
||||
* the target data is on <n, off>.
|
||||
* if we got enough data on the mbuf "n", we're done.
|
||||
*/
|
||||
if ((off == 0 || offp) && len <= n->m_len - off)
|
||||
goto ok;
|
||||
|
||||
/*
|
||||
* when len < n->m_len - off and off != 0, it is a special case.
|
||||
* len bytes from <n, off> sits in single mbuf, but the caller does
|
||||
* not like the starting position (off).
|
||||
* chop the current mbuf into two pieces, set off to 0.
|
||||
*/
|
||||
if (len < n->m_len - off) {
|
||||
o = m_copym(n, off, n->m_len - off, M_DONTWAIT);
|
||||
if (o == NULL) {
|
||||
m_freem(m);
|
||||
return NULL; /* ENOBUFS */
|
||||
}
|
||||
n->m_len = off;
|
||||
o->m_next = n->m_next;
|
||||
n->m_next = o;
|
||||
n = n->m_next;
|
||||
off = 0;
|
||||
goto ok;
|
||||
}
|
||||
|
||||
/*
|
||||
* we need to take hlen from <n, off> and tlen from <n->m_next, 0>,
|
||||
* and construct contiguous mbuf with m_len == len.
|
||||
* note that hlen + tlen == len, and tlen > 0.
|
||||
*/
|
||||
hlen = n->m_len - off;
|
||||
tlen = len - hlen;
|
||||
|
||||
/*
|
||||
* ensure that we have enough trailing data on mbuf chain.
|
||||
* if not, we can do nothing about the chain.
|
||||
*/
|
||||
olen = 0;
|
||||
for (o = n->m_next; o != NULL; o = o->m_next)
|
||||
olen += o->m_len;
|
||||
if (hlen + olen < len) {
|
||||
m_freem(m);
|
||||
return NULL; /* mbuf chain too short */
|
||||
}
|
||||
|
||||
/*
|
||||
* easy cases first.
|
||||
* we need to use m_copydata() to get data from <n->m_next, 0>.
|
||||
*/
|
||||
if ((n->m_flags & M_EXT) == 0)
|
||||
sharedcluster = 0;
|
||||
else {
|
||||
if (n->m_ext.ext_free)
|
||||
sharedcluster = 1;
|
||||
else if (MCLISREFERENCED(n))
|
||||
sharedcluster = 1;
|
||||
else
|
||||
sharedcluster = 0;
|
||||
}
|
||||
if ((off == 0 || offp) && M_TRAILINGSPACE(n) >= tlen
|
||||
&& !sharedcluster) {
|
||||
m_copydata(n->m_next, 0, tlen, mtod(n, caddr_t) + n->m_len);
|
||||
n->m_len += tlen;
|
||||
m_adj(n->m_next, tlen);
|
||||
goto ok;
|
||||
}
|
||||
if ((off == 0 || offp) && M_LEADINGSPACE(n->m_next) >= hlen
|
||||
&& !sharedcluster) {
|
||||
n->m_next->m_data -= hlen;
|
||||
n->m_next->m_len += hlen;
|
||||
bcopy(mtod(n, caddr_t) + off, mtod(n->m_next, caddr_t), hlen);
|
||||
n->m_len -= hlen;
|
||||
n = n->m_next;
|
||||
off = 0;
|
||||
goto ok;
|
||||
}
|
||||
|
||||
/*
|
||||
* now, we need to do the hard way. don't m_copy as there's no room
|
||||
* on both end.
|
||||
*/
|
||||
MGET(o, M_DONTWAIT, m->m_type);
|
||||
if (o == NULL) {
|
||||
m_freem(m);
|
||||
return NULL; /* ENOBUFS */
|
||||
}
|
||||
if (len > MHLEN) { /* use MHLEN just for safety */
|
||||
MCLGET(o, M_DONTWAIT);
|
||||
if ((o->m_flags & M_EXT) == 0) {
|
||||
m_freem(m);
|
||||
m_free(o);
|
||||
return NULL; /* ENOBUFS */
|
||||
}
|
||||
}
|
||||
/* get hlen from <n, off> into <o, 0> */
|
||||
o->m_len = hlen;
|
||||
bcopy(mtod(n, caddr_t) + off, mtod(o, caddr_t), hlen);
|
||||
n->m_len -= hlen;
|
||||
/* get tlen from <n->m_next, 0> into <o, hlen> */
|
||||
m_copydata(n->m_next, 0, tlen, mtod(o, caddr_t) + o->m_len);
|
||||
o->m_len += tlen;
|
||||
m_adj(n->m_next, tlen);
|
||||
o->m_next = n->m_next;
|
||||
n->m_next = o;
|
||||
n = o;
|
||||
off = 0;
|
||||
|
||||
ok:
|
||||
if (offp)
|
||||
*offp = off;
|
||||
return n;
|
||||
}
|
||||
.DE
|
|
@ -0,0 +1,23 @@
|
|||
# $Id: Makefile,v 1.1 2001/07/04 05:29:25 itojun Exp $
|
||||
|
||||
DIR= papers/pulldown
|
||||
SRCS= 0.t 1.t 2.t 4.t 8.t 9.t
|
||||
MACROS= -ms
|
||||
DPSRCS= ${SRCS} refs.r Makefile
|
||||
|
||||
paper.ps: ${DPSRCS}
|
||||
${SOELIM} -I${.CURDIR} ${SRCS} | \
|
||||
${REFER} -P -S -e -p ${.CURDIR}/refs.r | \
|
||||
${PIC} | ${TBL} | ${EQN} | ${ROFF} > ${.TARGET}
|
||||
|
||||
paper.dvi: ${DPSRCS}
|
||||
${SOELIM} -I${.CURDIR} ${SRCS} | \
|
||||
${REFER} -P -S -e -p ${.CURDIR}/refs.r | \
|
||||
${PIC} | ${TBL} | ${ROFF} -Tdvi > ${.TARGET}
|
||||
|
||||
paper.txt: ${DPSRCS}
|
||||
${SOELIM} -I${.CURDIR} ${SRCS} | \
|
||||
${REFER} -P -S -e -p ${.CURDIR}/refs.r | \
|
||||
${PIC} | ${TBL} | ${EQN} -Tascii | nroff -ms > ${.TARGET}
|
||||
|
||||
.include <bsd.doc.mk>
|
|
@ -0,0 +1,166 @@
|
|||
%A S. Deering
|
||||
%A R. Hinden
|
||||
%B RFC1883
|
||||
%T Internet Protocol, Version 6 (IPv6) Specification
|
||||
%D December 1995
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc1883.txt
|
||||
|
||||
%A A. Conta
|
||||
%A S. Deering
|
||||
%B RFC1885
|
||||
%T Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification
|
||||
%D December 1995
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc1885.txt
|
||||
|
||||
%A R. Hinden
|
||||
%A S. Deering
|
||||
%B RFC2373
|
||||
%T IP Version 6 Addressing Architecture
|
||||
%D July 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2373.txt
|
||||
|
||||
%A J. Postel
|
||||
%B RFC793
|
||||
%T Transmission Control Protocol
|
||||
%D Sep 1, 1981
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc793.txt
|
||||
|
||||
%A J. Postel
|
||||
%A J.K. Reynolds
|
||||
%B RFC959
|
||||
%T File Transfer Protocol
|
||||
%D Oct 1, 1985
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc959.txt
|
||||
|
||||
%A A. Durand
|
||||
%A B. Buclin
|
||||
%B RFC2546
|
||||
%T 6Bone Routing Practice
|
||||
%D March 1999
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2546.txt
|
||||
|
||||
%A S. Deering
|
||||
%A R. Hinden
|
||||
%B RFC2460
|
||||
%T Internet Protocol, Version 6 (IPv6) Specification
|
||||
%D December 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2460.txt
|
||||
|
||||
%A T. Narten
|
||||
%A E. Nordmark
|
||||
%A W. Simpson
|
||||
%B RFC2461
|
||||
%T Neighbor Discovery for IP Version 6 (IPv6)
|
||||
%D December 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2461.txt
|
||||
|
||||
%A S. Thomson
|
||||
%A T. Narten
|
||||
%B RFC2462
|
||||
%T IPv6 Stateless Address Autoconfiguration
|
||||
%D December 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2462.txt
|
||||
|
||||
%A A. Conta
|
||||
%A S. Deering
|
||||
%B RFC2463
|
||||
%T Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification
|
||||
%D December 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2463.txt
|
||||
|
||||
%A T. Bates
|
||||
%A Y. Rekhter
|
||||
%B RFC2260
|
||||
%T Scalable Support for Multi-homed Multi-provider Connectivity
|
||||
%D January 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2260.txt
|
||||
|
||||
%A G. Malkin
|
||||
%A R. Minnear
|
||||
%B RFC2080
|
||||
%T RIPng for IPv6
|
||||
%D January 1997
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2080.txt
|
||||
|
||||
%A G. Malkin
|
||||
%B RFC2081
|
||||
%T RIPng Protocol Applicability Statement
|
||||
%D January 1997
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2081.txt
|
||||
|
||||
%A T. Bates
|
||||
%A R. Chandra
|
||||
%A D. Katz
|
||||
%A Y. Rekhter
|
||||
%B RFC2283
|
||||
%T Multiprotocol Extensions for BGP-4
|
||||
%D February 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2283.txt
|
||||
|
||||
%A P. Marques
|
||||
%A F. Dupont
|
||||
%B RFC2545
|
||||
%T Use of BGP-4 Multiprotocol Extensions for IPv6 Inter-Domain Routing
|
||||
%D March 1999
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2545.txt
|
||||
|
||||
%A P. Ferguson
|
||||
%A D. Senie
|
||||
%B RFC2267
|
||||
%T Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing
|
||||
%D January 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2267.txt
|
||||
|
||||
%A Matt Crawford
|
||||
%B draft-ietf-ipngwg-router-renum-09.txt
|
||||
%T Router Renumbering for IPv6
|
||||
%D June 1999
|
||||
%O work in progress material
|
||||
|
||||
%A R. Gilligan
|
||||
%A E. Nordmark
|
||||
%B RFC1933
|
||||
%T Transition Mechanisms for IPv6 Hosts and Routers
|
||||
%D April 1996
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc1933.txt
|
||||
|
||||
%A Erik Nordmark
|
||||
%B draft-ietf-ngtrans-siit-06.txt
|
||||
%T Stateless IP/ICMP Translator (SIIT)
|
||||
%D June 24, 1999
|
||||
%O work in progress material
|
||||
|
||||
%Q TIS
|
||||
%T TIS Gauntlet
|
||||
%O http://www.tis.com/
|
||||
|
||||
%A Marcus Ranum
|
||||
%T Firewall Toolkit (FWTK)
|
||||
%O http://www.fwtk.org/
|
||||
%D first released in October 1, 1993
|
||||
|
||||
%A John Postel
|
||||
%B RFC791
|
||||
%T Internet Protocol
|
||||
%D September 1981
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc791.txt
|
||||
|
||||
%A Stephen Kent
|
||||
%A Randall Atkinson
|
||||
%B RFC2401
|
||||
%T Security Architecture for the Internet Protocol
|
||||
%D November 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2401.txt
|
||||
|
||||
%A Stephen Kent
|
||||
%A Randall Atkinson
|
||||
%B RFC2402
|
||||
%T IP Authentication Header
|
||||
%D November 1998
|
||||
%O ftp://ftp.isi.edu/in-notes/rfc2402.txt
|
||||
|
||||
%A Craig Metz
|
||||
%T Porting Kernel Code to Four BSDs and Linux
|
||||
%D June 1999
|
||||
%B 1999 USENIX annual technical conference, Freenix track
|
||||
%O http://www.usenix.org/publications/library/proceedings/usenix99/metz.html
|
Loading…
Reference in New Issue