add Freenix 2000 paper on m_pulldown(9), by itojun.

This commit is contained in:
itojun 2001-07-04 05:29:25 +00:00
parent acd533ce9b
commit daddfe35da
8 changed files with 1619 additions and 0 deletions

View File

@ -0,0 +1,108 @@
.\" $Id: 0.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
.\"
.EQ
delim $$
.EN
.if n .ND
.TL
Mbuf issues in 4.4BSD IPv6/IPsec support
.br
\(em experiences from KAME IPv6/IPsec implemntation \(em
.AU
Jun-ichiro itojun Hagino
.AI
KAME Project
Research Laboratory, Internet Initiative Japan Inc.
\f[CR]http://www.kame.net/\fP
.I itojun@iijlab.net
.AB
The 4.4BSD network stack has made certain assumptions regarding the packets it will handle.
In particular, 4.4BSD assumes that
(1) the total protocol header length is shorter than or equal to MHLEN,
usually 100 bytes, and
(2) there are a limited number of protocol headers on a packet.
Neither of these assumptions hold any longer,
due to the way IPv6/IPsec specifications are written.
.PP
We at the KAME project
are implementing IPv6 and IPsec support code on top of 4.4BSD.
To cope with the problems, we have introduced the following changes:
(1) a new function called
.I m_pulldown,
which adjusts the mbuf chain with a minimal number of copies/allocations, and
(2) a new calling sequence for parsing inbound packet headers.
These changes allow us to manipulate incoming packets in a safer,
more efficient, and more spec-conformant way.
The technique described in this paper is integrated into the KAME IPv6/IPsec
stack kit, and is freely available under BSD copyright.
The KAME codebase is being merged into NetBSD, OpenBSD and FreeBSD.
An integration into BSD/OS is planned.
.AE
.\".LP
.de PT
.lt \\n(LLu
.pc %
.nr PN \\n%
.tl '\\*(LH'\\*(CH'\\*(RH'
.lt \\n(.lu
..
.\".af PN i
.\".ce
.\".B "TABLE OF CONTENTS"
.\".LP
.\".sp 1
.\".nf
.\".B "1. Introduction"
.\".LP
.\".sp .5v
.\".nf
.\".B "2. The \fIgprof\fP Profiler"
.\"\0.1. Data Presentation"
.\"\0.1.1. The Flat Profile
.\"\0.1.2. The Call Graph Profile
.\"\0.2 Profiling the Kernel
.\".LP
.\".sp .5v
.\".nf
.\".B "3. Using \fIgprof\fP to Improve Performance
.\"\0.1. Using the Profiler
.\"\0.2. An Example of Tuning
.\".LP
.\".sp .5v
.\".nf
.\".B "4. Conclusions"
.\".LP
.\".sp .5v
.\".nf
.\".B Acknowledgements
.\".LP
.\".sp .5v
.\".nf
.\".B References
.\".af PN 1
.ds CH
.ds LH
.ds RH
.\".ds LH mbuf issues in 4.4BSD IPv6 support
.\".ds RH Contents
.\".bp 1
.ds CF
.ds LF
.ds RF
.\".if t .ds CF Freenix2000
.\".if t .ds LF
.\".if t .ds RF Jun-ichiro itojun Hagino
.\".bp 1
.de _d
.if t .ta .6i 2.1i 2.6i
.\" 2.94 went to 2.6, 3.64 to 3.30
.if n .ta .84i 2.6i 3.30i
..
.de _f
.if t .ta .5i 1.25i 2.5i
.\" 3.5i went to 3.8i
.if n .ta .7i 1.75i 3.8i
..
.nr figure 0
.nr table 0
.if t .2C

View File

@ -0,0 +1,343 @@
.\" $Id: 1.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
.\"
.\".ds RH 4.4BSD incompatibility with IPv6/IPsec packet processing
.NH 1
4.4BSD incompatibility with IPv6/IPsec packet processing
.PP
The 4.4BSD network code holds a packet in a chain of ``mbuf'' structures.
Each mbuf structure has three flavors:
.IP \(sq
non-cluster header mbuf, which holds MHLEN
(100 bytes in a 32bit architecture installation of 4.4BSD),
.IP \(sq
non-cluster data mbuf, which holds MLEN (104 bytes), and
.IP \(sq
cluster mbuf which holds MCLBYTES (2048 bytes).
.LP
We can make a chain of mbuf structures as a linked list.
Mbuf chains will efficiently hold variable-length packet data.
Such chains also enable us to insert or remove
some of the packet data from the chain
without data copies.
.PP
When processing inbound packets, 4.4BSD uses a function called
.I m_pullup
to ease the manipulation of data content in the mbufs.
It also uses a deep function call tree for inbound packet processing.
While these two items work just fine for traditional IPv4 processing,
they do not work as well with IPv6 and IPsec processing.
.NH 2
Restrictions in 4.4BSD m_pullup
.PP
For input packet processing,
the 4.4BSD network stack uses the
.I m_pullup
function to ease parsing efforts
by adjusting the data content in mbufs for placement onto the continuous memory
region.
.I m_pullup
is defined as follows:
.DS
.SM
\f[CR]struct mbuf *
m_pullup(m, len)
struct mbuf *m;
int len;\fP
.DE
.NL
.I m_pullup
will ensure that the first
.I len
bytes in the packet
are placed in the continuous memory region.
After a call to
.I m_pullup,
the caller can safely access the the first
.I len
bytes of the packet, assuming that they are continuous.
The caller can, for example, safely use pointer variables into
the continuous region, as long as they point inside the
.I len
boundary.
.PP
.1C
.KS
.PS
box wid boxwid*1.2 "IPv6 header" "next = routing"
box same "routing header" "next = auth"
box same "auth header" "next = TCP"
box same "TCP header"
box same "TCP payload"
.PE
.ce
.nr figure +1
Figure \n[figure]: IPv6 extension header chain
.KE
.if t .2C
.I m_pullup
makes certain assumptions regarding protocol headers.
.I m_pullup
can only take
.I len
upto MHLEN.
If the total packet header length is longer than MHLEN,
.I m_pullup
will fail, and the result will be a loss of the packet.
Under IPv4,
.[
RFC791
.]
the length assumption worked fine in most cases,
since for almost every protocol, the total length of the protocol header part
was less than MHLEN.
Each packet has only two protocol headers, including the IPv4 header.
For example, the total length of the protocol header part of a TCP packet
(up to TCP data payload) is a maximum of 120 bytes.
Typically, this length is 40 to 48 bytes.
When an IPv4 option is present, it is stripped off before TCP
header processing, and the maximum length passed to
.I m_pullup
will be 100.
.IP 1
The IPv4 header occupies 20 bytes.
.IP 2
The IPv4 option occupies 40 bytes maximum.
It will be stripped off before we parse the TCP header.
Also note that the use of IPv4 options is very rare.
.IP 3
The TCP header length is 20 bytes.
.IP 4
The TCP option is 40 bytes maximum.
In most cases it is 0 to 8 bytes.
.LP
.PP
IPv6 specification
.[
RFC2460
.]
and IPsec specification
.[
RFC2401
.]
allow more flexible use of protocol headers
by introducing chained extension headers.
With chained extension headers, each header has a ``next header field'' in it.
A chain of headers can be made as shown
.nr figure +1
in Figure \n[figure].
.nr figure -1
The type of protocol header is determined by
inspecting the previous protocol header.
There is no restriction in the number of extension headers in the spec.
.PP
Because of extension header chains, there is now no upper limit in
protocol packet header length.
The
.I m_pullup
function would impose unnecessary restriction
to the extension header processing.
In addition,
with the introduction of IPsec, it is now impossible to strip off extension headers
during inbound packet processing.
All of the data on the packet must be retained if it is to be authenticated
using Authentication Header.
.[
RFC2402
.]
Continuing the use of
.I m_pullup
will limit the
number of extension headers allowed on the packet,
and could jeopadize the possible usefulness of IPv6 extension headers. \**
.FS
In IPv4 days, the IPv4 options turned out to be unusable
due to a lack of implementation.
This was because most commercial products simply did not support IPv4 options.
.FE
.PP
Another problem related to
.I m_pullup
is that it tends to copy the protocol header even
when it is unnecessary to do so.
For example, consider the mbuf chain shown
.nr figure +1
in Figure \n[figure]:
.nr figure -1
.KS
.PS
define pointer { box ht boxht*1/4 }
define payload { box }
IP: [
IPp: pointer
IPd: payload with .n at bottom of IPp "IPv4"
]
move
TCP: [
TCPp: pointer
TCPd: payload with .n at bottom of TCPp "TCP" "TCP payload"
]
arrow from IP.IPp.center to TCP.TCPp.center
.PE
.ce
.nr figure +1
.nr beforepullup \n[figure]
Figure \n[figure]: mbuf chain before \fIm_pullup\fP
.KE
Here, the first mbuf contains an IPv4 header in the continuous region,
and the second mbuf contains a TCP header in the continuous region.
When we look at the content of the TCP header,
under 4.4BSD the code will look like the following:
.DS
.SM
\f[CR]struct ip *ip;
struct tcphdr *th;
ip = mtod(m, struct ip *);
/* extra copy with m_pullup */
m = m_pullup(m, iphdrlen + tcphdrlen);
/* MUST reinit ip */
ip = mtod(m, struct ip *);
th = mtod(m, caddr_t) + iphdrlen;\fP
.NL
.DE
As a result, we will get a mbuf chain shown in
.nr figure +1
Figure \n[figure].
.nr figure -1
.KF
.PS
define pointer { box ht boxht*1/4 }
define payload { box }
IP: [
IPp: pointer
IPd: payload with .n at bottom of IPp "IPv4" "TCP"
]
move
TCP: [
TCPp: pointer
TCPd: payload with .n at bottom of TCPp "TCP payload"
]
arrow from IP.IPp.center to TCP.TCPp.center
.PE
.ce
.nr figure +1
Figure \n[figure]: mbuf chain in figure \n[beforepullup] after \fIm_pullup\fP
.KE
Because
.I m_pullup
is only able to make a continuous
region starting from the top of the mbuf chain,
it copies the TCP portion in second mbuf
into the first mbuf.
The copy could be avoided if
.I m_pullup
were clever enough
to handle this case.
Also, the caller side is required to reinitialize all of
the pointers that point to the content of mbuf,
since after
.I m_pullup,
the first mbuf on the chain
.1C
.KS
.PS
ellipse "\fIip6_input\fP"
arrow
ellipse "\fIrthdr6_input\fP"
arrow
ellipse "\fIah_input\fP"
arrow "stack" "overflow"
ellipse "\fIesp_input\fP"
arrow
ellipse "\fItcp_input\fP"
.PE
.ce
Figure 5: an excessively deep call chain can cause kernel stack overflow
.KE
.if t .2C
.LP
can be reallocated and lives at
a different address than before.
While
.I m_pullup
design has provided simplicity in packet parsing,
it is disadvantageous for protocols like IPv6.
.PP
The problems can be summarized as follows:
(1)
.I m_pullup
imposes too strong restriction
on the total length of the packet header (MHLEN);
(2)
.I m_pullup
makes an extra copy even when this can be avoided; and
(3)
.I m_pullup
requires the caller to reinitialize all of the pointers into the mbuf chain.
.NH 2
Protocol header processing with a deep function call chain
.PP
Under 4.4BSD, protocol header processing will make a chain of function calls.
For example, if we have an IPv4 TCP packet, the following function call chain will be made
.nr figure +1
(see Figure \n[figure]):
.nr figure -1
.IP (1)
.I ipintr
will be called from the network software interrupt logic,
.IP (2)
.I ipintr
processes the IPv4 header, then calls
.I tcp_input.
.\".I ipintr
.\"can be called
.\".I ip_input
.\"from its functionality.
.IP (3)
.I tcp_input
will process the TCP header and pass the data payload
to the socket queues.
.LP
.KF
.PS
ellipse "\fIipintr\fP"
arrow
ellipse "\fItcp_input\fP"
.PE
.ce
.nr figure +1
Figure \n[figure]: function call chain in IPv4 inbound packet processing
.KE
.PP
If chained extension headers are handled as described above,
the kernel stack can overflow by a deep function call chain, as shown in
.nr figure +1
Figure \n[figure].
.nr figure -1
.nr figure +1
IPv6/IPsec specifications do not define any upper limit
to the number of extension headers on a packet,
so a malicious party can transmit a ``legal'' packet with a large number of chained
headers in order to attack IPv6/IPsec implementations.
We have experienced kernel stack overflow in IPsec code,
tunnelled packet processing code, and in several other cases.
The IPsec processing routines tend to use a large chunk of memory
on the kernel stack, in order to hold intermediate data and the secret keys
used for encryption. \**
.FS
For example, blowfish encryption processing code typically uses
an intermediate data region of 4K or more.
With typical 4.4BSD installation on i386 architecture,
the kernel stack region occupies less than 8K bytes and does not grow on demand.
.FE
We cannot put the intermediate data region into a static data region outside of
the kernel stack,
because it would become a source of performance drawback on multiprocessors
due to data locking.
.PP
Even though the IPv6 specifications do not define any restrictions
on the number of extension headers, it may be possible
to impose additional restriction in an IPv6 implementation for safety.
In any case, it is not possible to estimate the amount of the
kernel stack, which will be used by protocol handlers.
We need a better calling convention for IPv6/IPsec header processing,
regardless of the limits in the number of extension headers we may impose.

View File

@ -0,0 +1,286 @@
.\" $Id: 2.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
.\"
.\".ds RH KAME approach
.NH 1
KAME approach
.PP
This section describes the approaches we at the KAME project
took against the problems mentioned in the previous section.
We introduce a new function called
.I m_pulldown,
in place of
.I m_pullup,
for adjusting payload data in the mbuf.
We also change the calling sequence for the protocol input function.
.NH 2
What is the KAME project?
.PP
In the early days of IPv6/IPsec development,
the Japanese research community felt it very important to make
a reference code available in a freely-redistributable form
for educational, research and deployment purposes.
The KAME project is a consortium of 7 Japanese companies and
an academic research group.
The project aims to deliver IPv6/IPsec reference implementation
for 4.4BSD, under BSD license.
The KAME project intends to deliver the most
spec-conformant IPv6/IPsec implementation possible.
.NH 2
m_pulldown function
.PP
Here we introduce a new function,
.I m_pulldown,
to address the 3 problems with
.I m_pullup
that we have described above.
The actual source code is included at the end of this paper.
The function prototype is as follows:
.DS
.SM
\f[CR]struct mbuf *
m_pulldown(m, off, len, offp)
struct mbuf *m;
int off, len;
int *offp;\fP
.NL
.DE
.I m_pulldown
will ensure that the data region in the mbuf chain,
starting at
.I off
and ending at
.I "off + len",
is put into a continuous memory region.
.I len
must be smaller than, or equal to, MCLBYTES (2048 bytes).
The function returns a pointer to an intermediate mbuf in the chain
(we refer to the pointer as \fIn\fP), and puts the new offset in
.I n
to
.I *offp.
If
.I offp
is NULL, the resulting region can be located by
.I "mtod(n, caddr_t)";
if
.I offp
is non-null, it will be located at
.I "mtod(n, caddr_t) + *offp".
The mbuf prior to
.I off
will remain untouched,
so it is safe to keep the pointers to the mbuf chain.
For example, consider the mbuf chain
.nr figure +1
on Figure \n[figure]
.nr figure -1
as the input.
.KF
.PS
define pointer { box ht boxht*1/4 }
define payload { box }
IP: [
IPp: pointer
IPd: payload with .n at bottom of IPp "mbuf1" "50 bytes"
]
move
TCP: [
TCPp: pointer
TCPd: payload with .n at bottom of TCPp "mbuf2" "20 bytes"
]
arrow from IP.IPp.center to TCP.TCPp.center
.PE
.ce
.nr figure +1
Figure \n[figure]: mbuf chain before the call to \fIm_pulldown\fP
.KE
If we call
.I m_pulldown
with
.I "off = 40",
.I "len = 10",
and a non-null
.I offp,
the mbuf chain will remain unchanged.
The return value will be a pointer to mbuf1, and
.I *offp
will be
filled with 40.
If we call
.I m_pulldown
with
.I "off = 40",
.I "len = 20",
and null
.I offp,
then the mbuf chain will be modified as shown
.nr figure +1
in Figure \n[figure],
.nr figure -1
by allocating a new mbuf, mbuf3,
into the middle and moving data from both mbuf1 and mbuf2.
The function returns a pointer to mbuf3.
.KF
.PS
define pointer { box ht boxht*1/4 }
define payload { box }
IP: [
IPp: pointer
IPd: payload with .n at bottom of IPp "mbuf1" "40 bytes"
]
move 0.2;
INT: [
INTp: pointer
INTd: payload with .n at bottom of INTp "mbuf3" "20 bytes"
]
move 0.2;
TCP: [
TCPp: pointer
TCPd: payload with .n at bottom of TCPp "mbuf2'" "10 bytes"
]
arrow from IP.IPp.center to INT.INTp.center
arrow from INT.INTp.center to TCP.TCPp.center
.PE
.ce
.nr figure +1
Figure \n[figure]: mbuf chain after call to \fIm_pulldown\fP, with \fIoff = 40\fP and \fIlen = 20\fP
.KE
The
.I m_pulldown
function solves all 3 problems in
.I m_pullup
that were described in the previous section.
.I m_pulldown
does not copy mbufs when copying is not necessary.
Since it does not modify the mbuf chain prior to the speficied offset
.I off,
it is not necessary for the caller to re-initialize the pointers into the mbuf data
region.
With
.I m_pullup,
we always needed to specify the data payload length, starting from the very first byte
in the packet.
With
.I m_pulldown,
we pass
.I off
as the offset to the data payload we are interested in.
This change avoids extra data manipulation when we are only interested in
the intermediate data portion of the packet.
It also eases the assumption regarding total packet header length.
While
.I m_pullup
assumes that the total packet header length is smaller than or equal to MHLEN
(100 bytes),
.I m_pulldown
assumes that single packet header length is smaller than or equal to MCLBYTES
(2048 bytes).
With mbuf framework this is the best we
can do, since there is no way to hold continuous region longer than
MCLBYTES in a standard mbuf chain.
.NH 2
New function prototype for inbound packet processing
.PP
For IPv6 processing, our code does not make a deep function call chain.
Rather, we make a loop in the very last part of
.I ip6_input,
as shown in Figure 8.
IPPROTO_DONE is a pseudo-protocol type value that identifies the end of the
extension header chain.
If more protocol headers exist,
each header processing code will update the pointer variables
and return the next extension header type.
If the final header in the chain has been reached,
IPPROTO_DONE is returned.
.\" figure 8
.nr figure +1
With this code, we no longer have a deep call chain for IPv6/IPsec processing.
Rather,
.I ip6_input
will make calls to each extension header processor
directly.
This avoids the possibility of overflowing the kernel stack due to multiple
extension header processing.
.KF
.PS
A: ellipse "\fIip6_input\fP"
right
move
move
up
move
B: ellipse "\fIrthdr6_input\fP"
move to last ellipse .s
down
C: ellipse "\fIah_input\fP"
D: ellipse "\fIesp_input\fP"
E: ellipse "\fItcp_input\fP"
arrow from 1/4 <A.e, A.ne> to 1/4 <B.w, B.nw>
arrow from 1/4 <B.w, B.sw> to 1/4 <A.e, A.se>
arrow from 1/4 <A.e, A.ne> to 1/4 <C.w, C.nw>
arrow from 1/4 <C.w, C.sw> to 1/4 <A.e, A.se>
arrow from 1/4 <A.e, A.ne> to 1/4 <D.w, D.nw>
arrow from 1/4 <D.w, D.sw> to 1/4 <A.e, A.se>
arrow from 3/8 <A.e, A.ne> to 1/4 <E.w, E.nw>
arrow from 3/8 <E.w, E.sw> to 1/4 <A.e, A.se>
.PE
.ce
.nr figure +1
Figure \n[figure]: KAME avoids function call chain by making a loop in \fIip6_input\fP
.KE
.PP
Regardless of the calling sequence imposed by the
.I pr_input
function prototype, it is important not to use up the kernel
stack region in protocol handlers.
Sometimes it is necessary to decrease the size of kernel stack usage
by using pointer variables and dynamically allocated regions.
.1C
.KF
.DS
.ps 8
.vs 9
\f[CR]struct ip6protosw {
int (*pr_input) __P((struct mbuf **, int *, int));
/* and other members */
};
ip6_input(m)
struct mbuf *m;
{
/* in the very last part */
extern struct ip6protosw inet6sw[];
/* the first one in extension header chain */
nxt = ip6.ip6_nxt;
while (nxt != IPPROTO_DONE)
nxt = (*inet6sw[ip6_protox[nxt]].pr_input)(&m, &off, nxt);
}
/* in each header processing code */
int
foohdr_input(mp, offp, proto)
struct mbuf **mp;
int *offp;
int proto;
{
/* some processing, may modify mbuf chain */
if (we have more header to go) {
*mp = newm;
*offp = nxtoff;
return nxt;
} else {
m_freem(newm);
return IPPROTO_DONE;
}
}\fP
.DE
.NL
.ce
Figure 8: KAME IPv6 header chain processing code.
.KE
.if t .2C

View File

@ -0,0 +1,77 @@
.\" $Id: 4.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
.\"
.\".ds RH Alternative approaches
.NH 1
Alternative approaches
.PP
Many BSD-based IPv6 stacks have been implemented.
While the most popular stacks include NRL, INRIA and KAME,
dozens of other BSD-based IPv6 implementations have been made.
This section presents alternative approaches for purposes of comparison.
.NH 2
NRL m_pullup2
.PP
The latest NRL IPv6 release copes with the
.I m_pullup
limitation by introducing a new function,
.I m_pullup2.
.I m_pullup2
works similarly to
.I m_pullup,
but it allows
.I len
to extend up to MCLBYTES, which corresponds to 2048 bytes in a typical installation.
When
the
.I len
parameter is smaller than or equal to MHLEN,
.I m_pullup2
simply calls
.I m_pullup
from the inside.
.PP
While
.I m_pullup2
works well for packet headers up to MCLBYTES with very little change
in code, it does not avoid making unnecessary copies.
It also imposes restrictions on the total length of packet headers.
The assumption here is that the total length of packet headers is less than
MCLBYTES.
.NH 2
Hydrangea changes to m_devget
.PP
The Hydrangea IPv6 stack was implemented by a group of Japanese researchers,
and is one of the ancestors of the KAME IPv6 stack.
The Hydrangea IPv6 stack avoids the need for
.I m_pullup
by modifying the mbuf allocation policy in drivers.
For inbound packets, the drivers allocate mbufs by using the
.I m_devget
function, or by re-implementing the behavior of
.I m_devget.
.I m_devget
allocates mbuf as follows:
.IP 1
If the packet fits in MHLEN (100 bytes), allocate a single non-cluster mbuf.
.IP 2
If the packet is larger than MHLEN but fits in MHLEN + MLEN (204 bytes),
allocate two non-cluster mbufs.
.IP 3
Otherwise, allocate multiple cluster mbufs, MCLBYTES (2048 bytes) in size.
.LP
For typical packets, the second case is where
.I m_pullup
is used.
The Hydrangea stack avoids the use of
.I m_pullup
by eliminating the second case.
.PP
This approach worked well in most cases, but failed for (1) loopback interface,
(2) tunnelled packets, and (3) non-conforming drivers.
With the Hydrangea approach, every device driver had to be examined
to ensure the new mbuf allocation policy.
We could not be sure if the constraint was guaranteed until we checked the
driver code,
and the Hydrangea approach raised many support issues.
This was one of our motivations for introducing
.I m_pulldown.

View File

@ -0,0 +1,382 @@
.\" $Id: 8.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
.\"
.\".ds RH Comparisons
.NH 1
Comparisons
.PP
This section compares the following three approaches in terms of
their characteristics and actual behavior:
(1) 4.4BSD
.I m_pullup,
(2) NRL
.I m_pullup2,
and (3) KAME
.I m_pulldown.
.LP
.NH 2
Comparison of assumption
.PP
Table 1 shows the assumptions made by each of the three approaches.
As mentioned earlier,
.I m_pullup
imposes too stringent requirement for the total length of packet headers.
.I m_pullup2
is workable in most cases, although
this approach adds more restrictions than the specification claims.
.I m_pulldown
assumes that the single packet header is smaller than MCLBYTES,
but makes
no restriction regarding the total length of packet headers.
With a standard mbuf chain,
this is the best
.I m_pulldown
can do, since there is no way to hold continuous region longer than MCLBYTES.
This characteristic can contribute to better specification conformance,
since
.I m_pulldown
will impose fewer additional restrictions due to the
requirements of implementation.
.PP
Among the three approaches, only
.I m_pulldown
avoids making unnecessary copies of intermediate header data and
avoids pointer reinitialization after calls to these functions.
These attributes result in smaller overhead during input packet processing.
.PP
.nr table +1
At present,
we know of no other 4.4BSD-based IPv6/IPsec stack that addresses kernel
stack overflow issues,
although we are open to
new perspectives and new information.
.NH 2
Performance comparison based on simulated statistics
.PP
To compare the behavior and performance of
.I m_pulldown
against
.I m_pullup
and
.I m_pullup2
using the same set of traffic and
mbuf chains, we have gathered simulated statistics for
.I m_pullup
and
.I m_pullup2,
in
.I m_pulldown
function.
By running a kernel using the modified
.I m_pulldown
function,
we can easily
gather statistics for these three functions against exactly the same traffic.
.PP
The comparison was made on a computer
(with Celeron 366MHz CPU, 192M bytes of memory)
running NetBSD 1.4.1 with the KAME IPv6/IPsec stack.
Network drivers allocate mbufs just as normal 4.4BSD does.
.I m_pulldown
is called whenever it is needed to ensure continuity in packet data
during inbound packet processing.
The role of the computer is as an end node, not a router.
.PP
To describe the content of the following table,
we must look at the source code fragment.
.nr figure +1
Figure \n[figure]
.nr figure -1
shows the code fragment from our source code.
The code fragment will
(1) make the TCP header on the mbuf chain
.I m
at offset
.I hdrlen
continuous, and (2) point the region with pointer
.I th.
We use a macro named IP6_EXTHDR_CHECK,
and the code before and after the macro expansion is shown in the figure.
.KF
.LD
.ps 6
.vs 7
\f[CR]/* ensure that *th from hdrlen is continuous */
/* before macro expansion... */
struct tcphdr *th;
IP6_EXTHDR_CHECK(th, struct tcphdr *, m,
hdrlen, sizeof(*th));
if (th == NULL)
return; /*m is already freed*/
/* after macro expansion... */
struct tcphdr *th;
int off;
struct mbuf *n;
if (m->m_len < hdrlen + sizeof(*th)) {
n = m_pulldown(m, hdrlen, sizeof(*th), &off);
if (n)
th = (struct tcphdr *)(mtod(n, caddr_t) + off);
else
th = NULL;
} else
th = (struct tcphdr *)(mtod(m, caddr_t) + hdrlen);
if (th == NULL)
return;\fP
.NL
.DE
.nr figure +1
Figure \n[figure]: code fragment for trimming mbuf chain.
.KE
In Table 2,
the first column identifies the test case.
The second column shows the number of times
the IP6_EXTHDR_CHECK macro was used.
In other words, it shows the number of times we have made checks against
mbuf length.
The remaining columns show, from left to right,
the number of times memory allocation/copy was performed in each of the variants.
In the case of
.I m_pullup,
we counted the number of cases we passed
.I len
in excess of MHLEN (96 bytes in this installation).
.\"With
.\".I m_pullup2
.\"and
.\".I m_pulldown,
.\"there were no such failures.
This result suggests
that there was no packet with a packet header portion larger than
MCLBYTES (2048 bytes).
.\" The percentage in parentheses is ratio against the number on the first column.
In the evaluation we have used
.I m_pulldown
against IPv6 traffic only.
.1C
.KF
.TS
center box;
l cfI cfI cfI
l c c c.
m_pullup m_pullup2 m_pulldown
_
total header length MHLEN(100) MCLBYTES(2048) \(mi
single header length \(mi \(mi MCLBYTES(2048)
_
T{
avoids copy on intermediate headers
T} no no yes
_
T{
avoids pointer reinitialization
T} no no yes
.TE
.ce
Table 1: assumptions in mbuf manipulation approaches.
.KE
.KF
.TS
center box;
c |c |cfI s s |cfI s s |cfI s
c |r |c c c |c c c |c c
r |r |r r r |r r r |r r.
test len checks m_pulldown m_pullup m_pullup2
call alloc copy alloc copy fail alloc copy
_
(1) 204923 1706 1595 1596 165 165 1541 1596 1596
(2) 1063995 23786 22931 23008 1171 1229 22557 22895 22953
(3) 520028 1245 948 957 432 432 813 945 945
(4) 438602 180 6 6 178 178 2 24 24
(5) 5570 2236 206 206 812 812 1424 1424 1424
.TE
.ce
Table 2: number of mbuf allocation/copy against traffic
.KE
.KF
.TS
center box;
c |c c c c |c c c
c |r r r r |r r r.
test IPv6 input TCP UDP ICMPv6 1 mbuf 2 mbufs ext mbuf(s)
_
(1) 29334 20892 2699 5739 3624 15632 10078
(2) 313218 215919 15930 80263 38751 172976 101491
(3) 132267 117822 8561 5882 12782 59799 59686
(4) 73160 66512 5249 1343 7475 42053 23632
(5) 1433 148 53 52 103 1203 127
.TE
.ce
Table 3: Traffic characteristics for tests in Table 2
.KE
.if t .2C
.PP
From these measured results, we obtain several interesting observations.
.I m_pullup
actually failed on IPv6 trafic.
If an IPv6 implementation uses
.I m_pullup
for IPv6 input processing,
it must be coded carefully so as to avoid trying
.I m_pullup
against any length longer than MHLEN.
To achieve this end, the code copies the data portion from the mbuf
chain to a separate buffer, and the cost of memory copies becomes a penalty.
.PP
Due to the nature of this simulation,
the comparison described above may contain an implicit bias.
Since the IPv6 protocol processing code is written by using
.I m_pulldown,
the code is somewhat biased toward
.I m_pulldown.
If a programmer had to write the entire IPv6 protocol processing with
.I m_pullup
only, he or she would use
.I m_copydata
to copy intermediate
extension headers buried deep inside the header chains,
thus making it unnecessary to call
.I m_pullup.
In any case, a call to
.I m_copydata
will result in a data copy,
which causes extra overhead.
.\"The author thinks that this bias toward
.\".I m_pulldown
.\"is therefore negligible.
.PP
In all cases, the number of length checks (second column) exceeds the
number of inbound packets.
This behavior is the same as in the original 4.4BSD stack;
we did not add a significant number of length checks to the code.
This is because
.I m_pulldown
(or
.I m_pullup
in the 4.4BSD case)
is called
as necessary during the parsing of the headers.
For example, to process a TCP-over-IPv6 packet, at least 3
checks would be made against m->m_len;
these checks would be made
to grab the IPv6 header (40 bytes),
to grab the TCP header (20 bytes), and to grab the TCP header
and options (20 to 60 bytes).
The length of the TCP option part is kept inside the TCP header,
so the length needs to be checked twice for the TCP part.
.\"If the function call overhead is more significant than the actual
.\".I m_pullup
.\"or
.\".I m_pulldown
.\"operation,
.\"we may be able to blindly call
.\".I m_pulldown
.\"with the maximum TCP option length
.\"(60 bytes) in order to reduce the number of function calls.
.KF
.PS
Ao: box invis ht boxht*2
A: box at center of Ao "IPv6 header"
Bo: box invis ht boxht*2
B: box at center of Bo "TCP header" "(len)"
Co: box invis ht boxht*2
C: box at center of Co "TCP options"
D: box "payload"
arrow from 1/3 of the way between Ao.sw and Ao.se to Ao.sw
arrow from 2/3 of the way between Ao.sw and Ao.se to Ao.se
line invis from Ao.sw to Ao.se "40"
line from Ao.sw to 4/5 of the way between Ao.sw and A.sw
line from Ao.se to 4/5 of the way between Ao.se and A.se
arrow from 1/3 of the way between Bo.nw and Bo.ne to Bo.nw
arrow from 2/3 of the way between Bo.nw and Bo.ne to Bo.ne
line invis from Bo.nw to Bo.ne "20"
line from Bo.nw to 4/5 of the way between Bo.nw and B.nw
line from Bo.ne to 4/5 of the way between Bo.ne and B.ne
arrow from 1/3 of the way between Bo.sw and Co.se to Bo.sw
arrow from 2/3 of the way between Bo.sw and Co.se to Co.se
line invis from Bo.sw to Co.se "20 to 60"
line from Bo.sw to 4/5 of the way between Bo.sw and B.sw
line from Co.se to 4/5 of the way between Co.se and C.se
.PE
.ce
.nr figure +1
Figure \n[figure]: processing a TCP-over-IPv6 packet requires 3 length checks.
.KE
The results suggest that we call
.I m_pulldown
more frequently in ICMPv6 processing than in the processing of other protocols.
These additional calls are made for parsing of ICMPv6 and for neighbor discovery options.
The use of loopback interface also contributes to the use of
.I m_pulldown.
.PP
In the tests, the number of copies made in the
.I m_pullup2
case is similar to the number made in the
.I m_pulldown
case.
.I m_pulldown
makes less copies than
.I m_pullup2
against packets like below:
.IP \(sq
A packet is kept in multiple mbuf.
With mbuf allocation policy in
.I m_devget,
we will see two mbufs to hold single packet
if the packet is larger than MHLEN and smaller than MHLEN + MLEN,
or the packet is larger than MCLBYTES.
.IP \(sq
We have extension headers in multiple mbufs.
Header portion in the packet needs to occupy first mbuf and
subsequent mbufs.
.LP
To demonstrate the difference, we have generated an IPv6 packet with a
routing header, with 4 IPv6 addresses.
The test result is presented as the 5th test in Table 2.
Packet will look like
.nr figure +1
Figure \n[figure].
.nr figure -1
First 112 bytes are occupied by an IPv6 header and a routing header,
and the remaining 16 bytes are used for an ICMPv6 header and payload.
The packet met the above condition, and
.I m_pulldown
made less copies than
.I m_pullup2.
To process single incoming ICMPv6 packet shown in the figure,
.I m_pullup2
made 7 copies while
.I m_pulldown
made only 1 copy.
.KF
.LD
.ps 6
.vs 7
\f[CR]node A (source) = 2001:240:0:200:260:97ff:fe07:69ea
node B (destination) = 2001:240:0:200:a00:5aff:fe38:6f86
17:39:43.346078 A > B:
srcrt (type=0,segleft=4,[0]B,[1]B,[2]B,[3]B):
icmp6: echo request (len 88, hlim 64)
6000 0000 0058 2b40 2001 0240 0000 0200
0260 97ff fe07 69ea 2001 0240 0000 0200
0a00 5aff fe38 6f86 3a08 0004 0000 0000
2001 0240 0000 0200 0a00 5aff fe38 6f86
2001 0240 0000 0200 0a00 5aff fe38 6f86
2001 0240 0000 0200 0a00 5aff fe38 6f86
2001 0240 0000 0200 0a00 5aff fe38 6f86
8000 b650 030e 00c8 ce6e fd38 d553 0700
.DE
.ce
.nr figure +1
Figure \n[figure]: Packets with IPv6 routing header.
.KE
.PP
During the test, we experienced no kernel stack overflow,
thanks to a new calling sequence between IPv6 protocol handlers.
.PP
The number of copies and mbuf allocations vary very much by tests.
We need to investigate the traffic characteristic more carefully,
for example, about the average length of header portion in packets.

View File

@ -0,0 +1,234 @@
.\" $Id: 9.t,v 1.1 2001/07/04 05:29:25 itojun Exp $
.\"
.\".ds RH Related work
.NH 1
Related work
.PP
Van Jacobson proposed pbuf structure \**
.FS
A reference should be here,
but I'm having hard time finding published literature for it.
.FE
as an alternative to BSD mbuf structure.
The proposal has two main arguments.
First is the use of continuous data buffer, instead of chained fragments
like mbufs.
Another is the improvement to TCP performance by restructuring
TCP input/output handling.
While the latter point still holds for IPv6,
we believe that the former point must be reviewed carefully before being used with IPv6.
Our experience suggests that we need to insert many intermediate extension headers into
the packet data during IPv6 outbound packet processing.
We believe that mbuf is more suitable
than the proposed pbuf structure for handling the packet data efficiently.
Using pbuf may result in the making of more copies than in the mbuf case.
.PP
In a cross-BSD portability paper,
.[
metz four bsds
.]
Craig Metz described
.I nbuf
structure in NRL IPv6/IPsec stack.
nbuf is a wrapper structure used to unify linux linear-buffer packet management
and BSD mbuf structure, and is not closely related to the topic of this paper.
The
.I m_pullup2
example discussed in this paper is drawn from the NRL implementation.
.\".ds RH Conclusions
.NH 1
Conclusions
.PP
This paper discussed mbuf manipulation in a 4.4BSD-based IPv6/IPsec stack,
namely KAME IPv6/IPsec implementation.
4.4BSD makes certain assumptions regarding packet header length and its format.
For IPv6/IPsec support, we removed those assumptions from the
4.4BSD code.
We introduced the
.I m_pulldown
function and a new function call sequence for inbound packet processing.
These innovations helped us to implement IPv6/IPsec in a very spec-conformant manner,
with fewer implementation restrictions added against specifications.
.PP
The described code is publically available, under a BSD-like license,
at \f[CR]ftp://ftp.kame.net/\fP.
KAME IPv6/IPsec stack is being merged into 4.4BSD variants like FreeBSD,
NetBSD and OpenBSD.
An integration into BSD/OS is planned.
We will be able to see official releases of these OSes with KAME code soon.
.PP
.\".ds RH Acknowledgements
.NH 1
Acknowledgements
.PP
The paper was made possible by the collective efforts of researchers at
the KAME project and the WIDE project and of other IPv6 implementers at large.
We would also like to acknowledge all four BSD groups who helped
us improve the KAME IPv6 stack code
by sending bug reports and improvement suggestions,
and the Freenix reviewers helped polish the paper.
.[
$LIST$
.]
.if t .2C
.LD
.ps 5
.vs 6
\f[CR]\s5/*
* ensure that [off, off + len) is contiguous on the mbuf chain "m".
* packet chain before "off" is kept untouched.
* if offp == NULL, the target will start at <retval, 0> on resulting chain.
* if offp != NULL, the target will start at <retval, *offp> on resulting chain.
*
* on error return (NULL return value), original "m" will be freed.
*
* XXX M_TRAILINGSPACE/M_LEADINGSPACE on shared cluster (sharedcluster)
*/
struct mbuf *
m_pulldown(m, off, len, offp)
struct mbuf *m;
int off, len;
int *offp;
{
struct mbuf *n, *o;
int hlen, tlen, olen;
int sharedcluster;
/* check invalid arguments. */
if (m == NULL)
panic("m == NULL in m_pulldown()");
if (len > MCLBYTES) {
m_freem(m);
return NULL; /* impossible */
}
n = m;
while (n != NULL && off > 0) {
if (n->m_len > off)
break;
off -= n->m_len;
n = n->m_next;
}
/* be sure to point non-empty mbuf */
while (n != NULL && n->m_len == 0)
n = n->m_next;
if (!n) {
m_freem(m);
return NULL; /* mbuf chain too short */
}
/*
* the target data is on <n, off>.
* if we got enough data on the mbuf "n", we're done.
*/
if ((off == 0 || offp) && len <= n->m_len - off)
goto ok;
/*
* when len < n->m_len - off and off != 0, it is a special case.
* len bytes from <n, off> sits in single mbuf, but the caller does
* not like the starting position (off).
* chop the current mbuf into two pieces, set off to 0.
*/
if (len < n->m_len - off) {
o = m_copym(n, off, n->m_len - off, M_DONTWAIT);
if (o == NULL) {
m_freem(m);
return NULL; /* ENOBUFS */
}
n->m_len = off;
o->m_next = n->m_next;
n->m_next = o;
n = n->m_next;
off = 0;
goto ok;
}
/*
* we need to take hlen from <n, off> and tlen from <n->m_next, 0>,
* and construct contiguous mbuf with m_len == len.
* note that hlen + tlen == len, and tlen > 0.
*/
hlen = n->m_len - off;
tlen = len - hlen;
/*
* ensure that we have enough trailing data on mbuf chain.
* if not, we can do nothing about the chain.
*/
olen = 0;
for (o = n->m_next; o != NULL; o = o->m_next)
olen += o->m_len;
if (hlen + olen < len) {
m_freem(m);
return NULL; /* mbuf chain too short */
}
/*
* easy cases first.
* we need to use m_copydata() to get data from <n->m_next, 0>.
*/
if ((n->m_flags & M_EXT) == 0)
sharedcluster = 0;
else {
if (n->m_ext.ext_free)
sharedcluster = 1;
else if (MCLISREFERENCED(n))
sharedcluster = 1;
else
sharedcluster = 0;
}
if ((off == 0 || offp) && M_TRAILINGSPACE(n) >= tlen
&& !sharedcluster) {
m_copydata(n->m_next, 0, tlen, mtod(n, caddr_t) + n->m_len);
n->m_len += tlen;
m_adj(n->m_next, tlen);
goto ok;
}
if ((off == 0 || offp) && M_LEADINGSPACE(n->m_next) >= hlen
&& !sharedcluster) {
n->m_next->m_data -= hlen;
n->m_next->m_len += hlen;
bcopy(mtod(n, caddr_t) + off, mtod(n->m_next, caddr_t), hlen);
n->m_len -= hlen;
n = n->m_next;
off = 0;
goto ok;
}
/*
* now, we need to do the hard way. don't m_copy as there's no room
* on both end.
*/
MGET(o, M_DONTWAIT, m->m_type);
if (o == NULL) {
m_freem(m);
return NULL; /* ENOBUFS */
}
if (len > MHLEN) { /* use MHLEN just for safety */
MCLGET(o, M_DONTWAIT);
if ((o->m_flags & M_EXT) == 0) {
m_freem(m);
m_free(o);
return NULL; /* ENOBUFS */
}
}
/* get hlen from <n, off> into <o, 0> */
o->m_len = hlen;
bcopy(mtod(n, caddr_t) + off, mtod(o, caddr_t), hlen);
n->m_len -= hlen;
/* get tlen from <n->m_next, 0> into <o, hlen> */
m_copydata(n->m_next, 0, tlen, mtod(o, caddr_t) + o->m_len);
o->m_len += tlen;
m_adj(n->m_next, tlen);
o->m_next = n->m_next;
n->m_next = o;
n = o;
off = 0;
ok:
if (offp)
*offp = off;
return n;
}
.DE

View File

@ -0,0 +1,23 @@
# $Id: Makefile,v 1.1 2001/07/04 05:29:25 itojun Exp $
DIR= papers/pulldown
SRCS= 0.t 1.t 2.t 4.t 8.t 9.t
MACROS= -ms
DPSRCS= ${SRCS} refs.r Makefile
paper.ps: ${DPSRCS}
${SOELIM} -I${.CURDIR} ${SRCS} | \
${REFER} -P -S -e -p ${.CURDIR}/refs.r | \
${PIC} | ${TBL} | ${EQN} | ${ROFF} > ${.TARGET}
paper.dvi: ${DPSRCS}
${SOELIM} -I${.CURDIR} ${SRCS} | \
${REFER} -P -S -e -p ${.CURDIR}/refs.r | \
${PIC} | ${TBL} | ${ROFF} -Tdvi > ${.TARGET}
paper.txt: ${DPSRCS}
${SOELIM} -I${.CURDIR} ${SRCS} | \
${REFER} -P -S -e -p ${.CURDIR}/refs.r | \
${PIC} | ${TBL} | ${EQN} -Tascii | nroff -ms > ${.TARGET}
.include <bsd.doc.mk>

View File

@ -0,0 +1,166 @@
%A S. Deering
%A R. Hinden
%B RFC1883
%T Internet Protocol, Version 6 (IPv6) Specification
%D December 1995
%O ftp://ftp.isi.edu/in-notes/rfc1883.txt
%A A. Conta
%A S. Deering
%B RFC1885
%T Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification
%D December 1995
%O ftp://ftp.isi.edu/in-notes/rfc1885.txt
%A R. Hinden
%A S. Deering
%B RFC2373
%T IP Version 6 Addressing Architecture
%D July 1998
%O ftp://ftp.isi.edu/in-notes/rfc2373.txt
%A J. Postel
%B RFC793
%T Transmission Control Protocol
%D Sep 1, 1981
%O ftp://ftp.isi.edu/in-notes/rfc793.txt
%A J. Postel
%A J.K. Reynolds
%B RFC959
%T File Transfer Protocol
%D Oct 1, 1985
%O ftp://ftp.isi.edu/in-notes/rfc959.txt
%A A. Durand
%A B. Buclin
%B RFC2546
%T 6Bone Routing Practice
%D March 1999
%O ftp://ftp.isi.edu/in-notes/rfc2546.txt
%A S. Deering
%A R. Hinden
%B RFC2460
%T Internet Protocol, Version 6 (IPv6) Specification
%D December 1998
%O ftp://ftp.isi.edu/in-notes/rfc2460.txt
%A T. Narten
%A E. Nordmark
%A W. Simpson
%B RFC2461
%T Neighbor Discovery for IP Version 6 (IPv6)
%D December 1998
%O ftp://ftp.isi.edu/in-notes/rfc2461.txt
%A S. Thomson
%A T. Narten
%B RFC2462
%T IPv6 Stateless Address Autoconfiguration
%D December 1998
%O ftp://ftp.isi.edu/in-notes/rfc2462.txt
%A A. Conta
%A S. Deering
%B RFC2463
%T Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification
%D December 1998
%O ftp://ftp.isi.edu/in-notes/rfc2463.txt
%A T. Bates
%A Y. Rekhter
%B RFC2260
%T Scalable Support for Multi-homed Multi-provider Connectivity
%D January 1998
%O ftp://ftp.isi.edu/in-notes/rfc2260.txt
%A G. Malkin
%A R. Minnear
%B RFC2080
%T RIPng for IPv6
%D January 1997
%O ftp://ftp.isi.edu/in-notes/rfc2080.txt
%A G. Malkin
%B RFC2081
%T RIPng Protocol Applicability Statement
%D January 1997
%O ftp://ftp.isi.edu/in-notes/rfc2081.txt
%A T. Bates
%A R. Chandra
%A D. Katz
%A Y. Rekhter
%B RFC2283
%T Multiprotocol Extensions for BGP-4
%D February 1998
%O ftp://ftp.isi.edu/in-notes/rfc2283.txt
%A P. Marques
%A F. Dupont
%B RFC2545
%T Use of BGP-4 Multiprotocol Extensions for IPv6 Inter-Domain Routing
%D March 1999
%O ftp://ftp.isi.edu/in-notes/rfc2545.txt
%A P. Ferguson
%A D. Senie
%B RFC2267
%T Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing
%D January 1998
%O ftp://ftp.isi.edu/in-notes/rfc2267.txt
%A Matt Crawford
%B draft-ietf-ipngwg-router-renum-09.txt
%T Router Renumbering for IPv6
%D June 1999
%O work in progress material
%A R. Gilligan
%A E. Nordmark
%B RFC1933
%T Transition Mechanisms for IPv6 Hosts and Routers
%D April 1996
%O ftp://ftp.isi.edu/in-notes/rfc1933.txt
%A Erik Nordmark
%B draft-ietf-ngtrans-siit-06.txt
%T Stateless IP/ICMP Translator (SIIT)
%D June 24, 1999
%O work in progress material
%Q TIS
%T TIS Gauntlet
%O http://www.tis.com/
%A Marcus Ranum
%T Firewall Toolkit (FWTK)
%O http://www.fwtk.org/
%D first released in October 1, 1993
%A John Postel
%B RFC791
%T Internet Protocol
%D September 1981
%O ftp://ftp.isi.edu/in-notes/rfc791.txt
%A Stephen Kent
%A Randall Atkinson
%B RFC2401
%T Security Architecture for the Internet Protocol
%D November 1998
%O ftp://ftp.isi.edu/in-notes/rfc2401.txt
%A Stephen Kent
%A Randall Atkinson
%B RFC2402
%T IP Authentication Header
%D November 1998
%O ftp://ftp.isi.edu/in-notes/rfc2402.txt
%A Craig Metz
%T Porting Kernel Code to Four BSDs and Linux
%D June 1999
%B 1999 USENIX annual technical conference, Freenix track
%O http://www.usenix.org/publications/library/proceedings/usenix99/metz.html