- for ipv4, defer decision to ip layer as h/w checksum offloading does
so that it can check the actual interface the packet is going to.
- for ipv6, disable it.
(maybe will be revisited when it implements h/w checksum offloading.)
ok'ed by Jason Thorpe.
- introduce t_segqlen, the number of segments in segq/timeq.
the name is from freebsd.
- rather than maintaining a copy of sack blocks (rcv_sack_block[]),
build it directly from the segment list when needed.
- Use the correct IP header length variable for other-than-first packets.
- Remove redundant setting of the original IP header length in the first
packet's csum_data. (It's already set before ip_fragment() is called
in 1.147.)
net.local.stream.pcblist
net.local.dgram.pcblist
net.inet.tcp.pcblist
net.inet.udp.pcblist
net.inet.raw.pcblist
net.inet6.tcp6.pcblist
net.inet6.udp6.pcblist
net.inet6.raw6.pcblist
which allow retrieval of the pcbs in use for those protocols. The
struct involved is 32/64 bit clean and incorporates parts of struct
inpcb, struct unpcb, a bit of struct tcpcb, and two socket addresses.
on gnats-bugs in PR kern/29544.
Tested with an NFS client using default rwsize on an NFS server
with wm(4) interface configured IP4CSUM,TCP4CSUM,UDP4CSUM.
Prior revision required the server to have checksum offload disabled.
it in the TCP stack to test which of the REXMT or PERSIST timer is in use.
This fixes a race condition that could cause "panic: tcp_output REXMT". See
tech-net for details.
http://www.sigusr1.org/~kurahone/tcp-sack-netbsd-02152005.diff.gz
Fixes in that patch for pre-existing TCP pcb initializations were already
committed to NetBSD-current, so are not included in this commit.
The SACK patch has been observed to correctly negotiate and respond,
to SACKs in wide-area traffic.
There are two indepenently-observed, as-yet-unresolved anomalies:
First, seeing unexplained delays between in fast retransmission
(potentially explainable by an 0.2sec RTT between adjacent
ethernet/wifi NICs); and second, peculiar and unepxlained TCP
retransmits observed over an ath0 card.
After discussion with several interested developers, I'm committing
this now, as-is, for more eyes to use and look over. Current hypothesis
is that the anomalies above may in fact be due to link/level (hardware,
driver, HAL, firmware) abberations in the test setup, affecting both
Kentaro's wired-Ethernet NIC and in my two (different) WiFi NICs.
headers and LKM.
Add MKPF; if set to no, don't build and install the pf(4) programs,
headers, LKM and spamd.
Both options default to yes, so nothing changed in the default build.
Reviewed by lukem.
checksum is always in the L4 header by the time we get to this point. It
was occasionally not there due to a bug in tcp_respond, which has since
been fixed.
So, instead just stash the length of the L3 header in the high 16 bits of
csum_data.
(from an offset to the end of the packet), the pseudo-header checksum must be
calculated by software. So, provide it in the TCP/UDP header when
M_CSUM_NO_PSEUDOHDR is set in the interface's if_csum_flags_tx.
The start offset, the end of the IP header, is also provided in the high 16
bits of pkthdr.csum_data. Such that the driver need not examine the packet
at all.
XXX At the request of Jonathan Stone, note that sharing of if_csum_flags_tx &
pkthdr.csum_flags for checksum quirks should be re-evaluated.
Hans Rosenfeld (rosenfeld at grumpf.hope-2000.org)
This change makes it possible to add gif interfaces to bridges, which
will then send and receive IP protocol 97 packets. Packets are Ethernet
frames with an EtherIP header prepended.
1) dupseg_fix_=true from NS: do not count a segment with completely duplicate
data as a duplicate ack. This can occur due to duplicate packets in the
network, or due to fast retransmit from the other side.
2) dupack_reset_=false from NS: do not reset the duplicate ack counter or exit
fast recovery if we happen to get data or a window update along with a
duplicate ack.
3) In the "very old ack" case that itojun added, send an ACK before dropping
the segment, to try to update the other side's send sequence number.
4) Check the ssthresh crossover point with >= rather than >. Otherwise we
start to do "exponential" growth immediately following recovery, where we
should be doing "linear". This is what NS does.
and it could detect even older time stamps. (Really, to be 100% correct, there
should be a timer that clears these out -- but it probably doesn't matter in
the real world.)
* t_partialacks<0 means we are not in fast recovery.
* t_partialacks==0 means we are in fast recovery, but we have not received
any partial acks yet.
* t_partialacks>0 means we are in fast recovery, and we have received
partial acks.
This is used to implement 2 changes in RFC 3782:
* We keep the notion that we are in fast recovery separate from t_dupacks, so
it is not reset due to out-of-order acks. (This affects both the Reno and
NewReno cases.)
* We only reset the retransmit timer on the first partial ack -- preventing us
from possibly taking one RTO per segment once fast recovery is initiated.
As before, it is hard to measure any difference between Reno and NewReno in the
real-world cases that I've tested.
1) If an echoed RFC 1323 time stamp appears to be later than the current time,
ignore it and fall back to old-style RTT calculation. This prevents ending
up with a negative RTT and panicking later.
2) Fix NewReno. This involves a few changes:
a) Implement the send_high variable in RFC 2582. Our implementation is
subtly different; it is one *past* the last sequence number transmitted
rather than being equal to it. This simplifies some logic and makes
the code smaller. Additional logic was required to prevent sequence
number wraparound problems; this is not mentioned in RFC 2582.
b) Make sure we reset t_dupacks on new acks, but *not* on a partial ack.
All of the new ack code is pushed out into tcp_newreno(). (Later this
will probably be a pluggable function.) Thus t_dupacks keeps track of
whether we're in fast recovery all the time, with Reno or NewReno, which
keeps some logic simpler.
c) We do not need to update snd_recover when we're not in fast recovery.
See tech-net for an explanation of this.
d) In the gratuitous fast retransmit prevention case, do not send a packet.
RFC 2582 specifically says that we should "do nothing".
e) Do not inflate the congestion window on a partial ack. (This is done by
testing t_dupacks to see whether we're still in fast recovery.)
This brings the performance of NewReno back up to the same as Reno in a few
random test cases (e.g. transferring peer-to-peer over my wireless network).
I have not concocted a good test case for the behavior specific to NewReno.
- Allow rn_init() to be called multiple times, but do nothing except the
first call.
- Include opt_inet.h so that #ifdef INET works.
- Call rn_init() from encap_init() explicitly rather than depending on the
order of initialization.
received packet so that the checksum is not performed twice. Also,
tcp_respond() does not fill-in the m_pkthdr.csum_data, so a h/w checksum may
have the wrong offset.
OK from Jason Thorpe.
Connect over tcp on the loopback is broken:
4729 amq 0.000007 CALL connect(4,0x804f2a0,0x1c)
4729 amq 75.007420 RET connect -1 errno 60 Connection timed out
and marked as out of window - trying to do the add will result in a failure
and the packet being blocked, incorrectly.
Committed By: darrenr
Tested By: smb
("no domain for AF 0") on if_detach.
- SIOCAIFADDR, SIOCSIFADDR: free an address on error.
- SIOCSIFNETMASK, SIOCSIFDSTADDR: reject operations for an interface which
has no AF_INET addresses.
partly from OpenBSD and FreeBSD.
reviewed by Christos Zoulas on tech-net@.
the interface route and various internal state. Also, it should use an ifreq,
not an if_aliasreq. Addresses PR 9604. (Nothing in our source tree uses
SIOCSIFNETMASK, though. Perhaps it should be deprecated.)
1.) Make sure that "pass" is always initialized.
2.) Make sure the code doesn't use a stale mbuf pointer after fr_makefrip()
has been called. This fixes PR kern/25868.
Analyzed and reviewed by Steve Woodford.
support IPv6 if KAME IPSEC (RFC is not explicit about how we make data stream
for checksum with IPv6, but i'm pretty sure using normal pseudo-header is the
right thing).
XXX
current TCP MD5 signature code has giant flaw:
it does not validate signature on input (can't believe it! what is the point?)
(MD5 signatures for TCP, as used with BGP). Credit for original
FreeBSD code goes to Bruce M. Simpson, with FreeBSD sponsorship
credited to sentex.net. Shortening of the setsockopt() name
attributed to Vincent Jardin.
This commit is a minimal, working version of the FreeBSD code, as
MFC'ed to FreeBSD-4. It has received minimal testing with a ttcp
modified to set the TCP-MD5 option; BMS's additions to tcpdump-current
(tcpdump -M) confirm that the MD5 signatures are correct. Committed
as-is for further testing between a NetBSD BGP speaker (e.g., quagga)
and industry-standard BGP speakers (e.g., Cisco, Juniper).
NOTE: This version has two potential flaws. First, I do see any code
that verifies recieved TCP-MD5 signatures. Second, the TCP-MD5
options are internally padded and assumed to be 32-bit aligned. A more
space-efficient scheme is to pack all TCP options densely (and
possibly unaligned) into the TCP header ; then do one final padding to
a 4-byte boundary. Pre-existing comments note that accounting for
TCP-option space when we add SACK is yet to be done. For now, I'm
punting on that; we can solve it properly, in a way that will handle
SACK blocks, as a separate exercise.
In case a pullup to NetBSD-2 is requested, this adds sys/netipsec/xform_tcp.c
,and modifies:
sys/net/pfkeyv2.h,v 1.15
sys/netinet/files.netinet,v 1.5
sys/netinet/ip.h,v 1.25
sys/netinet/tcp.h,v 1.15
sys/netinet/tcp_input.c,v 1.200
sys/netinet/tcp_output.c,v 1.109
sys/netinet/tcp_subr.c,v 1.165
sys/netinet/tcp_usrreq.c,v 1.89
sys/netinet/tcp_var.h,v 1.109
sys/netipsec/files.netipsec,v 1.3
sys/netipsec/ipsec.c,v 1.11
sys/netipsec/ipsec.h,v 1.7
sys/netipsec/key.c,v 1.11
share/man/man4/tcp.4,v 1.16
lib/libipsec/pfkey.c,v 1.20
lib/libipsec/pfkey_dump.c,v 1.17
lib/libipsec/policy_token.l,v 1.8
sbin/setkey/parse.y,v 1.14
sbin/setkey/setkey.8,v 1.27
sbin/setkey/token.l,v 1.15
Note that the preceding two revisions to tcp.4 will be
required to cleanly apply this diff.
to pool_init. Untouched pools are ones that either in arch-specific
code, or aren't initialiased during initial system startup.
Convert struct session, ucred and lockf to pools.
closer to normal behaviour for the current century.
New Reno is now on by default (which is really the only reasonable
choice, since we don't do SACK); instead of an initial window of 1
for non-local nets, we now use Sally Floyd's magic 4K rule.
left of the receive window, ignore it. Add some additional comments to
the code that deals with received segemnts that are completely to the right
of the receive window. If an invalid SYN is received, force an ACK and
drop it; if the other side really sent the SYN; it'll respond with a reset.
timer, otherwise there is a tiny window where both timers are
active, and this is not correct according to the comments in the
code. I believe that this is the cause of the to_ticks <= 0 assertion
failure in callout_schedule() that I've been getting.
(this can never have worked)
now I can use a "bge" gigabit interface with hw checksumming
ttcp-t: 2147483648 bytes in 18.31 real seconds = 114527.11 KB/sec +++
woow!
When under pressure for mbufs or we have too many fragments in the IP
reassembly queue, drop half of all fragments. This multiplicative-drop
strategy ensures we return to a healthy state, even under borderline
denial-of-service from extremely lossy NFS-over-UDP peers.
The multiplicative-drop phase currently drops 50% of fragments, but
has pre-placed support for implementing drop-fractions other than 50%
The threshhold for the `drop-half' phase is the new variable,
ip_maxfrags which is calculated as nmbclusters/4.
ip_input.c now keeps ip_nmbclusters, a cached copy of nmbclusters.
Before using limits derived from nmbclusters, we check if nmbclusters
and ip_nmclusters are equal. If not, we recompute Ip parameters
derived from nmbclusters. Based on a suggestion by Jason Thorpe.
ip_maxfrags is currently auto-recalcuated.
The counters ip_nfrags and ip_nfragpacketsr are now declared static
and uninitialized (bss), to discourage tampering with them.
The idea is that we only clear M_CANFASTFWD if an SPD exists
for the packet. Otherwise, it's safe to add a fast-forward
cache entry for the route.
To make this work properly, we invalidate the entire ipflow
cache if a fast-ipsec key is added or changed.
to check if interface exists, as (1) if_index has different meaning
(2) ifindex2ifnet could become NULL when interface gets destroyed,
since when we have introduced dynamically-created interfaces. from kame
- seed2 is necessary, but use it as "seed2 + x" not "seed2 ^ x".
- skipping number is not needed, so disable it for 16bit generator (makes
the repetition period to 30000)
Gone are the old kern_sysctl(), cpu_sysctl(), hw_sysctl(),
vfs_sysctl(), etc, routines, along with sysctl_int() et al. Now all
nodes are registered with the tree, and nodes can be added (or
removed) easily, and I/O to and from the tree is handled generically.
Since the nodes are registered with the tree, the mapping from name to
number (and back again) can now be discovered, instead of having to be
hard coded. Adding new nodes to the tree is likewise much simpler --
the new infrastructure handles almost all the work for simple types,
and just about anything else can be done with a small helper function.
All existing nodes are where they were before (numerically speaking),
so all existing consumers of sysctl information should notice no
difference.
PS - I'm sorry, but there's a distinct lack of documentation at the
moment. I'm working on sysctl(3/8/9) right now, and I promise to
watch out for buses.
XXX: The decision whether or not to fast forward should be made
XXX: dynamically. Using the current approach seriously reduces
XXX: routing performance on gateways with IPsec enabled.
* Include "opt_inet.h" everywhere IP-ids are generated with ip_newid(),
so the RANDOM_IP_ID option is visible. Also in ip_id(), to ensure
the prototype for ip_randomid() is made visible.
* Add new sysctl to enable randomized IP-ids, provided the kernel was
configured with RANDOM_IP_ID. (The sysctl defaults to zero, and is
a read-only zero if RANDOM_IP_ID is not configured).
Note that the implementation of randomized IP ids is still defective,
and should not be enabled at all (even if configured) without
very careful deliberation. Caveat emptor.
Revert the (default) ip_id algorithm to the pre-randomid algorithm,
due to demonstrated low-period repeated IDs from the randomized IP_id
code. Consensus is that the low-period repetition (much less than
2^15) is not suitable for general-purpose use.
Allocators of new IPv4 IDs should now call the function ip_newid().
Randomized IP_ids is now a config-time option, "options RANDOM_IP_ID".
ip_newid() can use ip_random-id()_IP_ID if and only if configured
with RANDOM_IP_ID. A sysctl knob should be provided.
This API may be reworked in the near future to support linear ip_id
counters per (src,dst) IP-address pair.
due to demonstrated low-period repeated IDs from the randomized IP_id
code. Consensus is that the low-period repetition (much less than
2^15) is not suitable for general-purpose use.
Allocators of new IPv4 IDs should now call the function ip_newid().
Randomized IP_ids is now a config-time option, "options RANDOM_IP_ID".
ip_newid() can use ip_random-id()_IP_ID if and only if configured
with RANDOM_IP_ID. A sysctl knob should be provided.
This API may be reworked in the near future to support linear ip_id
counters per (src,dst) IP-address pair.
mbuf chains which are recycled (e.g., ICMP reflection, loopback
interface). A consensus was reached that such recycled packets should
behave (more-or-less) the same way if a new chain had been allocated
and the contents copied to that chain.
Some packet tags may in future be marked as "persistent" (e.g., for
mandatory access controls) and should persist across such deletion.
NetBSD as yet hos no persistent tags, so m_tag_delete_nonpersistent()
just deletes all tags. This should not be relied upon.
in_ifaddrhead. Recent changes in struct names caused a namespace
collision in fast-ipsec, which are most cleanly fixed by using
"in_ifaddrhead" as the listhead name.
sysctl. Add a protocol-independent sysctl handler to show the per-protocol
"struct ifq' statistics. Add IP(v4) specific call to the handler.
Other protocols can show their per-protocol input statistics by
allocating a sysclt node and calling sysctl_ifq() with their own struct ifq *.
As posted to tech-kern plus improvements/cleanup suggested by Andrew Brown.
initialized. Update the txp(4) to compensate.
- Statically initialize the TCP timer callout handles in the tcpcb
template. We still use callout_setfunc(), but that call is now much
less expensive. Add a comment that the compiler is likely to unroll
the loop (so don't sweat that it's there).
During testing the prediction counters show a hit-rate on about 85% for
packets sent on a local LAN, and better than 99% for intercontinental
high-speed bulk traffic (!).
close sockets on address changes, which was deemed to be a bad idea and was
summarily removed, so there is no point in wasting effort on maintaining it
any more.
individually, create a tcpcb template pre-initialized (and pre-zero'd)
with the static and mostly-static tcpcb parameters. The template is
now copied into the new tcpcb, which zeros and initializes most of the
tcpcb in one pass. The template is kept up-to-date as TCP sysctl
variables are changed.
Combined with the previous sb_max change, TCP socket creation is now
25% faster.
throughput significantly in a wide variety of test cases, including
local gigabit ethernet with both jumbo and standard frames,
transcontinental (U.S.) connections with e2e bandwidths ranging from
10Mbit/sec to 155Mbit/sec, and on a variety of test connections
between the NetBSD Project public servers and machines in Australia.
The impact of this change is less dramatic for high-delay connections
when Path MTU is in use but still measurable.
For optimal performance on local gigabit networks, a higher socket
buffer size (at least 64K) will still yield a substantial improvement
in performance, but 32K gets us most of the way there in my test
cases, with only a cost of _doubling_ memory use per socket rather
than _quadrupling_ it.
N.B. Windows NT, at least since Win2k SP2, uses a default socket buffer
size (or their analogue thereof) of 64K, which is a useful data
point.
the purpose of this code appears to be on crack -- it's talking about
end-to-end authentication, but the purpose of an AH tunnel is NOT end-to-end
authentication; it's authentication of the tunnel endpoints.
NB: This does not fix the fact that IPsec leaks "packet tags."
argument. So check so is non-NULL before doing the pointer-chasing
dance to find the PCB. (Unless and until we rework fast-ipsec and
KAME, to pass a struct in_pcbhdr * instead of the struct socket *).
argument to ip6_output() with a new explicit struct in6pcb* argument.
(The underlying socket can be obtained via in6pcb->inp6_socket.)
In preparation for fast-ipsec. Reviewed by itojun.
reported by Shiva Shenoy
while we're here, fix another problem when the same interface address is
assigned to !IFF_MULTICAST and IFF_MULTICAST interface. if ip_multicast_if()
returns the first one, join/leave will fail, which is not an desired effect.