- Don't use home-grown queue manipulation. Use <sys/queue.h> instead. The
data structures are a little larger, but we are otherwise wasting the
memory chunk anyway (we're already a 64-byte malloc bucket).
- Fix a bug in the cache-is-full case: if the oldest element removed from
the first non-empty bucket was the only element in the bucket, the
bucket wouldn't be removed from the bucket cache, causing queue corruption
later.
- Optimize the syn cache timers by using PRT timers rather than home-grown
decrement-and-propagate timers.
This code is now a fair bit smaller, and significantly easier to read
and understand.
the protocol dispatch layer for TCP timers. This saves having to
modify a potentially large number of timer values (which were shorts,
and expanded to ... a lot of code on the Alpha).
- kern/5381 (Dennis Ferguson): check IP header checksum in fast forward
code.
- In ipflow_slowtimo(), if no IP flows are in use, don't bother checking
all of the hash buckets.
flow, by setting the "can fast forward" flag in the packet header, and
giving a chance for filters to clear the flag. If the flag is still
set after the filters have given it a chance, the packet will be used
to create a fast-forward flow entry.
the burst size allowed, but rather a fixed number of packets, as
described in the Internet Draft. Default allowed burst is 4 packets,
per the Draft.
Make the use of CWM and the allowed burst size tunable via sysctl.
if_fddisubr.c to fastpath IP forwarding. If ip_forward successfully
forwards a packet, it will create a cache (ipflow) entry. ether_input
and fddi_input will first call ipflow_fastforward with the received
packet and if the packet passes enough tests, it will be forwarded (the
ttl is decremented and the cksum is adjusted incrementally).
conditional (tcp_compat_42). The kernel config option TCP_COMPAT_42
will still enable this by default, or disable this by default if the
option is not included (i.e. current behavior). This will be made a
sysctl soon.
rule was to update the timestamp if the sequence numbers are in range. New
rule adds a check that the timestamp is advancing, thus preventing our notion
of the most recent timestamp from incorrectly moving backwards.
all of the fragments. Use the mtu of route in preference of the MTU of the
interface when doing fragmentation decisions. (ie. Fragment to the path
mtu if it is available).
TCP connections by using the MTU of the interface. Also added
a knob, mss_ifmtu, to force all connections to use the MTU of
the interface to calculate the advertised MSS.
meeting of IETF #41 by Amy Hughes <ahughes@isi.edu>, and in an upcoming
internet draft from Hughes, Touch, and Heidemann.
CWM eliminates line-rate bursts after idle periods by counting pending
(unacknowledged) packets and limiting the congestion window to the
initial congestion window plus the pending packet count. This has the
effect of allowing us to use the window as long as we continue to transmit,
but as soon as we stop transmitting, we go back to a slow-start (also known
as `use it or lose it').
This is not enabled by default. You can enable this behavior by patching
the "tcp_cwm" global (set it to non-zero) or by building a kernel with the
TCP_CWM option.
to ACK immediately any packet that arrived with PSH set. This breaks
delayed ACKs in a few specific common cases that delayed ACKs were
supposed to help, and ends up not making much (if any) difference in
the case where where this ACK-on-PSH change was supposed to help.
Per discussion with several members of the TCPIMPL and TCPSAT IETF
working groups.
code, as clarified in the TCPIMPL WG meeting at IETF #41: If the SYN
(active open) or SYN,ACK (passive open) was retransmitted, the initial
congestion window for the first slow start of that connection must be
one segment.
RTO estimation changes. Under some circumstances it would return a value
of 0, while the old Van Jacobson RTO code would return a minimum of 3.
This would result in 12 retransmissions, each 1 second apart.
This takes care of those instances, and ensures that t_rttmin is
used everywhere as a lower bound.
change pfil_add_hook to put output filters at the tail of the queue,
while continuing to place input filters at the head of the queue. update
the two users of these functions, and document these changes.
fixes PR#4593.
in the packet. This fixes a bug that was resulting in extra packets
in retransmissions (the second packet would be 12 bytes long,
reflecting the RFC1323 timestamp option size).
results in reserved ephemeral ports starting at the top (as per
current practice), and shouldn't have a negative effect on normal
ephemeral ports...
* initialise inpt_lastlow in in_pcbinit
* IP_PORTRANGE socket option, which controls how the ephemeral ports
are allocated. it takes the following settings:
IP_PORTRANGE_DEFAULT use anonportmin (49152) -> anonportmax (65535)
IP_PORTRANGE_HIGH as IP_PORTRANGE_DEFAULT (retained for FreeBSD
compat reasons, where these are separate)
IP_PORTRANGE_LOW use 600 -> 1023. only works if uid==0.
* in_pcb flag INP_ANONPORT. set if port was allocated ephmerally
* support sysctl net.inet.ip.anonportmin (lowest ephemeral port)
and net.inet.ip.anonportmax (highest ephemeral port).
these can't be set to >65535, < IPPORT_RESERVED (unless IPNOPRIVPORTS
is defined), and anonportmin has to be < anonportmax.
* use a cleaner way of only cycling through the available set once;
this will be useful for when a random allocation scheme is used
* define IPPORT_ANON{MIN,MAX} instead of IPPORT_USER{LOW,HIGH}
so_linger is used as an argument to tsleep(), so was stuffed with
clockticks for the TCP linger time. However, so_linger is set directly from
l_linger if the linger time is specified, and l_linger is seconds (although
this is not currently documented anywhere). Fix this to set the TCP
linger time in seconds, and multiply so_linger by hz when tsleep() is
called to actually perform the linger.
- When running the slow timers, skip PCBs in LISTEN state.
- When processing the persist timer, drop the connection if the connection
idle time exceeds the maximum backoff for retransmit. Part of
kern/2335 (pete@daemon.net).
- If we fail to allocate mbufs for the outgoing segment, free the header
and abort.
From Stevens:
- Ensure the persist timer is running if the send window reaches zero.
Part of the fix for kern/2335 (pete@daemon.net).
The sysctl'able variable "tcp_init_win", when set to 0, selects an
auto-tuning algorithm for selecting the initial window, based on transmit
segment size, per discussion in the IETF tcpimpl working group.
Default initial window is still 1 segment, but will soon become 2 segments,
per discussion in tcpimpl.
in tcp_output(), and it will only be cleared in tcp_output() if the ACK was
transmitted sucessfully. Also, don't count delayed ACKs here, let tcp_output()
count them.
case. Sending an RST to ourselves is a little silly, considering that
we'll just attempt to remove a non-existent compressed state entry and
then drop the packet anyway.
socket:
- If we received a SYN,ACK, send an RST.
- If we received a SYN, and the connection attempt appears to come from
itself, send an RST, since it cannot possibly be valid.
pseudo-device rnd # /dev/random and in-kernel generator
in config files.
o Add declaration to all architectures.
o Clean up copyright message in rnd.c, rnd.h, and rndpool.c to include
that this code is derived in part from Ted Tyso's linux code.
Basically, in silly window avoidance, don't use the raw MSS we advertised
to the peer. What we really want here is the _expected_ size of received
segments, so we need to account for the path MTU (eventually; right now,
the interface MTU for "local" addresses and loopback or tcp_mssdflt for
non-local addresses). Without this, silly window avoidance would never
kick in if we advertised a very large (e.g. ~64k) MSS to the peer.
- Don't overload t_maxseg. Previous behavior was to set it to the min
of the peer's advertised MSS, our advertised MSS, and tcp_mssdflt
(for non-local networks). This breaks PMTU discovery running on
either host. Instead, remember the MSS we advertise, and use it
as appropriate (in silly window avoidance).
- Per last bullet, split tcp_mss() into several functions for handling
MSS (ours and peer's), and performing various tasks when a connection
becomes ESTABLISHED.
- Introduce a new function, tcp_segsize(), which computes the max size
for every segment transmitted in tcp_output(). This will eventually
be used to hook in PMTU discovery.
where it is now, and adding the specialized for Ethernet version of the ARP
structure, for the benefit of programs which are externally (to us) maintained
and not (yet) ported.
XXX This should NOT be used inside the kernel.
which happen to have a TCB in TIME_WAIT, where an mbuf which had been
advanced past the IP+TCP headers and TCP options would be reused as if
it had not been advanced. Problem found by Juergen Hannken-Illjes, who
also suggested a work-around on which this fix is based.
fixed in FreeBSD by John Polstra:
Fix a bug (apparently very old) that can cause a TCP connection to
be dropped when it has an unusual traffic pattern. For full details
as well as a test case that demonstrates the failure, see the
referenced PR (FreeBSD's kern/3998).
Under certain circumstances involving the persist state, it is
possible for the receive side's tp->rcv_nxt to advance beyond its
tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become
negative. However, in the code affected by this fix, that difference
was interpreted as an unsigned number by max(). Since it was
negative, it was taken as a huge unsigned number. The effect was
to cause the receiver to believe that its receive window had negative
size, thereby rejecting all received segments including ACKs. As
the test case shows, this led to fruitless retransmissions and
eventually to a dropped connection. Even connections using the
loopback interface could be dropped. The fix substitutes the signed
imax() for the unsigned max() function.
Bill informs me that his research indicates this bug appeared in Reno.
use of mbuf external storage and increasing performance (by eliminating
an m_pullup() for clusters in the IP reassembly code).
Changes from Koji Imada <koji@math.human.nagoya-u.ac.jp>, in PR #3628
and #3480, with ever-so-slight integration changes by me.