Note, we're reusing the previously unused slot for "MTU discovery" (which
was moved to the "net.inet.ip" branch of the sysctl tree quite some time
ago).
- Filter out multicast destinations explicitly for every incoming packet,
not just SYNs. Previously, non-SYN multicast destination would be
filtered out as a side effect of PCB lookup. Remove now redundant
similar checks in the dropwithreset case and in syn_cache_add().
- Defer the TCP checksum until we know that we want to process the
packet (i.e. have a non-CLOSED connection or a listen socket).
increasing both of them will result in negative number on udp
"delivered" stat on netstat(8), since netstat computes number of delivered
packet by subtracting them from number of inbound packets.
- a packet is delivered to an address X,
- and the address X is configured on my !IFF_UP interface
- and ipforwarding=1
NetBSD PR: 9387
From: nrt@iij.ad.jp
RFC2553/2292-compliant header file path, now the following headers are
forbidden:
netinet6/ip6.h
netinet6/icmp6.h
netinet6/in6.h
if you want netinet6/{ip6,icmp6}.h, use netinet/{ip6,icmp6}.h.
if you want netinet6/in6.h, you just need to include netinet/in.h.
it pulls it in.
(we may need to integrate them into netinet/in.h, but for cross-BSD code
sharing i'd like to keep it like this for now)
(netinet6/{ip6,icmp6}.h is non-standard path - these files should go away)
it was not possible to use cvsmove in this case.
when you try to look at history, chase it toward netinet6/{ip6,icmp6}.h.
it is not supposed to work.
logging fix: add "\n" to some of log() in in6_prefix.c.
improve in6_ifdetach(). now almost all structure depend on ifnet
will be cleared up.
possible loose ends:
- cached route_in6 in static varaiables needs to be cleared as well
- there are ifaddr manipulation without reference counting,
which should be fixed
we still see panics after card removal, though... not sure what is left.
(sync with kame)
although this version has been changed somewhat:
- reference counting on ifaddrs isn't as complete as Bill's original
work was. This is hard to get right, and we should attack one
protocol at a time.
- This doesn't do reference counting or dynamic allocation of ifnets yet.
- This version introduces a new PRU -- PRU_PURGEADDR, which is used to
purge an ifaddr from a protocol. The old method Bill used didn't work
on all protocols, and it only worked on some because it was Very Lucky.
This mostly works ... i.e. works for my USB Ethernet, except for a dangling
ifaddr reference left by the IPv6 code; have not yet tracked this down.
- interop issues in ipcomp is fixed
- padding type (after ESP) is configurable
- key database memory management (need more fixes)
- policy specification is revisited
XXX m->m_pkthdr.rcvif is still overloaded - hope to fix it soon
code, from netbsd-current repository.
#ifdef'ed version is always available from ftp.kame.net.
XXX please do not make too many diff-unfriendly changes, we'll need to take
bunch of diffs on upgrade...
AF_INET6 wildcard listening socket. heavily documented in ip6(4).
net.inet6.ip6.bindv6only defines default value. default is 1.
"options INET6_BINDV6ONLY" removes any code fragment that supports
IPV6_BINDV6ONLY == 0 case (not defopt'ed as use of this is rare).
Patch from Darren send to the mailing list after he released 3.3.6 and
did a bad job with using the wrong way to update the NetBSD version
of ipfilter.
due to massive changes in KAME side.
- IPv6 output goes through nd6_output
- faith can capture IPv4 packets as well - you can run IPv4-to-IPv6 translator
using heavily modified DNS servers
- per-interface statistics (required for IPv6 MIB)
- interface autoconfig is revisited
- udp input handling has a big change for mapped address support.
- introduce in4_cksum() for non-overwriting checksumming
- introduce m_pulldown()
- neighbor discovery cleanups/improvements
- netinet/in.h strictly conforms to RFC2553 (no extra defs visible to userland)
- IFA_STATS is fixed a bit (not tested)
- and more more more.
TODO:
- cleanup os-independency #ifdef
- avoid rcvif dual use (for IPsec) to help ifdetach
(sorry for jumbo commit, I can't separate this any more...)
(if the definition is like in rfc2553) they are not supposed to be used.
XXX i'm trying to change rfc2553 sockaddr_storage definition to include
"ss_len" and "ss_family". see ipngwg. situation might change soon.
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.
The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.
synchronization to latest KAME will take place on HEAD branch soon.
RTM_IFINFO is now 0xf, 0xe is RTM_OIFINFO which returns the old (if_msghdr14)
struct with 32bit counters (binary compat, conditioned on COMPAT_14).
Same for sysctl: node 3 is renamed NET_RT_OIFLIST, NET_RT_IFLIST is now node 4.
Change rt_msg1() to add an mbuf to the mbuf chain instead of just panic()
when the message is larger than MHLEN.
Avoid forwarding ip unicast packets which were contained inside
link-level multicast packets; having M_MCAST still set in the packet
header flags will mean that the packet will get multicast to a bogus
group instead of unicast to the next hop.
Malformed packets like this have occasionally been spotted "in the
wild" on a mediaone cable modem segment which also had multiple netbsd
machines running as router/NAT boxes.
Without this, any subnet with multiple netbsd routers receiving all
multicasts will generate a packet storm on receipt of such a
multicast. Note that we already do the same check here for link-level
broadcasts; ip6_forward already does this as well.
Note that multicast forwarding does not go through ip_forward().
Adding some code to if_ethersubr to sanity check link-level
vs. ip-level multicast addresses might also be worthwhile.
back into the packet. (ip_output() clears it since ipsec reuses that
packet field in the output path. by putting it back, we're going to
pretend we're back on the input path now).
This is important, because for most protocols, link level fragmentation is
used, but with different default effective MTUs. (e.g.: IPv4 default MTU
is 1500 octets, IPv6 default MTU is 9072 octets).
MSS advertisement must always be:
max(if mtu) - ip hdr siz - tcp hdr siz
We violated this in the previous code so it was fixed.
tcp_mss_to_advertise() now takes af (af on wire) as its argument,
to compute right ip hdr siz.
tcp_segsize() will take care of IPsec header size.
One thing I'm not really sure is how to handle IPsec header size in
*rxsegsizep (inbound segment size estimation).
The current code subtracts possible *outbound* IPsec size from *rxsegsizep,
hoping that the peer is using the same IPsec policy as me.
It may not be applicable, could TCP gulu please comment...
This situation happens on severe memory shortage. We may need more
improvements here and there.
- Grab IEEE802 address from IFT_ETHER card, even if the card is
inserted after bootup time. Is there any other card that can be
inserted afterwards? pcmcia fddi card? :-P
- RFC2373 u bit handling suggests that we SHOULD NOT copy interface id from
ethernet card to pseudo interface, when ethernet card has IEEE802/EUI64
with u bit != 0 (this means that IEEE802/EUI64 is not universally unique).
Do not use such address as, for example, interface id for gif interface.
(I have such an ethernet card myself)
This may change interface id for your gif interface. be careful upgrading
rc files.
(sync with recent KAME)
the member is used to pass struct socket to ip{,6}_output for ipsec decisions.
(i agree it is kind of ugly. we need to modify struct mbuf if we are
to do better - which seems to me a bit too much)
New Reno fast recovery code was being executed even when New Reno was
disabled, resulting in an unfortunate interaction with the traditional
fast recovery code, the end resulting being that the very condition
that would trigger the traditional fast recovery mechanism caused fast
recovery to be disabled!
Problem reported by Ted Lemon, and some analytical help from Charles Hannum.
Stale syn cache entries are useless because none of them will be used
if there is no listening socket, as tcp_input looks up listening socket by
in_pcblookup*() before looking into syn cache.
This fixes race condition due to dangling socket pointer from syn cache
entries to listening socket (this was introduced when ipsec is merged in).
This should preserve currently implemented behavior (but not 4.4BSD
behavior prior to syn cache).
Tested in KAME repository before commit, but we'd better run some
regression tests.
check that the packet if of the rigth protocol before giving it to the
proxy module, otherwise let the ipnat code handle it.
What happens in kern/7831 is that a router sends back a icmp message for
a TCP SYN, and ip_proxy.c forwards it to ip_ftp_pxy.c which can only
handle TCP packets. The icmp message is properly handled by ipnat, no need to
go to ip_ftp_pxy.c.
- Make sure that snd_recover is always at least snd_una. If we don't do
this, there can be confusion when sequence numbers wrap around on a
large loss-free data transfer.
- When doing a New Reno retransmit, snd_una hasn't been updated yet,
and the socket's send buffer has not yet dropped off ACK'd data, so
don't muddle with snd_una, so that tcp_output() gets the correct data
offset.
- When doing a New Reno retransmit, make sure the congestion window is
open one segment beyond the ACK'd data, so that we can actually perform
the retransmit.
Partially derived from, although more complete than, similar changes in
OpenBSD, which in turn originated from Tom Henderson <tomh@cs.berkeley.edu>.
is not the expected one.
I see PRC_REDIRECT_HOST with sa->sa_family == AF_UNIX coming to
{tcp,udp}_ctlinput() when I use dhclient, and I feel like adding
more sanity checks, without logging - if we log it it is too noisy.
when ip header and tcp header are not adjacent to each other
(i.e. when ip6 options are attached).
To test this, try
telnet @::1@::1 port
toward a port without responding server. Prior to the fix, the kernel will
generate broken RST packet.
for the protocol in the specified packet.
Fix statistic gathering to not make bogus increments of ips_delivered and
ips_noproto for cases where rip_input() is called by a protocol handler
(such as icmp_input or igmp_input) which has already processed the packet.
(Sorry for a big commit, I can't separate this into several pieces...)
Pls check sys/netinet6/TODO and sys/netinet6/IMPLEMENTATION for details.
- sys/kern: do not assume single mbuf, accept chained mbuf on passing
data from userland to kernel (or other way round).
- "midway" ATM card: ATM PVC pseudo device support, like those done in ALTQ
package (ftp://ftp.csl.sony.co.jp/pub/kjc/).
- sys/netinet/tcp*: IPv4/v6 dual stack tcp support.
- sys/netinet/{ip6,icmp6}.h, sys/net/pfkeyv2.h: IETF document assumes those
file to be there so we patch it up.
- sys/netinet: IPsec additions are here and there.
- sys/netinet6/*: most of IPv6 code sits here.
- sys/netkey: IPsec key management code
- dev/pci/pcidevs: regen
In my understanding no code here is subject to export control so it
should be safe.
address zero of each net/subnet is a broadcast address.
(The default value is nonzero, which preserves the current behavior).
This can be set using sysctl; the boot-time default can also be
configured using the HOSTZEROBROADCAST kernel config option.
While we're here, defopt HOSTZEROBROADCAST and SUBNETSARELOCAL
after various m_adj()s have been done. Kludge around this with a cheesy
macro that knows where the drivers put the mac header in the first mbuf.
XXX There should be a better way to do this.
where one side can think a connection exists, where the other side thinks
the connection was never established.
The original problem was first reported by Ty Sarna in PR #5909. The
original fix I made to the code didn't cover all cases. The problem this
fix addresses was reported by Christoph Badura via private e-mail.
Many thanks to Bill Sommerfeld for helping me to test this code, and
for finding a subtle bug.
remote of the tunnel can be found.
XXX If you manually mark the interface as "UP" and set the MTU later
XXX sending a packet will still cause a kernel panic.
passing SIOCSIFADDR/SIOIFDSTADDR, but by passing the addresses in
the appropriate structs.
One of the mysteries of ifconfig IMHO...
Should fix kern/6899.
and netinet, currently only tested under netinet.
Disabled by default, enabled by compiling the kernel with option
IFA_STATS. Enabling this feature seems to make the ip_output function
take 13% longer than before, which should be OK for people that need
this feature.
same uid or by root.
This code is from FreeBSD. (Whilst it was originally obtained from OpenBSD,
FreeBSD fixed it to work with multicast. To quote the commit message:
- Don't bother checking for conflicting sockets if we're binding to a
multicast address.
- Don't return an error if we're binding to INADDR_ANY, the conflicting
socket is bound to INADDR_ANY, and the conflicting socket has
SO_REUSEPORT set.
)
- Don't use tcp_respond(), instead create the tcp/ip header from scratch,
and send it ourself.
- Reuse the mbuf that carried the SYN, or allocate one if that is not
available.
- Cache the route we look up to do the Path MTU Discovery check, and
transfer the reference to that route to the inpcb when the connection
completes.
* Macro'ize a small, but often repeated code fragment.
SYN,ACK packets during Path MTU Discovery. Fix tcp_respond() to do the
appropriate route lookup and set DF as appropriate.
Also, fixup similar code in tcp_output() to relookup the route if it
is down.
ipip_input(), and returns non-zero if mrt_ipip_input() handled the
packet.
XXX Eventually, the multicast code should probably use regular IP-IP
XXX `interfaces', but mrouted knows about the VIF table, etc.
up the interface to ip_mroute.c somewhat, and properly separates IP-IP
from GRE. (They are similar, but they are different protocols, and should
not be implemented in the same place.)
where the user issued a write with a length greater than MLEN but less
than MINCLSIZE, thus causing two mbufs to be used. The loop in sosend()
would then call PRU_SEND twice, causing TCP to transmit 2 packets when
it could have transmitted one.
Suggested by Justin Walker <justin@apple.com> on the freebsd-net
mailing list.
((so->so_proto->pr_flags & PR_CONNREQUIRED) == 0 ||
(so->so_options & SO_ACCEPTCONN) == 0)
since the latter is always true, so the former test in unnecessary.
from `TCP/IP Illustrated, Volume 2', W. Richard Stevens, p 730.
Our other constants also use "ATALK".
Added many new ETHERTYPE constants to sys/net/ethertypes.h, including the
ones from libpcap and tcpdump "ethertype.h" files.
of a lookup_wildcard arg; simplifies the logic a bit.
* when assigning ephemeral ports in in_pcbbind(), always call
in_pcblookup_port() with lookup_wildcard=1, so that ephemeral port
allocation on sockets with SO_REUSEADDR set won't potentially bind to a
port in use by something else (principle of least surprise).
syn_cache_unreach() should remove the entry, or just continue on.
Algorithm is to only remove the entry if we've had more than one unreach
error and have retransmitted 3 or more times. This prevents the following
scenario, as noted in PR #5909 (PR from Ty Sarna, scenario from
Charles Hannum):
* Host A sends a SYN.
* Host A retransmits the SYN.
* Host B gets the first SYN and sends a SYN-ACK.
* Host B gets the second SYN and sends a SYN-ACK.
* One of the SYN-ACK bounces with an
ICMP unreachable, causing the `SYN cache' entry to be
removed with no notification.
* Host A receives the other SYN-ACK, sends an ACK, and goes to
ESTABLISHED state.
Should fix PR #5909.
suggestion in draft-floyd-incr-init-win-03. Rather than scaling cwnd back
by the ratio of new segment size to old segment size, we perform a slow start
using the Initial Window, computed with the new segment size.
from the output flags for CLOSING state. There is no harm in retransmitting
the FIN, and this change has unexpected side effects that break simultaneous
close behaviour.
NETHER, NFDDI, NARC are not used anywhere. Remove #include "ether.h",
which had no effect.
Removes clash with "options NATM" for native-ATM network protocol stack.
(1.44) missed a test for the right interface, making some machines answer
to some bogus arp requests (like for WHO-HAS 127.0.0.1).
The quick patch in 1.46-1.47 does not work for so-called "unnumbered"
interfaces, that is, (point-to-point) interfaces that share their local
address with another (e.g., the Ethernet) interface.
We add a macro to in_var.h, to step (in the current implementation) through
the hash chain and fine more entries with the same address, and use that
in if_arp.c to find one which belongs to our interface.
as with user-land programs, include files are installed by each directory
in the tree that has includes to install. (This allows more flexibility
as to what gets installed, makes 'partial installs' easier, and gives us
more options as to which machines' includes get installed at any given
time.) The old SYS_INCLUDES={symlinks,copies} behaviours are _both_
still supported, though at least one bug in the 'symlinks' case is
fixed by this change. Include files can't be build before installation,
so directories that have includes as targets (e.g. dev/pci) have to move
those targets into a different Makefile.
- Don't use home-grown queue manipulation. Use <sys/queue.h> instead. The
data structures are a little larger, but we are otherwise wasting the
memory chunk anyway (we're already a 64-byte malloc bucket).
- Fix a bug in the cache-is-full case: if the oldest element removed from
the first non-empty bucket was the only element in the bucket, the
bucket wouldn't be removed from the bucket cache, causing queue corruption
later.
- Optimize the syn cache timers by using PRT timers rather than home-grown
decrement-and-propagate timers.
This code is now a fair bit smaller, and significantly easier to read
and understand.
the protocol dispatch layer for TCP timers. This saves having to
modify a potentially large number of timer values (which were shorts,
and expanded to ... a lot of code on the Alpha).
- kern/5381 (Dennis Ferguson): check IP header checksum in fast forward
code.
- In ipflow_slowtimo(), if no IP flows are in use, don't bother checking
all of the hash buckets.
flow, by setting the "can fast forward" flag in the packet header, and
giving a chance for filters to clear the flag. If the flag is still
set after the filters have given it a chance, the packet will be used
to create a fast-forward flow entry.
the burst size allowed, but rather a fixed number of packets, as
described in the Internet Draft. Default allowed burst is 4 packets,
per the Draft.
Make the use of CWM and the allowed burst size tunable via sysctl.
if_fddisubr.c to fastpath IP forwarding. If ip_forward successfully
forwards a packet, it will create a cache (ipflow) entry. ether_input
and fddi_input will first call ipflow_fastforward with the received
packet and if the packet passes enough tests, it will be forwarded (the
ttl is decremented and the cksum is adjusted incrementally).
conditional (tcp_compat_42). The kernel config option TCP_COMPAT_42
will still enable this by default, or disable this by default if the
option is not included (i.e. current behavior). This will be made a
sysctl soon.
rule was to update the timestamp if the sequence numbers are in range. New
rule adds a check that the timestamp is advancing, thus preventing our notion
of the most recent timestamp from incorrectly moving backwards.
all of the fragments. Use the mtu of route in preference of the MTU of the
interface when doing fragmentation decisions. (ie. Fragment to the path
mtu if it is available).
TCP connections by using the MTU of the interface. Also added
a knob, mss_ifmtu, to force all connections to use the MTU of
the interface to calculate the advertised MSS.
meeting of IETF #41 by Amy Hughes <ahughes@isi.edu>, and in an upcoming
internet draft from Hughes, Touch, and Heidemann.
CWM eliminates line-rate bursts after idle periods by counting pending
(unacknowledged) packets and limiting the congestion window to the
initial congestion window plus the pending packet count. This has the
effect of allowing us to use the window as long as we continue to transmit,
but as soon as we stop transmitting, we go back to a slow-start (also known
as `use it or lose it').
This is not enabled by default. You can enable this behavior by patching
the "tcp_cwm" global (set it to non-zero) or by building a kernel with the
TCP_CWM option.
to ACK immediately any packet that arrived with PSH set. This breaks
delayed ACKs in a few specific common cases that delayed ACKs were
supposed to help, and ends up not making much (if any) difference in
the case where where this ACK-on-PSH change was supposed to help.
Per discussion with several members of the TCPIMPL and TCPSAT IETF
working groups.
code, as clarified in the TCPIMPL WG meeting at IETF #41: If the SYN
(active open) or SYN,ACK (passive open) was retransmitted, the initial
congestion window for the first slow start of that connection must be
one segment.
RTO estimation changes. Under some circumstances it would return a value
of 0, while the old Van Jacobson RTO code would return a minimum of 3.
This would result in 12 retransmissions, each 1 second apart.
This takes care of those instances, and ensures that t_rttmin is
used everywhere as a lower bound.
change pfil_add_hook to put output filters at the tail of the queue,
while continuing to place input filters at the head of the queue. update
the two users of these functions, and document these changes.
fixes PR#4593.
in the packet. This fixes a bug that was resulting in extra packets
in retransmissions (the second packet would be 12 bytes long,
reflecting the RFC1323 timestamp option size).