Commit Graph

350 Commits

Author SHA1 Message Date
ozaki-r
4c25fb2f83 Add rtcache_unref to release points of rtentry stemming from rtcache
In the MP-safe world, a rtentry stemming from a rtcache can be freed at any
points. So we need to protect rtentries somehow say by reference couting or
passive references. Regardless of the method, we need to call some release
function of a rtentry after using it.

The change adds a new function rtcache_unref to release a rtentry. At this
point, this function does nothing because for now we don't add a reference
to a rtentry when we get one from a rtcache. We will add something useful
in a further commit.

This change is a part of changes for MP-safe routing table. It is separated
to avoid one big change that makes difficult to debug by bisecting.
2016-12-08 05:16:33 +00:00
mrg
bbc9acc117 apply a #ifdef INET6 so the previous compiles without INET6. 2016-11-15 22:23:09 +00:00
mlelstv
845a599209 Enforce alignment requirements that are violated in some cases.
For machines that don't need strict alignment (i386,amd64,vax,m68k) this
is a no-op.

Fixes PR kern/50766 but should be improved.
2016-11-15 20:50:28 +00:00
ozaki-r
fe6d427551 Avoid storing a pointer of an interface in a mbuf
Having a pointer of an interface in a mbuf isn't safe if we remove big
kernel locks; an interface object (ifnet) can be destroyed anytime in any
packet processing and accessing such object via a pointer is racy. Instead
we have to get an object from the interface collection (ifindex2ifnet) via
an interface index (if_index) that is stored to a mbuf instead of an
pointer.

The change provides two APIs: m_{get,put}_rcvif_psref that use psref(9)
for sleep-able critical sections and m_{get,put}_rcvif that use
pserialize(9) for other critical sections. The change also adds another
API called m_get_rcvif_NOMPSAFE, that is NOT MP-safe and for transition
moratorium, i.e., it is intended to be used for places where are not
planned to be MP-ified soon.

The change adds some overhead due to psref to performance sensitive paths,
however the overhead is not serious, 2% down at worst.

Proposed on tech-kern and tech-net.
2016-06-10 13:31:43 +00:00
ozaki-r
d938d837b3 Introduce m_set_rcvif and m_reset_rcvif
The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.

No functional change.
2016-06-10 13:27:10 +00:00
rtr
e2a3307b85 Reduce code duplication.
Split creation of IPv4-Mapped IPv6 addresses into its own function
and use it.

No functional change intended.  As posted to tech-net@
2016-02-15 14:59:03 +00:00
pooka
1c4a50f192 sprinkle _KERNEL_OPT 2015-08-24 22:21:26 +00:00
matt
6de0fc0ff8 Make sure that snd_win doesn't go negative. 2015-07-24 04:31:20 +00:00
ozaki-r
fcda92b6be Remove unused arguments and the associated code from nd6_nud_hint()
from OpenBSD
2015-07-15 09:20:18 +00:00
rtr
e7083d7a4b remove transitional functions in{,6}_pcbconnect_m() that were used in
converting protocol user requests to accept sockaddr instead of mbufs.

remove tcp_input copy in to mbuf from sockaddr and just copy to sockaddr
to make it possible for the transitional functions to go away.

no version bump since these functions only existed for a short time and
were commented as adapters (they appeared in 7.99.15).
2015-05-24 15:43:45 +00:00
kefren
110f4b05db Don't try to do PCB lookup for bad checksummed segments
Fixes PR/43510 and PR/48452
2015-05-15 18:03:45 +00:00
rtr
fd12cf39ee make connect syscall use sockaddr_big and modify pr_{send,connect}
nam parameter type from buf * to sockaddr *.

final commit for parameter type changes to protocol user requests

* bump kernel version to 7.99.15 for parameter type changes to pr_{send,connect}
2015-05-02 17:18:03 +00:00
ozaki-r
2373b55abc Introduce in6_selecthlim_rt to consolidate an idiom for rt->rt_ifp
It consolidates a scattered routine:
(rt = rtcache_validate(&in6p->in6p_route)) != NULL ? rt->rt_ifp : NULL
2015-04-27 02:59:44 +00:00
rtr
8699b912c3 Move code that is conditional on options INET6 into #ifdef INET6.
* Re-organize some variable declarations to limit #ifdef's.
* Move INET and INET6 code into respective switch cases to simplify
  #ifdef INET6.

No intended functional change.
2015-03-14 02:08:16 +00:00
he
1d14d02249 Port over the TCP_INFO socket option from FreeBSD, originally from
the Linux 2.6 TCP API.  This permits the caller to query certain information
about a TCP connection, and is used by pkgsrc's net/iperf3 test program
if available.

This extends struct tcbcb with three fields to count retransmits,
out-of-sequence receives and zero window announcements, and will
therefore warrant a kernel revision bump (done separately).
2015-02-14 12:57:52 +00:00
christos
f89df58b37 use the new printing code. 2014-12-02 20:25:47 +00:00
rtr
822872eada split PRU_RCVD function out of pr_generic() usrreq switches and put into
separate functions

  - always KASSERT(solocked(so)) even if not implemented

  - replace calls to pr_generic() with req = PRU_RCVD with calls to
    pr_rcvd()
2014-08-08 03:05:44 +00:00
rmind
c173b112be tcp_signature_getsav: handle !ipsec_used case and fix the build (hi christos!). 2014-05-30 02:27:29 +00:00
christos
5d61e6c015 Introduce 2 new variables: ipsec_enabled and ipsec_used.
Ipsec enabled is controlled by sysctl and determines if is allowed.
ipsec_used is set automatically based on ipsec being enabled, and
rules existing.
2014-05-30 01:39:03 +00:00
maxv
c4846ab7e9 ';;' -> ';'
no functional change

spotted by my code scanner

ok christos@
2014-03-01 16:46:14 +00:00
kefren
4d4f2b7db1 * implement TCP CUBIC congestion control algorithm
* move tcp_sack_newack bits inside reno and newreno_fast_retransmit_newack
* notify ECN peer about cwnd shrink in [new]reno_slow_retransmit

Based on the patch proposed on tech-net@ on Nov 7 with minor improvments:
 * adapt wmax for no-fast convergence case
 * correct cbrt calculation for big window sizes (>750KB)
2013-11-12 09:02:05 +00:00
martin
3c7c27640e Remove unused variable 2013-09-15 14:40:56 +00:00
rmind
8088e72932 Remove SS_ISCONFIRMING, it is unused and TP4 will not come back. 2013-08-29 17:49:20 +00:00
christos
1071947586 merge error paths, pass the address of sav; pointed out by Greg Troxel 2013-06-06 00:03:14 +00:00
christos
27fe772ddc IPSEC has not come in two speeds for a long time now (IPSEC == kame,
FAST_IPSEC). Make everything refer to IPSEC to avoid confusion.
2013-06-05 19:01:26 +00:00
christos
bda667f3a1 remove unintended commit (this was to avoid a bug in the hme driver which
I have not been able to reproduce)
2012-06-22 15:09:36 +00:00
christos
40114b997c PR/46602: Move the rfc6056 port randomization to the IP layer. 2012-06-22 14:54:34 +00:00
yamt
88246c7ae3 comment 2012-04-13 15:35:57 +00:00
drochner
364a06bb29 remove KAME IPSEC, replaced by FAST_IPSEC 2012-03-22 20:34:37 +00:00
drochner
11ce8fbd83 fix build in the (FAST_)IPSEC & TCP_SIGNATURE case 2012-01-11 14:39:08 +00:00
christos
42c420856f - fix offsetof usage, and redundant defines
- kill pointer casts to 0
2011-12-31 20:41:58 +00:00
drochner
23e5beaef1 rename the IPSEC in-kernel CPP variable and config(8) option to
KAME_IPSEC, and make IPSEC define it so that existing kernel
config files work as before
Now the default can be easily be changed to FAST_IPSEC just by
setting the IPSEC alias to FAST_IPSEC.
2011-12-19 11:59:56 +00:00
tls
3afd44cf08 First step of random number subsystem rework described in
<20111022023242.BA26F14A158@mail.netbsd.org>.  This change includes
the following:

	An initial cleanup and minor reorganization of the entropy pool
	code in sys/dev/rnd.c and sys/dev/rndpool.c.  Several bugs are
	fixed.  Some effort is made to accumulate entropy more quickly at
	boot time.

	A generic interface, "rndsink", is added, for stream generators to
	request that they be re-keyed with good quality entropy from the pool
	as soon as it is available.

	The arc4random()/arc4randbytes() implementation in libkern is
	adjusted to use the rndsink interface for rekeying, which helps
	address the problem of low-quality keys at boot time.

	An implementation of the FIPS 140-2 statistical tests for random
	number generator quality is provided (libkern/rngtest.c).  This
	is based on Greg Rose's implementation from Qualcomm.

	A new random stream generator, nist_ctr_drbg, is provided.  It is
	based on an implementation of the NIST SP800-90 CTR_DRBG by
	Henric Jungheim.  This generator users AES in a modified counter
	mode to generate a backtracking-resistant random stream.

	An abstraction layer, "cprng", is provided for in-kernel consumers
	of randomness.  The arc4random/arc4randbytes API is deprecated for
	in-kernel use.  It is replaced by "cprng_strong".  The current
	cprng_fast implementation wraps the existing arc4random
	implementation.  The current cprng_strong implementation wraps the
	new CTR_DRBG implementation.  Both interfaces are rekeyed from
	the entropy pool automatically at intervals justifiable from best
	current cryptographic practice.

	In some quick tests, cprng_fast() is about the same speed as
	the old arc4randbytes(), and cprng_strong() is about 20% faster
	than rnd_extract_data().  Performance is expected to improve.

	The AES code in src/crypto/rijndael is no longer an optional
	kernel component, as it is required by cprng_strong, which is
	not an optional kernel component.

	The entropy pool output is subjected to the rngtest tests at
	startup time; if it fails, the system will reboot.  There is
	approximately a 3/10000 chance of a false positive from these
	tests.  Entropy pool _input_ from hardware random numbers is
	subjected to the rngtest tests at attach time, as well as the
	FIPS continuous-output test, to detect bad or stuck hardware
	RNGs; if any are detected, they are detached, but the system
	continues to run.

	A problem with rndctl(8) is fixed -- datastructures with
	pointers in arrays are no longer passed to userspace (this
	was not a security problem, but rather a major issue for
	compat32).  A new kernel will require a new rndctl.

	The sysctl kern.arandom() and kern.urandom() nodes are hooked
	up to the new generators, but the /dev/*random pseudodevices
	are not, yet.

	Manual pages for the new kernel interfaces are forthcoming.
2011-11-19 22:51:18 +00:00
yamt
4fa9fc4940 fix a double unlock bug introduced by tcp_input.c rev.1.312. 2011-10-31 13:01:42 +00:00
plunky
7f3d4048d7 NULL does not need a cast 2011-08-31 18:31:02 +00:00
joerg
3eb244d801 Retire varargs.h support. Move machine/stdarg.h logic into MI
sys/stdarg.h and expect compiler to provide proper builtins, defaulting
to the GCC interface. lint still has a special fallback.
Reduce abuse of _BSD_VA_LIST_ by defining __va_list by default and
derive va_list as required by standards.
2011-07-17 20:54:30 +00:00
gdt
c238210804 Remove erroneous additional tick in RTO estimation. The variable
ts_rtt is 1 plus the RTT, so that 0 can mean invalid measurement.
However, the code failed to subtract the 1 back out before use.  With
this change, TCP from Massachusetts to France now typically has 1s RTO
values, rather than 1.5s.

This bug was found and fixed by Bev Schwartz of BBN.  This material is
based upon work supported by the Defense Advanced Research Projects
Agency and Space and Naval Warfare Systems Center, Pacific, under
Contract No. N66001-09-C-2073.  Approved for Public Release,
Distribution Unlimited
2011-05-25 23:20:57 +00:00
dholland
5d71a1f21c typo in comment 2011-05-17 05:40:24 +00:00
dyoung
c2e43be1c5 Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).

MSLT and VTW were contributed by Coyote Point Systems, Inc.

Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires.  On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.

Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer.  Corresponding to each class
is an MSL, and a session uses the MSL of its class.  The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways).  Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote.  Loopback and local sessions
expire more quickly when MSLT is used.

Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB".  VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion.  The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer.  When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.

It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.

A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive.  It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
2011-05-03 18:28:44 +00:00
yamt
3e17d0f5a4 tcp_input: simplify redundant assignment. no functional changes. 2011-04-25 22:12:43 +00:00
wiz
d8926a5a43 Fix typos. 2011-04-20 14:08:07 +00:00
gdt
f641bea548 Rewrite comments about TCP RTO calculations.
Long ago, the storage representations of srtt and rttvar were changed
from the 4.4BSD scheme, and the comments are out of sync with the
code.  This commit rewrites most of the comments that explain the RTO
calculations, and points out some issues in the code.

Joint work with Bev Schwartz of BBN (original analysis and comments),
but I have rewritten and extended them, so errors are mine.

This material is based upon work supported by the Defense Advanced
Research Projects Agency and Space and Naval Warfare Systems Center,
Pacific, under Contract No. N66001-09-C-2073.  Approved for Public
Release, Distribution Unlimited
2011-04-20 13:35:51 +00:00
yamt
0b881f8c57 comments 2011-04-14 15:48:48 +00:00
yamt
b1563ea6d9 fix a typo in rev.1.283, which broke tcp dupack and duppack statistics. 2011-03-09 00:44:23 +00:00
plunky
d334ec0fc0 fix potential mbuf overflow, from Alexander Danilov on tech-net 2010-12-02 19:07:27 +00:00
bouyer
adad9c5471 Make sure SYN_CACHE_TIMER_ARM() has been run before calling syn_cache_put()
as it will reschedule the timer.  Fixes PR kern/43318.
2010-05-26 17:38:29 +00:00
bouyer
c638cbeac1 syn_cache_put(): defer all pool_put() to the callout. Reschedule
the callout if needed so frees are not delayed too much.
syn_cache_timer(): we can't call syn_cache_put() here any more,
so move code deleted from syn_cache_put() here.

Avoid KASSERT() in kern_timeout.c because pool_put() is called from
ipintr context, as reported in
http://mail-index.netbsd.org/tech-kern/2010/03/19/msg007762.html
Thanks to Andrew Doran and Mindaugas Rasiukevicius for help and review.
2010-04-21 20:40:16 +00:00
rmind
b278cb5138 tcp_input: set ECE flag even if CWR flag is active.
Submitted by Richard Scheffenegger in PR/43150.
2010-04-16 03:13:03 +00:00
tls
4e0229021b Oops. Fix LOCKDEBUG panic -- and spurious calls to tcp_output()! -- in
previous.  Be careful with that {}, Eugene.
2010-04-01 14:31:51 +00:00
tls
994b02bdbe After discussion with ad@: it appears that KERNEL_LOCK also protects
the driver output path (that is, ifp->if_output()).  In the case of
entry through the socket code, we are fine, because pru_usrreq takes
KERNEL_LOCK.  However, there are a few other ways to cause output
which require protection:

	1) direct calls to tcp_output() in tcp_input()
	2) fast-forwarding code (ip_flow) -- protected elsewise
	   against itself by the softnet lock.
	3) *Possibly* the ARP code.  I have currently persuaded
	   myself that it is safe because of how it's called.
	4) Possibly the ICMP code.

This change addresses #1 and #2.
2010-04-01 00:24:41 +00:00