done by Artur Grabowski and Thomas Nordin for OpenBSD, which is more
efficient in several ways than the callwheel implementation that it is
replacing. It has been adapted to our pre-existing callout API, and
also provides the slightly more efficient (and much more intuitive)
API (adapted to the callout_*() naming scheme) that the OpenBSD version
provides.
Among other things, this shaves a bunch of cycles off rescheduling-in-
the-future a callout which is already scheduled, which the common case
for TCP timers (notably REXMT and KEEP).
The API has been simplified a bit, as well. The (very confusing to
a good many people) "ACTIVE" state for callouts has gone away. There
is now only "PENDING" (scheduled to fire in the future) and "EXPIRED"
(has fired, and the function called).
Kernel version bump not done; we'll ride the 1.6N bump that happened
with the malloc(9) change.
as ltsleep() may call callout_reset() with the scheduler lock held.
So, prevent interrupts that may take the scheduler lock while holding
the callwheel lock.
counters. These counters do not exist on all CPUs, but where they
do exist, can be used for counting events such as dcache misses that
would otherwise be difficult or impossible to instrument by code
inspection or hardware simulation.
pmc(9) is meant to be a general interface. Initially, the Intel XScale
counters are the only ones supported.
"earliest" firing callout in a bucket. This allows us to skip
the scan up the bucket if no callouts are due in the bucket.
A cheap O(1) hint update is done at callout insertion (if new callout
is earlier than hint) and removal (is bucket empty). A thorough
refresh of the hint is done when the bucket is traversed.
This doesn't matter much on machines with small values of hz
(e.g. i386), but on systems with large values of hz (e.g. Alpha),
it has a definite positive effect.
Also, keep the callwheel stats in evcnts, so that you can view them
with "vmstat -e".
defined, call addupc_intr() directly from statclock() in the system time case,
using the same P_OWEUPC path if the copyin/copyout fails.
Use this in i386 to remove profiling code from the normal userret() path.
- Periodically invoke roundrobin() from hardclock() on all cpu's rather
than from a timer callout; this allows time-slicing on non-primary cpu's.
- Make pscnt per-cpu.
- Notice psdiv changes on each cpu, and adjust pscnt at that point.
Also, invoke setstatclockrate() from the clock interrupt when each cpu
notices the divisor change, rather than when starting/stopping the
profiling clock.
Stops sleeps from returning early (by up to a clock tick), and return 0
ticks for timeouts that should happen now or in the past.
Returning 0 is different from the legacy hzto() interface, and callers
need to check for it.
"KERN_SYSVIPC_SEM_INFO" and "KERN_SYSVIPC_SHM_INFO" to return the
info and data structures for the relevent SysV IPC types. The return
structures use fixed-size types and should be compat32 safe. All
user-visible changes are protected with
#if !defined(_POSIX_C_SOURCE) && !defined(_XOPEN_SOURCE)
Make all variable declarations extern in msg.h, sem.h and shm.h and
add relevent variable declarations to sysv_*.c and remove unneeded
header files from those .c files.
Make compat14 SysV IPC conversion functions and sysctl_file() static.
Change the data pointer to "void *" in sysctl_clockrate(),
sysctl_ntptime(), sysctl_file() and sysctl_doeproc().
timeout()/untimeout() API:
- Clients supply callout handle storage, thus eliminating problems of
resource allocation.
- Insertion and removal of callouts is constant time, important as
this facility is used quite a lot in the kernel.
The old timeout()/untimeout() API has been removed from the kernel.
approximation of reality if the MD code doesn't. This variable is the
equivalent of "tickfix" for the non-NTP path.
This allows an alpha kernel (where hz=1024) with "options NTP" to
synch up quite nicely (as opposed to having an frequency error of
~560ppm, which is outside the capture range of the PLL).
that is priority is rasied. Add a new spllowersoftclock() to provide the
atomic drop-to-softclock semantics that the old splsoftclock() provided,
and update calls accordingly.
This fixes a problem with using the "rnd" pseudo-device from within
interrupt context to extract random data (e.g. from within the softnet
interrupt) where doing so would incorrectly unblock interrupts (causing
all sorts of lossage).
XXX 4 platforms do not have priority-raising capability: newsmips, sparc,
XXX sparc64, and VAX. This platforms still have this bug until their
XXX spl*() functions are fixed.
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclk() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code
=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)
=== New schedclk() mechansism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock. Define these for alpha, where hz==1024, and nice was
totally broke.
=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater wieght) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this, it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.
=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.
is running (and NTP is not enabled), the adjtime()-handling code clobbers
any tickfix that may be necessary for systems with clocks with frequency
greater than 1000Hz.