syscall/trap entry, eliminating a test+branch on every syscall/trap.
This wasn't possible in the 3.99.x timeframe when l->l_cred came about
because there wasn't a reliable/timely way to force an ONPROC LWP running on
a remote CPU into the kernel (which is just about the only new thing in
this scheme).
functional change but it does get rid of a bunch of assumptions about how
mi_userret() works making it easier to adjust in that in the future, and
works as a kind of documentation too.
functional change but it does get rid of a bunch of assumptions about how
mi_userret() works making it easier to adjust in that in the future, and
works as a kind of documentation too.
than setting it up at each site where we block, make it a property of
syncobj_t. Then, do not hang onto the priority boost until userret(),
drop it as soon as the LWP is out of the run queue and onto a CPU.
Holding onto it longer is of questionable benefit.
- This allows two members of lwp_t to be deleted, and mi_userret() to be
simplified a lot (next step: trim it down to a single conditional).
- While here, constify syncobj_t and de-inline a bunch of small functions
like lwp_lock() which turn out not to be small after all (I don't know
why, but atomic_*_relaxed() seem to provoke a compiler shitfit above and
beyond what volatile does).
- Do away with separate pool_cache for some kernel objects that have no special
requirements and use the general purpose allocator instead. On one of my
test systems this makes for a small (~1%) but repeatable reduction in system
time during builds presumably because it decreases the kernel's cache /
memory bandwidth footprint a little.
- vfs_lockf: cache a pointer to the uidinfo and put mutex in the data segment.
requirements and use the general purpose allocator instead. On one of my
test systems this makes for a small (~1%) but repeatable reduction in system
time during builds presumably because it decreases the kernel's cache /
memory bandwidth footprint a little.
- vfs_lockf: cache a pointer to the uidinfo and put mutex in the data segment.
And make them bind to the CPU as a side effect, instead of requiring
the caller to have already done so.
This lets us eliminate the assertions so we can use them in ddb even
when things are going haywire and we just want to get diagnostics.
XXX kernel revbump -- struct cpu_info change
We can't return early from defibrillate because the IPI may have yet
to run -- we can't return until the other CPU is definitely done
using the ipi_msg_t we created on the stack.
We should avoid calling panic again on the patient CPU in case it was
already in the middle of a panic, so that we don't re-enter panic
while, e.g., trying to print a stack trace.
Sprinkle some comments.
The timecounter ticks only on the primary CPU, so of course it will
go stale if it's suspended.
(It is, perhaps, a mistake that it only ticks on the primary CPU,
even if the primary CPU is offlined or in a polled-input console
loop, but that's a separate issue.)
This way we can suspend heartbeats on a single CPU while the console
is in polling mode, not just when the CPU is offlined. This should
be rare, so it's not _convenient_, but it should enable us to fix
polling-mode console input when the hardclock timer is still running
on other CPUs.
Do this on successful offlining, not on failed offlining.
No functional change right now because heartbeat_suspend is
implemented as a noop -- heartbeat(9) will just check the
SPCF_OFFLINE flag. But if we change it to not be a noop, well, then
we need to call it in the right place.
As soon as the workqueue function has called, it is forbidden to
touch the struct work passed to it -- the function might free or
reuse the data structure it is embedded in.
So workqueue_wait is forbidden to search the queue for the batch of
running work items. Instead, use a generation number which is odd
while the thread is processing a batch of work and even when not.
There's still a small optimization available with the struct work
pointer to wait for: if we find the work item in one of the per-CPU
_pending_ queues, then after we wait for a batch of work to complete
on that CPU, we don't need to wait for work on any other CPUs.
PR kern/57574
XXX pullup-10
XXX pullup-9
XXX pullup-8
entropy_extract may sleep on an adaptive lock, which invalidates
percpu(9) references.
Add a note in the comment over entropy_extract about this.
Discovered by stumbling upon this panic during a test run:
[ 1.0200050] panic: kernel diagnostic assertion "(cprng == percpu_getref(cprng_fast_percpu)) && (percpu_putref(cprng_fast_percpu), true)" failed: file "/home/riastradh/netbsd/current/src/sys/rump/librump/rumpkern/../../../crypto/cprng_fast/cprng_fast.c", line 117
XXX pullup-10
Evidently rump starts threads before it sets cold = 0, and deferring
the call to rump_thread_allow(NULL) in rump.c rump_init_callback
until after setting cold = 0 doesn't work either -- rump kernels just
hang. To be investigated -- for now, let's just stop breaking
thousands of tests.
Softints are forbidden to run while cold. So let's make sure nobody
even tries it -- if they do, they might be delayed indefinitely,
which is going to be much harder to diagnose.
- Nix the entropy stage (cold, warm, hot). Just use the usual kernel
`cold' (cold: single-core, single-thread; interrupts may happen),
and don't make any three-way distinction about whether interrupts
or threads or other CPUs can be running.
Instead, while cold, use splhigh/splx or forbid paths to come from
interrupt context, and while warm, use mutex or the per-CPU hard
and soft interrupt paths for low latency. This comes at a small
cost to some interrupt latency, since we may stir the pool in
interrupt context -- but only for a very short window early at boot
between configure and configure2, so it's hard to imagine it
matters much.
- Allow rnd_add_uint32 to run in hard interrupt context or with spin
locks held, but defer processing to softint and drop samples on the
floor if buffer is full. This is mainly used for cheaply tossing
samples from drivers for non-HWRNG devices into the entropy pool,
so it is often used from interrupt context and/or under spin locks.
- New rnd_add_data_intr provides the interrupt-like data entry path
for arbitrary buffers and driver-specified entropy estimates: defer
processing to softint and drop samples on the floor if buffer is
full.
- Document that rnd_add_data is forbidden under spin locks outside
interrupt context (will crash in LOCKDEBUG), and inadvisable in
interrupt context (but technically permitted just in case there are
compatibility issues for now); later we can forbid it altogether in
interrupt context or under spin locks.
- Audit all uses of rnd_add_data to use rnd_add_data_intr where it
might be used in interrupt context or under a spin lock.
This fixes a regression from last year when the global entropy lock
was changed from IPL_VM (spin) to IPL_SOFTSERIAL (adaptive). Thought
I'd caught all the problems from that, but another one bit three
different people this week, presumably because of recent changes that
led to more non-HWRNG drivers entering the entropy consolidation
path from rnd_add_uint32.
In my attempt to preserve the rnd(9) API for the (now long-since
abandoned) prospect of pullup to netbsd-9 in my rewrite of the
entropy subsystem in 2020, I didn't introduce a separate entry point
for entering entropy from interrupt context or equivalent, i.e., spin
locks held, and instead made rnd_add_data rely on cpu_intr_p() to
decide whether to process the whole sample under a lock or only take
as much as there's buffer space for before scheduling a softint. In
retrospect, that was a mistake (though perhaps not as much of a
mistake as other entropy API decisions...), a mistake which is
finally getting rectified now by rnd_add_data_intr.
XXX pullup-10
This is used by savecore(8), vmstat(8), and possibly other things.
XXX Should really teach dump and savecore(8) to use an intentionally
designed header, rather than relying on kvm(3) -- and make it work
for saving cores from other kernels so you don't have to boot the
same buggy kernel to get a core dump.
This code was apparently written under the misapprehension that
membar_producer on one side is good enough, but that doesn't
accomplish anything other than making the code unnecessarily obscure.
For semantics, you always always always need memory barriers to come
in pairs, with membar_consumer on the reading side if you want the
membar_producer to have on the writing side to have any useful
effect.
It is unfortunate that this might hurt performance, but that's an
unfortunate consequence of the design made without understanding
memory barriers, not an unfortunate consequence of the memory
barriers.
If it is really critical for the read side to avoid memory barriers,
then the write side needs to issue an IPI or xcall to effect memory
barriers -- at higher cost to the write side, of course.
Only relevant during 32-bit wraparound, so the potential performance
impact of using splhigh here is negligible; indeed, we would have to
go out of our way to exercise this in tests before it will ever
happen in the next century.
These use atomic load on platforms with atomic 64-bit load, and
seqlocks on platforms without.
This has the unfortunate side effect of slightly reducing the real
times available on 32-bit platforms, from ending some time in the
year 584942417218 AD, available on 64-bit platforms, to ending some
time in the year 584942417355 AD. But during that slightly shorter
time, 32-bit platforms can avoid bugs arising from non-atomic access
to time_uptime and time_second.
Note: All platforms still have non-atomic access problems for
bintime, binuptime, nanotime, nanouptime, &c. This can be addressed
by putting a seqlock around timebasebin and possibly some other
variable -- to be done in a later change.
XXX kernel ABI change -- deleting symbols
XXX potential kernel ABI change -- not sure any modules actually use
struct syncobj but it's hard to rule that out because sys/syncobj.h
leaks into sys/lwp.h
No reason for one kthread_join to interfere with another, or to cause
non-cyclic dependencies to get stuck.
Uses struct lwp::l_private for this purpose, which for user threads
stores the tls pointer. I don't think anything in kthread(9) uses
l_private -- generally kernel threads will use lwp specificdata. But
maybe this should use a new member instead, or a union member with an
existing pointer for the purpose.