mi_switch(), migration for LSONPROC is now performed via idle loop.
Handles/fixes on-CPU case in lwp_migrate(), misc.
Closes PR/38169, idea of migration via idle loop by Andrew Doran.
NetBSD's shutdown behavior of more than 6 years before rev 1.176.
Ok joerg@.
It is essential that we restore some hardware to initial conditions
before rebooting, in order to avoid interfering with the BIOS
bootstrap. For example, if NetBSD gives control back to the Soekris
comBIOS while the kernel text is write-protected, the BIOS bootstrap
hangs during the power-on self-test, "POST: 0123456789bcefghip".
In principle, bus masters can also interfere with BIOS boot.
- Fix performance regression inroduced by the workaround by making job
stealing a lot simpler: if the local run queue is empty, let the CPU enter
the idle loop. In the idle loop, try to steal a job from another CPU's run
queue if we are idle. If we succeed, re-enter mi_switch() immediatley to
dispatch the job.
- When stealing jobs, consider a remote CPU to have one less job in its
queue if it's currently in the idle loop. It will dispatch the job soon,
so there's no point sloshing it about.
- Introduce a few event counters to monitor what's happening with the run
queues.
- Revert the idle CPU bitmap change. It's pointless considering NUMA.
Fail sched_catchlwp() if mutex_tryenter() on the remote CPU's state fails.
Seems to work around the issue described in this PR.
XXX Stealing jobs from remote CPUs could probably be moved into the idle
loop, making the locking quite a bit simpler.
- Do timeslicing for SCHED_RR threads. At ~16Hz it's too slow but better
than nothing. XXX
- If a SCHED_OTHER thread has hogged the CPU for 1/8s without taking a
trip through mi_switch(), try to force a kernel preemption to give other
threads a chance.
run through copy-on-write. Call fscow_run() with valid data where possible.
The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.
- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.
- Always run copy-on-write on buffers returned from ffs_balloc().
- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.
Welcome to 4.99.63
Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>
compiles in GENERIC for i386. This is still a work-in-progress, but
this checkin covers most of the mechanical work (changing signalling
to be able to accomidate SA's process-wide signalling and re-adding
includes of sys/sa.h and savar.h). Subsequent changes will be much
more interesting.
Also, kern_sa.c has received partial cleanup. There's still more
to do, though.
Make VFS hooks dynamic while we're here and say farewell to VFS_ATTACH and
VFS_HOOKS_ATTACH linksets.
As a consequence, most of the file systems can now be loaded as new style
modules.
Quick sanity check by ad@.
Simplify the mount locking. Remove all the crud to deal with recursion on
the mount lock, and crud to deal with unmount as another weirdo lock.
Hopefully this will once and for all fix the deadlocks with this. With this
commit there are two locks on each mount:
- krwlock_t mnt_unmounting. This is used to prevent unmount across critical
sections like getnewvnode(). It's only ever read locked with rw_tryenter(),
and is only ever write locked in dounmount(). A write hold can't be taken
on this lock if the current LWP could hold a vnode lock.
- kmutex_t mnt_updating. This is taken by threads updating the mount, for
example when going r/o -> r/w, and is only present to serialize updates.
In order to take this lock, a read hold must first be taken on
mnt_unmounting, and the two need to be held across the operation.
One effect of this change: previously if an unmount failed, we would make a
half hearted attempt to back out of it gracefully, but that was unlikely to
work in a lot of cases. Now while an unmount that will be aborted is in
progress, new file operations within the mount will fail instead of being
delayed. That is unlikely to be a problem though, because if the admin
requests unmount of a file system then s(he) has made a decision to deny
access to the resource.
- sys_sync: acquire a write lock on the mount since the operation modifies
the mount structure.
- sys_fchdir: use vfs_trybusy(). If an unmount is in progress, just fail it.
- vnode and cwd locks were being taken with proc_lock held, which is bad
because proc_lock can only be held for a short period of time.
- Processes could have continually forked and escaped notice, keeping
a reference to the old directory on top of which a new mount exists.
getvnode: Use vfs_trybusy, not vfs_busy. If unmount is in progress we
could deadlock, because vnode locks can be held during getnewvnode().
dounmount() locks in the reverse order (vfs_busy -> vnode).
The previous fix worked, but it opened a window where mounts could have
disappeared from mountlist while the caller was traversing it using
vfs_trybusy(). Fix that.
BUFQ_CANCEL(queue, element) removes the specified element previously queued
on the queue. It returns NULL if it was not found on the queue and the
element if it was successfully removed.
Run trough tech-kern and changed name from BUFQ_REVOKE() by suggestion of
Jason Thorpe.
The symptom was that sometimes file systems would occasionally not appear
in output from 'df' or 'mount' if the system was busy. Resolution:
- Make mount locks work somewhat like vm_map locks.
- vfs_trybusy() now only fails if the mount is gone, or if someone is
unmounting the file system. Simple contention on mnt_lock doesn't
cause it to fail.
- vfs_busy() will wait even if the file system is being unmounted.
one of the following:
- Holding kernel_lock (indicating that the code is not MT safe).
- Bracketing critical sections with kpreempt_disable/kpreempt_enable.
- Holding the interrupt priority level above IPL_NONE.
Statistics on kernel preemption are reported via event counters, and
where preemption is deferred for some reason, it's also reported via
lockstat. The LWP priority at which preemption is triggered is tuneable
via sysctl.
DragonflyBSD uses the crit names for something quite different.
- Add a kpreempt_disabled function for diagnostic assertions.
- Add inline versions of kpreempt_enable/kpreempt_disable for primitives.
- Make some more changes for preemption safety to the x86 pmap.
we no longer need to guard against access from hardware interrupt handlers.
Additionally, if cloning a process with CLONE_SIGHAND, arrange to have the
child process share the parent's lock so that signal state may be kept in
sync. Partially addresses PR kern/37437.
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:
- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.
- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.
- The system spends less time at IPL_SCHED, and there is less lock activity.
- Socket layer becomes MP safe.
- Unix protocols become MP safe.
- Allows protocol processing interrupts to safely block on locks.
- Fixes a number of race conditions.
With much feedback from matt@ and plunky@.
error-out the master buffer.
The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.
condition), it leaves the control message with file descriptors. Calling
unp_dispose() will interpret the message as containing file pointers
and crash the system.
This change removes unp_dispose() from this failure path and avoids
using goto to jump into switch statements...
The previous workaround to ignore such messages in unp_scan() is removed.
appended while the receiver is blocked, the sockbuf will be corrupted.
Dequeue control messages from the sockbuf and sync its state in one
pass. Only then process the control messages. From FreeBSD.
put in the right spot. The 'next' link in the new entry must become globally
visible before the list head is updated. This could have affected systems
with weak memory ordering like the alpha.
mandatory. Remove the 4BSD run queue code. Effects:
- Pluggable scheduler is only responsible for co-ordinating timeshared jobs.
- All systems run with per-CPU run queues.
- 4BSD scheduler gets processor sets / affinity.
- 4BSD scheduler gets a significant peformance boost on some workloads.
Discussed on tech-kern@.
- Use atomic ops directly, since rwlocks work the same way on all platforms.
- Try to make it a bit more cache efficient, and use branch hints.
- Fix a bug in rw_downgrade() where the turnstile lock was not released.
- Remove a couple of redundant assertions.
- Use atomic_swap instead of atomic_cas where it's safe to do so.
- After acquiring the turnstile lock in rw_vector_enter, check if the
owner is running again and spin if so.
- Introduce and use rw_onproc() instead of abusing mutex_onproc().
- Change the handoff/release algorithm to reduce the window when a rwlock
can held, but the owner not on a CPU.
- percpu_getptr() is now called percpu_getref() and implicitly disables
preemption (via crit_enter()) when it is called.
- Added percpu_putref() which implicitly reenables preemption (via
crit_exit()).
PRI_KTHREAD range. This is kind of ugly, but needed because of direct handoff
with rwlocks, and because threads that block holding a mutex regularly hold
other locks/resources.
Problem addressed: priority lending works well where a thread blocking on a
turnstile has a high priority level (eg realtime). For timeshared threads
(low priority) it's unlikely to have much effect. In the latter case threads
awoken from a turnstile can and do compete for CPU time with regular waits
like disk I/O. On MP systems this can result in a feedback loop where
threads cannot quickly get access to a resource held by a thread waking from
a turnstile. The waking thread eventually runs when enough of the other
threads block waiting for it, freeing up the CPU. The end result is a lot of
idle time during builds.
there are no waiters. This gives a major boost to build.sh on larger
systems as directory vnode locks are exclusive for lookup, but are often
only held for a very short period of time.
This change has the potential to more readily expose lock order reversals
and other types of deadlock.
tech-kern last year. Instead of modifying callout_stop, add a new
routine (callout_halt) which will sleep if the callout is already in
flight. Note that if a callout can take locks, the caller of callout_halt
must not hold any of those locks - otherwise the two could deadlock.
we have unlocked the kqueue to check its state, leave it queued and
re-check later.
- knote_dequeue: fold into knote_detach since nothing else uses it.
- Note a couple more problems.
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.
testing by rmind@ and myself.
Which approach to use is still being discussed, but I would like to get
this out of my working tree. If we decide to use a different approach
there is no problem with revisiting this.
- Redo reference counting to be sane. LWPs accessing files take a short
term reference on the local file descriptor. This is the most common
case. While a file is in a process descriptor table, a reference is
held to the file. The file reference count only changes during control
operations like open() or close(). Code that comes at files from an
unusual direction (i.e. foreign to the process) like procfs or sysctl
takes a reference on the file (f_count), and not on a descriptor.
- Remove knowledge of reference counting and locking from most code that
deals with files.
- Make the usual case of file descriptor lookup lockless.
- Make kqueue MP and MT safe. PR kern/38098, PR kern/38137.
- Fix numerous file handling bugs, and bugs in the descriptor code that
affected multithreaded processes.
- Split descriptor system calls out into sys_descrip.c.
- A few stylistic changes: KNF, remove unused casts now that caddr_t is
gone. Replace dumb gotos with loop control in a few places.
- Don't do redundant pointer passing (struct proc, lwp, filedesc *) unless
the routine is likely to be inlined. Most of the time it's about the
current process.
This is for netsmb which wants to poll sockets directly.
- When polling a socket, first check for pending I/O without acquring any
locks. If no I/O seems to be pending, acquire locks/spl and check again
doing selrecord() if necessary.
value, which can be overwritten with syscalls.conf defines (just like
sys_nosys).
This fix a problem where rump awk variables are set to 0 value,
leading to the creation of an unexpected file with that name.
ok by pooka.
This make uid_find(), chgproccnt(), chgsbsize() and lf_alloc(), lf_free()
functions lock-less.
- Increase the size of uihashtbl in case of MP system, as suggested by <ad>.
- Add HASH_SLIST type for hashinit().
Reviewed by <ad>.
existing behaviour: the unsleep method unlocks and wakes the swapper if
needs be. If false, the caller is doing a batch operation and will take
care of that later. This is kind of ugly, but it's difficult for the caller
to know which lock to release in some situations.
existing behaviour: the unsleep method unlocks and wakes the swapper if
needs be. If false, the caller is doing a batch operation and will take
care of that later. This is kind of ugly, but it's difficult for the caller
to know which lock to release in some situations.
Improve PMF-ability.
Add a 'flags' argument to suspend/resume handlers and
callers such as pmf_system_suspend().
Define a flag, PMF_F_SELF, which indicates to PMF that a
device is suspending/resuming itself. Add helper routines,
pmf_device_suspend_self(dev) and pmf_device_resume_self(dev),
that call pmf_device_suspend(dev, PMF_F_SELF) and
pmf_device_resume(dev, PMF_F_SELF), respectively. Use
PMF_F_SELF to suspend/resume self in ath(4), audio(4),
rtw(4), and sip(4).
In ath(4) and in rtw(4), replace the icky sc_enable/sc_disable
callbacks, provided by the bus front-end, with
self-suspension/resumption. Also, clean up the bus
front-ends. Make sure that the interrupt handler is
disestablished during suspension. Get rid of driver-private
flags (e.g., RTW_F_ENABLED, ath_softc->sc_invalid); use
device_is_active()/device_has_power() calls, instead.
In the network-class suspend handler, call if_stop(, 0)
instead of if_stop(, 1), because the latter is superfluous
(bus- and driver-suspension hooks will 'disable' the NIC),
and it may cause recursion.
In the network-class resume handler, prevent infinite
recursion through if_init() by getting out early if we are
self-suspending (PMF_F_SELF).
rtw(4) improvements:
Destroy rtw(4) callouts when we detach it. Make rtw at
pci detachable. Print some more information with the "rx
frame too long" warning.
Remove activate() methods:
Get rid of rtw_activate() and ath_activate(). The device
activate() methods are not good for much these days.
Make ath at cardbus resume with crypto functions intact:
Introduce a boolean device property, "pmf-powerdown". If
pmf-powerdown is present and false, it indicates that a
bus back-end should not remove power from a device.
Honor this property in cardbus_child_suspend().
Set this property to 'false' in ath_attach(), since removing
power from an ath at cardbus seems to lobotomize the WPA
crypto engine. XXX Should the pmf-powerdown property
propagate toward the root of the device tree?
Miscellaneous ath(4) changes:
Warn if ath(4) tries to write crypto keys to suspended
hardware.
Reduce differences between FreeBSD and NetBSD in ath(4)
multicast filter setup.
Make ath_printrxbuf() print an rx descriptor's status &
key index, to help debug crypto errors.
Shorten a staircase in ath_ioctl(). Don't check for
ieee80211_ioctl() return code ERESTART, it never happens.
going through a syscall trap. These are currently useful for rumps.
As all the standard syscalls are not compiled into librump, mark
relevant ones with RUMP in syscalls.master. To do e.g. a mkdir
"system call" from a rump, one would call
rump_sys_mkdir("/dir", mode, &eval);
where the last value represents something to store errno into.
never sleep. Should fix PR/37245 by <yamt>.
- Fix a regression - dissalow catching of bound threads. Also, allow
migration of non-bound kthreads, this restriction seems pointless.
- Few micro-optimisations, misc.