I'm leaving in the conditional around the legacy membar_enters
(store-before-load, store-before-store) in kern_mutex.c and in
kern_lock.c because they may still matter: store-before-load barriers
tend to be the most expensive kind, so eliding them is probably
worthwhile on x86. (It also may not matter; I just don't care to do
measurements right now, and it's a single valid and potentially
justifiable use case in the whole tree.)
However, membar_release/acquire can be mere instruction barriers on
all TSO platforms including x86, so there's no need to go out of our
way with a bad API to conditionalize them. If the procedure call
overhead is measurable we just could change them to be macros on x86
that expand into __insn_barrier.
Discussed on tech-kern:
https://mail-index.netbsd.org/tech-kern/2023/02/23/msg028729.html
Currently there is a voluntary yield every 100ms, but that's a long
time. Should help to avoid hogging the CPU while flushing lots of
data to big disks on systems without kpreemption.
- Parallel mounts may get the same fsid. Always increment "xxxfs_mntid"
to make it unlikely.
- Directly walk "mountlist" to prevent a rare deadlock where one thread
holds a vnode locked, calls vfs_getnewfsid() and the iterator has to
wait for a suspended file system while the thread suspending needs
this vnode lock.
This just goes through my recent reference count membar audit and
changes membar_exit to membar_release and membar_enter to
membar_acquire -- this should make everything cheaper on most CPUs
without hurting correctness, because membar_acquire is generally
cheaper than membar_enter.
vdevgone relies on this to ensure that if there is a concurrent
revoke in progress, it will wait for that revoke to finish -- that
way, it can guarantee all I/O operations have completed and the
device is closed.
If two threads are using an object that is freed when the reference
count goes to zero, we need to ensure that all memory operations
related to the object happen before freeing the object.
Using an atomic_dec_uint_nv(&refcnt) == 0 ensures that only one
thread takes responsibility for freeing, but it's not enough to
ensure that the other thread's memory operations happen before the
freeing.
Consider:
Thread A Thread B
obj->foo = 42; obj->baz = 73;
mumble(&obj->bar); grumble(&obj->quux);
/* membar_exit(); */ /* membar_exit(); */
atomic_dec -- not last atomic_dec -- last
/* membar_enter(); */
KASSERT(invariant(obj->foo,
obj->bar));
free_stuff(obj);
The memory barriers ensure that
obj->foo = 42;
mumble(&obj->bar);
in thread A happens before
KASSERT(invariant(obj->foo, obj->bar));
free_stuff(obj);
in thread B. Without them, this ordering is not guaranteed.
So in general it is necessary to do
membar_exit();
if (atomic_dec_uint_nv(&obj->refcnt) != 0)
return;
membar_enter();
to release a reference, for the `last one out hit the lights' style
of reference counting. (This is in contrast to the style where one
thread blocks new references and then waits under a lock for existing
ones to drain with a condvar -- no membar needed thanks to mutex(9).)
I searched for atomic_dec to find all these. Obviously we ought to
have a better abstraction for this because there's so much copypasta.
This is a stop-gap measure to fix actual bugs until we have that. It
would be nice if an abstraction could gracefully handle the different
styles of reference counting in use -- some years ago I drafted an
API for this, but making it cover everything got a little out of hand
(particularly with struct vnode::v_usecount) and I ended up setting
it aside to work on psref/localcount instead for better scalability.
I got bored of adding #ifdef __HAVE_ATOMIC_AS_MEMBAR everywhere, so I
only put it on things that look performance-critical on 5sec review.
We should really adopt membar_enter_preatomic/membar_exit_postatomic
or something (except they are applicable only to atomic r/m/w, not to
atomic_load/store_*, making the naming annoying) and get rid of all
the ifdefs.
locked and referenced across the call to swap_off() and finally
use it from vfs_unmountall1() to remove swap after unmounting
the last file system.
Adresses PR kern/54969 (Disk cache is no longer flushed on shutdown)
disks before the raid:
forcefully unmounting / (/dev/raid0a)...
sd1: detached
sd0: detached
raid0: cache flush to component /dev/sd0a failed.
raid0: cache flush to component /dev/sd1a failed.
fatal page fault in supervisor mode
Stopped in pid 2356.2356 (reboot) at netbsd:sdstrategy+0x36
Reopens PR kern/54969: Disk cache is no longer flushed on shutdown
which relied on taking extra vnode refs.
Having benchmarked various experimental changes over the past few months it
seems that it's better to avoid vnode refs as much as possible. cwdi_lock
as a RW lock already did that to some extent for getcwd() and will permit
the same for namei() too.
- Have a stab at clustering the members of vnode_t and vnode_impl_t in a
more cache-conscious way. With that done, go back to adjusting v_usecount
with atomics and keep vi_lock directly in vnode_impl_t (saves KVA).
- Allow VOP_LOCK(LK_NONE) for the benefit of VFS_VGET() and VFS_ROOT().
Make sure LK_UPGRADE always comes with LK_NOWAIT.
- Make cwdinfo use mostly lockless.
Don't take a mount reference for fstrans as it gets notified about the release.
Defer the final free of the mount to fstrans_mount_dtor() when fstrans
has released all references to this mount. Prevents the mount's memory
to be reused as a new mount before fstrans released all references.
Address PR kern/53928 modules/t_builtin:disable test case randomly fails.
valid "mnt_transinfo" and remove now unneeded flag IMNT_HAS_TRANS.
Run fstrans_start()/fstrans_done() on dead_rootmount if FSTRANS_DEAD_ENABLED.
Should become the default for DIAGNOSTIC in the future.
use FSTRANS_SHARED as lock type so remove the lock type argument.
File system state FSTRANS_SUSPENDING is now unused so remove it.
Regen vnode_if files.
Ride 8.99.1 less than a hour ago.
kmem_alloc() with KM_SLEEP
kmem_zalloc() with KM_SLEEP
percpu_alloc()
pserialize_create()
psref_class_create()
all of these paths include an assertion that the allocation has not failed,
so callers should not assert that again.