gcc generates calls to this symbol in programs that use
__sync_*_compare_and_swap, which require full sequential consistency
barriers, including store-before-load ordering on both sides of the
atomic; none of the release/acquire operations guarantee that, so we
have to insert explicit DMB instructions.
Note: gcc's own definition omits some of the DMB instructions, but I
can't prove that it's correct that way -- stores preceding the CAS
must complete before the load part of the CAS, and the store part of
the CAS must complete before loads following the CAS. Maybe there's
some way to prove that one of these orderings is guaranteed some
other way than a DMB but I'm not seeing it, and store-before-load
ordering is hard to understand.
Patch by skrll@ based on a patch by mrg@, soliloquy in commit message
by me.
The more-compatible LOCK ADD $0,-N(%rsp) turns out to be cheaper
than MFENCE anyway. Let's save some space and maintenance and rip
out the hotpatching for it.
- membar_enter/exit ordering was backwards.
- membar_enter doesn't make any sense for load anyway.
- Switch to membar_release for store and membar_acquire for load.
The only sensible orderings for a simple load or store are acquire or
release, respectively, or sequential consistency. This never
provided correct sequential consistency before -- we should really
make it conditional on memmodel but I don't know offhand what the
values of memmodel might be and this is at least better than before.
The names membar_enter/exit were unclear, and the documentation of
membar_enter has disagreed with the implementations on sparc,
powerpc, and even x86(!) for the entire time it has been in NetBSD.
The terms `acquire' and `release' are ubiquitous in the literature
today, and have been adopted in the C and C++ standards to mean
load-before-load/store and load/store-before-store, respectively,
which are exactly the orderings required by acquiring and releasing a
mutex, as well as other useful applications like decrementing a
reference count and then freeing the underlying object if it went to
zero.
Originally I proposed changing one word in the documentation for
membar_enter to make it load-before-load/store instead of
store-before-load/store, i.e., to make it an acquire barrier. I
proposed this on the grounds that
(a) all implementations guarantee load-before-load/store,
(b) some implementations fail to guarantee store-before-load/store,
and
(c) all uses in-tree assume load-before-load/store.
I verified parts (a) and (b) (except, for (a), powerpc didn't even
guarantee load-before-load/store -- isync isn't necessarily enough;
need lwsync in general -- but it _almost_ did, and it certainly didn't
guarantee store-before-load/store).
Part (c) might not be correct, however: under the mistaken assumption
that atomic-r/m/w then membar-w/rw is equivalent to atomic-r/m/w then
membar-r/rw, I only audited the cases of membar_enter that _aren't_
immediately after an atomic-r/m/w. All of those cases assume
load-before-load/store. But my assumption was wrong -- there are
cases of atomic-r/m/w then membar-w/rw that would be broken by
changing to atomic-r/m/w then membar-r/rw:
https://mail-index.netbsd.org/tech-kern/2022/03/29/msg028044.html
Furthermore, the name membar_enter has been adopted in other places
like OpenBSD where it actually does follow the documentation and
guarantee store-before-load/store, even if that order is not useful.
So the name membar_enter currently lives in a bad place where it
means either of two things -- r/rw or w/rw.
With this change, we deprecate membar_enter/exit, introduce
membar_acquire/release as better names for the useful pair (r/rw and
rw/w), and make sure the implementation of membar_enter guarantees
both what was documented _and_ what was implemented, making it an
alias for membar_sync.
While here, rework all of the membar_* definitions and aliases. The
new logic follows a rule to make it easier to audit:
membar_X is defined as an alias for membar_Y iff membar_X is
guaranteed by membar_Y.
The `no stronger than' relation is (the transitive closure of):
- membar_consumer (r/r) is guaranteed by membar_acquire (r/rw)
- membar_producer (w/w) is guaranteed by membar_release (rw/w)
- membar_acquire (r/rw) is guaranteed by membar_sync (rw/rw)
- membar_release (rw/w) is guaranteed by membar_sync (rw/rw)
And, for the deprecated membars:
- membar_enter (whether r/rw, w/rw, or rw/rw) is guaranteed by
membar_sync (rw/rw)
- membar_exit (rw/w) is guaranteed by membar_release (rw/w)
(membar_exit is identical to membar_release, but the name is
deprecated.)
Finally, while here, annotate some of the instructions with their
semantics. For powerpc, leave an essay with citations on the
unfortunate but -- as far as I can tell -- necessary decision to use
lwsync, not isync, for membar_acquire and membar_consumer.
Also add membar(3) and atomic(3) man page links.
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
On x86, every store is a store-release, so there is no need for any
barrier. But this wasn't a barrier anyway; it was just a store,
which was redundant with the store of the return address to the stack
implied by CALL even if issuing a store made a difference.
lfence is only needed for MD logic, such as operations on I/O memory
rather than normal cacheable memory, or special instructions like
RDTSC -- never for MI synchronization between threads/CPUs. No need
for hot-patching to do lfence here.
(The x86_lfence function might reasonably be patched on i386 to do
lfence for MD logic, but it isn't now and this doesn't change that.)
In TSO this is the only memory barrier ever needed, and somehow we
got this wrong and instead issued an unnecessary membar #LoadLoad --
not needed even in PSO let alone in TSO.
XXX Apparently we may run userland programs with PSO or RMO, in which
case all of these membars need fixing:
PSO RMO
membar_consumer nop membar #LoadLoad
membar_producer membar #StoreStore membar #StoreStore
membar_enter nop membar #LoadLoad|LoadStore
membar_exit membar #StoreStore membar #LoadStore|StoreStore
membar_sync membar #StoreLoad|StoreStore
membar #...everything...
But at least this fixes the TSO case in which we run the kernel.
Also I'm not sure there's any non-TSO hardware out there in practice.
membar_sync is required to be a full sequential consistency barrier,
equivalent to MEMBAR #StoreStore|LoadStore|StoreLoad|LoadLoad on
sparcv9. LDSTUB and SWAP are the only pre-v9 instructions that do
this and SWAP doesn't exist on all v7 hardware, so use LDSTUB.
Note: I'm having a hard time nailing down a reference for the
ordering implied by LDSTUB and SWAP. I'm _pretty sure_ SWAP has to
imply store-load ordering since the SPARCv8 manual recommends it for
Dekker's algorithm (which notoriously requires store-load ordering),
and the formal memory model treats LDSTUB and SWAP the same for
ordering. But the v8 and v9 manuals aren't clear.
GCC issues STBAR and LDSTUB, but (a) I don't see why STBAR is
necessary here, (b) STBAR doesn't exist on v7 so it'd be a pain to
use, and (c) from what I've heard (although again it's hard to nail
down authoritative references here) all actual SPARC hardware is TSO
or SC anyway so STBAR is a noop in all the silicon anyway.
Either way, certainly this is better than what we had before, which
was nothing implying ordering at all, just a store!
5c44459c3b
This bug was reported by Danilo Ramos of Eideticom, Inc. It has
lain in wait 13 years before being found! The bug was introduced
in zlib 1.2.2.2, with the addition of the Z_FIXED option. That
option forces the use of fixed Huffman codes. For rare inputs with
a large number of distant matches, the pending buffer into which
the compressed data is written can overwrite the distance symbol
table which it overlays. That results in corrupted output due to
invalid distances, and can result in out-of-bound accesses,
crashing the application.
The fix here combines the distance buffer and literal/length
buffers into a single symbol buffer. Now three bytes of pending
buffer space are opened up for each literal or length/distance
pair consumed, instead of the previous two bytes. This assures
that the pending buffer cannot overwrite the symbol table, since
the maximum fixed code compressed length/distance is 31 bits, and
since there are four bytes of pending space for every three bytes
of symbol space.
This change should be safe because it doesn't remove or weaken any
memory barriers, but does add, clarify, or strengthen barriers.
Goals:
- Make sure mutex_enter/exit and mutex_spin_enter/exit have
acquire/release semantics.
- New macros make maintenance easier and purpose clearer:
. SYNC_ACQ is for load-before-load/store barrier, and BDSYNC_ACQ
for a branch delay slot -- currently defined as plain sync for MP
and nothing, or nop, for UP; thus it is no weaker than SYNC and
BDSYNC as currently defined, which is syncw on Octeon, plain sync
on non-Octeon MP, and nothing/nop on UP.
It is not clear to me whether load-then-syncw or ll/sc-then-syncw
or even bare load provides load-acquire semantics on Octeon -- if
no, this will fix bugs; if yes (like it is on SPARC PSO), we can
relax SYNC_ACQ to be syncw or nothing later.
. SYNC_REL is for load/store-before-store barrier -- currently
defined as plain sync for MP and nothing for UP.
It is not clear to me whether syncw-then-store is enough for
store-release on Octeon -- if no, we can leave this as is; if
yes, we can relax SYNC_REL to be syncw on Octeon.
. SYNC_PLUNGER is there to flush clogged Cavium store buffers, and
BDSYNC_PLUNGER for a branch delay slot -- syncw on Octeon,
nothing or nop on non-Octeon.
=> This is not necessary (or, as far as I'm aware, sufficient)
for acquire semantics -- it serves only to flush store buffers
where stores might otherwise linger for hundreds of thousands
of cycles, which would, e.g., cause spin locks to be held for
unreasonably long durations.
Newerish revisions of the MIPS ISA also have finer-grained sync
variants that could be plopped in here.
Mechanism:
Insert these barriers in the right places, replacing only those where
the definition is currently equivalent, so this change is safe.
- Replace #ifdef _MIPS_ARCH_OCTEONP / syncw / #endif at the end of
atomic_cas_* by SYNC_PLUNGER, which is `sync 4' (a.k.a. syncw) if
__OCTEON__ and empty otherwise.
=> From what I can tell, __OCTEON__ is defined in at least as many
contexts as _MIPS_ARCH_OCTEONP -- i.e., there are some Octeons
with no _MIPS_ARCH_OCTEONP, but I don't know if any of them are
relevant to us or ever saw the light of day outside Cavium; we
seem to buid with `-march=octeonp' so this is unlikely to make a
difference. If it turns out that we do care, well, now there's
a central place to make the distinction for sync instructions.
- Replace post-ll/sc SYNC by SYNC_ACQ in _atomic_cas_*, which are
internal kernel versions used in sys/arch/mips/include/lock.h where
it assumes they have load-acquire semantics. Should move this to
lock.h later, since we _don't_ define __HAVE_ATOMIC_AS_MEMBAR on
MIPS and so the extra barrier might be costly.
- Insert SYNC_REL before ll/sc, and replace post-ll/sc SYNC by
SYNC_ACQ, in _ucas_*, which is used without any barriers in futex
code and doesn't mention barriers in the man page so I have to
assume it is required to be a release/acquire barrier.
- Change BDSYNC to BDSYNC_ACQ in mutex_enter and mutex_spin_enter.
This is necessary to provide load-acquire semantics -- unclear if
it was provided already by syncw on Octeon, but it seems more
likely that either (a) no sync or syncw is needed at all, or (b)
syncw is not enough and sync is needed, since syncw is only a
store-before-store ordering barrier.
- Insert SYNC_REL before ll/sc in mutex_exit and mutex_spin_exit.
This is currently redundant with the SYNC already there, but
SYNC_REL more clearly identifies the necessary semantics in case we
want to define it differently on different systems, and having a
sync in the middle of an ll/sc is a bit weird and possibly not a
good idea, so I intend to (carefully) remove the redundant SYNC in
a later change.
- Change BDSYNC to BDSYNC_PLUNGER at the end of mutex_exit. This has
no semantic change right now -- it's syncw on Octeon, sync on
non-Octeon MP, nop on UP -- but we can relax it later to nop on
non-Cavium MP.
- Leave LLSCSYNC in for now -- it is apparently there for a Cavium
erratum, but I'm not sure what the erratum is, exactly, and I have
no reference for it. I suspect these can be safely removed, but we
might have to double up some other syncw instructions -- Linux uses
it only in store-release sequences, not at the head of every ll/sc.
- Eradicate last vestiges of mb_* barriers.
- In __cpu_simple_lock_init, omit needless barrier. It is the
caller's responsibility to ensure __cpu_simple_lock_init happens
before other operations on it anyway, so there was never any need
for a barrier here.
- In __cpu_simple_lock_try, leave comments about memory ordering
guarantees of the kernel's _atomic_cas_uint, which are inexplicably
different from the non-underscored atomic_cas_uint.
- In __cpu_simple_unlock, use membar_exit instead of mb_memory, and do
it unconditionally.
This ensures that in __cpu_simple_lock/.../__cpu_simple_unlock, all
memory operations in the ellipsis happen before the store that
releases the lock.
- On Octeon, the barrier was omitted altogether, which is a bug --
it needs to be there or else there is no happens-before relation
and whoever takes the lock next might see stale values stored or
even stomp over the unlocking CPU's delayed loads.
- On non-Octeon, the mb_memory was sync. Using membar_exit
preserves this.
XXX On Octeon, membar_exit only issues syncw -- this seems wrong,
only store-before-store and not load/store-before-store, unless the
CNMIPS architecture guarantees it is sufficient here like
SPARCv8/v9 PSO (`Partial Store Order').
- Leave an essay with citations about why we have an apparently
pointless syncw _after_ releasing a lock, to work around a design
bug^W^Wquirk in cnmips which sometimes buffers stores for hundreds
of thousands of cycles for fun unless you issue syncw.
In the current implementation, locks are acquired at the entrance of the mcount
internal function, so the higher the number of cores, the more lock conflict
occurs, making profiling performance in a MULTIPROCESSOR environment unusable
and slow. Profiling buffers has been changed to be reserved for each CPU,
improving profiling performance in MP by several to several dozen times.
- Eliminated cpu_simple_lock in mcount internal function, using per-CPU buffers.
- Add ci_gmon member to struct cpu_info of each MP arch.
- Add kern.profiling.percpu node in sysctl tree.
- Add new -c <cpuid> option to kgmon(8) to specify the cpuid, like openbsd.
For compatibility, if the -c option is not specified, the entire system can be
operated as before, and the -p option will get the total profiling data for
all CPUs.