Commit Graph

1136 Commits

Author SHA1 Message Date
skrll 16e0b464c4 Fix the copy&paste botch from previous. Spotted by Tom Lane. 2022-05-16 06:07:23 +00:00
skrll bbf56d84ab *** empty log message *** 2022-05-14 05:35:55 +00:00
riastradh 4a6459a8d1 mips/cavium: Take advantage of Octeon's guaranteed r/rw ordering. 2022-04-21 12:06:31 +00:00
riastradh cfa39f97b0 libc/atomic: Fix membars in __atomic_load/store_* stubs.
- membar_enter/exit ordering was backwards.
- membar_enter doesn't make any sense for load anyway.
- Switch to membar_release for store and membar_acquire for load.

The only sensible orderings for a simple load or store are acquire or
release, respectively, or sequential consistency.  This never
provided correct sequential consistency before -- we should really
make it conditional on memmodel but I don't know offhand what the
values of memmodel might be and this is at least better than before.
2022-04-09 23:38:57 +00:00
riastradh 4f8ce3b31d Introduce membar_acquire/release. Deprecate membar_enter/exit.
The names membar_enter/exit were unclear, and the documentation of
membar_enter has disagreed with the implementations on sparc,
powerpc, and even x86(!) for the entire time it has been in NetBSD.

The terms `acquire' and `release' are ubiquitous in the literature
today, and have been adopted in the C and C++ standards to mean
load-before-load/store and load/store-before-store, respectively,
which are exactly the orderings required by acquiring and releasing a
mutex, as well as other useful applications like decrementing a
reference count and then freeing the underlying object if it went to
zero.

Originally I proposed changing one word in the documentation for
membar_enter to make it load-before-load/store instead of
store-before-load/store, i.e., to make it an acquire barrier.  I
proposed this on the grounds that

(a) all implementations guarantee load-before-load/store,
(b) some implementations fail to guarantee store-before-load/store,
and
(c) all uses in-tree assume load-before-load/store.

I verified parts (a) and (b) (except, for (a), powerpc didn't even
guarantee load-before-load/store -- isync isn't necessarily enough;
need lwsync in general -- but it _almost_ did, and it certainly didn't
guarantee store-before-load/store).

Part (c) might not be correct, however: under the mistaken assumption
that atomic-r/m/w then membar-w/rw is equivalent to atomic-r/m/w then
membar-r/rw, I only audited the cases of membar_enter that _aren't_
immediately after an atomic-r/m/w.  All of those cases assume
load-before-load/store.  But my assumption was wrong -- there are
cases of atomic-r/m/w then membar-w/rw that would be broken by
changing to atomic-r/m/w then membar-r/rw:

https://mail-index.netbsd.org/tech-kern/2022/03/29/msg028044.html

Furthermore, the name membar_enter has been adopted in other places
like OpenBSD where it actually does follow the documentation and
guarantee store-before-load/store, even if that order is not useful.
So the name membar_enter currently lives in a bad place where it
means either of two things -- r/rw or w/rw.

With this change, we deprecate membar_enter/exit, introduce
membar_acquire/release as better names for the useful pair (r/rw and
rw/w), and make sure the implementation of membar_enter guarantees
both what was documented _and_ what was implemented, making it an
alias for membar_sync.

While here, rework all of the membar_* definitions and aliases.  The
new logic follows a rule to make it easier to audit:

	membar_X is defined as an alias for membar_Y iff membar_X is
	guaranteed by membar_Y.

The `no stronger than' relation is (the transitive closure of):

- membar_consumer (r/r) is guaranteed by membar_acquire (r/rw)
- membar_producer (w/w) is guaranteed by membar_release (rw/w)
- membar_acquire (r/rw) is guaranteed by membar_sync (rw/rw)
- membar_release (rw/w) is guaranteed by membar_sync (rw/rw)

And, for the deprecated membars:

- membar_enter (whether r/rw, w/rw, or rw/rw) is guaranteed by
  membar_sync (rw/rw)
- membar_exit (rw/w) is guaranteed by membar_release (rw/w)

(membar_exit is identical to membar_release, but the name is
deprecated.)

Finally, while here, annotate some of the instructions with their
semantics.  For powerpc, leave an essay with citations on the
unfortunate but -- as far as I can tell -- necessary decision to use
lwsync, not isync, for membar_acquire and membar_consumer.

Also add membar(3) and atomic(3) man page links.
2022-04-09 23:32:51 +00:00
riastradh d808f015e1 riscv/membar_ops: Upgrade membar_enter from W/RW to RW/RW.
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
2022-04-09 22:53:53 +00:00
riastradh 75d950a155 x86_64/membar_ops: Upgrade membar_enter from R/RW to RW/RW.
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
2022-04-09 22:53:45 +00:00
riastradh a1f4bcbfda i386/membar_ops: Upgrade membar_enter from R/RW to RW/RW.
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
2022-04-09 22:53:36 +00:00
riastradh 48b2cb5aa9 sparc64/membar_ops: Upgrade membar_enter from R/RW to RW/RW.
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
2022-04-09 22:53:25 +00:00
riastradh ca73d72920 sparc/membar_ops: Upgrade membar_enter from R/RW to RW/RW.
This will be deprecated soon but let's avoid leaving rakes to trip on
with it arising from disagreement over the documentation (W/RW) and
implementation and usage (R/RW).
2022-04-09 22:53:17 +00:00
riastradh a8d0eed140 aarch64/membar_ops: Fix wrong symbol end. 2022-04-09 12:07:37 +00:00
riastradh d767c9730a x86: Add a note on membar_sync and mfence. 2022-04-09 12:07:29 +00:00
riastradh 3066bbbbf8 x86: Omit needless store in membar_producer/exit.
On x86, every store is a store-release, so there is no need for any
barrier.  But this wasn't a barrier anyway; it was just a store,
which was redundant with the store of the return address to the stack
implied by CALL even if issuing a store made a difference.
2022-04-09 12:07:17 +00:00
riastradh e0c914a79b x86: Every load is a load-acquire, so membar_consumer is a noop.
lfence is only needed for MD logic, such as operations on I/O memory
rather than normal cacheable memory, or special instructions like
RDTSC -- never for MI synchronization between threads/CPUs.  No need
for hot-patching to do lfence here.

(The x86_lfence function might reasonably be patched on i386 to do
lfence for MD logic, but it isn't now and this doesn't change that.)
2022-04-09 12:07:00 +00:00
riastradh ffe06880f0 sparc64: Fix membar_sync by issuing membar #StoreLoad.
In TSO this is the only memory barrier ever needed, and somehow we
got this wrong and instead issued an unnecessary membar #LoadLoad --
not needed even in PSO let alone in TSO.

XXX Apparently we may run userland programs with PSO or RMO, in which
case all of these membars need fixing:

                        PSO                     RMO
membar_consumer         nop                     membar #LoadLoad
membar_producer         membar #StoreStore      membar #StoreStore
membar_enter            nop                     membar #LoadLoad|LoadStore
membar_exit             membar #StoreStore      membar #LoadStore|StoreStore
membar_sync             membar #StoreLoad|StoreStore
                                                membar #...everything...

But at least this fixes the TSO case in which we run the kernel.
Also I'm not sure there's any non-TSO hardware out there in practice.
2022-04-09 12:06:47 +00:00
riastradh da06f841fd sparc: Fix membar_sync with LDSTUB.
membar_sync is required to be a full sequential consistency barrier,
equivalent to MEMBAR #StoreStore|LoadStore|StoreLoad|LoadLoad on
sparcv9.  LDSTUB and SWAP are the only pre-v9 instructions that do
this and SWAP doesn't exist on all v7 hardware, so use LDSTUB.

Note: I'm having a hard time nailing down a reference for the
ordering implied by LDSTUB and SWAP.  I'm _pretty sure_ SWAP has to
imply store-load ordering since the SPARCv8 manual recommends it for
Dekker's algorithm (which notoriously requires store-load ordering),
and the formal memory model treats LDSTUB and SWAP the same for
ordering.  But the v8 and v9 manuals aren't clear.

GCC issues STBAR and LDSTUB, but (a) I don't see why STBAR is
necessary here, (b) STBAR doesn't exist on v7 so it'd be a pain to
use, and (c) from what I've heard (although again it's hard to nail
down authoritative references here) all actual SPARC hardware is TSO
or SC anyway so STBAR is a noop in all the silicon anyway.

Either way, certainly this is better than what we had before, which
was nothing implying ordering at all, just a store!
2022-04-09 12:06:39 +00:00
riastradh 09ff5f3b48 Nix trailing whitespace in files of membars, atomics, and lock stubs.
Will be touching many of these files soon for functional changes.

No functional change intended.
2022-04-06 22:47:55 +00:00
wiz 0362f707fc zlib: Fix a bug that can crash deflate on some input when using Z_FIXED.
5c44459c3b

This bug was reported by Danilo Ramos of Eideticom, Inc. It has
lain in wait 13 years before being found! The bug was introduced
in zlib 1.2.2.2, with the addition of the Z_FIXED option. That
option forces the use of fixed Huffman codes. For rare inputs with
a large number of distant matches, the pending buffer into which
the compressed data is written can overwrite the distance symbol
table which it overlays. That results in corrupted output due to
invalid distances, and can result in out-of-bound accesses,
crashing the application.

The fix here combines the distance buffer and literal/length
buffers into a single symbol buffer. Now three bytes of pending
buffer space are opened up for each literal or length/distance
pair consumed, instead of the previous two bytes. This assures
that the pending buffer cannot overwrite the symbol table, since
the maximum fixed code compressed length/distance is 31 bits, and
since there are four bytes of pending space for every three bytes
of symbol space.
2022-03-24 10:13:01 +00:00
riastradh 05a5e24cff mips: Membar audit.
This change should be safe because it doesn't remove or weaken any
memory barriers, but does add, clarify, or strengthen barriers.

Goals:

- Make sure mutex_enter/exit and mutex_spin_enter/exit have
  acquire/release semantics.

- New macros make maintenance easier and purpose clearer:

  . SYNC_ACQ is for load-before-load/store barrier, and BDSYNC_ACQ
    for a branch delay slot -- currently defined as plain sync for MP
    and nothing, or nop, for UP; thus it is no weaker than SYNC and
    BDSYNC as currently defined, which is syncw on Octeon, plain sync
    on non-Octeon MP, and nothing/nop on UP.

    It is not clear to me whether load-then-syncw or ll/sc-then-syncw
    or even bare load provides load-acquire semantics on Octeon -- if
    no, this will fix bugs; if yes (like it is on SPARC PSO), we can
    relax SYNC_ACQ to be syncw or nothing later.

  . SYNC_REL is for load/store-before-store barrier -- currently
    defined as plain sync for MP and nothing for UP.

    It is not clear to me whether syncw-then-store is enough for
    store-release on Octeon -- if no, we can leave this as is; if
    yes, we can relax SYNC_REL to be syncw on Octeon.

  . SYNC_PLUNGER is there to flush clogged Cavium store buffers, and
    BDSYNC_PLUNGER for a branch delay slot -- syncw on Octeon,
    nothing or nop on non-Octeon.

    => This is not necessary (or, as far as I'm aware, sufficient)
       for acquire semantics -- it serves only to flush store buffers
       where stores might otherwise linger for hundreds of thousands
       of cycles, which would, e.g., cause spin locks to be held for
       unreasonably long durations.

  Newerish revisions of the MIPS ISA also have finer-grained sync
  variants that could be plopped in here.

Mechanism:

Insert these barriers in the right places, replacing only those where
the definition is currently equivalent, so this change is safe.

- Replace #ifdef _MIPS_ARCH_OCTEONP / syncw / #endif at the end of
  atomic_cas_* by SYNC_PLUNGER, which is `sync 4' (a.k.a. syncw) if
  __OCTEON__ and empty otherwise.

  => From what I can tell, __OCTEON__ is defined in at least as many
     contexts as _MIPS_ARCH_OCTEONP -- i.e., there are some Octeons
     with no _MIPS_ARCH_OCTEONP, but I don't know if any of them are
     relevant to us or ever saw the light of day outside Cavium; we
     seem to buid with `-march=octeonp' so this is unlikely to make a
     difference.  If it turns out that we do care, well, now there's
     a central place to make the distinction for sync instructions.

- Replace post-ll/sc SYNC by SYNC_ACQ in _atomic_cas_*, which are
  internal kernel versions used in sys/arch/mips/include/lock.h where
  it assumes they have load-acquire semantics.  Should move this to
  lock.h later, since we _don't_ define __HAVE_ATOMIC_AS_MEMBAR on
  MIPS and so the extra barrier might be costly.

- Insert SYNC_REL before ll/sc, and replace post-ll/sc SYNC by
  SYNC_ACQ, in _ucas_*, which is used without any barriers in futex
  code and doesn't mention barriers in the man page so I have to
  assume it is required to be a release/acquire barrier.

- Change BDSYNC to BDSYNC_ACQ in mutex_enter and mutex_spin_enter.
  This is necessary to provide load-acquire semantics -- unclear if
  it was provided already by syncw on Octeon, but it seems more
  likely that either (a) no sync or syncw is needed at all, or (b)
  syncw is not enough and sync is needed, since syncw is only a
  store-before-store ordering barrier.

- Insert SYNC_REL before ll/sc in mutex_exit and mutex_spin_exit.
  This is currently redundant with the SYNC already there, but
  SYNC_REL more clearly identifies the necessary semantics in case we
  want to define it differently on different systems, and having a
  sync in the middle of an ll/sc is a bit weird and possibly not a
  good idea, so I intend to (carefully) remove the redundant SYNC in
  a later change.

- Change BDSYNC to BDSYNC_PLUNGER at the end of mutex_exit.  This has
  no semantic change right now -- it's syncw on Octeon, sync on
  non-Octeon MP, nop on UP -- but we can relax it later to nop on
  non-Cavium MP.

- Leave LLSCSYNC in for now -- it is apparently there for a Cavium
  erratum, but I'm not sure what the erratum is, exactly, and I have
  no reference for it.  I suspect these can be safely removed, but we
  might have to double up some other syncw instructions -- Linux uses
  it only in store-release sequences, not at the head of every ll/sc.
2022-02-27 19:21:53 +00:00
riastradh e35e7b15e2 mips: Brush up __cpu_simple_lock.
- Eradicate last vestiges of mb_* barriers.

- In __cpu_simple_lock_init, omit needless barrier.  It is the
  caller's responsibility to ensure __cpu_simple_lock_init happens
  before other operations on it anyway, so there was never any need
  for a barrier here.

- In __cpu_simple_lock_try, leave comments about memory ordering
  guarantees of the kernel's _atomic_cas_uint, which are inexplicably
  different from the non-underscored atomic_cas_uint.

- In __cpu_simple_unlock, use membar_exit instead of mb_memory, and do
  it unconditionally.

  This ensures that in __cpu_simple_lock/.../__cpu_simple_unlock, all
  memory operations in the ellipsis happen before the store that
  releases the lock.

  - On Octeon, the barrier was omitted altogether, which is a bug --
    it needs to be there or else there is no happens-before relation
    and whoever takes the lock next might see stale values stored or
    even stomp over the unlocking CPU's delayed loads.

  - On non-Octeon, the mb_memory was sync.  Using membar_exit
    preserves this.

  XXX On Octeon, membar_exit only issues syncw -- this seems wrong,
  only store-before-store and not load/store-before-store, unless the
  CNMIPS architecture guarantees it is sufficient here like
  SPARCv8/v9 PSO (`Partial Store Order').

- Leave an essay with citations about why we have an apparently
  pointless syncw _after_ releasing a lock, to work around a design
  bug^W^Wquirk in cnmips which sometimes buffers stores for hundreds
  of thousands of cycles for fun unless you issue syncw.
2022-02-12 17:10:02 +00:00
andvar 5ceb9d96fa fix typos in comments. 2022-01-15 10:38:56 +00:00
andvar 1cb7819f04 fix various typos in comments. 2021-12-12 22:20:52 +00:00
andvar 42412bc75c s/efficent/efficient/ in comments. 2021-12-08 20:11:54 +00:00
msaitoh 7a2933d5cb s/asychronous/asynchronous/ in comment. 2021-12-05 04:24:08 +00:00
msaitoh 7c496db356 s/absense/absence/ in comment. 2021-12-05 03:24:19 +00:00
msaitoh 344f0d1e04 s/exisit/exist/ in comment. 2021-12-05 02:52:17 +00:00
andvar a27a533e2d fix few typos in comments and log message. 2021-11-14 20:51:57 +00:00
christos 00f17ebc18 Use defined constant instead of direct value (Etienne Brateau) 2021-10-28 15:09:08 +00:00
christos b0d97acfad Fix build with -Werror=array-parameter (Etienne Brateau) 2021-10-28 15:08:05 +00:00
andvar 50d9072672 remove duplicate the article in comments. 2021-10-04 21:02:39 +00:00
andvar a136e22ab6 fix various typos in comments, messages and documentation. 2021-09-19 10:34:06 +00:00
andvar 72e44f84cb fix typos in word "successfully", mainly s/succesfully/successfully/. 2021-09-16 21:29:41 +00:00
andvar 4ddb87935b s/aquire/acquire/ in comments, also one typo fix acqure->acquire. 2021-09-07 13:24:45 +00:00
christos 8f97cb72d8 remove lint exclusion 2021-08-30 12:52:32 +00:00
ryo 567a3a02e7 Improved the performance of kernel profiling on MULTIPROCESSOR, and possible to get profiling data for each CPU.
In the current implementation, locks are acquired at the entrance of the mcount
internal function, so the higher the number of cores, the more lock conflict
occurs, making profiling performance in a MULTIPROCESSOR environment unusable
and slow. Profiling buffers has been changed to be reserved for each CPU,
improving profiling performance in MP by several to several dozen times.

- Eliminated cpu_simple_lock in mcount internal function, using per-CPU buffers.
- Add ci_gmon member to struct cpu_info of each MP arch.
- Add kern.profiling.percpu node in sysctl tree.
- Add new -c <cpuid> option to kgmon(8) to specify the cpuid, like openbsd.
  For compatibility, if the -c option is not specified, the entire system can be
  operated as before, and the -p option will get the total profiling data for
  all CPUs.
2021-08-14 17:51:18 +00:00
ryo 1979ff4ae2 don't include "opt_multiprocessor.h" inside an ifdef to work "make depend" properly. 2021-08-14 17:38:44 +00:00
andvar ebbc7028d3 fix typos in words "pointer" and s/fram /frame/ 2021-08-13 20:47:54 +00:00
skrll 1306a159ff Whitespace 2021-08-08 07:17:18 +00:00
andvar 077d1c0f36 fix various typos in comments and log messages. 2021-08-02 12:56:22 +00:00
andvar 5298fab779 s/overwriten/overwritten/ in comments. 2021-08-01 21:58:56 +00:00
andvar 31f72197e0 fix more typos in style found one in file - check/fix them all. 2021-07-31 14:36:33 +00:00
skrll 65d55bcee1 As we're providing the legacy gcc __sync built-in functions for atomic
memory access we might as well get the memory barriers right...
From the gcc documentation:

In most cases, these built-in functions are considered a full barrier.
That is, no memory operand is moved across the operation, either forward
or backward. Further, instructions are issued as necessary to prevent the
processor from speculating loads across the operation and from queuing
stores after the operation.

type __sync_lock_test_and_set (type *ptr, type value, ...)

   This built-in function is not a full barrier, but rather an acquire
   barrier. This means that references after the operation cannot move to
   (or be speculated to) before the operation, but previous memory stores
   may not be globally visible yet, and previous memory loads may not yet
   be satisfied.

void __sync_lock_release (type *ptr, ...)

   This built-in function is not a full barrier, but rather a release
   barrier. This means that all previous memory stores are globally
   visible, and all previous memory loads have been satisfied, but
   following memory reads are not prevented from being speculated to
   before the barrier.
2021-07-29 10:29:05 +00:00
simonb 3fc2996b41 #define<tab> consistency. 2021-07-28 08:01:10 +00:00
skrll 8e8c0784cf Remove memory barriers from the atomic_ops(3) atomic operations. They're
not needed for correctness.

Add the correct memory barriers to the gcc legacy __sync built-in
functions for atomic memory access.  From the gcc documentation:

In most cases, these built-in functions are considered a full barrier.
That is, no memory operand is moved across the operation, either forward
or backward. Further, instructions are issued as necessary to prevent the
processor from speculating loads across the operation and from queuing
stores after the operation.

type __sync_lock_test_and_set (type *ptr, type value, ...)

   This built-in function is not a full barrier, but rather an acquire
   barrier. This means that references after the operation cannot move to
   (or be speculated to) before the operation, but previous memory stores
   may not be globally visible yet, and previous memory loads may not yet
   be satisfied.

void __sync_lock_release (type *ptr, ...)

   This built-in function is not a full barrier, but rather a release
   barrier. This means that all previous memory stores are globally
   visible, and all previous memory loads have been satisfied, but
   following memory reads are not prevented from being speculated to
   before the barrier.
2021-07-28 07:32:20 +00:00
andvar 7991f5a7b8 Fix all remaining typos, mainly in comments but also in few definitions and log messages, reported by me in PR kern/54889.
Also fixed some additional typos in comments, found on review of same files or typos.
2021-07-24 21:31:31 +00:00
skrll 6a2d1b5533 #include <sys/param.h> 2021-07-22 13:54:38 +00:00
skrll 5e911a385d s/ifdef _ARM_ARCH_6/if defined(_ARM_ARCH_6)/ for consistency. NFCI. 2021-07-10 06:53:40 +00:00
skrll 52728926ba One more s/pte/ptr/ 2021-07-06 08:31:41 +00:00
skrll 6788795c38 typo in comment s/pte/ptr/ 2021-07-05 08:50:31 +00:00
skrll 68a49f39f0 Fix the logic operation for atomic_nand_{8,16,32,64}
From the gcc docs the operations are as follows

 { tmp = *ptr; *ptr = ~(tmp & value); return tmp; }   // nand
 { tmp = ~(*ptr & value); *ptr = tmp; return *ptr; }   // nand

yes, this is really rather strange.
2021-07-04 06:55:47 +00:00