These are needed for BUS_SPACE_MAP_PREFETCHABLE mappings. On x86,
these are WC-type memory regions, which means -- unlike normal
WB-type memory regions -- loads can be reordered with loads,
requiring LFENCE, and stores can be reordered with stores, requiring
SFENCE.
Reference: AMD64 Architecture Programmer's Manual, Volume 2: System
Programming, Sec. 7.4.1 `Memory Barrier Interaction with Memory
Types', Table 7-3 `Memory Access Ordering Rules'.
- A driver that expects prefetchable memory and knows to issue the
needed bus_space_barrier calls can pass BUS_SPACE_MAP_PREFETCHABLE
to indicate a desire to map the memory prefetchable if the BAR
allows it.
(A driver that _really wants_ BUS_SPACE_MAP_PREFETCHABLE even if
the BAR claims _not_ to be prefetchable can use pci_mapreg_info and
bus_space_map explicitly -- this is not different from what we have
today.)
- For a driver that _does not_ expect prefetchable memory, the
appearance of the prefetchable bit in the BAR shouldn't cause it to
use BUS_SPACE_MAP_PREFETCHABLE, because the driver will not issue
the needed bus_space_barrier calls to get sensible results.
Note: `Prefetchable' here, sometimes called `write-combining', means
reads have no side effects, and writes are idempotent, so it is safe
to issue reads out of order and safe to combine writes.
Mappings with BUS_SPACE_MAP_PREFETCHABLE are often more weakly
ordered than normal memory -- e.g., on x86, in WC-type memory
regions, loads can be reordered with loads, stores can be reordered
with stores, which is not possible with any other type of memory
regions.
Discussed on tech-kern a while ago:
https://mail-index.NetBSD.org/tech-kern/2017/03/22/msg021678.html
This is option A, which received the most support. This should help
unconfuse drivers that do not expect prefetchable mappings, like
Yamaguchi-san tripped over recently:
https://mail-index.NetBSD.org/tech-kern/2019/12/02/msg025785.html
While here, fix a bug that was formerly in xcall(9): a missing
acquire operation in the xc_wait fast path so that all memory
operations in the xcall on remote CPUs will happen before any memory
operations on the issuing CPU after xc_wait returns.
All stores of xc->xc_donep are done with atomic_store_release so that
we can safely use atomic_load_acquire to read it outside the lock.
However, this fast path only works on platforms with cheap 64-bit
atomic load/store, so conditionalize it on __HAVE_ATOMIC64_LOADSTORE.
(Under the lock, no need for atomic loads since nobody else will be
issuing stores.)
For review, here's the relevant diff from the old version of the fast
path, from before it was removed and some other things changed in the
file:
diff --git a/sys/kern/subr_xcall.c b/sys/kern/subr_xcall.c
index 45a877aa90e0..b6bfb6455291 100644
--- a/sys/kern/subr_xcall.c
+++ b/sys/kern/subr_xcall.c
@@ -84,6 +84,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_xcall.c,v 1.27 2019/10/06 15:11:17 uwe Exp $");
#include <sys/evcnt.h>
#include <sys/kthread.h>
#include <sys/cpu.h>
+#include <sys/atomic.h>
#ifdef _RUMPKERNEL
#include "rump_private.h"
@@ -334,10 +353,12 @@ xc_wait(uint64_t where)
xc = &xc_low_pri;
}
+#ifdef __HAVE_ATOMIC64_LOADSTORE
/* Fast path, if already done. */
- if (xc->xc_donep >= where) {
+ if (atomic_load_acquire(&xc->xc_donep) >= where) {
return;
}
+#endif
/* Slow path: block until awoken. */
mutex_enter(&xc->xc_lock);
@@ -422,7 +443,11 @@ xc_thread(void *cookie)
(*func)(arg1, arg2);
mutex_enter(&xc->xc_lock);
+#ifdef __HAVE_ATOMIC64_LOADSTORE
+ atomic_store_release(&xc->xc_donep, xc->xc_donep + 1);
+#else
xc->xc_donep++;
+#endif
}
/* NOTREACHED */
}
@@ -462,7 +487,6 @@ xc__highpri_intr(void *dummy)
* Lock-less fetch of function and its arguments.
* Safe since it cannot change at this point.
*/
- KASSERT(xc->xc_donep < xc->xc_headp);
func = xc->xc_func;
arg1 = xc->xc_arg1;
arg2 = xc->xc_arg2;
@@ -475,7 +499,13 @@ xc__highpri_intr(void *dummy)
* cross-call has been processed - notify waiters, if any.
*/
mutex_enter(&xc->xc_lock);
- if (++xc->xc_donep == xc->xc_headp) {
+ KASSERT(xc->xc_donep < xc->xc_headp);
+#ifdef __HAVE_ATOMIC64_LOADSTORE
+ atomic_store_release(&xc->xc_donep, xc->xc_donep + 1);
+#else
+ xc->xc_donep++;
+#endif
+ if (xc->xc_donep == xc->xc_headp) {
cv_broadcast(&xc->xc_busy);
}
mutex_exit(&xc->xc_lock);
This was a very nice win in my tests on a 48 CPU box.
- Reorganise cpu_data slightly according to usage.
- Put cpu_onproc into struct cpu_info alongside ci_curlwp (now is ci_onproc).
- On x86, put some items in their own cache lines according to usage, like
the IPI bitmask and ci_want_resched.
Changes:
- membar_producer();
*p = v;
=>
atomic_store_release(p, v);
(Effectively like using membar_exit instead of membar_producer,
which is what we should have been doing all along so that stores by
the `reader' can't affect earlier loads by the writer, such as
KASSERT(p->refcnt == 0) in the writer and atomic_inc(&p->refcnt) in
the reader.)
- p = *pp;
if (p != NULL) membar_datadep_consumer();
=>
p = atomic_load_consume(pp);
(Only makes a difference on DEC Alpha. As long as lists generally
have at least one element, this is not likely to make a big
difference, and keeps the code simpler and clearer.)
No other functional change intended. While here, annotate each
synchronizing load and store with its counterpart in a comment.
Don't call kpreempt_disable() / kpreempt_enable() to make sure we're not
preempted while using the value of curcpu(). Instead, observe the value of
l_ncsw before and after the check to see if we have been preempted. If
we have been preempted, then we need to retry the read.