In the current implementation, locks are acquired at the entrance of the mcount
internal function, so the higher the number of cores, the more lock conflict
occurs, making profiling performance in a MULTIPROCESSOR environment unusable
and slow. Profiling buffers has been changed to be reserved for each CPU,
improving profiling performance in MP by several to several dozen times.
- Eliminated cpu_simple_lock in mcount internal function, using per-CPU buffers.
- Add ci_gmon member to struct cpu_info of each MP arch.
- Add kern.profiling.percpu node in sysctl tree.
- Add new -c <cpuid> option to kgmon(8) to specify the cpuid, like openbsd.
For compatibility, if the -c option is not specified, the entire system can be
operated as before, and the -p option will get the total profiling data for
all CPUs.
memory access we might as well get the memory barriers right...
From the gcc documentation:
In most cases, these built-in functions are considered a full barrier.
That is, no memory operand is moved across the operation, either forward
or backward. Further, instructions are issued as necessary to prevent the
processor from speculating loads across the operation and from queuing
stores after the operation.
type __sync_lock_test_and_set (type *ptr, type value, ...)
This built-in function is not a full barrier, but rather an acquire
barrier. This means that references after the operation cannot move to
(or be speculated to) before the operation, but previous memory stores
may not be globally visible yet, and previous memory loads may not yet
be satisfied.
void __sync_lock_release (type *ptr, ...)
This built-in function is not a full barrier, but rather a release
barrier. This means that all previous memory stores are globally
visible, and all previous memory loads have been satisfied, but
following memory reads are not prevented from being speculated to
before the barrier.
not needed for correctness.
Add the correct memory barriers to the gcc legacy __sync built-in
functions for atomic memory access. From the gcc documentation:
In most cases, these built-in functions are considered a full barrier.
That is, no memory operand is moved across the operation, either forward
or backward. Further, instructions are issued as necessary to prevent the
processor from speculating loads across the operation and from queuing
stores after the operation.
type __sync_lock_test_and_set (type *ptr, type value, ...)
This built-in function is not a full barrier, but rather an acquire
barrier. This means that references after the operation cannot move to
(or be speculated to) before the operation, but previous memory stores
may not be globally visible yet, and previous memory loads may not yet
be satisfied.
void __sync_lock_release (type *ptr, ...)
This built-in function is not a full barrier, but rather a release
barrier. This means that all previous memory stores are globally
visible, and all previous memory loads have been satisfied, but
following memory reads are not prevented from being speculated to
before the barrier.
indeed quicker, it nonetheless failed to actually fill all the requested
memory with the specified value much of the time if a non-aligned start
address was used.