- Override __libc_thr_init() instead of using our own constructor.
- Add pthread__getenv() and use instead of getenv(). This is used before
we are up and running and unfortunatley getenv() takes locks.
Other changes:
- Cache the spinlock vectors in pthread__st. Internal spinlock operations
now take 1 function call instead of 3 (i386).
- Use pthread__self() internally, not pthread_self().
- Use __attribute__ ((visibility("hidden"))) in some places.
- Kill PTHREAD_MAIN_DEBUG.
preparing to sleep on a mutex are slow in relative terms, so this allows
us to recover from short lock holds without blocking, while not wasting
too much time on longer holds.
architecture to provide asm versions of the RAS operations.
We do this because relying on the compiler to get the RAS right is not
sensible. (It gets alpha wrong and hppa is suboptimal)
Provide asm RAS ops for hppa.
(A slightly different version) reviewed by Andrew Doran.
Instead, make the deferred wakeup list a per-thread array and pass down
the lwpid_t's that way.
- In pthread_cond_wait(), take the mutex before dealing with early wakeup.
In this way there should never be contention on the CV's spinlock if
the app follows POSIX rules (there should only be contention on the
user-provided mutex).
- Add a port of the kernel's rwlocks. The rwlock's spinlock is only taken if
there is contention. This is enabled where atomic ops are available. Right
now that is only i386 and amd64 because I don't have other hardware to
test with. It's trivial to add stubs for other architectures as long as
they have compare-and-swap. When we have proper atomic ops the old rwlock
code can be removed.
- Add a new mutex implementation that's similar to the kernel's mutexes, but
uses compare-and-swap to maintain the waiters list, so no spinlocks are
involved. Same caveats apply as for the rwlocks.
yielding. This is a nasty band-aid but with many threads, looping over
sched_yield() wastes a huge amount of CPU time. It would be nice to have a
way to temporarily disable preemption, but it turns out that's yet another
no-brain concept that has been patented and the patent holder seems to be
suing people lately. Another alternative is probably to have kernel-assisted
spinlocks.
Chops another ~10% off create/join in a loop on i386.
- Disable low level debugging as this is stable. Improves benchmarks
across the board by a small percentage. Uncontested mutex acquire
and release in a loop becomes about 8% quicker.
- Minor cleanup.
detach/join.
- Make mutex acquire spin for a short time, as done with spinlocks.
- Make the number of spins controllable with the env var PTHREAD_NSPINS.
- Reduce the amount of time that libpthread internal spinlocks are held.
- Rely more on the barrier effects of park/unpark to avoid taking spinlocks.
- Simplify the locking around pthreads and the global queues.
- Align per-thread sync data on a 128 byte boundary.
- Offset thread stacks by a small amount to try and reduce cache thrash.
so avoid making them.
- When parking an LWP on a condition variable, point the hint argument at
the mutex's waiters queue. Chances are we will be awoken from that later.
- enable concurrency according to environment variable PTHREAD_CONCURRENCY
- add idle VP wakeup if there are additional jobs and idle VPs
- make reidlequeue per VP
- enable spinning for locks
- fix race condition in alarm processing
- fix race condition in mutex locking
- make debugging output line buffered and add VP prefix to debug lines
explicitly declared global in the asm() statements for the benefit of
SH5 binutils. Otherwise, the assembler/linker (I haven't figured out
which) botches the SHmedia bit when generating GOT references for
these symbols in the shared version of the library.
Ok'd by Nathan.