initialisation order is correct in this case as _lwp_setprivate has been
called already by ld.elf_so for dynamic programs or _libc_init for
statically linked ones.
_rtld_tls_allocate and _rtld_tls_free. libpthread uses this functions to
setup the thread private area of all new threads. ld.elf_so is
responsible for setting up the private area for the initial thread.
Similar functions are called from _libc_init for static binaries, using
dl_iterate_phdr to access the ELF Program Header.
Add test cases to exercise the different TLS storage models. Test cases
are compiled and installed on all platforms, but are skipped on
platforms not marked for TLS support.
This material is based upon work partially supported by
The NetBSD Foundation under a contract with Joerg Sonnenberger.
It is inspired by the TLS support in FreeBSD by Doug Rabson and the
clean ups of the DragonFly port of the original FreeBSD modifications.
libgcc_s's __register_frame_info gets called from libc's CSU code before
the libc constructors are run. __register_frame_info in turn calls
pthread_mutex_lock. libpthread is not initialised at this point and
therefore pthread__self() traps when deferencing the thread register.
This worked before because the garbage from pthread__self() is
effectively ignored.
on all platforms except VAX and IA64. Add fast access via register for
AMD64, i386 and SH3 ports. Use this fast access in libpthread to replace
the stack based pthread_self(). Implement skeleton support for Alpha,
HPPA, PowerPC, SPARC and SPARC64, but leave it disabled.
Ports that support this feature provide __HAVE____LWP_GETPRIVATE_FAST in
machine/types.h and a corresponding __lwp_getprivate_fast in
machine/mcontext.h.
This material is based upon work partially supported by
The NetBSD Foundation under a contract with Joerg Sonnenberger.
the situation, I decided to commit it. There is an inherent problem
with ASLR and the way the pthread library is using the thread stack.
Our pthread library chooses that stack for each thread strategically
so that it can locate the location of the pthread struct for each
thread by masking the stack pointer and looking just below the red
zone it creates. Unfortunately with ASLR you get many random values
for the initial stack, and there are situations where the masked
stack base ends up below the base of the stack. (this happens on
x86 when the stack base happens to be 0x???02000 for example and
your mask is stackmask is 0xffe00000). To fix this, we detect the
pathological cases (this happens only in the main thread), allocate
more stack, and mprotect it appropriately. Then we stash the main
base and the main struct, so that when we look for the pthread
struct in pthread__id, we can special case the main thread.
Another way to work around the problem is unlimiting stacksize,
but the proper way is to use TLS to find the thread structure and
not to play games with the thread stacks.
- Rely on _lwp_makecontext() to set up the thread identity register.
This is not currently done (a bug), nor does libpthread use the
threadreg yet. I'm doing this so it the code can be used by the
person working on TLS to verify that their threadreg code is working.
SCHED_FIFO threads
- Change condvar sync so that we never take the condvar's spinlock without
first holding the caller-provided mutex. Previously, the spinlock was only
taken without the mutex in an error path, but it was enough to trigger the
problem described in the PR.
- Even with this change, applications calling pthread_cond_signal/broadcast
without holding the interlocking mutex are still subject to the problem
described in the PR. POSIX discourages this saying that it leads to
undefined scheduling behaviour, which seems good enough for the time being.
- Elsewhere, use a hash of mutexes instead of per-object spinlocks to
synchronize entry/exit from sleep queues.
- Simplify how sleep queues are maintained.
- Add new functions: pthread_mutex_held_np, mutex_owner_np, rwlock_held_np,
rwlock_wrheld_np, rwlock_rdheld_np. These match the kernel's locking
primitives and can be used when porting kernel code to userspace.
- Always create LWPs detached. Do join/exit sync mostly in userland. When
looped on a dual core box this seems ~30% quicker than using lwp_wait().
Reduce number of lock acquire/release ops during thread exit.
- Play scrooge again and chop more cycles off acquire/release.
- Spin while the lock holder is running on another CPU (adaptive mutexes).
- Do non-atomic release.
Threadreg:
- Add the necessary hooks to use a thread register.
- Add the code for i386, using %gs.
- Leave i386 code disabled until xen and COMPAT_NETBSD32 have the changes.
- Override __libc_thr_init() instead of using our own constructor.
- Add pthread__getenv() and use instead of getenv(). This is used before
we are up and running and unfortunatley getenv() takes locks.
Other changes:
- Cache the spinlock vectors in pthread__st. Internal spinlock operations
now take 1 function call instead of 3 (i386).
- Use pthread__self() internally, not pthread_self().
- Use __attribute__ ((visibility("hidden"))) in some places.
- Kill PTHREAD_MAIN_DEBUG.
architecture to provide asm versions of the RAS operations.
We do this because relying on the compiler to get the RAS right is not
sensible. (It gets alpha wrong and hppa is suboptimal)
Provide asm RAS ops for hppa.
(A slightly different version) reviewed by Andrew Doran.
the following do not wake other threads early:
pthread_mutex_lock(&mutex);
pthread_cond_broadcast(&cond);
foo = malloc(100); /* takes libc mutexes */
pthread_mutex_unlock(&mutex);
- Eliminate mutexattr_private and just set a bit in ptm_owner if the mutex
is recursive. This forces the slow path to be taken for recursive mutexes.
Overload an unused field in pthread_mutex_t to record whether or not it's
an errorcheck mutex.
- Streamline pthread_mutex_lock / pthread_mutex_unlock a bit more. As a
side effect makes it possible to have assembly stubs for them.
- Update some comments and fix minor bugs. Minor cosmetic changes.
- Replace some spinlocks with mutexes and rwlocks.
- Change the process private semaphores to use mutexes and condition
variables instead of doing the synchronization directly. Spinlocks
are no longer used by the semaphore code.
Instead, make the deferred wakeup list a per-thread array and pass down
the lwpid_t's that way.
- In pthread_cond_wait(), take the mutex before dealing with early wakeup.
In this way there should never be contention on the CV's spinlock if
the app follows POSIX rules (there should only be contention on the
user-provided mutex).
- Add a port of the kernel's rwlocks. The rwlock's spinlock is only taken if
there is contention. This is enabled where atomic ops are available. Right
now that is only i386 and amd64 because I don't have other hardware to
test with. It's trivial to add stubs for other architectures as long as
they have compare-and-swap. When we have proper atomic ops the old rwlock
code can be removed.
- Add a new mutex implementation that's similar to the kernel's mutexes, but
uses compare-and-swap to maintain the waiters list, so no spinlocks are
involved. Same caveats apply as for the rwlocks.
Chops another ~10% off create/join in a loop on i386.
- Disable low level debugging as this is stable. Improves benchmarks
across the board by a small percentage. Uncontested mutex acquire
and release in a loop becomes about 8% quicker.
- Minor cleanup.
hint pointer, but do so in a way that remains compatible with older
pthread libraries. This can be used to wake another thread before the
calling thread goes asleep, saving at least one syscall + involuntary
context switch. This turns out to be a fairly large win on the condvar
benchmarks that I have tried.
detach/join.
- Make mutex acquire spin for a short time, as done with spinlocks.
- Make the number of spins controllable with the env var PTHREAD_NSPINS.
- Reduce the amount of time that libpthread internal spinlocks are held.
- Rely more on the barrier effects of park/unpark to avoid taking spinlocks.
- Simplify the locking around pthreads and the global queues.
- Align per-thread sync data on a 128 byte boundary.
- Offset thread stacks by a small amount to try and reduce cache thrash.
After resuming execution, the thread must check to see if it
has been restarted as a result of pthread_cond_signal(). If it
has, but cannot take the wakeup (because of eg a pending Unix
signal or timeout) then try to ensure that another thread sees
it. This is necessary because there may be multiple waiters,
and at least one should take the wakeup if possible.
so avoid making them.
- When parking an LWP on a condition variable, point the hint argument at
the mutex's waiters queue. Chances are we will be awoken from that later.
be used only as as a hint. Clear the pointer when releasing the mutex.
- When releasing a mutex, wake all waiters. Makes it possible to tranfer
waiters from another object to a mutex.
any threads are created turned out to be not such a good idea.
there are stronger requirements on what has to work in a forked child
while a process is still single-threaded. so take all that stuff
back out and fix the problems with single-threaded programs that
are linked with libpthread differently, by checking if the library
has been started and doing completely different stuff if it hasn't been:
- for pthread_rwlock_timedrdlock(), just fail with EDEADLK immediately.
- for sem_wait(), the only thing that can unlock the semaphore is a
signal handler, so use sigsuspend() to wait for a signal.
- for pthread_mutex_lock_slow(), just go into an infinite loop
waiting for signals.
I also noticed that there's a "sem2" test that has never worked in its
single-threaded form. the problem there is that a signal handler tries
to take a sem_t interlock which is already held when the signal is received.
fix this too, by adding a single-threaded case for sig_trywait() that
blocks signals instead of using the userland interlock.
call pthread__start() if it hasn't already been called. this avoids
an internal assertion from the library if these routines are used
before any threads are created and they need to sleep.
fixes PR 20256, PR 24241, PR 25722, PR 26096.