the bug appeared only with requests roughly 2*sizeof(size_t) to
4*sizeof(size_t) bytes smaller than a multiple of the page size, and
only for requests large enough to be serviced by mmap instead of the
normal heap. it was only ever observed on 64-bit machines but
presumably could also affect 32-bit (albeit with a smaller window of
opportunity).
since vfprintf will provide a temporary buffer in the case where the
target FILE has a zero buffer size, don't bother setting up a real
buffer for vdprintf. this also allows us to skip the call to fflush
since we know everything will be written out before vfprintf returns.
this change makes it so most calls to fprintf(stderr, ...) will result
in a single writev syscall, as opposed to roughly 2*N syscalls (and
possibly more) where N is the number of format specifiers. in
principle we could use a much larger buffer, but it's best not to
increase the stack requirements too much. most messages are under 80
chars.
0e10000000000000000000000000000000 was setting ERANGE
exponent char e/p was considered part of the match even if not
followed by a valid decimal value
"1e +10" was parsed as "1e+10"
hex digits were misinterpreted as 0..5 instead of 10..15
the problem: there is a (single-instruction) race condition window
between a thread flagging itself dead and decrementing itself from the
thread count. if it receives the rsyscall signal at this exact moment,
the rsyscall caller will never succeed in signalling enough flags to
succeed, and will deadlock forever. in previous versions of musl, the
about-to-terminate thread masked all signals prior to decrementing
the thread count, but this cost a whole syscall just to account for
extremely rare races.
the solution is a huge hack: rather than blocking in the signal
handler if the thread is dead, modify the signal mask of the saved
context and return in order to prevent further signal handling by the
dead thread. this allows the dead thread to continue decrementing the
thread count (if it had not yet done so) and exiting, even while the
live part of the program blocks for rsyscall.
for some inexplicable reason, linux allows the sender of realtime
signals to spoof its identity. permission checks for sending signals
should limit the impact to same-user processes, but just to be safe,
we avoid trusting the siginfo structure and instead simply examine the
program state to see if we're in the middle of a legitimate rsyscall.
if init_malloc returns positive (successful first init), malloc will
retry getting a chunk from the free bins rather than expanding the
heap again. also pass init_malloc a hint for the size of the initial
allocation.
we want to keep atomically updated fields (locks and thread count) and
really anything writable far away from frequently-needed function
pointers. stuff some rarely-needed function pointers in between to
pad, hopefully up to a cache line boundary.
instead of allocating a userspace structure for signal-based timers,
simply use the kernel timer id. we use the fact that thread pointers
will always be zero in the low bit (actually more) to encode integer
timerid values as pointers.
also, this change ensures that the timer_destroy syscall has completed
before the library timer_destroy function returns, in case it matters.
the major idea of this patch is not to depend on having the timer
pointer delivered to the signal handler, and instead use the thread
pointer to get the callback function address and argument. this way,
the parent thread can make the timer_create syscall while the child
thread is starting, and it should never have to block waiting for the
barrier.
unlocking an unlocked mutex is not UB for robust or error-checking
mutexes, so we must avoid calling __pthread_self (which might crash
due to lack of thread-register initialization) until after checking
that the mutex is locked.
why does this affect behavior? well, the linker seems to traverse
archive files starting from its current position when resolving
symbols. since calloc.c comes alphabetically (and thus in sequence in
the archive file) between __simple_malloc.c and malloc.c, attempts to
resolve the "malloc" symbol for use by calloc.c were pulling in the
full malloc.c implementation rather than the __simple_malloc.c
implementation.
as of now, lite_malloc.c and malloc.c are adjacent in the archive and
in the correct order, so malloc.c should never be used to resolve
"malloc" unless it's already needed to resolve another symbol ("free"
or "realloc").
this roughly halves the cost of pthread_mutex_unlock, at least for
non-robust, normal-type mutexes.
the a_store change is in preparation for future support of archs which
require a memory barrier or special atomic store operation, and also
should prevent the possibility of the compiler misordering writes.
cycle-level benchmark on atom cpu showed typical pthread_mutex_lock
call dropping from ~120 cycles to ~90 cycles with this change. benefit
may vary with compiler options and version, but this optimization is
very cheap to make and should always help some.
this implementation is superior to the glibc/nptl implementation, in
that it gives true realtime behavior. there is no risk of timer
expiration events being lost due to failed thread creation or failed
malloc, because the thread is created as time creation time, and
reused until the timer is deleted.
- there is no longer any risk of spoofing cancellation requests, since
the cancel flag is set in pthread_cancel rather than in the signal
handler.
- cancellation signal is no longer unblocked when running the
cancellation handlers. instead, pthread_create will cause any new
threads created from a cancellation handler to unblock their own
cancellation signal.
- various tweaks in preparation for POSIX timer support.