this change lifts undocumented restrictions on calls by replacement
mallocs to libc functions that might take these locks, and sets the
stage for lifting restrictions on the child execution environment
after multithreaded fork.
care is taken to #define macros to replace all four functions (malloc,
calloc, realloc, free) even if not all of them will be used, using an
undefined symbol name for the ones intended not to be used so that any
inadvertent future use will be caught at compile time rather than
directed to the wrong implementation.
allowing the application to replace malloc (since commit
c9f415d7ea) has brought multiple
headaches where it's used from various critical sections in libc
components. for example:
- the thread-local message buffers allocated for dlerror can't be
freed at thread exit time because application code would then run in
the context of a non-existant thread. this was handled in commit
aa5a9d15e0 by queuing them for free
later.
- the dynamic linker has to be careful not to pass memory allocated at
early startup time (necessarily using its own malloc) to realloc or
free after redoing relocations with the application and all
libraries present. bugs in this area were fixed several times, at
least in commits 0c5c8f5da6 and
2f1f51ae7b and possibly others.
- by calling the allocator from contexts where libc-internal locks are
held, we impose undocumented requirements on alternate malloc
implementations not to call into any libc function that might
attempt to take these locks; if they do, deadlock results.
- work to make fork of a multithreaded parent give the child an
unrestricted execution environment is blocked by lock order issues
as long as the application-provided allocator can be called with
libc-internal locks held.
these problems are all fixed by giving libc internals access to the
original, non-replaced allocator, for use where needed. it can't be
used everywhere, as some interfaces like str[n]dup, open_[w]memstream,
getline/getdelim, etc. are required to provide the called memory
obtained as if by (the public) malloc. and there are a number of libc
interfaces that are "pure library" code, not part of some internal
singleton, and where using the application's choice of malloc
implementation is preferable -- things like glob, regex, etc.
one might expect there to be significant cost to static-linked
programs, pulling in two malloc implementations, one of them
mostly-unused, if malloc is replaced. however, in almost all of the
places where malloc is used internally, care has been taken already
not to pull in realloc/free (i.e. to link with just the bump
allocator). this size optimization carries over automatically.
the newly-exposed internal allocator functions are obtained by
renaming the actual definitions, then adding new wrappers around them
with the public names. technically __libc_realloc and __libc_free
could be aliases rather than needing a layer of wrapper, but this
would almost surely break certain instrumentation (valgrind) and the
size and performance difference is negligible. __libc_calloc needs to
be handled specially since calloc is designed to work with either the
internal or the replaced malloc.
as a bonus, this change also eliminates the longstanding ugly
dependency of the static bump allocator on order of object files in
libc.a, by making it so there's only one definition of the malloc
function and having it in the same source file as the bump allocator.
the only place stdio was used here was for reading the ldso path file,
taking advantage of getdelim to automatically allocate and resize the
buffer. the motivation for use here was that, with shared libraries,
stdio is already available anyway and free to use. this has long been
a nuisance to users because getdelim's use of realloc here triggered a
valgrind bug, but removing it doesn't really fix that; on some archs
even calling the valgrind-interposed malloc at this point will crash.
the actual motivation for this change is moving towards getting rid of
use of application-provided malloc in parts of libc where it would be
called with libc-internal locks held, leading to the possibility of
deadlock if the malloc implementation doesn't follow unwritten rules
about which libc functions are safe for it to call. since getdelim is
required to produce a pointer as if by malloc (i.e. that can be passed
to reallor or free), it necessarily must use the public malloc.
instead of performing a realloc loop as the path file is read, first
query its size with fstat and allocate only once. this produces
slightly different truncation behavior when racing with writes to a
file, but neither behavior is or could be made safe anyway; on a live
system, ldso path files should be replaced by atomic rename only. the
change should also reduce memory waste.
thread-local buffers allocated for dlerror need to be queued for free
at a later time when the owning thread exits, since malloc may be
replaced by application code and the exiting context is not valid to
call application code from. the code to process queue of pending
frees, introduced in commit aa5a9d15e0,
gratuitously held the lock for the entire duration of queue
processing, updating the global queue pointer after each free, despite
there being no logical requirement that all frees finish before
another thread can access the queue.
instead, immediately claim the whole queue for freeing and release the
lock, then walk the list and perform frees without the lock held. the
change is unlikely to make any meaningful difference to performance,
but it eliminates one point where the allocator is called under an
internal lock. since the allocator may be application-provided, such
calls are undesirable because they allow application code to impede
forward progress of libc functions in other threads arbitrarily long,
and to induce deadlock if it calls a libc function that requires the
same lock.
the change also eliminates a lock ordering consideration that's an
impediment upcoming work with multithreaded fork.
the ABI type for the vector registers in fpregset_t, struct
fpsimd_context, and struct user_fpsimd_struct is __uint128_t, which
was presumably originally not used because it's a nonstandard type,
but its existence is mandated by the aarch64 psABI. use of the wrong
type here broke software using these structures, and encouraged
incorrect fixes with casts rather than reinterpretation of
representation.
the reasoning in commit 2d0bbe6c78 was
not entirely correct. while it's true that setting the waiters flag
ensures that the next unlock will perform a wake, it's possible that
the wake is consumed by a mutex waiter that has no relationship with
the condvar wait queue being processed, which then takes the mutex.
when that thread subsequently unlocks, it sees no waiters, and leaves
the rest of the condvar queue stuck.
bring back the waiter count adjustment, but skip it for PI mutexes,
for which a successful lock-after-waiting always sets the waiters bit.
if future changes are made to bring this same waiters-bit contract to
all lock types, this can be reverted.
sem_open is required to return the same sem_t pointer for all
references to the same named semaphore when it's opened more than once
in the same process. thus we keep a table of all the mapped semaphores
and their reference counts. the code path for sem_close checked the
reference count, but then proceeded to unmap the semaphore regardless
of whether the count had reached zero.
add an immediate unlock-and-return for the nonzero refcnt case so the
property of performing the munmap syscall after releasing the lock can
be preserved.
resource limits have been process-wide since linux 2.6.10, and the
prlimit syscall was added in 2.6.36, so prlimit can be assumed to set
the resource limits correctly for the whole process.
commit bd153422f2 reintroduced the bug
fixed in c21051e90c by refactoring the
__syscall_ret into _Fork where it once again runs before the atfork
handlers are called. since _Fork is a public interface that sets
errno, this can't be fixed the way it was fixed last time without
making new internal interfaces. instead, just save errno, and restore
it only on error to ensure that a value of 0 is never restored.
pthread_cond_wait arranged for requeued waiters to wake when the mutex
is unlocked by temporarily adjusting the mutex's waiter count. commit
54ca677983 broke this when introducing
PI mutexes by repurposing the waiter count field of the mutex
structure. since then, for PI mutexes, the waiter count adjustment was
misinterpreted by the mutex locking code as indicating that the mutex
is non a non-recoverable state.
it would be possible to special-case PI mutexes here, but instead just
drop all adjustment of the waiters count, and instead use the lock
word waiters bit for all mutex types. since the mutex is either held
by the caller or in unrecoverable state at the time the bit is set, it
will necessarily still be set at the time of any subsequent valid
unlock operation, and this will produce the desired effect of waking
the next waiter.
if waiter counts are entirely dropped at some point in the future this
code should still work without modification.
commit 25ea9f712c introduced a deadlock
to the posix_spawn child whereby, if abort was called in the parent
and ended up taking the abort lock to terminate the process, the
__libc_sigaction calls in the child would wait forever to obtain a
lock that would not be released. this could be fixed by having abort
set the abort lock as the exit futex address, but it's cleaner to just
remove the SIGABRT special handling from the internal __libc_sigaction
and lift it to the public sigaction function.
nothing but the posix_spawn child calls __libc_sigaction on SIGABRT,
and since commit b7bc966522 the abort
lock is held at the time of __clone, which precludes the child
inheriting a kernel-level signal disposition inconsistent with the
disposition on the abstract machine. this means it's fine to inspect
and modify the disposition in the child without a lock.
Merge changes from Solar Designer's crypt_blowfish v1.3. This makes
crypt_blowfish fully compatible with OpenBSD's bcrypt by adding
support for the $2b$ prefix (which behaves the same as
crypt_blowfish's $2y$).
this makes the code slightly smaller and eliminates timer_create from
relevance to possible future changes to multithreaded fork.
the barrier of a_store isn't technically needed here, but a_store is
used anyway for internal consistency of the memory model.
this was leftover from when the actual SIGEV_THREAD timer logic was in
the signal handler. commit 5b74eed3b3
replaced that with use of sigwaitinfo, with the actual signal left
blocked, so the no-op signal handler was no longer serving any
purpose.
the signal disposition reset to SIG_DFL is still needed, however, in
case we inherited SIG_IGN from a foreign-libc process.
assert is not specified to flush open stdio streams, and doing so can
block indefinitely waiting for a lock already held or an output
operation to a file that can't accept more output until an
unsatisfiable condition is met.
commit 500c6886c6 broke this by fixing
the behavior of fread to conform to the C standard; getgroupslist was
assuming the old behavior, that a request to read 1 member of length 0
would return 1, not 0.
this change prevents the child created concurrently with abort from
seeing the SIGABRT disposition change from SIG_IGN to SIG_DFL (other
changes are not visible anyway) and prevents leaking the write end of
the child pipe to children created by fork in another thread, which
may block return of posix_spawn indefinitely if the forked child does
not exit or exec.
along with other changes, this suggests that __abort_lock should
perhaps eventually be renamed to reflect that it's becoming a broader
lock on related "process lifetime" state.
the existing abort locking logic in sigaction only accounted for
attempts to change the disposition, not attempts to observe the change
made by abort.
unfortunately the change is still observable in at least one other
place: inheritance of signal dispositions across exec and posix_spawn.
fixing these is a separate task and it's not even clear whether a
complete fix is possible.
the _Fork interface is defined for future issue of POSIX as the
outcome of Austin Group issue 62, which drops the AS-safety
requirement for fork, and provides an AS-safe replacement that does
not run the registered atfork handlers.
commit 188759bbee documented the intent
to allow recursive dlopen based on tracking ctor_visitor, but used a
kernel tid rather than the pthread_t to identify the caller. as a
result, it would not behave as intended under fork by a ctor, where
the child tid would not match.
queue_ctors should not be called with the init_fini_lock held, since
it may longjmp out on allocation failure. this introduces a minor
TOCTOU race with p->constructed, but one already exists further down
anyway, and by design it's okay to run through the queue more than
once anyway. the only reason we bother to check p->constructed at all
is to avoid spurious failure of dlopen when the library is already
fully loaded and constructed.
this makes the code slightly smaller and eliminates these functions
from relevance to possible future changes to multithreaded fork.
the barrier of a_store isn't technically needed here, but a_store is
used anyway for internal consistency of the memory model.
if the multithreaded parent forked while another thread was calling
sigaction for SIGABRT or calling abort, the child could inherit a lock
state in which future calls to abort will deadlock, or in which the
disposition for SIGABRT has already been reset to SIG_DFL. this is
nonconforming since abort is AS-safe and permitted to be called
concurrently with fork or in the MT-forked child.
the dummy definition of __abort_lock in sigaction.c was performing
exactly the same role that putting the lock in its own source file
could and should have been used to achieve.
while we're moving it, give it a proper declaration.
previously, if a file descriptor had aio operations pending in the
parent before fork, attempting to close it in the child would attempt
to cancel a thread belonging to the parent. this could deadlock, fail,
or crash the whole process of the cancellation signal handler was not
yet installed in the parent. in addition, further use of aio from the
child could malfunction or deadlock.
POSIX specifies that async io operations are not inherited by the
child on fork, so clear the entire aio fd map in the child, and take
the aio map lock (with signals blocked) across the fork so that the
lock is kept in a consistent state.
taking the deprecated/dropped vfork spec strictly, doing pretty much
anything but execve in the child is wrong and undefined. however,
these are commonly needed operations to setup the child state before
exec, and historical implementations tolerated them.
for single-threaded parents, these operations already worked as
expected in the vforked child. however, due to the need for __synccall
to synchronize id/resource limit changes among all threads, calling
these functions in the vforked child of a multithreaded parent caused
a misdirected broadcast signaling of all threads in the parent. these
signals could kill the parent entirely if the synccall signal handler
had never been installed in the parent, or could be ignored if it had,
or could signal/kill one or more utterly wrong processes if the parent
already terminated (due to vfork semantics, only possible via fatal
signal) and the parent tids were recycled. in any case, the expected
number of semaphore posts would never happen, so the child would
permanently hang (with all signals blocked) waiting for them.
to mitigate this, and also make the normal usage case work as
intended, treat the condition where the caller's actual tid does not
match the tid in its thread structure as single-threaded, and bypass
the entire synccall broadcast operation.
commit 0a05eace16 implemented AT_EACCESS
for faccessat with a horrible hack, creating a child process to change
switch uid/gid and perform the access probe without making potentially
irreversible changes to the caller's credentials. this was due to the
syscall lacking a flags argument.
linux 5.8 introduced a new syscall, SYS_faccessat2, fixing this
deficiency. use it if any flags are passed, and fallback to the old
strategy on ENOSYS. continue using the old syscall when there are no
flags.
Ethernet protocol number for media redundancy protocol, see
linux commit 4714d13791f831d253852c8b5d657270becb8b2a
bridge: uapi: mrp: Add mrp attributes.
the linux faccessat syscall lacks a flag argument that is necessary
to implement the posix api, see
linux commit c8ffd8bcdd28296a198f237cc595148a8d4adfbe
vfs: add faccessat2 syscall
add TCP_NLA_BYTES_NOTSENT and new tcp_zerocopy_receive fields, see
linux commit c8856c051454909e5059df4e81c77b9c366c5515
tcp-zerocopy: Return inq along with tcp receive zerocopy.
linux commit 33946518d493cdf10aedb4a483f1aa41948a3dab
tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.
linux commit e08ab0b377a1489760533424437c5f4be7f484a4
tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS
it remaps anon mappings without unmapping the original. chromeos plans
to use it with userfaultfd, see:
linux commit e346b3813067d4b17383f975f197a9aa28a3b077
mm/mremap: add MREMAP_DONTUNMAP to mremap()
see
linux commit 9e2ba2c34f1922ca1e0c7d31b30ace5842c2e7d1
fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name
linux commit 44d705b0370b1d581f46ff23e5d33e8b5ff8ec58
fanotify: report name info for FAN_DIR_MODIFY event
added in
linux commit 1a50ec0b3b2e9a83f1b1245ea37a853aac2f741c
arm64: Implement archrandom.h for ARMv8.5-RNG
linux commit d4209d8b717311d114b5d47ba7f8249fd44e97c2
arm64: cpufeature: Export matrix and other features to userspace
these were missed before, added in
linux commit 1201937491822b61641c1878ebcd16a93aed4540
arm64: Expose ARMv8.5 CondM capability to userspace
linux commit ca9503fc9e9812aa6258e55d44edb03eb30fc46f
arm64: Expose FRINT capabilities to userspace
reuses a bit from CSIGNAL so it can only be used with unshare
and clone3, added in
linux commit 769071ac9f20b6a447410c7eaa55d1a5233ef40c
ns: Introduce Time Namespace
needed for storage drivers with userspace component that may
run in the IO path, see
linux commit 8d19f1c8e1937baf74e1962aae9f90fa3aeab463
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
The use of TCP_ in udp.h is not known, fortunately udp.h is not
specified by posix so there are no strict namespace rules, added in
linux commit e27cca96cd68fa2c6814c90f9a1cfd36bb68c593
xfrm: add espintcp (RFC 8229)
TCP_NLA_TIMEOUT_REHASH queries timeout-triggered rehash attempts,
tcpm_ifindex limits the scope of TCP_MD5SIG* sockopt to a device.
see
linux commit 32efcc06d2a15fa87585614d12d6c2308cc2d3f3
tcp: export count for rehash attempts
linux commit 6b102db50cdde3ba2f78631ed21222edf3a5fb51
net: Add device index to tcp_md5sig
add IPPROTO_ETHERNET and IPPROTO_MPTCP, see
linux commit 2677625387056136e256c743e3285b4fe3da87bb
seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
linux commit faf391c3826cd29feae02078ca2022d2f912f7cc
tcp: Define IPPROTO_MPTCP