Commit Graph

99 Commits

Author SHA1 Message Date
Alex Xu (Hello71)
63c67053a3 in early stage ldso before __dls2b, call mprotect with __syscall
if LTO is enabled, gcc hoists the call to ___errno_location outside the
loop even though the access to errno is gated behind head != &ldso
because ___errno_location is marked __attribute__((const)). this causes
the program to crash because TLS is not yet initialized when called from
__dls2. this is also possible if LTO is not enabled; even though gcc 11
doesn't do it, it is still wrong to use errno here.

since the start and end are already aligned, we can simply call
__syscall instead of using global errno.

Fixes: e13a2b8953 ("implement PT_GNU_RELRO support")
2022-07-02 16:12:04 -04:00
Érico Nogueira
b7a130e0b9 remove unnecessary cast for map_library return
the function already returns (void *)
2021-04-20 15:40:27 -04:00
Rich Felker
aad50fcd79 fix regression in dl_iterate_phdr reporting of modules with no TLS
__tls_get_addr should not be called with an invalid TLS module id of
0. in practice it probably "works", returning the DTV length as if it
were a pointer, and the callback should probably not inspect
dlpi_tls_data in this case, but it's likely that some real-world
callbacks use a check on dlpi_tls_data being non-null, rather than on
dlpi_tls_modid being nonzero, to conclude that the module has TLS.
2021-04-16 10:20:46 -04:00
Rich Felker
521b4d27a0 fix dl_iterate_phdr dlpi_tls_data reporting to match spec
dl_iterate_phdr was wrongly reporting the address of the DSO's PT_TLS
image rather than the calling thread's instance of the TLS. the man
page, which is essentially normative for a nonstandard function of
this sort, clearly specifies the latter. it does not clarify where
exactly within/relative-to the image the pointer should point, but the
reasonable thing to do is match the ABI's DTP offset, and this seems
to be what other implementations do.
2021-03-26 13:35:41 -04:00
Rich Felker
cfdfd5ea3c don't fail to map library/executable with zero-length segment maps
reportedly the GNU linker can emit such segments, causing spurious
failure to load due to mmap with a length of zero producing EINVAL.
no action is required for such a load map (it's effectively a nop in
the program headers table) so just treat it as always successful.
2021-03-05 11:13:02 -05:00
Rich Felker
167390f055 lift child restrictions after multi-threaded fork
as the outcome of Austin Group tracker issue #62, future editions of
POSIX have dropped the requirement that fork be AS-safe. this allows
but does not require implementations to synchronize fork with internal
locks and give forked children of multithreaded parents a partly or
fully unrestricted execution environment where they can continue to
use the standard library (per POSIX, they can only portably use
AS-safe functions).

up until recently, taking this allowance did not seem desirable.
however, commit 8ed2bd8bfc exposed the
extent to which applications and libraries are depending on the
ability to use malloc and other non-AS-safe interfaces in MT-forked
children, by converting latent very-low-probability catastrophic state
corruption into predictable deadlock. dealing with the fallout has
been a huge burden for users/distros.

while it looks like most of the non-portable usage in applications
could be fixed given sufficient effort, at least some of it seems to
occur in language runtimes which are exposing the ability to run
unrestricted code in the child as part of the contract with the
programmer. any attempt at fixing such contracts is not just a
technical problem but a social one, and is probably not tractable.

this patch extends the fork function to take locks for all libc
singletons in the parent, and release or reset those locks in the
child, so that when the underlying fork operation takes place, the
state protected by these locks is consistent and ready for the child
to use. locking is skipped in the case where the parent is
single-threaded so as not to interfere with legacy AS-safety property
of fork in single-threaded programs. lock order is mostly arbitrary,
but the malloc locks (including bump allocator in case it's used) must
be taken after the locks on any subsystems that might use malloc, and
non-AS-safe locks cannot be taken while the thread list lock is held,
imposing a requirement that it be taken last.
2020-11-11 15:55:30 -05:00
Rich Felker
34952fe5de convert malloc use under libc-internal locks to use internal allocator
this change lifts undocumented restrictions on calls by replacement
mallocs to libc functions that might take these locks, and sets the
stage for lifting restrictions on the child execution environment
after multithreaded fork.

care is taken to #define macros to replace all four functions (malloc,
calloc, realloc, free) even if not all of them will be used, using an
undefined symbol name for the ones intended not to be used so that any
inadvertent future use will be caught at compile time rather than
directed to the wrong implementation.
2020-11-11 13:31:50 -05:00
Rich Felker
c1e5d243b7 drop use of getdelim/stdio in dynamic linker
the only place stdio was used here was for reading the ldso path file,
taking advantage of getdelim to automatically allocate and resize the
buffer. the motivation for use here was that, with shared libraries,
stdio is already available anyway and free to use. this has long been
a nuisance to users because getdelim's use of realloc here triggered a
valgrind bug, but removing it doesn't really fix that; on some archs
even calling the valgrind-interposed malloc at this point will crash.

the actual motivation for this change is moving towards getting rid of
use of application-provided malloc in parts of libc where it would be
called with libc-internal locks held, leading to the possibility of
deadlock if the malloc implementation doesn't follow unwritten rules
about which libc functions are safe for it to call. since getdelim is
required to produce a pointer as if by malloc (i.e. that can be passed
to reallor or free), it necessarily must use the public malloc.

instead of performing a realloc loop as the path file is read, first
query its size with fstat and allocate only once. this produces
slightly different truncation behavior when racing with writes to a
file, but neither behavior is or could be made safe anyway; on a live
system, ldso path files should be replaced by atomic rename only. the
change should also reduce memory waste.
2020-11-11 10:55:13 -05:00
rcombs
ccba23459e ldso: notify the debugger when we're doing a dlopen
Otherwise lldb doesn't notice the new library and stack traces
containing it get cut off unhelpfully.
2020-10-27 01:10:29 -04:00
Rich Felker
50716702d4 ldso: use pthread_t rather than kernel tid to track ctor visitor
commit 188759bbee documented the intent
to allow recursive dlopen based on tracking ctor_visitor, but used a
kernel tid rather than the pthread_t to identify the caller. as a
result, it would not behave as intended under fork by a ctor, where
the child tid would not match.
2020-10-14 20:27:12 -04:00
Rich Felker
1efc8eb2c7 fix stale lock when allocation of ctor queue fails during dlopen
queue_ctors should not be called with the init_fini_lock held, since
it may longjmp out on allocation failure. this introduces a minor
TOCTOU race with p->constructed, but one already exists further down
anyway, and by design it's okay to run through the queue more than
once anyway. the only reason we bother to check p->constructed at all
is to avoid spurious failure of dlopen when the library is already
fully loaded and constructed.
2020-10-14 20:27:12 -04:00
Rich Felker
57f6e85c9d remove redundant pthread struct members repeated for layout purposes
dtv_copy, canary2, and canary_at_end existed solely to match multiple
ABI and asm-accessed layouts simultaneously. now that pthread_arch.h
can be included before struct __pthread is defined, the struct layout
can depend on macros defined by pthread_arch.h.
2020-08-27 18:36:45 -04:00
Rich Felker
e9f4fd1185 have ldso track replacement of aligned_alloc
this is in preparation for improving behavior of malloc interposition.
2020-06-10 22:02:45 -04:00
Rich Felker
cee88b76f7 move declaration of interfaces between malloc and ldso to dynlink.h
this eliminates consumers of malloc_impl.h outside of the malloc
implementation.
2020-06-02 21:38:25 -04:00
Fangrui Song
72658c658b ldso: remove redundant switch case for REL_NONE
as a result of commit b6a6cd703f,
the REL_NONE case is now redundant.
2020-03-20 12:35:38 -04:00
Rich Felker
0ff18be208 fix incorrect __hwcap seen in dynamic-linked __set_thread_area
the bug fixed in commit b82cd6c78d was
mostly masked on arm because __hwcap was zero at the point of the call
from the dynamic linker to __set_thread_area, causing the access to
libc.auxv to be skipped and kuser_helper versions of TLS access and
atomics to be used instead of the armv6 or v7 versions. however, on
kernels with kuser_helper removed for hardening it would crash.

since __set_thread_area potentially uses __hwcap, it must be
initialized before the function is called. move the AT_HWCAP lookup
from stage 3 to stage 2b.
2020-01-15 16:15:49 -05:00
Rich Felker
d6bbea2acf fix fdpic regression in dynamic linker with overly smart compilers
at least gcc 9 broke execution of DT_INIT/DT_FINI for fdpic archs
(presently only sh) by recognizing that the stores to the
compound-literal function descriptor constructed to call them were
dead stores. there's no way to make a "may_alias function", so instead
launder the descriptor through an asm-statement barrier. in practice
just making the compound literal volatile seemed to have worked too,
but this should be less of a hack and more accurately convey the
semantics of what transformations are not valid.
2020-01-01 00:15:04 -05:00
Rich Felker
b82cd6c78d fix crashing ldso on archs where __set_thread_area examines auxv
commit 1c84c99913 moved the call to
__init_tp above the initialization of libc.auxv, inadvertently
breaking archs where __set_thread_area examines auxv for the sake of
determining the TLS/atomic model needed at runtime. this broke armv6
and sh2.
2019-12-31 21:59:07 -05:00
Rich Felker
b529ec9b52 move stage3_func typedef out of shared internal dynlink.h header
this interface contract is entirely internal to dynlink.c.
2019-12-31 21:51:07 -05:00
Rich Felker
22daaea39f add time64 redirect for, and redirecting implementation of, dlsym
if symbols are being redirected to provide the new time64 ABI, dlsym
must perform matching redirections; otherwise, it would poke a hole in
the magic and return pointers to functions that are not safe to call
from a caller using time64 types.

rather than duplicating a table of redirections, use the time64
symbols present in libc's symbol table to derive the decision for
whether a particular symbol needs to be redirected.
2019-11-02 18:30:56 -04:00
Rich Felker
9d35fec9e1 fix regression whereby main thread didn't get TLS relocations
commit ffab43602b broke this by moving
relocations after not only the allocation of storage for the main
thread's static TLS, but after the copying of the TLS image. thus,
relocation results were not reflected in the main thread's copy. this
could be fixed by calling __reset_tls after relocations, but instead
split the allocation and installation before/after relocations so that
there's not a redundant copy.

due to commit 71af530987, updating of
static_tls_cnt needs to be kept with allocation of static TLS, before
relocations, rather than after installation.
2019-08-13 21:53:30 -04:00
Szabolcs Nagy
f2435263d7 make relocation time symbol lookup and dlsym consistent
Using common code path for all symbol lookups fixes three dlsym issues:

- st_shndx of STT_TLS symbols were not checked and thus an undefined
  tls symbol reference could be incorrectly treated as a definition
  (the sysv hash lookup returns undefined symbols, gnu does not, so should
  be rare in practice).

- symbol binding was not checked so a hidden symbol may be returned
  (in principle STB_LOCAL symbols may appear in the dynamic symbol table
  for hidden symbols, but linkers most likely don't produce it).

- mips specific behaviour was not applied (ARCH_SYM_REJECT_UND) so
  undefined symbols may be returned on mips.

always_inline is used to avoid relocation performance regression, the
code generation for find_sym should not be affected.
2019-08-12 18:25:38 -04:00
Rich Felker
1f060ed2fb ldso: correct condition for local symbol handling in do_relocs
commit 7a9669e977 added use of the
symbol reference as the definition, in place of performing a lookup,
for STT_SECTION symbol references that were first found used in FDPIC.
such references may happen in certain other cases, such as
local-dynamic TLS and with relocation types that require a symbol but
that are being used for non-symbolic purposes, like the powerpc
unaligned address relocations.

in all such cases I'm aware of, the symbol referenced is a section
symbol (STT_SECTION); however, the important semantic property is not
its being a section, but rather its binding local (STB_LOCAL). check
the latter instead of the former for greater generality and semantic
correctness.
2019-08-12 18:19:38 -04:00
Samuel Holland
08869deb7e add support for powerpc/powerpc64 unaligned relocations
R_PPC_UADDR32 (R_PPC64_UADDR64) has the same meaning as R_PPC_ADDR32
(R_PPC64_ADDR64), except that its address need not be aligned. For
powerpc64, BFD ld(1) will automatically convert between ADDR<->UADDR
relocations when the address is/isn't at its native alignment. This
will happen if, for example, there is a pointer in a packed struct.

gold and lld do not currently generate R_PPC64_UADDR64, but pass
through misaligned R_PPC64_ADDR64 relocations from object files,
possibly relaxing them to misaligned R_PPC64_RELATIVE. In both cases
(relaxed or not) this violates the PSABI, which defines the relevant
field type as "a 64-bit field occupying 8 bytes, the alignment of
which is 8 bytes unless otherwise specified."

All three linkers violate the PSABI on 32-bit powerpc, where the only
difference is that the field is 32 bits wide, aligned to 4 bytes.

Currently musl fails to load executables linked by BFD ld containing
R_PPC64_UADDR64, with the error "unsupported relocation type 43".
This change provides compatibility with BFD ld on powerpc64, and any
static linker on either architecture that starts following the PSABI
more closely.
2019-08-11 17:43:57 -04:00
Rich Felker
71af530987 ldso: remove redundant runtime checks in static TLS logic
as a result of commit ffab43602b,
static_tls_cnt is now valid during relocations at program startup, so
it's no longer necessary to condition the check against static_tls_cnt
on this being a runtime (dlopen) relocation.
2019-08-11 11:57:38 -04:00
Rich Felker
ffab43602b ldso: fix calloc misuse allocating initial tls
this is analogous to commit 2f1f51ae7b,
and should have been caught at the same time since it was right next
to the code moved in that commit. between final stage 3 reloc_all and
the jump to the main program's entry point, it is not valid to call
any functions which may be interposed by the application; doing so
results in execution of application code before ctors have run, and on
fdpic archs, before the main program's fdpic self-fixups have taken
place, which will produce runaway wrong execution.
2019-08-11 11:48:06 -04:00
Rich Felker
9b83182069 fix inadvertent use of uninitialized variable in dladdr
commit c8b49b2fbc introduced code that
checked bestsym to determine whether a matching symbol was found, but
bestsym is uninitialized if not. instead use best, consistent with use
in the rest of the function.

simplified from bug report and patch by Cheng Liu.
2019-07-06 17:47:43 -04:00
Rich Felker
54b7564b72 remove unnecessary and problematic _Noreturn from crt/ldso startup
after commit a48ccc159a removed the use
of _Noreturn on the stage3_func type (which only worked due to it
being defined to the "GNU C" attribute in C99 mode), GCC could no
longer assume that the ends of __dls2 and __dls2b are unreachable, and
produced a warning that a function marked _Noreturn returns.

also, since commit 4390383b32, the
_Noreturn declaration for __libc_start_main in crt1/rcrt1 has been not
only inconsistent with the definition, but wrong. formally,
__libc_start_main does return, via a (hopefully) tail call to a helper
function after the barrier. incorrect usage of _Noreturn in the
declaration was probably formal UB.

the _Noreturn specifiers were not useful in any of these places, so
remove them all. now, the only remaining usage of _Noreturn is in
public interfaces where _Noreturn is part of their contract.
2019-06-25 19:05:40 -04:00
Szabolcs Nagy
a60b9e0686 fix tls offsets when p_vaddr%p_align != 0 on TLS_ABOVE_TP targets
currently the bfd linker does not seem to create tls segments where
p_vaddr%p_align != 0, but this is valid in ELF and then the runtime
computed tls offset must satisfy

  offset%p_align == (base+p_vaddr)%p_align

and in case of local exec tls (main executable) the smallest such
offset must be used (otherwise it is incompatible with the offset
computed by the static linker). the !TLS_ABOVE_TP case is handled
correctly (the offset is negative then in the formula).

the ldso code for TLS_ABOVE_TP is changed so the static tls offset
of each module satisfies the formula.
2019-05-16 21:48:39 -04:00
Szabolcs Nagy
6104dae908 fix static tls offsets of shared libs on TLS_ABOVE_TP targets
tls_offset should always point to the end of the allocated static tls
area, but this was not handled correctly on "tls variant 1" targets
in the dynamic linker:

after application tls was allocated, tls_offset was aligned up,
potentially wasting tls space. (alignment may be needed at the
begining of the tls area, not at the end, but that will be fixed
separately as it is unlikely to affect real binaries.)

when static tls was allocated for a shared library, tls_offset was
only updated with the size of the tls segment which does not include
alignment gaps, which can easily happen if the tls size update for
one library leaves tls_offset misaligned for the next one. this can
cause oob access in __copy_tls or arbitrary breakage at tls access.
(the issue was observed on aarch64 with rust binaries)
2019-05-16 20:12:56 -04:00
Fangrui Song
f450c150d3 remove unused struct dso members from dynlink.c
maintainer's note: commit 9d44b6460a
removed their use.
2019-05-12 09:51:45 -04:00
Rich Felker
22e5bbd0de overhaul i386 syscall mechanism not to depend on external asm source
this is the first part of a series of patches intended to make
__syscall fully self-contained in the object file produced using
syscall.h, which will make it possible for crt1 code to perform
syscalls.

the (confusingly named) i386 __vsyscall mechanism, which this commit
removes, was introduced before the presence of a valid thread pointer
was mandatory; back then the thread pointer was setup lazily only if
threads were used. the intent was to be able to perform syscalls using
the kernel's fast entry point in the VDSO, which can use the sysenter
(Intel) or syscall (AMD) instruction instead of int $128, but without
inlining an access to the __syscall global at the point of each
syscall, which would incur a significant size cost from PIC setup
everywhere. the mechanism also shuffled registers/calling convention
around to avoid spills of call-saved registers, and to avoid
allocating ebx or ebp via asm constraints, since there are plenty of
broken-but-supported compiler versions which are incapable of
allocating ebx with -fPIC or ebp with -fno-omit-frame-pointer.

the new mechanism preserves the properties of avoiding spills and
avoiding allocation of ebx/ebp in constraints, but does it inline,
using some fairly simple register shuffling, and uses a field of the
thread structure rather than global data for the vdso-provided syscall
code address.

for now, the external __syscall function is refactored not to use the
old __vsyscall so it can be kept, but the intent is to remove it too.
2019-04-10 17:10:36 -04:00
Ilya Matveychikov
7784680072 fix the use of syscall result in dl_mmap 2019-04-06 09:38:49 -04:00
Ray
086a12b920 delete a redundant if in dynamic linker ctor execution loop 2019-04-02 10:36:49 -04:00
Rich Felker
50cd02386b fix invalid-/double-/use-after-free in new dlopen ctor execution
this affected the error path where dlopen successfully found and
loaded the requested dso and all its dependencies, but failed to
resolve one or more relocations, causing the operation to fail after
storage for the ctor queue was allocated.

commit 188759bbee wrongly put the free
for the ctor_queue array in the error path inside a loop over each
loaded dso that needed to be backed-out, rather than just doing it
once. in addition, the exit path also observed the ctor_queue pointer
still being nonzero, and would attempt to call ctors on the backed-out
dsos unless the double-free crashed the process first.
2019-03-10 13:16:59 -04:00
Rich Felker
43e7efb465 avoid malloc of ctor queue for programs with no external deps
together with the previous two commits, this completes restoration of
the property that dynamic-linked apps with no external deps and no tls
have no failure paths before entry.
2019-03-03 13:24:23 -05:00
Rich Felker
f034f145bd avoid malloc of deps arrays for ldso and vdso
neither has or can have any dependencies, but since commit
4035556907, gratuitous zero-length deps
arrays were being allocated for them. use a dummy array instead.
2019-03-03 12:45:23 -05:00
Rich Felker
e612d094b1 avoid malloc of deps array for programs with no external deps
traditionally, we've provided a guarantee that dynamic-linked
applications with no external dependencies (nothing but libc) and no
thread-local storage have no failure paths before the entry point.
normally, thanks to reclaim_gaps, such a malloc will not require a
syscall anyway, but if segment alignment is unlucky, it might. use a
builtin array for this common special case.
2019-03-03 12:42:16 -05:00
Rich Felker
2f1f51ae7b fix malloc misuse for startup ctor queue, breakage on fdpic archs
in the case where malloc is being replaced, it's not valid to call
malloc between final relocations and main app's crt1 entry point; on
fdpic archs the main app's entry point will not yet have performed the
self-fixups necessary to call its code.

to fix, reorder queue_ctors before final relocations. an alternative
solution would be doing the allocation from __libc_start_init, after
the entry point but before any ctors run. this is less desirable,
since it would leave a call to malloc that might be provided by the
application happening at startup when doing so can be easily avoided.
2019-03-03 12:06:44 -05:00
Rich Felker
8e43b5613e synchronize shared library dtor exec against concurrent loads/ctors
previously, going way back, there was simply no synchronization here.
a call to exit concurrent with ctor execution from dlopen could cause
a dtor to execute concurrently with its corresponding ctor, or could
cause dtors for newly-constructed libraries to be skipped.

introduce a shutting_down state that blocks further ctor execution,
producing the quiescence the dtor execution loop needs to ensure any
kind of consistency, and that blocks further calls to dlopen so that a
call into dlopen from a dtor cannot deadlock.

better approaches to some of this may be possible, but the changes
here at least make things safe.
2019-03-03 12:06:44 -05:00
Rich Felker
188759bbee overhaul shared library ctor execution for dependency order, concurrency
previously, shared library constructors at program start and dlopen
time were executed in reverse load order. some libraries, however,
rely on a depth-first dependency order, which most other dynamic
linker implementations provide. this is a much more reasonable, less
arbitrary order, and it turns out to have much better properties with
regard to how slow-running ctors affect multi-threaded programs, and
how recursive dlopen behaves.

this commit builds on previous work tracking direct dependencies of
each dso (commit 4035556907), and
performs a topological sort on the dependency graph at load time while
the main ldso lock is held and before success is committed, producing
a queue of constructors needed by the newly-loaded dso (or main
application). in the case of circular dependencies, the dependency
chain is simply broken at points where it becomes circular.

when the ctor queue is run, the init_fini_lock is held only for
iteration purposes; it's released during execution of each ctor, so
that arbitrarily-long-running application code no longer runs with a
lock held in the caller. this prevents a dlopen with slow ctors in one
thread from arbitrarily delaying other threads that call dlopen.
fully-independent ctors can run concurrently; when multiple threads
call dlopen with a shared dependency, one will end up executing the
ctor while the other waits on a condvar for it to finish.

another corner case improved by these changes is recursive dlopen
(call from a ctor). previously, recursive calls to dlopen could cause
a ctor for a library to be executed before the ctor for its
dependency, even when there was no relation between the calling
library and the library it was loading, simply due to the naive
reverse-load-order traversal. now, we can guarantee that recursive
dlopen in non-circular-dependency usage preserves the desired ctor
execution order properties, and that even in circular usage, at worst
the libraries whose ctors call dlopen will fail to have completed
construction when ctors that depend on them run.

init_fini_lock is changed to a normal, non-recursive mutex, since it
is no longer held while calling back into application code.
2019-03-03 12:06:22 -05:00
Rich Felker
88207361ea record preloaded libraries as direct pseudo-dependencies of main app
this makes calling dlsym on the main app more consistent with the
global symbol table (load order), and is a prerequisite for
dependency-order ctor execution to work correctly with LD_PRELOAD.
2019-03-02 17:29:29 -05:00
Rich Felker
0c5c8f5da6 fix unsafety of new ldso dep tracking in presence of malloc replacement
commit 4035556907 introduced runtime
realloc of an array that may have been allocated before symbols were
resolved outside of libc, which is invalid if the allocator has been
replaced. track this condition and manually copy if needed.
2019-03-02 17:29:11 -05:00
Rich Felker
4035556907 fix and overhaul dlsym depedency order, always record direct deps
dlsym with an explicit handle is specified to use "dependency order",
a breadth-first search rooted at the argument. this has always been
implemented by iterating a flattened dependency list built at dlopen
time. however, the logic for building this list was completely wrong
except in trivial cases; it simply used the list of libraries loaded
since a given library, and their direct dependencies, as that
library's dependencies, which could result in misordering, wrongful
omission of deep dependencies from the search, and wrongful inclusion
of unrelated libraries in the search.

further, libraries did not have any recorded list of resolved
dependencies until they were explicitly dlopened, meaning that
DT_NEEDED entries had to be resolved again whenever a library
participated as a dependency of more than one dlopened library.

with this overhaul, the resolved direct dependency list of each
library is always recorded when it is first loaded, and can be
extended to a full flattened breadth-first search list if dlopen is
called on the library. the extension is performed using the direct
dependency list as a queue and appending copies of the direct
dependency list of each dependency in the queue, excluding duplicates,
until the end of the queue is reached. the direct deps remain
available for future use as the initial subarray of the full deps
array.

first-load logic in dlopen is updated to match these changes, and
clarified.
2019-02-27 16:05:37 -05:00
Rich Felker
71db5dfaa9 fix crash/misbehavior from oob read in new dynamic tls installation
code introduced in commit 9d44b6460a
wrongly attempted to read past the end of the currently-installed dtv
to determine if a dso provides new, not-already-installed tls. this
logic was probably leftover from an earlier draft of the code that
wrongly installed the new dtv before populating it.

it would work if we instead queried the new, not-yet-installed dtv,
but instead, replace the incorrect check with a simple range check
against old_cnt. this also catches modules that have no tls at all
with a single condition.
2019-02-27 12:07:20 -05:00
Rich Felker
6516282d2a fix crash in new dynamic tls installation when last dep lacks tls
code introduced in commit 9d44b6460a
wrongly assumed the dso list tail was the right place to find new dtv
storage. however, this is only true if the last-loaded dependency has
tls. the correct place to get it is the dso corresponding to the tls
module list tail. introduce a container_of macro to get it, and use
it.

ultimately, dynamic tls allocation should be refactored so that this
is not an issue. there is no reason to be allocating new dtv space at
each load_library; instead it could happen after all new libraries
have been loaded but before they are committed. such changes may be
made later, but this commit fixes the present regression.
2019-02-25 02:09:36 -05:00
Rich Felker
ba18c1ecc6 add membarrier syscall wrapper, refactor dynamic tls install to use it
the motivation for this change is twofold. first, it gets the fallback
logic out of the dynamic linker, improving code readability and
organization. second, it provides application code that wants to use
the membarrier syscall, which depends on preregistration of intent
before the process becomes multithreaded unless unbounded latency is
acceptable, with a symbol that, when linked, ensures that this
registration happens.
2019-02-22 03:25:39 -05:00
Rich Felker
609dd57c4e fix loop logic cruft in dynamic tls installation
commit 9d44b6460a inadvertently
contained leftover logic from a previous approach to the fallback
signaling loop. it had no adverse effect, since j was always nonzero
if the loop body was reachable, but it makes no sense to be there with
the current approach to avoid signaling self.
2019-02-22 02:24:33 -05:00
Rich Felker
9d44b6460a install dynamic tls synchronously at dlopen, streamline access
previously, dynamic loading of new libraries with thread-local storage
allocated the storage needed for all existing threads at load-time,
precluding late failure that can't be handled, but left installation
in existing threads to take place lazily on first access. this imposed
an additional memory access and branch on every dynamic tls access,
and imposed a requirement, which was not actually met, that the
dynamic tlsdesc asm functions preserve all call-clobbered registers
before calling C code to to install new dynamic tls on first access.
the x86[_64] versions of this code wrongly omitted saving and
restoring of fpu/vector registers, assuming the compiler would not
generate anything using them in the called C code. the arm and aarch64
versions saved known existing registers, but failed to be future-proof
against expansion of the register file.

now that we track live threads in a list, it's possible to install the
new dynamic tls for each thread at dlopen time. for the most part,
synchronization is not needed, because if a thread has not
synchronized with completion of the dlopen, there is no way it can
meaningfully request access to a slot past the end of the old dtv,
which remains valid for accessing slots which already existed.
however, it is necessary to ensure that, if a thread sees its new dtv
pointer, it sees correct pointers in each of the slots that existed
prior to the dlopen. my understanding is that, on most real-world
coherency architectures including all the ones we presently support, a
built-in consume order guarantees this; however, don't rely on that.
instead, the SYS_membarrier syscall is used to ensure that all threads
see the stores to the slots of their new dtv prior to the installation
of the new dtv. if it is not supported, the same is implemented in
userspace via signals, using the same mechanism as __synccall.

the __tls_get_addr function, variants, and dynamic tlsdesc asm
functions are all updated to remove the fallback paths for claiming
new dynamic tls, and are now all branch-free.
2019-02-18 21:01:16 -05:00
Rich Felker
1c84c99913 add new stage 2b to dynamic linker bootstrap for thread pointer
commit a603a75a72 removed attribute
const from __errno_location and pthread_self, and the same reasoning
forced arch definitions of __pthread_self to use volatile asm,
significantly impacting code generation and imposing manual caching of
pointers where the impact might be noticable.

reorder the thread pointer setup and place it across a strong barrier
(symbolic function lookup) so that there is no assumed ordering
between the initialization and the accesses to the thread pointer in
stage 3.
2018-10-16 13:50:28 -04:00