The received length field in the message may be greater than the
size of the 'answer' buffer in which the message resides. Currently,
ABUF_SIZE is 768. And if we get a larger 'alens[i]', it will result
in an out-of-bounds reading in __dns_parse().
To fix this, limit the length to the size of the received buffer.
the buffer-flush function did not account for mbtowc returning 0
rather than 1 when converting the nul character. this prevented
advancing past it, instead repeatedly converting it into the output
wide character string until the max output length was exhausted.
commit d42269d7c8 appropriated the
stream error flag temporarily to let the printf family of functions
suppress further output attempts after encountering a write error.
since the wide printf code relies on (narrow) vfprintf to print
padding and numeric conversions, a hack was put in vfprintf not to
clear the initial error status unless the stream is narrow oriented.
this was okay, because calling vfprintf on a wide-oriented stream
(outside of internal use by the implementation) produces undefined
behavior. however, it was highly non-obvious to anyone reading the
wide printf code, where the calls to fprintf without first checking
for error status appeared erroneous.
this patch removes all direct use of fprintf from the wide printf
core, except in the numeric conversions case where it was already
checked before starting processing of the directive that the error
status is not set. the other calls, which were performing padding, are
replaced by a new pad() helper function, which performs the check and
abstracts out the mechanism of writing the padding.
direct use of the error flag is also replaced by ferror, which is
defined as a macro in stdio_impl.h, expanding directly to the flag
check with no call or locking overhead.
unlike with wide printf variants, encoding errors are not a vector by
which this bug is reachable, and the out() helper function already
ensured that no further output could be written after an output error,
transient or otherwise. however, the %n specifier could still be
processed after an error, yielding a side effect that wrongly implied
output had succeeded.
due to buffering effects, it's still possible for %n to show output as
having "succeeded", but for it never to appear on the underlying file
due to an error at flush time. this change, however, ensures that
processing of %n does not conflict with any error which has already
been seen.
this fixes a broader bug for which a special case was reported by
Bruno Haible, in the form of %n getting processed (and reporting the
number of wide characters which would have been written, but weren't)
after an encoding error (EILSEQ). in addition to the %n case, some but
not all of the format specifiers continued to attempt output after an
error. in particular, %c, %lc, and %s all used fputwc directly without
any check for error status.
as long as the error condition was permanent rather than transient,
these write attempts had no visible side effects, but in theory it
could be visible, for example with EAGAIN/EWOULDBLOCK or ENOSPC, if
the condition precluding output came to an end. this could produce
output with missing non-final data, rather than just truncated output,
albeit with the function still returning -1 as expected to report an
error.
to fix this, a check is added to stop processing of any new directive
(including %n) if the stream is already in error state, and direct use
of fputwc is replaced with calls to the out() helper function, which
checks for error status.
note that fprintf is also used directly without checking error status,
but due to how commit d42269d7c8
previously attempted to solve the issue of output after error, the
call to fprintf does not attempt to write anything when the
wide-oriented stream is already in error state. this is non-obvious,
and is quite a hack, so it should be changed, but I've left it alone
for now to make the bug fix commit itself as non-invasive as possible.
this function was overlooked during the time64 transition, probably as
a result of not having any time-related types in its application-side
interface. however, for archs that lack the traditional poll syscall
and have only ppoll, it used timespec as part of its interface with
the kernel: the millisecond timeout was converted to a timespec to
pass to SYS_ppoll. this is a type/ABI mismatch on 32-bit archs with
legacy time32 syscalls.
only one supported arch, or1k, is affected. all of the others either
have SYS_poll, or are 64-bit.
rather than using timespec, define a type locally to match what the
kernel expects. the condition (SYS_ppoll_time64 == SYS_ppoll),
comparable to conditions used elsewhere in timespec-handling code,
evaluates true for "natively time64" 32-bit archs including x32,
future riscv32, and all future 32-bit archs (via definitions in
internal syscall.h). otherwise, the arch is either 64-bit or has
syscalls that take the legacy type, and in either case "long" is
correct.
this fix is based on bug report and proposal by Alexey Izbyshev but
with a different approach to the changes to minimize the contextual
knowledge needed for a reader to understand the source file.
If the (normalized) timeout passed to select exceeds INT_MAX seconds on
an arch with SYS_pselect6_time64 and the kernel is too old to support
time64 syscalls, the timeout is implicitly converted to (32-bit) long on
the fallback path, losing its upper 32 bits and potentially becoming a
small positive value, violating the intended semantics, or even
a negative value, causing the fallback syscall failure. Fix this by
saturating the timeout at INT_MAX as done in other time64 fallback
cases.
this is the best-effort fallback path for kernels that can't actually
support the dup3 functionality. it was setting FD_CLOEXEC flag on the
target fd (new) even if the dup2 operation failed. normally that
shouldn't happen under correct usage, but it's possible if the source
fd is not open or intentionally invalid (e.g. -1).
our dup3 code wrongly skipped directly to making the SYS_dup2 syscall
whenever the O_CLOEXEC bit of flags was not set. this is incorrect if
any new flags are ever added, as it would silently ignore them rather
than failing with an error.
archs which lack SYS_dup2 were unaffected.
adjust the logic so that SYS_dup3 is attempted whenever flags is
nonzero, and explicitly fail with EINVAL if SYS_dup3 is unavailable
and there are any unknown flags.
kernels using the fallback have an inherent close-on-exec race
condition and as such support for them is only best-effort anyway.
however, ignoring potential new flags is still very bad behavior.
instead, fail with EINVAL.
If the buffer passed to getservbyport_r is just enough to store two
pointers after aligning it, getnameinfo is called with buflen == 0
(which means that service name is not needed) and trivially succeeds.
Then, strtol is called on the address just past the buffer end, and
if it doesn't happen to find the port number there, getservbyport_r
spuriously succeeds and returns the same bad address to the caller.
Fix this by ensuring that buflen is at least 1 when passed to
getnameinfo.
getifaddrs computes &ctx->first->ifa even if ctx->first is NULL. While
this shouldn't be possible on the success path because the loopback
interface is hardcoded into the kernel, this is still possible on the
error path (for example, if __rtnetlink_enumerate couldn't create a
socket due to exceeding the fd limit).
accept4 emulation via accept ignores unknown flags, so it can spuriously
succeed instead of failing (or succeed without doing the action implied
by an unknown flag if it's added in a future kernel). Worse, unknown
flags trigger the fallback code even on modern kernels if the real
accept4 syscall returns EINVAL, because this is indistinguishable from
socketcall returning EINVAL due to lack of accept4 support.
Fix this by always failing with EINVAL if unknown flags are present and
the syscall is missing or failed with EINVAL.
This is completely analoguous to commit 633183b5d1.
Similar code called from __lookup_name is not affected because it checks
that the line contains the host name surrounded by blanks.
When IPv6 nameservers are present, __res_msend_rc attempts to disable
IPV6_V6ONLY socket option to ensure that it can communicate with IPv4
nameservers (if they are present too) via IPv4-mapped IPv6 addresses.
However, this option can't be disabled on bound sockets, so setsockopt
always fails.
A zero returned from recvmsg is currently treated as if some data were
received, so if a DNS server closes its TCP socket before sending the
full answer, __res_msend_rc will spin until the timeout elapses because
POLLIN event will be reported on each poll. Fix this by treating an
early EOF as an error.
DNS parsing callbacks pass the response buffer end instead of the actual
response end to dn_expand, so a malformed DNS response can use message
compression to make dn_expand jump past the response end and attempt to
parse uninitialized parts of that buffer, which might succeed and return
garbage.
There are several issues with range checks in this function:
* The question section parsing loop can read up to two out-of-bounds
bytes before doing the range check and bailing out.
* The answer section parsing loop, in addition to the same issue as
above, uses the wrong length in the range check that doesn't prevent
OOB reads when computing len later.
* The len range check before calling the callback is off by 10. Also,
p+len can overflow in a (probably theoretical) case when p is within
2^16 from UINTPTR_MAX.
Because __dns_parse is used only with stack-allocated buffers, such
small overreads can't result in a segfault. The first two also don't
affect the function result, but the last one may result in getaddrinfo
incorrectly succeeding and returning up to 10 bytes past the
response buffer as a part of the IP address, and in (canon) name
returned by getaddrinfo/getnameinfo being affected by memory past the
response buffer (because dn_expand might interpret it as a pointer).
Before this commit, DNS timeouts always used CLOCK_REALTIME, which
could produce spurious timeouts or delays if wall time changed for
whatever reason.
Now we try CLOCK_MONOTONIC and only fall back to CLOCK_REALTIME when
it is unavailable.
As a result of using simple subtraction to implement the return values
for wcscmp and wcsncmp, integer overflow can occur (producing
undefined behavior, and in practice, a wrong comparison result). This
does not occur for meaningful character values (21-bit range) but the
functions are specified to work on arbitrary wchar_t arrays.
This patch replaces the subtraction with a little bit of code that
orders the characters correctly, returning -1 if the character from
the first string is smaller than the one from the second, 0 if they
are equal and 1 if the character from the first string is larger than
the one from the second.
A signed int shift overflowed when computing a constant mask, use hex
literal instead. This is unlikely to cause actual issues unless the
code was compiled with ubsan or similar instrumentation specifically
to catch this. The stripped libc.so is unchanged on x86_64.
Reported by q66 on irc.
When a dot is encountered, the loop counter is incremented before
exiting the loop, but the corresponding ip array element is left
uninitialized, so the subsequent memmove (if "::" was seen) and the
loop copying ip to the output buffer will operate on an uninitialized
uint16_t.
The uninitialized data never directly influences the control flow and
is overwritten on successful return by the second half of the parsed
IPv4 address. But it's better to fix this to avoid unexpected
transformations by a sufficiently smart compiler and reports from
UB-detection tools.
The kernel defines a limit on the number of fds that can be passed
through an SCM_RIGHTS ancillary message as SCM_MAX_FD. The value was
255 before kernel 2.6.38 (after that it is 253), and an SCM_RIGHTS
ancillary message with 255 fds requires 1040 bytes, slightly more than
the current 1024 byte internal buffer in sendmsg. 1024 is an arbitrary
size, so increase it to match the the arbitrary size limit in the
kernel. This fixes tests that are verifying they support up to
SCM_MAX_FD fds.
until the mq notification event arrives, it is mandatory that signals
be blocked. otherwise, a signal can be received, and its handler
executed, in a thread which does not yet exist on the abstract
machine.
after the point of the event arriving, having signals blocked is not a
conformance requirement but a QoI requirement. while the application
can unblock any signals it wants unblocked in the event handler
thread, if they did not start out blocked, it could not block them
without a race window where they are momentarily unblocked, and this
would preclude controlled delivery or other forms of acceptance
(sigwait, etc.) anywhere in the application.
in the error path where the mq_notify syscall fails, the initiating
thread may have closed the socket before the worker thread calls recv
on it. even in the absence of such a race, if the recv call failed,
e.g. due to seccomp policy blocking it, the worker thread could
proceed to close, producing a double-close condition.
this can all be simplified by moving the mq_notify syscall into the
new thread, so that the error case does not require pthread_cancel.
now, the initiating thread only needs to read back the error status
after waiting for the worker thread to consume its arguments.
disabling cancellation around the pthread_join call seems to be the
safest and logically simplest fix. i believe it would also be possible
to just perform the unmap directly here after __tl_sync, removing the
dependency on pthread_join, but such an approach duplicately encodes a
lot more implementation assumptions.
the logic to check hwcap for SPE register file inadvertently clobbered
the val argument before use. switch to a different work register so
this doesn't happen.
we wrongly defined a dummy SA_RESTORER flag on these archs, despite
the kernel interface not actually having such a feature. on archs
which lack SA_RESTORER, the kernel sigaction structure also lacks the
restorer function pointer member, which means the signal mask appears
at a different offset. the kernel was thereby interpreting the bits of
the code address as part of the signal set to be masked while handling
the signal.
this patch removes the erroneous SA_RESTORER definitions from archs
which do not have it, makes access to the member conditional on
whether SA_RESTORER is defined for the arch, and removes the
now-unused asm for the affected archs.
because there are reportedly versions of qemu-user which also use the
wrong ABI here, the old ksigaction struct size is preserved with an
unused member at the end. this is harmless and mitigates the risk of
such a bug turning into a buffer overflow onto the sigaction
function's stack.
the result of the 0xffff mask with the exit status could have bit 15
set, in which case multiplying by 0x10001 overflows 32-bit signed int.
making the multiply unsigned avoids the overflow. it also changes the
sign extension behavior of the subsequent >> operation, but the
affected bits are all unwanted anyway and all discarded by the cast to
short.
mips has its own mechanisms for DT_DEBUG because it makes _DYNAMIC
read-only, and the original mechanism, DT_MIPS_RLD_MAP, was
PIE-incompatible. DT_MIPS_RLD_MAP_REL was added to remedy this, but we
never implemented support for it. add it now using the same idioms for
mips-specific ldso logic.
memmem has been adopted for the next issue of POSIX (outcome of
tracker item 1061). since mem* is in the reserved namespace for
string.h it's already fully conforming to expose it by default, so
just do so.
while no lock is held here making it a lock-order issue, replacement
malloc is likely to want to use pthread_atfork, possibly making the
call to malloc infinitely recursive.
even if not, there is no reason to prefer an application-provided
malloc here.
printf_core() runs twice, and during its first run, nl_arg is
uninitialized and must not be read. It gets initialized at the end of
the first run. Conversely, nl_type does not need to be set during the
second run, as its useful life has ended at that point, since the only
time it is read is during that exact same initialization. Therefore we
can simply alternate the assignments.
p and w do still need to get values assigned to them, since at least one
line in the same if-statement depends on that, but they can be dummy
values. arg does not need to be assigned, since in the first run, we
encounter a continue statement before using the argument.
because the has-waiters state in the semaphore value futex word is
only representable when the value is zero (the special value -1
represents "0 with potential new waiters"), it's lost if intervening
operations make the semaphore value positive again. this creates an
ABA issue in sem_post, whereby the post uses a stale waiters count
rather than re-evaluating it, skipping the futex wake if the stale
count was zero.
the fix here is based on a proposal by Alexey Izbyshev, with minor
changes to eliminate costly new spurious wake syscalls.
the basic idea is to replace the special value -1 with a sticky
waiters bit (repurposing the sign bit) preserved under both wait and
post. any post that takes place with the waiters bit set will perform
a futex wake.
to be useful, the waiters bit needs to be removable, and to remove it
safely, we perform a broadcast wake instead of a normal single-task
wake whenever removing the bit. this lets any un-accounted-for waiters
wake and re-add the waiters bit if they still need it.
there are multiple possible choices for when to perform this
broadcast, but the optimal choice seems to be doing it whenever the
observed waiters count is less than two (semantically, this means
exactly one, but we might see a stale count of zero). in this case,
the expected number of threads to be woken is one, with exactly the
same cost as a non-broadcast wake.
when PAGE_SIZE is not constant, internal/libc.h defines it to expand
to libc.page_size. however, kernel_mapped_dso, reachable from stage 2
of the dynamic linker bootstrap (__dls2), needs PAGE_SIZE to interpret
the relro range. at this point the libc object is both uninitialized
and invalid to access according to our model for bootstrapping, which
does not assume any external-linkage objects are accessible until
stages 2b/3. in practice it likely worked because hidden visibility
tends to behave like internal linkage, but this is not a property that
the dynamic linker was designed to rely upon.
this bug likely manifested as relro malfunction on archs with variable
page size, due to incorrect mask when aligning the relro bounds to
page boundaries.
while there are certainly more direct ways to fix the known problem
point here, a maximally future-proof way is to just bypass the libc.h
PAGE_SIZE definition in the dynamic linker and instead have dynlink.c
define its own internal-linkage object for variable page size. then,
if anything else in stage 2 ever ends up referencing PAGE_SIZE, it
will just automatically work right.
this is analogous to skip_relative logic in do_relocs -- because
relative relocations for the dynamic linker itself were already
performed at entry (stage 1), they must not be applied again.
the rule that longest digit sequence not beginning with a zero is
greater only applies when both sequences being compared are
non-degenerate. this is spelled out explicitly in the man page, which
may be deemed authoritative for this nonstandard function: "If one or
both of these is empty, then return what strcmp(3) would have
returned..."
we were wrongly treating any sequence of digits not beginning with a
zero as greater than a non-digit in the other string.
if async cancellation is enabled and acted upon, the stack pointer is
not necessarily pointing to a __syscall_cp_asm stack frame. the
contents of the stack being wrong don't really matter, but if the
stack pointer is not suitably aligned, the procedure call ABI is
violated when calling back into C code via __cancel, and pthread_exit,
cancellation cleanup handlers, TSD destructors, etc. may malfunction
or crash.
for the async cancel case, just call __cancel directly like we did
prior to commit 102f6a01e2. restore the
signal mask prior to doing this since the cancellation handler runs
with all signals blocked.
commit f081d5336a fixed
gethostbyname[2]_r to treat negative results as a non-error, leaving
gethostbyname[2] wrongly returning a pointer to the unfilled result
buffer rather than a null pointer. since, as documented with commit
fe82bb9b92, the caller of
gethostby{name[2],addr}_r can always rely on the result pointer being
set, use that consistently rather than trying to duplicate logic about
whether we have a result or not in gethostby{name[2],addr}.
the only functional change here should be that MAXADDRS is only
checked for RRs that provide address results, so that a CNAME which
appears after an excessive number of address RRs does not get ignored.
I'm not aware of any servers that order the RRs this way, and it may
even be forbidden to do so, but I prefer having the callback logic not
be order dependent.
other than that, the motivation for this change is that the A and AAAA
cases were mostly duplicate code that could be combined as a single
code path.