Treat %gs the same way we treat %ds/%es/%fs: restore it in INTRFASTEXIT
on 32bit LWPs.
On Xen however, its behavior does not change, because we need to do an
hypercall before INTR_RESTORE_GPRS, and that's too complicated for now.
As a side effect, this change fixes a bug in the ACPI wakeup code; %fs/%gs
were not restored on 32bit LWPs, and chances are they would segfault
shortly afterwards.
Support for USER_LDT on amd64 is almost complete now.
vmin is only an optional hint since we're not passing UVM_FLAG_FIXED,
but that doesn't mean we should use uninitialized stack garbage as
the hint.
Noted by chs@.
Candidate fix for PR kern/45718: `processes sometimes get stuck and
spin in vm_map', a problem that has been plaguing all our 32-bit
ports for years.
Since we currently use large (256k) buffers for execargs, and since
nobody has stepped up to tackle breaking them into bite-sized (or at
least page-sized) chunks, after KVA gets sufficiently fragmented we
can't allocate new execargs buffers from kernel_map.
Until 2008, we always carved out KVA for execargs on boot with a uvm
submap exec_map of kernel_map. Then ad@ found that the uvm_km_free
call, to discard them when done, cost about 100us, which a pool
avoided:
https://mail-index.NetBSD.org/tech-kern/2008/06/25/msg001854.htmlhttps://mail-index.NetBSD.org/tech-kern/2008/06/26/msg001859.html
ad@ _simultaneously_ introduced a pool _and_ eliminated the reserved
KVA in the exec_map submap. This change preserves the pool, but
restores exec_map (with less code, by putting it in MI code instead
of copying it in every MD initialization routine).
Patch proposed on tech-kern:
https://mail-index.NetBSD.org/tech-kern/2017/10/19/msg022461.html
Patch tested by bouyer@:
https://mail-index.NetBSD.org/tech-kern/2017/10/20/msg022465.html
I previously discussed the issue on tech-kern before I knew of the
history around exec_map:
https://mail-index.NetBSD.org/tech-kern/2012/12/09/msg014695.html
The candidate workaround I proposed of using pool_setlowat to force
preallocation of KVA would also force preallocation of physical RAM,
which is a waste not incurred by using exec_map, and which is part of
why I never committed it.
There may remain a general problem that if thread A calls pool_get
and tries to service that request by a uvm_km_alloc call that hangs
because KVA is scarce, and thread B does pool_put, the pool_put in
thread B will not notify the pool_get in thread A that it doesn't
need to wait for KVA, and so thread A may continue to hang in
uvm_km_alloc. However,
(a) That won't apply here, because there is exactly as much KVA
available in exec_map as exec_pool will ever try to use.
(b) It is possible that may not even matter in other cases as long as
the page daemon eventually tries to shrink the pool, which will cause
a uvm_km_free that can unhang the hung uvm_km_alloc.
XXX pullup-8
XXX pullup-7
XXX pullup-6
XXX pullup-5, perhaps...
and ata_channel_destroy() respectively, to make attachment code simpler,
and to make it easier to spot special queue manipulation like cmdide(4)
on topic of PR kern/52606
product array, rather than switch inside attach routine
XXX judging from product name, Silicon Image 0680 might be newer than 0649
XXX and hence have actually independant channels, but I don't have the hw
XXX so keeping as-is
no functional change, just to improve visibility in course of fixing
PR kern/52606
Treat %fs the same way we treat %ds and %es. For a new 32bit LWP %fs is
set to GUDATA32_SEL, and always updated in INTRFASTEXIT.
This solves an important issue we had until now: we couldn't handle the
faults generated by the "movw $val,%fs" instructions, because they were
deep into the kernel context. Now %fs can fault only in INTRFASTEXIT,
which is safe.
Note that it also fixes a bug I believe affected the kernel: on AMD CPUs,
setting %fs to zero does not flush the internal register state, and
therefore we could leak the %fs base address when context-switching. This
being said, I couldn't trigger the issue on the AMD cpu I have. Whatever,
it's fixed now, since we first set %fs to GUDATA32 - which does flush the
register state.
Right now, we are saving and restoring %ds/%es each time we enter/leave the
kernel. However, we let %fs/%gs live in the kernel space, and we rely on
the fact that when switching to an LWP, %fs/%gs are set right away (via
cpu_switchto or setregs).
It has two drawbacks: we are taking care of %ds/%es while they are
deprecated (useless) on 64bit LWPs, and we are restricting %fs/%gs while
they still have a meaning on 32bit LWPs.
Therefore, handle 32bit and 64bit LWPs differently:
* 64bit LWPs use fixed segregs, which are not taken care of.
* 32bit LWPs have dynamic segregs, always saved/restored.
For now, only %ds and %es are changed; %fs and %gs will be in the next
passes.
The trapframe is constructed as usual. In INTRFASTEXIT, we restore %ds/%es
depending on the %cs value. If %cs contains one of the two standard 64bit
selectors, don't do anything. Otherwise, restore everything.
When doing a context switch, just restore %ds/%es to their default values.
On a 32bit LWP they will be overwritten by INTRFASTEXIT; on a 64bit LWP
they won't be updated.
In the ACPI wakeup code, restore %ds/%es to the default 64bit user value.
extend the uint64_t's when building it, so we're leaking 48 bits of kernel
stack to userland.
Having said that, it appears that I unintentionally fixed most of this
issue in locore.S::rev1.127 - by building the frame with interrupts
disabled, we are implicitly guaranteeing that the structure doesn't get
overwritten by the kernel. Which means, we are leaking to userland data
that comes from userland anyway.
(still other places with this issue, but I'll fix them differently)
confusion in the code - in part introduced by myself -, and clearly this
place is not supposed to handle 32bit LWPs.
Right now we're returning EINVAL, but verily we would need to redirect
these calls to their netbsd32 counterparts.
code duplication, and reducing the size of /bin/sh by a trivial amount.
NFCI.
This is being done now as there are two other changes forthcoming, both
of which benefit - one would result in even more code duplication without
this, the other might need to alter how this is done, and doing it after this
means there's just one place to change (if required).