into cpu_info directly. This concerns only {i386, Xen-i386, Xen-amd64},
because amd64 already has a direct map that is way faster than that.
There are two major issues with the global array: maxcpus entries are
allocated while it is unlikely that common i386 machines have so many
cpus, and the base VA of these entries is not cache-line-aligned, which
mostly guarantees cache-line-thrashing each time the VAs are entered.
Now the number of tmp VAs allocated is proportionate to the number of CPUs
attached (which therefore reduces memory consumption), and the base is
properly aligned.
On my 3-core AMD, the number of DC_refills_L2 events triggered when
performing 5x10^6 calls to pmap_zero_page on two dedicated cores is on
average divided by two with this patch.
Discussed on tech-kern a little.
by UVM, so there is no physical loss.
On amd64 we always remap the kernel text with 2MB pages, and because of the
1MB start address we were forced to map [0MB; 2MB[ inside the first large
page. The problem is, the lower half is used by UVM to allocate physical
pages, and it is possible that some of these could be used by userland. We
could end up with userland-controllable data mapped into the kernel text on
a privileged page, which is far from being a good idea from a security pov.
I am not fixing i386 yet, because the large page size depends on PAE, and
we probably don't want to have a text located at 4MB on low-memory systems.
(note: I didn't introduce this issue, it was already there when I came in)
do not attempt to pmap_tlb_shootdown() after a pmap_kenter_ma() during
boot. pmap_tlb_shootdown() assumes post boot. Instead invalidate the
entry on the local CPU only.
XXX: to DTRT, probably this assumption needs re-examination.
XXX: The tradeoff is a (predicted) single word size comparison
penalty, so perhaps a decision needs performance stats.
all, with magic offsets here and there in different layers of the system.
It is just blind luck that everything has always worked as expected so
far.
Due to this wrong design we have a problem now: we allocate one physical
page for lapic, and it happens to overlap with the dummy page, which
causes the system to crash.
Fix this by keeping the dummy va directly in a variable instead of magic
offsets. The asm locore now increments the first pa to hide the dummy page
to machdep and pmap.
Add new ptrace(2) calls:
- PT_COUNT_WATCHPOINTS - count the number of available hardware watchpoints
- PT_READ_WATCHPOINT - read struct ptrace_watchpoint from the kernel state
- PT_WRITE_WATCHPOINT - write new struct ptrace_watchpoint state, this
includes enabling and disabling watchpoints
The ptrace_watchpoint structure contains MI and MD parts:
typedef struct ptrace_watchpoint {
int pw_index; /* HW Watchpoint ID (count from 0) */
lwpid_t pw_lwpid; /* LWP described */
struct mdpw pw_md; /* MD fields */
} ptrace_watchpoint_t;
For example amd64 defines MD as follows:
struct mdpw {
void *md_address;
int md_condition;
int md_length;
};
These calls are protected with the __HAVE_PTRACE_WATCHPOINTS guard.
Tested on amd64, initial support added for i386 and XEN.
Sponsored by <The NetBSD Foundation>
The benefits of the change are:
- We can reduce codes
- We can provide the same behavior between drivers
- Where/When if_ipackets is counted up
- Note that some drivers still update packet statistics in their own
way (periodical update)
- Moved bpf_mtap run in softint
- This makes it easy to MP-ify bpf
Proposed on tech-kern and tech-net
This feature was intended to detect stack overflow with CPU Debug Registers
(x86). It was never ported to other ports, neither amd64 and should be
adapted for SMP...
Currently there might be better ways to detect stack overflows like page
mapping protection. Since the number of Debug Registers is restricted
(4 on x86), torn it down completely.
This interface introduced helper functions for Debug Registers, they will
be replaced with the new <x86/dbregs.h> interface.
KSTACK_CHECK_DR0 was disabled by default and won't affect ordinary users.
Sponsored by <The NetBSD Foundation>
This is more opaque and appropriate type, as vaddr_t is meant to be used
for vitual address value. Not all DR on x86 are used to represent virtual
address (DR6 and DR7 are definitely not).
No functional change intended.
Change suggested by <christos>
Sponsored by <The NetBSD Foundation>
There are 8 Debug Registers on i386 (available at least since 80386) and 16
on AMD64. Currently DR4 and DR5 are reserved on both cpu-families and
DR9-DR15 are still reserved on AMD64. Therefore add accessors for DR0-DR3,
DR6-DR7 for all ports.
Debug Registers x86:
* DR0-DR3 Debug Address Registers
* DR4-DR5 Reserved
* DR6 Debug Status Register
* DR7 Debug Control Register
* DR8-DR15 Reserved
Access the registers is available only from a kernel (ring 0) as there is
needed top protected access. For this reason there is need to use special
XEN functions to get and set the registers in the XEN3 kernels.
XEN specific functions as defined in NetBSD:
- HYPERVISOR_get_debugreg()
- HYPERVISOR_set_debugreg()
This code extends the existing rdr6() and ldr6() accessor for additional:
- rdr0() & ldr0()
- rdr1() & ldr1()
- rdr2() & ldr2()
- rdr3() & ldr3()
- rdr7() & ldr7()
Traditionally accessors for DR6 were passing vaddr_t argument, while it's
appropriate type for DR0-DR3, DR6-DR7 should be using u_long, however it's
not a big deal. The resulting functionality should be equivalent so stick
to this convention and use the vaddr_t type for all DR accessors.
There was already a function defined for rdr6() in XEN, but it had a nit on
AMD64 as it was casting HYPERVISOR_get_debugreg() to u_int (32-bit on
AMD64), truncating result. It still works for DR6, but for the sake of
simplicity always return full 64-bit value.
New accessors duplicate functionality of the dr0() function available on
i386 within the KSTACK_CHECK_DR0 option. dr0() is a specialized layer with
logic to set appropriate types of interrupts, now accessors are designed to
pass verbatim values from user-land (with simple sanity checks in the
kernel). At the moment there are no plans to make possible to coexist
KSTACK_CHECK_DR0 with debug registers for user applications (debuggers).
options KSTACK_CHECK_DR0
Detect kernel stack overflow using DR0 register. This option uses DR0
register exclusively so you can't use DR0 register for other purpose
(e.g., hardware breakpoint) if you turn this on.
The KSTACK_CHECK_DR0 functionality was designed for i386 and never ported
to amd64.
Code tested on i386 and amd64 with kernels: GENERIC, XEN3_DOMU, XEN3_DOM0.
Sponsored by <The NetBSD Foundation>
LX_SLOT_KERNBASE: the base address is KERNBASE, and we just start mapping
from KERNTEXTOFF. For symmetry with the normal amd64, does not change
anything.
kernel doesn't boot:
(XEN) mm.c:2394:d139v0 Bad type (saw 5400000000000001 != exp 7000000000000000) for mfn 1136f5 (pfn 621)
(XEN) mm.c:887:d139v0 Could not get page type PGT_writable_page
(XEN) mm.c:939:d139v0 Error getting mfn 1136f5 (pfn 621) from L1 entry 00000001136f5003 for l1e_owner=139, pg_owner=139
(XEN) mm.c:1254:d139v0 Failure in alloc_l1_table: entry 33
(XEN) mm.c:2141:d139v0 Error while validating mfn 112f57 (pfn dbf) for type 1000000000000000: caf=8000000000000003 taf=1000000000000001
(XEN) mm.c:947:d139v0 Attempt to create linear p.t. with write perms
(XEN) mm.c:1330:d139v0 Failure in alloc_l2_table: entry 3
(XEN) mm.c:2141:d139v0 Error while validating mfn 112f5b (pfn dbb) for type 2200000000000000: caf=8000000000000003 taf=2200000000000001
(XEN) mm.c:1412:d139v0 Failure in alloc_l3_table: entry 3
(XEN) mm.c:2141:d139v0 Error while validating mfn 112f60 (pfn db6) for type 3000000000000000: caf=8000000000000003 taf=3000000000000001
(XEN) mm.c:3044:d139v0 Error while pinning mfn 112f60
(XEN) traps.c:459:d139v0 Unhandled bkpt fault/trap [#3] on VCPU 0 [ec=0000]
(XEN) domain_crash_sync called from entry.S: fault at ffff82d080231894 compat_create_bounce_frame+0xda/0xf2
The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.
No functional change.
guest requirements and support.
Add infrastructure to query the hypervisor about features support.
For verbose boot, print the features suppoted by the hypervisor for this
guest.
xennet_stop()) so that we don't get events while slepping (e.g.
in softint_disestablish()) when some structures have already been
freed.
Problem reported and patch tested by Rohan Desai.
This change intends to run the whole network stack in softint context
(or normal LWP), not hardware interrupt context. Note that the work is
still incomplete by this change; to that end, we also have to softint-ify
if_link_state_change (and bpf) which can still run in hardware interrupt.
This change softint-ifies at ifp->if_input that is called from
each device driver (and ieee80211_input) to ensure Layer 2 runs
in softint (e.g., ether_input and bridge_input). To this end,
we provide a framework (called percpuq) that utlizes softint(9)
and percpu ifqueues. With this patch, rxintr of most drivers just
queues received packets and schedules a softint, and the softint
dequeues packets and does rest packet processing.
To minimize changes to each driver, percpuq is allocated in struct
ifnet for now and that is initialized by default (in if_attach).
We probably have to move percpuq to softc of each driver, but it's
future work. At this point, only wm(4) has percpuq in its softc
as a reference implementation.
Additional information including performance numbers can be found
in the thread at tech-kern@ and tech-net@:
http://mail-index.netbsd.org/tech-kern/2016/01/14/msg019997.html
Acknowledgment: riastradh@ greatly helped this work.
Thank you very much!
request again (possibly because of compiler optimisations), by using
copies and barrier.
From XSA155:
The compiler can emit optimizations in the PV backend drivers which
can lead to double fetch vulnerabilities. Specifically the shared
memory between the frontend and backend can be fetched twice (during
which time the frontend can alter the contents) possibly leading to
arbitrary code execution in backend.