Commit Graph

1262 Commits

Author SHA1 Message Date
maxv
ac670b3099 Instead of using a global array with per-cpu indexes, embed the tmp VAs
into cpu_info directly. This concerns only {i386, Xen-i386, Xen-amd64},
because amd64 already has a direct map that is way faster than that.

There are two major issues with the global array: maxcpus entries are
allocated while it is unlikely that common i386 machines have so many
cpus, and the base VA of these entries is not cache-line-aligned, which
mostly guarantees cache-line-thrashing each time the VAs are entered.

Now the number of tmp VAs allocated is proportionate to the number of CPUs
attached (which therefore reduces memory consumption), and the base is
properly aligned.

On my 3-core AMD, the number of DC_refills_L2 events triggered when
performing 5x10^6 calls to pmap_zero_page on two dedicated cores is on
average divided by two with this patch.

Discussed on tech-kern a little.
2017-02-11 14:11:24 +00:00
maxv
6c9d31ed8a Rename ldt->ldtstore and gdt->gdtstore on i386. It reduces the diff with
amd64, and makes it easier to track down these variables on nxr - 'ldt'
and 'gdt' being common keywords.
2017-02-05 10:42:21 +00:00
maxv
2b26583164 Increase KERNTEXTOFF from 1MB to 2MB on amd64. [1MB; 2MB[ is now handled
by UVM, so there is no physical loss.

On amd64 we always remap the kernel text with 2MB pages, and because of the
1MB start address we were forced to map [0MB; 2MB[ inside the first large
page. The problem is, the lower half is used by UVM to allocate physical
pages, and it is possible that some of these could be used by userland. We
could end up with userland-controllable data mapped into the kernel text on
a privileged page, which is far from being a good idea from a security pov.

I am not fixing i386 yet, because the large page size depends on PAE, and
we probably don't want to have a text located at 4MB on low-memory systems.

(note: I didn't introduce this issue, it was already there when I came in)
2017-02-02 19:09:08 +00:00
maxv
a4a4753729 Use __read_mostly on these variables, to reduce the probability of false
sharing.
2017-02-02 08:57:04 +00:00
maxv
4d2995d98f Import xpmap_pg_nx, and put it in the per-cpu recursive slot on amd64. 2017-01-22 19:42:48 +00:00
maxv
14f8c1a9b1 Export xpmap_pg_nx, and put it in the page table pages. It does not change
anything, since Xen removes the X bit on these; but it is better for
consistency.
2017-01-22 19:24:51 +00:00
maxv
2b321a2d23 Remove a few #if 0s, and explain what we are doing on PAE: the last two PAs
are entered in reversed order.
2017-01-06 08:32:26 +00:00
cherry
85a999caa3 In the MP case,
do not attempt to pmap_tlb_shootdown() after a pmap_kenter_ma() during
boot. pmap_tlb_shootdown() assumes post boot. Instead invalidate the
entry on the local CPU only.

XXX: to DTRT, probably this assumption needs re-examination.
XXX: The tradeoff is a (predicted) single word size comparison
     penalty, so perhaps a decision needs performance stats.
2016-12-26 08:53:11 +00:00
skrll
fb73e3c1d6 Hold the interlock before cv_broadcast as per condvar(9) 2016-12-26 08:16:28 +00:00
cherry
b94dbe7b54 balloon(4) can now use uvm_hotplug(9)
Do this.
2016-12-23 17:01:10 +00:00
maxv
8a3e058bbd The way the xen dummy page is taken care of makes absolutely no sense at
all, with magic offsets here and there in different layers of the system.
It is just blind luck that everything has always worked as expected so
far.

Due to this wrong design we have a problem now: we allocate one physical
page for lapic, and it happens to overlap with the dummy page, which
causes the system to crash.

Fix this by keeping the dummy va directly in a variable instead of magic
offsets. The asm locore now increments the first pa to hide the dummy page
to machdep and pmap.
2016-12-16 19:52:22 +00:00
kamil
241cf91ddc Add support for hardware assisted watchpoints/breakpoints API in ptrace(2)
Add new ptrace(2) calls:
 - PT_COUNT_WATCHPOINTS - count the number of available hardware watchpoints
 - PT_READ_WATCHPOINT   - read struct ptrace_watchpoint from the kernel state
 - PT_WRITE_WATCHPOINT  - write new struct ptrace_watchpoint state, this
                          includes enabling and disabling watchpoints

The ptrace_watchpoint structure contains MI and MD parts:

typedef struct ptrace_watchpoint {
	int		pw_index;	/* HW Watchpoint ID (count from 0) */
	lwpid_t		pw_lwpid;	/* LWP described */
	struct mdpw	pw_md;		/* MD fields */
} ptrace_watchpoint_t;

For example amd64 defines MD as follows:
struct mdpw {
	void	*md_address;
	int	 md_condition;
	int	 md_length;
};

These calls are protected with the __HAVE_PTRACE_WATCHPOINTS guard.

Tested on amd64, initial support added for i386 and XEN.

Sponsored by <The NetBSD Foundation>
2016-12-15 12:04:17 +00:00
ozaki-r
dd8638eea5 Move bpf_mtap and if_ipackets++ on Rx of each driver to percpuq if_input
The benefits of the change are:
- We can reduce codes
- We can provide the same behavior between drivers
  - Where/When if_ipackets is counted up
  - Note that some drivers still update packet statistics in their own
    way (periodical update)
- Moved bpf_mtap run in softint
  - This makes it easy to MP-ify bpf

Proposed on tech-kern and tech-net
2016-12-15 09:28:02 +00:00
kamil
192986ec23 Torn down KSTACK_CHECK_DR0, i386-only feature to detect stack overflow
This feature was intended to detect stack overflow with CPU Debug Registers
(x86). It was never ported to other ports, neither amd64 and should be
adapted for SMP...

Currently there might be better ways to detect stack overflows like page
mapping protection. Since the number of Debug Registers is restricted
(4 on x86), torn it down completely.

This interface introduced helper functions for Debug Registers, they will
be replaced with the new <x86/dbregs.h> interface.

KSTACK_CHECK_DR0 was disabled by default and won't affect ordinary users.

Sponsored by <The NetBSD Foundation>
2016-12-13 10:54:27 +00:00
kamil
9b94e002e8 Switch x86 CPU Debug Register types from vaddr_t to register_t
This is more opaque and appropriate type, as vaddr_t is meant to be used
for vitual address value. Not all DR on x86 are used to represent virtual
address (DR6 and DR7 are definitely not).

No functional change intended.

Change suggested by <christos>

Sponsored by <The NetBSD Foundation>
2016-12-13 10:21:33 +00:00
kamil
266caf90fb Add accessors for available x86 Debug Registers
There are 8 Debug Registers on i386 (available at least since 80386) and 16
on AMD64. Currently DR4 and DR5 are reserved on both cpu-families and
DR9-DR15 are still reserved on AMD64. Therefore add accessors for DR0-DR3,
DR6-DR7 for all ports.

Debug Registers x86:
 * DR0-DR3  Debug Address Registers
 * DR4-DR5  Reserved
 * DR6      Debug Status Register
 * DR7      Debug Control Register
 * DR8-DR15 Reserved

Access the registers is available only from a kernel (ring 0) as there is
needed top protected access. For this reason there is need to use special
XEN functions to get and set the registers in the XEN3 kernels.

XEN specific functions as defined in NetBSD:
 - HYPERVISOR_get_debugreg()
 - HYPERVISOR_set_debugreg()

This code extends the existing rdr6() and ldr6() accessor for additional:
 - rdr0() & ldr0()
 - rdr1() & ldr1()
 - rdr2() & ldr2()
 - rdr3() & ldr3()
 - rdr7() & ldr7()

Traditionally accessors for DR6 were passing vaddr_t argument, while it's
appropriate type for DR0-DR3, DR6-DR7 should be using u_long, however it's
not a big deal. The resulting functionality should be equivalent so stick
to this convention and use the vaddr_t type for all DR accessors.

There was already a function defined for rdr6() in XEN, but it had a nit on
AMD64 as it was casting HYPERVISOR_get_debugreg() to u_int (32-bit on
AMD64), truncating result. It still works for DR6, but for the sake of
simplicity always return full 64-bit value.

New accessors duplicate functionality of the dr0() function available on
i386 within the KSTACK_CHECK_DR0 option. dr0() is a specialized layer with
logic to set appropriate types of interrupts, now accessors are designed to
pass verbatim values from user-land (with simple sanity checks in the
kernel). At the moment there are no plans to make possible to coexist
KSTACK_CHECK_DR0 with debug registers for user applications (debuggers).

     options KSTACK_CHECK_DR0
     Detect kernel stack overflow using DR0 register.  This option uses DR0
     register exclusively so you can't use DR0 register for other purpose
     (e.g., hardware breakpoint) if you turn this on.

The KSTACK_CHECK_DR0 functionality was designed for i386 and never ported
to amd64.

Code tested on i386 and amd64 with kernels: GENERIC, XEN3_DOMU, XEN3_DOM0.

Sponsored by <The NetBSD Foundation>
2016-11-27 14:49:21 +00:00
maxv
415d831276 KNF a little 2016-11-25 12:20:03 +00:00
ozaki-r
61f9115f54 Sweep unnecessary xcall.h inclusions 2016-11-21 04:10:05 +00:00
maxv
401e389a28 Mmh, apparently I didn't properly test my previous change since it does not
compile anymore
2016-11-15 17:01:12 +00:00
maxv
944fdd3c50 Keep simplifying that stuff. Also, replace plX_pi(KERNTEXTOFF) by
LX_SLOT_KERNBASE: the base address is KERNBASE, and we just start mapping
from KERNTEXTOFF. For symmetry with the normal amd64, does not change
anything.
2016-11-15 15:37:20 +00:00
maxv
979460599e Rename xen_pmap_bootstrap to xen_locore, it really has nothing to do with
pmap and is just a C version of what amd64 and i386 do in asm.
2016-11-11 11:34:51 +00:00
maxv
748acd38e3 Start simplifying the Xen locore: rename and reorder several things, remove
awful debug messages, use unsigned counters, fix typos and KNF.
2016-11-11 11:12:42 +00:00
maxv
6afccfe796 Map the PTE space as non-executable on PAE. The same is already done on
amd64.
2016-11-01 12:16:10 +00:00
maxv
36b3504040 Map the remaining pages as non-executable. Only text should have X. 2016-11-01 12:00:21 +00:00
jdolecek
ee25a7878f provide stub intr xname establish for xen 2016-10-17 18:23:49 +00:00
kre
30a3d786ca This should return the amd64 build to a working state (and hopefully
i386 as well) - but this is a hideous hack, and should be reverted
as soon as a better (which means any) alternative is available.
2016-10-16 06:40:43 +00:00
joerg
d87b908a67 Given Xen/i386 the same process and file limit as native i386. 2016-09-23 22:07:12 +00:00
bouyer
7929952b1a Revert to 1.59 (adding back the W^X kernel mapings), and move the data+bss
mapping late so that mappings that should be RO (such as page tables) won't
be made RW by accident.
2016-08-25 17:03:57 +00:00
bouyer
db4f79c55f Stopgap measure: revert to rev 1.56. starting with 1.57 an i386PAE Xen
kernel doesn't boot:
(XEN) mm.c:2394:d139v0 Bad type (saw 5400000000000001 != exp 7000000000000000) for mfn 1136f5 (pfn 621)
(XEN) mm.c:887:d139v0 Could not get page type PGT_writable_page
(XEN) mm.c:939:d139v0 Error getting mfn 1136f5 (pfn 621) from L1 entry 00000001136f5003 for l1e_owner=139, pg_owner=139
(XEN) mm.c:1254:d139v0 Failure in alloc_l1_table: entry 33
(XEN) mm.c:2141:d139v0 Error while validating mfn 112f57 (pfn dbf) for type 1000000000000000: caf=8000000000000003 taf=1000000000000001
(XEN) mm.c:947:d139v0 Attempt to create linear p.t. with write perms
(XEN) mm.c:1330:d139v0 Failure in alloc_l2_table: entry 3
(XEN) mm.c:2141:d139v0 Error while validating mfn 112f5b (pfn dbb) for type 2200000000000000: caf=8000000000000003 taf=2200000000000001
(XEN) mm.c:1412:d139v0 Failure in alloc_l3_table: entry 3
(XEN) mm.c:2141:d139v0 Error while validating mfn 112f60 (pfn db6) for type 3000000000000000: caf=8000000000000003 taf=3000000000000001
(XEN) mm.c:3044:d139v0 Error while pinning mfn 112f60
(XEN) traps.c:459:d139v0 Unhandled bkpt fault/trap [#3] on VCPU 0 [ec=0000]
(XEN) domain_crash_sync called from entry.S: fault at ffff82d080231894 compat_create_bounce_frame+0xda/0xf2
2016-08-23 11:03:52 +00:00
maxv
a38b5c9a58 Make the I/O area non-executable on Xen. 2016-08-11 15:35:10 +00:00
maxv
d7d5f3349a Map the recursive slot and page table pages as non-executable on Xen. Same
as normal x86.
2016-08-03 11:51:18 +00:00
maxv
2f746d1585 Map the kernel text, rodata and data+bss independently on Xen, with
respectively RX, R and RW.
2016-08-02 14:21:53 +00:00
maxv
d2a4f6b0ae Use PG_RO instead of a magic zero. 2016-08-02 13:29:35 +00:00
maxv
039c7ddcb0 KNF, and use PAGE_SIZE instead of NBPG. 2016-08-02 13:25:56 +00:00
msaitoh
71fbb921c3 KNF. No functional change. 2016-07-11 11:31:49 +00:00
msaitoh
8bc54e5be6 KNF. Remove extra spaces. No functional change. 2016-07-07 06:55:38 +00:00
nonaka
6af4cd3352 Pass bus_dma(9) tag to allow for porting sdhc(4) at acpi. 2016-06-21 11:33:32 +00:00
jnemeth
4ec72366f1 - add machdep.xen.version sysctl to easily get hypervisor version
- move machdep.xen_timepush_ticks to machdep.xen.timepush_ticks to
  consolidate all Xen related sysctls under machdep.xen
2016-06-12 09:08:09 +00:00
ozaki-r
d938d837b3 Introduce m_set_rcvif and m_reset_rcvif
The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.

No functional change.
2016-06-10 13:27:10 +00:00
jnemeth
d9c5d3073b Feeding uninitialised garbage to the hypervisor is likely to be a bad idea. 2016-06-08 01:59:06 +00:00
bouyer
55f57916be Switch to elf notes for amd64 instead of the old key=value list to describe the
guest requirements and support.
Add infrastructure to query the hypervisor about features support.
For verbose boot, print the features suppoted by the hypervisor for this
  guest.
2016-05-29 17:06:17 +00:00
jnemeth
605ea3fe8e make CPU microcode loading dependent on both DOM0OPS AND CPU_UCODE 2016-05-20 03:41:20 +00:00
christos
c7f0ba033b Account for the CRC len (Jean-Jacques.Puig) 2016-05-09 15:11:35 +00:00
mlelstv
6846a009d9 no condition for cpu_rng here 2016-02-27 15:42:20 +00:00
mlelstv
850f8baca4 add missing cpu_rng.c to kernel 2016-02-27 14:28:50 +00:00
bouyer
f35e6799e0 In xennet_xenbus_detach(), remove the event handler early (just after
xennet_stop()) so that we don't get events while slepping (e.g.
in softint_disestablish()) when some structures have already been
freed.
Problem reported and patch tested by Rohan Desai.
2016-02-16 08:41:32 +00:00
ozaki-r
9c4cd06355 Introduce softint-based if_input
This change intends to run the whole network stack in softint context
(or normal LWP), not hardware interrupt context. Note that the work is
still incomplete by this change; to that end, we also have to softint-ify
if_link_state_change (and bpf) which can still run in hardware interrupt.

This change softint-ifies at ifp->if_input that is called from
each device driver (and ieee80211_input) to ensure Layer 2 runs
in softint (e.g., ether_input and bridge_input). To this end,
we provide a framework (called percpuq) that utlizes softint(9)
and percpu ifqueues. With this patch, rxintr of most drivers just
queues received packets and schedules a softint, and the softint
dequeues packets and does rest packet processing.

To minimize changes to each driver, percpuq is allocated in struct
ifnet for now and that is initialized by default (in if_attach).
We probably have to move percpuq to softc of each driver, but it's
future work. At this point, only wm(4) has percpuq in its softc
as a reference implementation.

Additional information including performance numbers can be found
in the thread at tech-kern@ and tech-net@:
http://mail-index.netbsd.org/tech-kern/2016/01/14/msg019997.html

Acknowledgment: riastradh@ greatly helped this work.
Thank you very much!
2016-02-09 08:32:07 +00:00
bouyer
13ee92e7ec Apply patch from xsa155: make sure that the backend won't read parts of the
request again (possibly because of compiler optimisations), by using
copies and barrier.
From XSA155:
The compiler can emit optimizations in the PV backend drivers which
can lead to double fetch vulnerabilities. Specifically the shared
memory between the frontend and backend can be fetched twice (during
which time the frontend can alter the contents) possibly leading to
arbitrary code execution in backend.
2016-01-06 15:28:40 +00:00
christos
19d921e536 need definition 2015-12-13 16:11:14 +00:00
christos
07d3d8c47c fix the build. 2015-12-13 15:22:31 +00:00