so that our kernels works with newer xen-3 hypervisors; and correct the value
of VIRT_BASE for dom0.
Now that we can embed the values of KERNBASE and KERNTEXTOFF in the binary
for Xen, make the domU memory layout the same as dom0 for Xen3 (making
it the other way round doens't work; probably because of alignement
constraints in the hypervisor). The old domU layout is used if options
XEN_COMPAT_030001 is present in the kernel config file. Enable this the
domU kernel config files for now, in case someone wants to run a NetBSD
domU on an older Xen3 installation.
CPUs are now configured on mainbus only in dom0, and only to know about
their APIC id. virtual CPUs are attached to hypervisor as:
vcpu* at hypervisor?
and this is what's used as curcpu(). The kernel config files needs to be
updated for this, see XEN3_DOM0 or XEN3_DOMU for examples.
XEN3_DOM0 now has acpi, MPBIOS and ioapic by default.
Note that a Xen dom0 kernel doens't have access to the lapic.
Old hypercall call method still still available with
options XEN_NO_HYPERCALLPAGE
but this is disabled by default (xen-3.0.2-2 supports hypercall call page
just fine).
While there add a VIRT_BASE= string in __xen_guest section; from
Bastian Blank on port-xen@.
- Add a few scopes to the kernel: system, network, and machdep.
- Add a few more actions/sub-actions (requests), and start using them as
opposed to the KAUTH_GENERIC_ISSUSER place-holders.
- Introduce a basic set of listeners that implement our "traditional"
security model, called "bsd44". This is the default (and only) model we
have at the moment.
- Update all relevant documentation.
- Add some code and docs to help folks who want to actually use this stuff:
* There's a sample overlay model, sitting on-top of "bsd44", for
fast experimenting with tweaking just a subset of an existing model.
This is pretty cool because it's *really* straightforward to do stuff
you had to use ugly hacks for until now...
* And of course, documentation describing how to do the above for quick
reference, including code samples.
All of these changes were tested for regressions using a Python-based
testsuite that will be (I hope) available soon via pkgsrc. Information
about the tests, and how to write new ones, can be found on:
http://kauth.linbsd.org/kauthwiki
NOTE FOR DEVELOPERS: *PLEASE* don't add any code that does any of the
following:
- Uses a KAUTH_GENERIC_ISSUSER kauth(9) request,
- Checks 'securelevel' directly,
- Checks a uid/gid directly.
(or if you feel you have to, contact me first)
This is still work in progress; It's far from being done, but now it'll
be a lot easier.
Relevant mailing list threads:
http://mail-index.netbsd.org/tech-security/2006/01/25/0011.htmlhttp://mail-index.netbsd.org/tech-security/2006/03/24/0001.htmlhttp://mail-index.netbsd.org/tech-security/2006/04/18/0000.htmlhttp://mail-index.netbsd.org/tech-security/2006/05/15/0000.htmlhttp://mail-index.netbsd.org/tech-security/2006/08/01/0000.htmlhttp://mail-index.netbsd.org/tech-security/2006/08/25/0000.html
Many thanks to YAMAMOTO Takashi, Matt Thomas, and Christos Zoulas for help
stablizing kauth(9).
Full credit for the regression tests, making sure these changes didn't break
anything, goes to Matt Fleming and Jaime Fournier.
Happy birthday Randi! :)
because of the jitter caused by the serial console), which is not only
cosmetic but is bad for clock accuracy. Introducing a 1s delay before reading
Xen's idea of the CPU frequency fixes this.
of CPU_MAXFAMILY. This effectively causes to downgrade to i386 class
instead of a nonexistant class, and overrunning classnames[] by one.
Coverity ID 1472.
reality -- the memory taken by the kernel image should be counted in the
former, but not the latter, as is the case on other ports.
Discussed on port-xen; approved by bouyer@.
cc_microtime with a MD function that isn't affected by erratic "clock
interrupts" and instead takes more advantage of time information
provided by the hypervisor. Fixes, most importantly, a case where the
clock as seen by userland would sometimes bounce back and forth by up to
1<<31 us (~35 min).
Approved by bouyer@; explained in more detail in
http://mail-index.netbsd.org/port-xen/2006/02/28/0002.html
- kernel (both dom0 and domU) boot, console is functionnal and it can starts
software from a ramdisk
- there is no driver front-end expect console for domU yet.
- dom0 can probe devices and ex(4) work when Xen3 is booted without acpi
and apic support. But the on-board IDE doens't get interrupts.
The PCI code still needs work (it's hardcoded to mode 1). Some of this
code should be shared with ../x86
The physical insterrupt code needs to get MPBIOS and ACPI support, and
do interrupt routing to properly interract with Xen.
To enable Xen-3.0 support, add
options XEN3
to your kernel config file (this will disable Xen2 support)
Changes affecting Xen-2.0 support (no functionnal changes intended):
- get more constants from genassym for assembly code
- remove some unneeded registers move from start()
- map the shared info page from start(), and remove the pte = 0xffffffff hack
- vector.S: in hypervisor_callback() make sure %esi points to
HYPERVISOR_shared_info before accessing the info page. Remplace some
hand-written assembly with the equivalent macro defined in frameasm.h
- more debug code, dissabled by default.
while here added my copyright on some files I worked on in 2005.
Always call HYPERVISOR_fpu_taskswitch() at the end of npxsave_lwp().
This fixes the FPU problems detected by paranoia on a NetBSD/Xen guest.
Based on patch sent by Paul Ripke on port-xen, but reworked by me.
Should fix port-xen/30977.
kernel by x86 platforms (instead of a simple char *). This way, the code
in, e.g., lookup_bootinfo, is a bit easier to understand.
While here, move the lookup_bootinfo function used in x86 platforms (amd64,
i386 and xen) to a common file (x86/x86_machdep.c), as it was exactly the
same in all of them.
- move copyin and friends from locore.S to their own file, copy.S.
share it between i386 and xen.
- defparam KERNBASE and kill KERNBASE_LOCORE hack.
- add more symbols to assym.h and use it where appropriate.
instead to always trying PG_RW and falling back to PG_RO if this fails.
Use uvm_map_checkprot() in IOCTL_PRIVCMD_MMAP and IOCTL_PRIVCMD_MMAP_BATCH
to compute the appropriate vm_prot_t for pmap_remap_pages().
Thanks to Jed Davis for pointing out uvm_map_checkprot().
In pmap_remap_pages() new mappings are created (PG_RW|PG_M). When saving
a domain, the hypervisor will refuse to map the foreing pages RW.
As a temporary measure, retry the mapping read-only if PG_RW fails, so that
domain save will work. Also fix the PTP's wire_count if the MMU update
fails (prevent a kernel panic).
Ceri Storey to port-xen, with some additionnal changes by me:
- include bus_dma.c, bus_space.c and pci_machdep.c if pci is defined
instead of dom0ops
- Make various initialisations, and probe/attach pci busses based on NPCI
instead of DOM0OPS
- in conf/files.xen, move xen-specific devices before non-xen specific devices
so that the xen-specific match function is called first, to avoid false
attachement from too liberal match function in non-xen code.
disks to other domains) from Jed Davis, <jld@panix.com>:
* Issue multiple requests when necessary rather than
assuming that arbitrary requests can be mapped into single
contiguous virtual address ranges.
* Don't assume that all data for a request is consecutive
in memory. With some client OSes, it's not.
The above two changes fix data corruption issues with Linux
clients with certain filesystem block sizes.
* Gracefully handle memory or pool allocation failures after
beginning to handle a request from the ring.
* Merge contiguous requests to avoid the "64K turns into 44K + 20K
and doubles the transactions per second at the disk" problem
caused by the 11-page limit caused by the structure of Xen
ring entries. This causes a very slight performance decrease
for sequential 64K I/O if the disk is not already saturated with
requests (about 1%) but halves the transactions per second we
hit the disk with -- or better. It even compensates for bizarre
Linux behaviour like breaking long requests up into 5.5K pieces.
* Probably some stuff I forgot to mention.
Disk throughput (though not latency) is now much, much closer to the
"raw hardware" case than it was before.
them from interrupt context.
xpq_flush_queue() is called from IPL_NET in if_xennet.c, and
other xpq_queue* functions may be called from interrupt context via
pmap_kenter*(). Should fix port-xen/30153.
Thanks to Jason Thorpe and YAMAMOTO Takashi for enlightments on this issue.
where the printing of `version' is already performed.
This has the benefit of allowing the copyright to be available
via dmesg(8) on platforms which need the `msgbuf' to be setup
in cpu_startup() before printed output is remembered.
- sort the ih_evt_handler list by IPL, higher first. Otherwise some handlers
would have been delayed, event if they could run at the current IPL.
- As ih_evt_handler is sorted, remove IPLs that have been processed for
an event when calling hypervisor_set_ipending()
- In hypervisor_set_ipending(), enter the event in ipl_evt_mask only
for the lowest IPL. As deffered IPLs are processed high to low,
this ensure that hypervisor_enable_event() will be called only when all
callbacks have been called for an event. We don't need the evtch_maskcount[]
counters any more.
Thanks to YAMAMOTO Takashi for ideas and feedback.
cause an event to be both handled and marked as pending, or being
marked as pending twice (triggering the diagnostic check
evtch_maskcount[port] == 0 in hypervisor_set_ipending):
mask and clear event by word of 32bit in do_hypervisor_event() or stipending(),
instead of by indiviual bits in do_event() or xenevt_event().
In addition this is marginally more efficient.
including soft interrupt, and this is way too low in some use (lots of domains,
or domains with lots of xennet, or even hardware with lots of devices at
different interrupts).
Based on idea from YAMAMOTO Takashi, keep one list of handler per-event and
one per-IPL (so the same handler is now in 2 lists). In the common case were
an event is received at low IPL, we can call the handlers quickly (there
is usually only one handler per event, unless the event is mapped to a
physical interrupt and this interrupt is shared by different devices).
Deffered events and software interrupts are handled by a bitmask (as before)
with one bit per IPL. When one IPL has an event pending all handlers for
this IPL will be called.
With this change, it is now possible to have all the 1024 events active.
While here, handle debug event in a special way: the handler is always called,
regardless of the current IPL. Make the handler print usefull informations
about events and IPL states.
Also remove code not used on Xen in files inherited from the x86 port.
- distinguish paddr_t and bus_addr_t.
for xen, use bus_addr_t in the sense of machine address.
- move _X86_BUS_DMA_PRIVATE part of bus.h into bus_private.h.
- remove special handling of xen_shm. we can always grab
machine address from pte.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.
We can't write directly to a gdt slot, we need to go though
xen_update_descriptor().
From YAMAMOTO Takashi: this is not a problem in HEAD, because
as the kva for gdt is pageable, when gdt_put_slot1 attempts to modify gdt,
the fault handler allocates and maps a new page for you.
So the kernel doesn't panic, but could leak some memory.
the interrupt latch (which is probably done by the hypervisor, linux/xen
doesn't do it either). Now the "fputest" configure test from pkgsrc/math/yorick
works as expected.
Thanks to Christian Limpach for the hint.
if the corresponding bit needs to be set in evtchn_pending_sel, and eventually
force an upcall (if we got a second event when this one what being handled).
For now to this by calling hypervisor_enable_irq(), this could be rewritten
in inline assembly by someone knowing enouth about i386 assembly :)
Check the passed in address as well as determining the maximum length
using VM_MAXUSER_ADDRESS in copyinstr and copyoutstr.
Problem originally fixed in OpenBSD/i386.
This fix suggested by Charles Hannum (mycroft at netbsd dot org).
Protect the callback queue with splvm()
XXX some debug printf about the callback stuff is left here. This is because
I've not been able to trigger this condition yet, so I've left them
until we sure the code works as intended.
- allocate kva for vm_map_entry from the map itsself and
remove the static limit, MAX_KMAPENT.
- keep merged entries for later splitting to fix allocate-to-free problem.
PR/24039.
around a cngetc() that will never return while "halted", which is rude,
and which also requires domain 0 to not just restart us, but kill us
first. Suggestion from Michael Kukat (though this change is not the
same as the one he suggested).
Will cngetc() actually return something when running in domain 0 with a
VGA console? I don't think it will; if it actually does, we should make
this behaviour depend on whether we're in domain 0 or some other domain.
does different things depending what's in %ebx.
We weren't setting %ebx here *at all*; so we did not get SCHEDOP_yield,
which was what was wanted; so unpredictable things happened, notably
immediate return to the NetBSD idle loop, in other words spinning on CPU.
Definitely not cool!
Michael Kukat caught this one and suggested this fix on port-xen.
actually adjusting the time correctly (calling hardclock as needed, not
just blindly every time Xen schedules us) based on Xen's idea of the
time in the shared page.
Xen source repo change info:
ChangeSet
2004/09/22 13:47:22+01:00 cl349@freefall.cl.cam.ac.uk
Fix time.
netbsd-2.0-xen-sparse/sys/arch/xen/xen/clock.c
2004/09/22 13:47:21+01:00 cl349@freefall.cl.cam.ac.uk +28 -3
Don't call hardclock on spurious timer interrupt and call hardclock
for missed interrupts.
netbsd-2.0-xen-sparse/sys/arch/xen/conf/XEN
2004/09/22 13:47:21+01:00 cl349@freefall.cl.cam.ac.uk +0 -1
Don't need custom HZ value any longer.
: ----------------------------------------------------------------------