New interfaces for PV drivers:
- Xen transcedent memory
- USB IO
- SCSI IO
PCI IO improvements:
- PCI MSI support
- PCI Express AER support
New features:
- xen honors flags to be placed into guest kernel available pte bits
if enabled (for grant table)
- support for 128 vcpus
(old interface is still present and supports up to 32 vcpus)
- PCI passthrough: new hypercalls to support SR-IOV
- new hypercall for physical cpu hotplugging
- new hypercall for physical page offlining
- fixes to compile with clang
- machine check recovery mechanism
current CPU, but for any CPU which may accept this event.
xen/xenevt.c: more use of atomic ops and locks where appropriate, and some
other SMP fixes. Handle all events on the primary CPU (may be revisited
later). Set/clear ci_evtmask[] for watched events.
This should fix the problems on dom0 kernels reported by jym@
xbdback_xenbus_destroy(). The second call will wait forever as the first
already caused the xbd thread to exit.
Have xbdback_disconnect() check if we're already disconnected and if so,
do nothing.
instead of continuations directly from shm callbacks or interrupt
handlers. The whole CPS design remains but is adapted to cope with
a thread model.
This patch allows scheduling away I/O requests of domains that behave
abnormally, or even destroy them if there is a need to (without thrashing
dom0 with lots of error messages at IPL_BIO).
I took this opportunity to make the driver MPSAFE, so multiple instances
can run concurrently. Moved from home-grown pool(9) queues to
pool_cache(9), and rework the callback mechanism so that it delegates
I/O processing to thread instead of handling it itself through the
continuation trampoline.
This one fixes the potential DoS many have seen in a dom0 when trying to
suspend a NetBSD domU with a corrupted I/O ring.
Benchmarks (build.sh release runs and bonnie++) do not show any
performance regression, the "new" driver is on-par with the "old" one.
ok bouyer@.
<20111022023242.BA26F14A158@mail.netbsd.org>. This change includes
the following:
An initial cleanup and minor reorganization of the entropy pool
code in sys/dev/rnd.c and sys/dev/rndpool.c. Several bugs are
fixed. Some effort is made to accumulate entropy more quickly at
boot time.
A generic interface, "rndsink", is added, for stream generators to
request that they be re-keyed with good quality entropy from the pool
as soon as it is available.
The arc4random()/arc4randbytes() implementation in libkern is
adjusted to use the rndsink interface for rekeying, which helps
address the problem of low-quality keys at boot time.
An implementation of the FIPS 140-2 statistical tests for random
number generator quality is provided (libkern/rngtest.c). This
is based on Greg Rose's implementation from Qualcomm.
A new random stream generator, nist_ctr_drbg, is provided. It is
based on an implementation of the NIST SP800-90 CTR_DRBG by
Henric Jungheim. This generator users AES in a modified counter
mode to generate a backtracking-resistant random stream.
An abstraction layer, "cprng", is provided for in-kernel consumers
of randomness. The arc4random/arc4randbytes API is deprecated for
in-kernel use. It is replaced by "cprng_strong". The current
cprng_fast implementation wraps the existing arc4random
implementation. The current cprng_strong implementation wraps the
new CTR_DRBG implementation. Both interfaces are rekeyed from
the entropy pool automatically at intervals justifiable from best
current cryptographic practice.
In some quick tests, cprng_fast() is about the same speed as
the old arc4randbytes(), and cprng_strong() is about 20% faster
than rnd_extract_data(). Performance is expected to improve.
The AES code in src/crypto/rijndael is no longer an optional
kernel component, as it is required by cprng_strong, which is
not an optional kernel component.
The entropy pool output is subjected to the rngtest tests at
startup time; if it fails, the system will reboot. There is
approximately a 3/10000 chance of a false positive from these
tests. Entropy pool _input_ from hardware random numbers is
subjected to the rngtest tests at attach time, as well as the
FIPS continuous-output test, to detect bad or stuck hardware
RNGs; if any are detected, they are detached, but the system
continues to run.
A problem with rndctl(8) is fixed -- datastructures with
pointers in arrays are no longer passed to userspace (this
was not a security problem, but rather a major issue for
compat32). A new kernel will require a new rndctl.
The sysctl kern.arandom() and kern.urandom() nodes are hooked
up to the new generators, but the /dev/*random pseudodevices
are not, yet.
Manual pages for the new kernel interfaces are forthcoming.
http://mail-index.netbsd.org/source-changes/2011/10/22/msg028271.html
From the Log:
Log Message:
Various interrupt fixes, mainly:
keep a per-cpu mask of enabled events, and use it to get pending events.
A cpu-specific event (all of them at this time) should not be ever masked
by another CPU, because it may prevent the target CPU from seeing it
(the clock events all fires at once for example).
- Make clock MP aware.
- Bring in fixes that bouyer@ brought in via:
cvs rdiff -u -r1.54.6.4 -r1.54.6.5 src/sys/arch/xen/xen/clock.c
Thanks to riz@ for testing on dom0
is to provide routines that do as KASSERT(9) says: append a message
to the panic format string when the assertion triggers, with optional
arguments.
Fix call sites to reflect the new definition.
Discussed on tech-kern@. See
http://mail-index.netbsd.org/tech-kern/2011/09/07/msg011427.html
boolean, so checking for "true" with "== 0" is... wrong.
Now xennet(4) should work as expected, and not stay in the InitWait state
(which blocks network communication with the backend).
Thanks to riz@ and sborrill@ for reporting breakage with -current
xennet(4) after my merge.
slightly modified by me to profit from runtime checks for dom0 privileges
instead of using compile time macros (DOM0OPS).
It should now be possible to use pkgsrc's sysutils/xentools inside
a domU to query XenStore entries (or even modify part of it if the domain
has enough rights).
Goal: save/restore support in NetBSD domUs, for i386, i386 PAE and amd64.
Executive summary:
- split all Xen drivers (xenbus(4), grant tables, xbd(4), xennet(4))
in two parts: suspend and resume, and hook them to pmf(9).
- modify pmap so that Xen hypervisor does not cry out loud in case
it finds "unexpected" recursive memory mappings
- provide a sysctl(7), machdep.xen.suspend, to command suspend from
userland via powerd(8). Note: a suspend can only be handled correctly
when dom0 requested it, so provide a mechanism that will prevent
kernel to blindly validate user's commands
The code is still in experimental state, use at your own risk: restore
can corrupt backend communications rings; this can completely thrash
dom0 as it will loop at a high interrupt level trying to honor
all domU requests.
XXX PAE suspend does not work in amd64 currently, due to (yet again!)
page validation issues with hypervisor. Will fix.
XXX secondary CPUs are not suspended, I will write the handlers
in sync with cherry's Xen MP work.
Tested under i386 and amd64, bear in mind ring corruption though.
No build break expected, GENERICs and XEN* kernels should be fine.
./build.sh distribution still running. In any case: sorry if it does
break for you, contact me directly for reports.
ranges that include the least and the greatest vmem_addr_t. Update
vmem(9) uses throughout the kernel. Slightly expand on the tests in
subr_vmem.c, which still pass. I've been running a kernel with this
patch without any trouble.
(virq_timer_to_evtch, indexed by cpuid) different from the
VIRQ <> event channel one (virq_to_evtch, indexed by event channel ID).
This is fine: fix a "harmless" bug that resulted in the event
channel of VIRQ_TIMER getting lost during bind as it was not stored
in the proper array.
"Harmless" because it is not critical for -current, however in the Xen
save/restore branch this completely cripples restore. Xen clock gets
suspended, but never comes back (fetched channel ID being invalid). Oops.
Add a small comment so we can better see the "get => allocate? => set"
chain of actions when binding/unbinding event channels.
prematurely in case they do, to avoid looping "endlessly" (or at least
a very long time) at IPL_BIO while trying to handle requests.
This should not happen in a nominal scenario, but the ring can get
corrupted for whatever reason (memory errors, domU failures or
exploitation).
in the block device being opened twice. Fixes port-xen/45158,
although the underlying cause (multiple open of the same device not
properly handled any more) is not fixed.
- make sure to enter the continuation loop at splbio(), and add some
KASSERT() for this.
- When a flush operation is enqueued to the workqueue, make sure the
continuation loop can't be restarted by a previous workqueue
completion or an event. We can't restart it at this point because
the flush even is still recorded as the current I/O.
For this add a xbdback_co_cache_doflush_wait() which acts as a noop;
the workqueue callback will restart the loop once the flush is complete.
Should fix "kernel diagnostic assertion xbd_io->xio_mapped == 0" panics
reported by Jeff Rizzo on port-xen@.
belong to either kern or hw.
Rename machdep.xen_timepush_ticks to xen.timepush_ticks, so it can live
under the same tree as the balloon node, machdep.xen.
ok bouyer@.
easier to follow (and understand). Helped tracking down a regression
between save/restore xbdback(4) states.
A few minor fixes, which are merely cosmetic:
- call graph is (somewhat) more readable
- rework the xbdback_do_io routine with a switch statement, so as to
trigger a panic() in case an invalid operation passed through the sanity
checks. panic might be overkill here, but I am sure to catch errrors in
case it happens.
sys/stdarg.h and expect compiler to provide proper builtins, defaulting
to the GCC interface. lint still has a special fallback.
Reduce abuse of _BSD_VA_LIST_ by defining __va_list by default and
derive va_list as required by standards.
- cpu_load_pmap: perform tlbflush() after xen_set_user_pgd().
- xen_pmap_bootstrap: perform xpq_queue_tlb_flush() in the end.
- pmap_tlb_shootdown: do not check PG_G for Xen.
- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.
- Simplify locking in some pmap(9) modules by removing P->V locking.
- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).
- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.
- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.
Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
xbdback(4) rather than xbd(4), and use i for identifier separation
(like xvif(4)).
The name is not used outside from event counters (vmstat -i), so
should be transparent to Xen block scripts.
the reboot of PR port-xen/45028
Now that Xen2 is gone, handle FPU context switches the same way as
amd64. This makes all tests in /usr/tests/lib/libc/ieeefp pass.
of the ``nr_segments'' variable.
In cases where we are running domUs with an architecture different from the
dom0 one (for example: 32 bits domUs on 64 bits dom0), copying segments
with an invalid nr_segments value will lead to the corruption of the
xbdback instance structure and quickly crash the dom0 backend.
Tested under 64 bits dom0 with 32 bits domUs. No regression observed.
ok bouyer@.
Will be pulled up to -4 and -5.
possible between the decrement and the fetch of the ref counter value,
hence we might call the G/C routine twice. Not good.
Also remove the 'volatile' attribute, refcnt is only use by xbdi_put/_get
and should not be exposed anywhere else (except for initialization).
functions. The frontend watch function is easier to read, and mixing
switch() with goto's error paths is rather error-prone.
While here, sprinkle some aprint_*.
Tested under amd64 dom0 with i386 PAE and amd64 domUs.
XenbusStateConnected mode. Under rare occasions, the xenbus watcher
can fire multiple times, overwriting the I/O ring memory mappings with
invalid values. This will lead sooner or later to dom0 panic().
Will ask for pullup. FWIW, xbdback(4) is not affected.
that talks with Xenstore to query backend's information. Resuming is now
performed just after xennet(4) attachment instead of waiting for backend
to announce its features in Xenstore and change it state.
This fixes the race observed by Urban Boquist when the domU boots with
root on NFS.
FWIW, the boot code (when root is NFS-backed) can innit() the xennet(4)
interface very early: it tried to access ifnet structures that were not
yet allocated.
Will ask for a pullup. Thanks to Urban for reporting the issue and
investigate it. Confirmed fixed. No regression observed by me for
dynamic attach/detach of xvif(4) and xennet(4) interfaces.
See also http://mail-index.netbsd.org/port-xen/2011/04/18/msg006647.html
- turns balloon into a driver that attaches to xenbus(4). This allows to
disable the functionality either at compile time or boot time via
userconf(4). Driver can implement detach or pmf(9) hooks if deemed
necessary.
- keeps Cherry's locking model, but simplify it a bit. There is now
only one target value serialized inside balloon, we do not feedback
alternative value to Xenstore (clients are not expected to see its value
evolve behind their back, and can't do much about that either)
- implements min threshold; this is an admin-settable value that tells
driver to "not balloon below this threshold." This can be used by domain
to keep memory reservations, useful if activity is expected in the near
future.
- in addition to min threshold, the driver implements internally a
safeguard value (uvmexp.freemin + 1MiB), so that admin cannot
inadvertently set min to a very low value forcing domain into heavy
memory pressure and swapping.
- create the sysctl(8) kern.xen.balloon tree. 4 nodes are actually present
(values are in KiB):
- min: (rw) an admin-settable value that prevents ballooning below this
mark
- max: (ro) the maximum size for reservation, as set by xm(1) mem-max.
- current: (ro) the current reservation for domain.
- target: (rw) the targetted reservation for domain.
- fix a few limitations here and there, most notably the max_reservation
hypercall, and KiB vs pages representations at interfaces.
The driver is still turned off by default. Enabling it would need more
approval, especially from bouyer@, cherry@ and cegger@.
FWIW: tested it two days long, from amd64 dom0 (with dom0 ballooning
enabled for xend), and bunch of domUs. Did not notice anything suspicious.
XXX it still has one big limitation: it cannot hotplug memory pages in
uvm(9) if they were not present beforehand. Example: ballooning above
physmem will give more pages to domain but it won't use it to serve
allocations, unless we teach uvm(9) how to handle the extra pages.
pci_find_rom(), pci_intr_map(9), pci_enumerate_bus(), nor the match
predicate passed to pciide_compat_intr_establish() should ever modify
their pci_attach_args argument, so make their pci_attach_args arguments
const and deal with the fallout throughout the kernel.
For the most part, these changes add a 'const' where there was no
'const' before, however, some drivers and MD code used to modify
pci_attach_args. Now those drivers either copy their pci_attach_args
and modify the copy, or refrain from modifying pci_attach_args:
Xen: according to Manuel Bouyer, writing to pci_attach_args in
pci_intr_map() was a leftover from Xen 2. Probably a bug. I
stopped writing it. I have not tested this change.
siside(4): sis_hostbr_match() needlessly wrote to pci_attach_args.
Probably a bug. I use a temporary variable. I have not tested this
change.
slide(4): sl82c105_chip_map() overwrote the caller's pci_attach_args.
Probably a bug. Use a local pci_attach_args. I have not tested
this change.
viaide(4): via_sata_chip_map() and via_sata_chip_map_new() overwrote the
caller's pci_attach_args. Probably a bug. Make a local copy of the
caller's pci_attach_args and modify the copy. I have not tested
this change.
While I'm here, make pci_mapreg_submap() static.
With these changes in place, I have tested the compilation of these
kernels:
alpha GENERIC
amd64 GENERIC XEN3_DOM0
arc GENERIC
atari HADES MILAN-PCIIDE
bebox GENERIC
cats GENERIC
cobalt GENERIC
evbarm-eb NSLU2
evbarm-el ADI_BRH ARMADILLO9 CP3100 GEMINI GEMINI_MASTER GEMINI_SLAVE GUMSTIX
HDL_G IMX31LITE INTEGRATOR IQ31244 IQ80310 IQ80321 IXDP425 IXM1200
KUROBOX_PRO LUBBOCK MARVELL_NAS NAPPI SHEEVAPLUG SMDK2800 TEAMASA_NPWR
TEAMASA_NPWR_FC TS7200 TWINTAIL ZAO425
evbmips-el AP30 DBAU1500 DBAU1550 MALTA MERAKI MTX-1 OMSAL400 RB153 WGT624V3
evbmips64-el XLSATX
evbppc EV64260 MPC8536DS MPC8548CDS OPENBLOCKS200 OPENBLOCKS266
OPENBLOCKS266_OPT P2020RDB PMPPC RB800 WALNUT
hp700 GENERIC
i386 ALL XEN3_DOM0 XEN3_DOMU
ibmnws GENERIC
macppc GENERIC
mvmeppc GENERIC
netwinder GENERIC
ofppc GENERIC
prep GENERIC
sandpoint GENERIC
sgimips GENERIC32_IP2x
sparc GENERIC_SUN4U KRUPS
sparc64 GENERIC
As of Sun Apr 3 15:26:26 CDT 2011, I could not compile these kernels
with or without my patches in place:
### evbmips-el GDIUM
nbmake: nbmake: don't know how to make /home/dyoung/pristine-nbsd/src/sys/arch/mips/mips/softintr.c. Stop
### evbarm-el MPCSA_GENERIC
src/sys/arch/evbarm/conf/MPCSA_GENERIC:318: ds1672rtc*: unknown device `ds1672rtc'
### ia64 GENERIC
/tmp/genassym.28085/assym.c: In function 'f111':
/tmp/genassym.28085/assym.c:67: error: invalid application of 'sizeof' to incomplete type 'struct pcb'
/tmp/genassym.28085/assym.c:76: error: dereferencing pointer to incomplete type
### sgimips GENERIC32_IP3x
crmfb.o: In function `crmfb_attach':
crmfb.c:(.text+0x2304): undefined reference to `ddc_read_edid'
crmfb.c:(.text+0x2304): relocation truncated to fit: R_MIPS_26 against `ddc_read_edid'
crmfb.c:(.text+0x234c): undefined reference to `edid_parse'
crmfb.c:(.text+0x234c): relocation truncated to fit: R_MIPS_26 against `edid_parse'
crmfb.c:(.text+0x2354): undefined reference to `edid_print'
crmfb.c:(.text+0x2354): relocation truncated to fit: R_MIPS_26 against `edid_print'
Although the hypercall arguments (like struct sysctl_readconsole) are not
compatible between different XEN_SYSCTL_INTERFACE_VERSIONs (one of the
reasons why the sysctl calls should only be used by xentools directly),
it's still practical to have when one wants to query Xen's dmesg from
ddb(4) in case of a panic.
Note: additional code is needed for readconsole() functionality, but adding
the hypercall should not cause any harm.
- Use free_otherend_details() instead of calling free() on xbusd_otherend.
- rename talk_to_otherend() to watch_otherend(). We register a watch for
changes in the otherend device "state"; we are not really talking to it.
- add missing prototypes.
not in HEAD:
- use uvm_km_alloc() instead of kmem_alloc() to enforce alignement when
allocating p2m_frame pages (xentools can only deal with page-aligned
addresses)
- do not use paddr_t for p2m_frame_list_list with PAE, xentools expect
32 bits PFNs even with 64 bits PTE.
Required to make ``xm dump-core'' work as expected.
call it for different levels (L1 => L4).
Replace all calls to xpq_queue_pin_table(...) in MD code with these new
functions, with proper #ifdef'ing depending on $MACHINE.
Rationale:
- only one function to modify for logging
- pushes responsibility to caller for chosing the proper pin level, rather
than Xen internal functions; this makes the pin level explicit rather than
implicit.
Boot tested for dom0 i386/amd64, PAE included. No functional change intended.
Honour this for dependency processing in bsd.dep.mk. Switch i386 and
amd64 assembly to use ISO C90 preprocessor concat and drop the
-traditional-cpp on this platform.
handled by the hypervisor and all CPUs are running when the dom0 is started.
In addition, we don't have a reliable way to determine the boot CPU as
- we may not be running on the boot CPU
- we don't have access to the lapic id
So simplify by ignoring the information and assign phycpu_info_primary to the
first attached CPU.