Old qemu versions required that 1st s/g entry is the header.
Since QEMU 1.5, patchset titled "virtio-net: iovec handling cleanup"
removed this limitation but a feature bit is needed so guests know it's
safe to lay out header differently.
This patch applies on top and adds such a feature bit to QEMU.
It is set by default for virtio-net.
virtio net header inline with the data is beneficial
for latency and small packet bandwidth - guest driver
code utilizing this feature has been acked but missed 3.11
by a narrow margin, it's pending for 3.12.
This feature bit is cleared by default when compatibility with old
machine types is requested.
Other performance-sensitive devices (blk and scsi)
don't yet support arbitrary s/g layouts, so
we only set this bit for virtio-net for now.
There are plans to allow arbitrary layouts there, but
no code has been posted yet.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Currently macvtap based macvlan device is working in promiscuous
mode, we want to implement mac-programming over macvtap through
Libvirt for better performance.
Design:
QEMU notifies Libvirt when rx-filter config is changed in guest,
then Libvirt query the rx-filter information by a monitor command,
and sync the change to macvtap device. Related rx-filter config
of the nic contains main mac, rx-mode items and vlan table.
This patch adds a QMP event to notify management of rx-filter change,
and adds a monitor command for management to query rx-filter
information.
Test:
If we repeatedly add/remove vlan, and change macaddr of vlan
interfaces in guest by a loop script.
Result:
The events will flood the QMP client(management), management takes
too much resource to process the events.
Event_throttle API (set rate to 1 ms) can avoid the events to flood
QMP client, but it could cause an unexpected delay (~1ms), guests
guests normally expect rx-filter updates immediately.
So we use a flag for each nic to avoid events flooding, the event
is emitted once until the query command is executed. The flag
implementation could not introduce unexpected delay.
There maybe exist an uncontrollable delay if we let Libvirt do the
real change, guests normally expect rx-filter updates immediately.
But it's another separate issue, we can investigate it when the
work in Libvirt side is done.
Michael S. Tsirkin: tweaked to enable events on start
Michael S. Tsirkin: fixed not to crash when no id
Michael S. Tsirkin: fold in patch:
"additional fixes for mac-programming feature"
Amos Kong: always notify QMP client if mactable is changed
Amos Kong: return NULL list if no net client supports rx-filter query
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Amos Kong <akong@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
g_hash_table_get_keys() was only introduced in glib 2.14, and we're
still targeting a minimum version of 2.12. Rewrite the offending
code (introduced in commit 721fae1) to use g_hash_table_foreach()
to build the list of keys.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Tested-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Tested-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Message-id: 1372678819-8633-1-git-send-email-peter.maydell@linaro.org
LPAE CPUs have more potentially valid bits in the TTBCR, and so the
simple masking out of invalid bits is no longer sufficient to obtain
the base address width field of the register, which is what we use to
precalculate c2_mask and c2_base_mask. Explicitly extract the
relevant register field rather than simply shifting by the register
value.
This bug would have had no ill effects in practice, since if the
EAE bit (TTBCR bit 31) is set then we don't use the precalculated
masks, and if EAE is zero then bits 30..3 are all UNK/SBZP, so
well-behaved guests won't set them. However the shift is undefined
behaviour, so we should avoid it.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1372347527-4428-1-git-send-email-peter.maydell@linaro.org
Allow for defining const opaque data in ARM CP register definitions by
setting .opaque = foo. If non null opaque is passed into
define_one_arm_cp_reg_with_opaque then that opaque will take
precedence, otherwise if null opaque is passed, the original opaque
data will be used.
Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Message-id: cf0a3ac3438d97464240db9f5f4ef1585cbc1d77.1373429432.git.peter.crosthwaite@xilinx.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Unimplemented registers in the cp15, CRn=0, opc1=0, CRm=0 space default
to aliasing the MIDR register. Set all registers in the space to access
MIDR by default.
Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Message-id: 6127846712b7ad2727354a4f5e1d809451f1e859.1373429432.git.peter.crosthwaite@xilinx.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The if block detecting OMAP/StrongARM modifies the id_cp_reginfo
.access fields in place. So there is no need to replicate the call
to define_arm_cp_reg(). Dropped, and let the OMAP case fall through
to the normal behaviour after the in-place modification.
Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Message-id: 72aae9b8ebbc9a76d2b06faf8666ef8a4b34b92a.1373429432.git.peter.crosthwaite@xilinx.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The ARMv8 SEVL instruction is in the architectural hint space already
emulated as nop. This makes the decoding of SEVL explicit for clarity.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Message-id: 1370606786-5650-3-git-send-email-mans@mansr.com
[PMM: added 'SEVL' to the TODO comment]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This adds support for the ARMv8 load acquire/store release instructions.
Since qemu does nothing special for memory barriers, these can be
emulated like their non-acquire/release counterparts.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The Calxeda ECX-2000 chip (aka. Midway) is model-wise quite similar
to the Highbank. The most prominent difference is the Cortex-A15 CPU
core in it, together with the associated core peripherals.
Add a new ARM machine type called "midway".
Move the L2 cache controller device into the Highbank specific part,
since Midway does not have (and need) it.
Signed-off-by: Andre Przywara <andre.przywara@calxeda.com>
Message-id: 1373026897-12085-3-git-send-email-andre.przywara@calxeda.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
To allow the modelling of machines similar to Calxeda Highbank,
introduce a parameter to the init function and call it from a
wrapper. This allows to tweak the definition for individual machines
later on.
Signed-off-by: Andre Przywara <andre.przywara@calxeda.com>
Message-id: 1373026897-12085-2-git-send-email-andre.przywara@calxeda.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The A15 Versatile Express board can remap a variety of things at address
0. We don't currently emulate the Serial Configuration Controller which
is how the guest can control this remapping, but we can provide the
initial default mapping of the first flash device into this space.
In particular this allows QEMU to boot flash images such as UEFI which
expect to include an exception vector table.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Tested-by: Grant Likely <grant.likely@linaro.org>
Message-id: 1373374180-19884-1-git-send-email-peter.maydell@linaro.org
The drqbmp field of struct soc_dma_s is a uint64_t; however several
places in the code attempt to set bits in it using "(1 << drq)",
which will fail if drq is large enough that the 1 bit gets shifted
off the top of a 32 bit integer. Change these to "(1ULL << drq)" so
that the promotion to 64 bit happens before the shift rather than
afterwards.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1372423919-5669-1-git-send-email-peter.maydell@linaro.org
Add a cast to avoid potentially shifting into the sign bit of
a signed value, which is undefined behaviour in C.
(Detected with clang's -fsanitize=undefined.)
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1372341831-4264-1-git-send-email-peter.maydell@linaro.org
The a15mpcore device claims that its default value for num-irq
is the number of interrupts used by the A15MP in the vexpress-a15
board. However that chip has 128 external interrupts, not 64.
Since there is only one A15 based model in QEMU currently, we
can fix this by simply changing the default value.
This error was causing recent (3.10) Linux kernels to print
warnings/backtraces when the number of interrupts reported
by the GIC was smaller than an interrupt number they wanted
to use.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1373032481-15280-1-git-send-email-peter.maydell@linaro.org
Signed-off-by: Mans Rullgard <mans@mansr.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
commit 1db8b5efe0 introduced an issue
where QEMU would segfault if you have an unattached Cadence UART.
Fix by guarding the flush-on-reset logic on there being a qemu_chr
attachment.
Reported-by: Soren Brinkmann <soren.brinkmann@xilinx.com>
Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Tested-by: Soren Brinkmann <soren.brinkmann@xilinx.com>
Message-id: 9009578ee10a50d994b2e10aa2840d73765f5968.1370577272.git.peter.crosthwaite@xilinx.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
AHCI couldn't cope with asynchronous commands that aren't doing DMA, it
simply wouldn't complete them. Due to the bug fixed in commit f68ec837,
FLUSH commands would seem to have completed immediately even if they
were still running on the host. After the commit, they would simply hang
and never unset the BSY bit, rendering AHCI unusable on any OS sending
flushes.
This patch adds another callback for the completion of asynchronous
commands. This is what AHCI really wants to use for its command
completion logic rather than an DMA completion callback.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
If bdrv_flush_all() returns an error, there is an inconsistency in the
view of an image file between the source and the destination host.
Completing the migration would lead to corruption. Better abort
migration in this case.
To reproduce this case, try the following (ensures that there is
something to flush, and then fails that flush):
$ qemu-img create -f qcow2 test.qcow2 1G
$ cat blkdebug.cfg
[inject-error]
event = "flush_to_os"
errno = "5"
$ qemu-system-x86_64 -hda blkdebug:blkdebug.cfg:test.qcow2 -monitor stdio
(qemu) qemu-io ide0-hd0 "write 0 4k"
(qemu) migrate ...
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
If flushing the block devices fails, return an error. The VM is stopped
anyway.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_flush() can fail, and bdrv_flush_all() should return an error as
well if this happens for a block device. It returns the first error
return now, but still at least tries to flush the remaining devices even
in error cases.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
One of the major reasons for doing something new for -blockdev and
blockdev-add was that the old block layer code parses filenames instead
of just taking them literally. So we should really leave it untouched
when it's passing using the new interfaces (like -drive
file.filename=...).
This allows opening relative file names that contain a colon.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Make "drive_backup" available on the HMP monitor:
drive_backup [-n] [-f] device target [format]
The -n flag requests QEMU to reuse the image found in new-image-file,
instead of recreating it from scratch.
The -f flag requests QEMU to copy the whole disk, so that the result
does not need a backing file. Note that this flag *must* currently be
passed since the other sync modes ('none' and 'top') have not been
implemented yet. Requiring it ensures that "drive_backup" behaves like
"drive_mirror".
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The drive-backup command is similar to the drive-mirror command, except
no guest data written after the command executes gets copied. Add a
sync mode argument which determines whether the entire disk is copied,
just allocated clusters, or only clusters being written to by the guest.
Currently only sync mode 'full' is supported - it copies the entire disk.
For read-only point-in-time snapshots we may only need sync mode 'none'
since the target can be a qcow2 file using the guest's disk as its
backing file (no need to copy the entire disk). Finally, sync mode
'top' is useful if we wish to preserve the backing chain.
Note that this patch just adds the sync mode argument to drive-backup.
It does not implement sync modes 'top' or 'none'. This patch is
necessary so we can add a drive-backup HMP command that behaves like the
existing drive-mirror HMP command and takes a sync mode.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The 1980 epoch is used by the ARC PALcode for NT. But we're emulating
a system using the SRM PALcode. Using the proper epoch results in less
confusion in the guest userland.
Signed-off-by: Richard Henderson <rth@twiddle.net>
The memory and i/o core now support passing 64-bit accesses along
from the guest, so we no longer need to emulate them.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Honor the implementation maximum access size, and at least check
the minimum access size.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Not really correct, but we don't implement all of the random devices
that the kernel looks for. This is good enough to keep us booting.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Advancements in the ioport subsystem mean that we need no longer
thunk memory-mapped i/o through the system-io address space.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Setting it to LE forces a byte swap when host != guest endian but
this makes no sense at all.
Herve made the suggestion upon observing that word writes/reads
were broken into byte writes/reads in such a way as to assume
devices are interpret registers as LE.
However, even if this were a problem, marking the region as LE is
not useful because what's essentially happening here is that LE is
open coded. So by marking it LE in MemoryRegionOps, we're doing a
superflous swap.
Now, the portio code is suspicious to begin with. The dispatch
layer really has no purpose in splitting I/O requests in the first
place...
Cc: Hervé Poussineau <hpoussin@reactos.org>
Cc: Alex Graf <agraf@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.
Verified the convergence using the following :
- Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)
Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).
(qemu) info migrate
capabilities: xbzrle: off auto-converge: off <----
Migration status: active
total time: 1487503 milliseconds
expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages
---
(qemu) info migrate
capabilities: xbzrle: off auto-converge: on <----
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds
transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes
Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
The auto-converge migration capability allows the user to specify if they
choose live migration seqeunce to automatically detect and force convergence.
Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Introduce an asynchronous version of run_on_cpu() i.e. the caller
doesn't have to block till the call back routine finishes execution
on the target vcpu.
Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
# By Alexander Graf (16) and others
# Via Alexander Graf
* agraf/ppc-for-upstream: (22 commits)
PPC: dbdma: Support more multi-issue DMA requests
PPC: Add timer handler for newworld mac-io
PPC: dbdma: Support unaligned DMA access
PPC: dbdma: Wait for DMA until we have data
PPC: dbdma: Move processing to io
PPC: dbdma: macio: Add DMA callback
PPC: dbdma: Move static bh variable to device struct
PPC: dbdma: Introduce kick function
PPC: dbdma: Move defines into header file
PPC: dbdma: Allow new commands in RUN state
PPC: dbdma: Fix debug print
PPC: Mac: Add debug prints in macio and dbdma code
PPC: dbdma: Replace tabs with spaces
PPC: Macio: Replace tabs with spaces
PPC: g3beige: Move secondary IDE bus to mac-io
PPC: Mac: Fix guest exported tbfreq values
target-ppc: Add POWER8 v1.0 CPU model
pseries: move interrupt controllers to hw/intc/
spapr: Respect -bios command line option for SLOF
spapr: Use named enum for function remove_hpte
...
Message-id: 1373562085-29728-1-git-send-email-agraf@suse.de
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
A DMA request can happen for data that hasn't been completely been
provided by the IDE core yet. For example
- DBDMA request for 0x1000 bytes
- IDE request for 1 sector
- DBDMA wants to read 0x1000 bytes (8 sectors) from bdrv
- breakage
Instead, we should truncate our bdrv request to the maximum number
of sectors we're allowed to read at that given time. Once that transfer
is through, we will fall into our recently introduced waiting logic.
- DBDMA requests for 0x1000 bytes
- IDE request for 1 sector
- DBDMA wants to read MIN(0x1000, 1 * 512) bytes
- DBDMA finishes reading, indicates to IDE core that transfer is complete
- IDE request for 7 sectors
- DBDMA finishes the DMA
Reported-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mac OS X accesses fancy timer registers inside of the mac-io on bootup.
These really should be ticking at the mac-io bus frequency, but I don't
see anyone upset when we just make them as fast as we want to.
With this patch on top of my previous patch queue and latest OpenBIOS
I am able to boot Mac OS X 10.4 with -M mac99.
Signed-off-by: Alexander Graf <agraf@suse.de>
The DBDMA engine really just reads bytes from a producing device (IDE
in our case) and shoves these bytes into memory. It doesn't care whether
any alignment takes place or not.
Our code today however assumes that block accesses always happen on
sector (512 byte) boundaries. This is a fair assumption for most cases.
However, Mac OS X really likes to do unaligned, incomplete accesses
that it finishes with the next DMA request.
So we need to read / write the unaligned bits independent of the actual
asynchronous request, because that one can only handle 512-byte-aligned
data. We also need to cache these unaligned sectors until the next DMA
request, at which point the data might be successfully flushed from the
pipe.
Signed-off-by: Alexander Graf <agraf@suse.de>
We should only start processing DMA requests when we have data to process.
Hold off working through the DMA shuffling until the IDE core told us that
it's ready.
This is required because the guest can program the DMA engine or the IDE
transfer first. Both are legal.
Signed-off-by: Alexander Graf <agraf@suse.de>
Soon we will introduce intermediate processing pauses which will
allow the bottom half to restart a DMA request that couldn't be
fulfilled yet.
For that to work, move the processing variable into the io struct
which is what DMA providers work with.
While touching it, also change it into a bool
Signed-off-by: Alexander Graf <agraf@suse.de>
We need to know when the IDE core starts a DMA transfer. Add a notifier
function so we have the chance to start transmitting data.
Signed-off-by: Alexander Graf <agraf@suse.de>
The DBDMA controller has a bottom half to asynchronously process DMA
request queues.
This bh was stored as a gross static variable. Move it into the device
struct instead.
While at it, move all users of it to the new generic kick function.
Signed-off-by: Alexander Graf <agraf@suse.de>
The DBDMA engine really is running all the time, waiting for input. However
we don't want to waste cycles constantly polling.
So introduce a kick function that data providers can call to notify the
DBDMA controller of new input.
Signed-off-by: Alexander Graf <agraf@suse.de>
We usually keep struct and constant definitions in header files. Move
them there to stay consistent and to make access to fields easier.
Signed-off-by: Alexander Graf <agraf@suse.de>
The DBDMA controller can not change its command stream while it's
actively streaming data, true. But the fact that it's in RUN state
doesn't actually indicate anything. It could just as well be in
WAIT while in RUN. And then it's legal to change commands.
This fixes a real world issue I've encountered with Mac OS X.
Signed-off-by: Alexander Graf <agraf@suse.de>
There was a debug print that didn't compile for me because the format
and the arguments weren't in sync. Fix it up.
Signed-off-by: Alexander Graf <agraf@suse.de>
The macio code is basically undebuggable as it stands today, with no
debug prints anywhere whatsoever. DBDMA was better, but I needed a
few more to create reasonable logs that tell me where breakage is.
Add a DPRINTF macro in the macio source file and add a bunch of debug
prints that are all disabled by default of course.
Signed-off-by: Alexander Graf <agraf@suse.de>