cleanup qemu docs
This commit is contained in:
parent
1f154abfa6
commit
326a9a5fba
|
@ -1,104 +0,0 @@
|
|||
/*
|
||||
* This model describes the interaction between aio_set_dispatching()
|
||||
* and aio_notify().
|
||||
*
|
||||
* Author: Paolo Bonzini <pbonzini@redhat.com>
|
||||
*
|
||||
* This file is in the public domain. If you really want a license,
|
||||
* the WTFPL will do.
|
||||
*
|
||||
* To simulate it:
|
||||
* spin -p docs/aio_notify.promela
|
||||
*
|
||||
* To verify it:
|
||||
* spin -a docs/aio_notify.promela
|
||||
* gcc -O2 pan.c
|
||||
* ./a.out -a
|
||||
*/
|
||||
|
||||
#define MAX 4
|
||||
#define LAST (1 << (MAX - 1))
|
||||
#define FINAL ((LAST << 1) - 1)
|
||||
|
||||
bool dispatching;
|
||||
bool event;
|
||||
|
||||
int req, done;
|
||||
|
||||
active proctype waiter()
|
||||
{
|
||||
int fetch, blocking;
|
||||
|
||||
do
|
||||
:: done != FINAL -> {
|
||||
// Computing "blocking" is separate from execution of the
|
||||
// "bottom half"
|
||||
blocking = (req == 0);
|
||||
|
||||
// This is our "bottom half"
|
||||
atomic { fetch = req; req = 0; }
|
||||
done = done | fetch;
|
||||
|
||||
// Wait for a nudge from the other side
|
||||
do
|
||||
:: event == 1 -> { event = 0; break; }
|
||||
:: !blocking -> break;
|
||||
od;
|
||||
|
||||
dispatching = 1;
|
||||
|
||||
// If you are simulating this model, you may want to add
|
||||
// something like this here:
|
||||
//
|
||||
// int foo; foo++; foo++; foo++;
|
||||
//
|
||||
// This only wastes some time and makes it more likely
|
||||
// that the notifier process hits the "fast path".
|
||||
|
||||
dispatching = 0;
|
||||
}
|
||||
:: else -> break;
|
||||
od
|
||||
}
|
||||
|
||||
active proctype notifier()
|
||||
{
|
||||
int next = 1;
|
||||
int sets = 0;
|
||||
|
||||
do
|
||||
:: next <= LAST -> {
|
||||
// generate a request
|
||||
req = req | next;
|
||||
next = next << 1;
|
||||
|
||||
// aio_notify
|
||||
if
|
||||
:: dispatching == 0 -> sets++; event = 1;
|
||||
:: else -> skip;
|
||||
fi;
|
||||
|
||||
// Test both synchronous and asynchronous delivery
|
||||
if
|
||||
:: 1 -> do
|
||||
:: req == 0 -> break;
|
||||
od;
|
||||
:: 1 -> skip;
|
||||
fi;
|
||||
}
|
||||
:: else -> break;
|
||||
od;
|
||||
printf("Skipped %d event_notifier_set\n", MAX - sets);
|
||||
}
|
||||
|
||||
#define p (done == FINAL)
|
||||
|
||||
never {
|
||||
do
|
||||
:: 1 // after an arbitrarily long prefix
|
||||
:: p -> break // p becomes true
|
||||
od;
|
||||
do
|
||||
:: !p -> accept: break // it then must remains true forever after
|
||||
od
|
||||
}
|
|
@ -1,352 +0,0 @@
|
|||
CPUs perform independent memory operations effectively in random order.
|
||||
but this can be a problem for CPU-CPU interaction (including interactions
|
||||
between QEMU and the guest). Multi-threaded programs use various tools
|
||||
to instruct the compiler and the CPU to restrict the order to something
|
||||
that is consistent with the expectations of the programmer.
|
||||
|
||||
The most basic tool is locking. Mutexes, condition variables and
|
||||
semaphores are used in QEMU, and should be the default approach to
|
||||
synchronization. Anything else is considerably harder, but it's
|
||||
also justified more often than one would like. The two tools that
|
||||
are provided by qemu/atomic.h are memory barriers and atomic operations.
|
||||
|
||||
Macros defined by qemu/atomic.h fall in three camps:
|
||||
|
||||
- compiler barriers: barrier();
|
||||
|
||||
- weak atomic access and manual memory barriers: atomic_read(),
|
||||
atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends();
|
||||
|
||||
- sequentially consistent atomic access: everything else.
|
||||
|
||||
|
||||
COMPILER MEMORY BARRIER
|
||||
=======================
|
||||
|
||||
barrier() prevents the compiler from moving the memory accesses either
|
||||
side of it to the other side. The compiler barrier has no direct effect
|
||||
on the CPU, which may then reorder things however it wishes.
|
||||
|
||||
barrier() is mostly used within qemu/atomic.h itself. On some
|
||||
architectures, CPU guarantees are strong enough that blocking compiler
|
||||
optimizations already ensures the correct order of execution. In this
|
||||
case, qemu/atomic.h will reduce stronger memory barriers to simple
|
||||
compiler barriers.
|
||||
|
||||
Still, barrier() can be useful when writing code that can be interrupted
|
||||
by signal handlers.
|
||||
|
||||
|
||||
SEQUENTIALLY CONSISTENT ATOMIC ACCESS
|
||||
=====================================
|
||||
|
||||
Most of the operations in the qemu/atomic.h header ensure *sequential
|
||||
consistency*, where "the result of any execution is the same as if the
|
||||
operations of all the processors were executed in some sequential order,
|
||||
and the operations of each individual processor appear in this sequence
|
||||
in the order specified by its program".
|
||||
|
||||
qemu/atomic.h provides the following set of atomic read-modify-write
|
||||
operations:
|
||||
|
||||
void atomic_inc(ptr)
|
||||
void atomic_dec(ptr)
|
||||
void atomic_add(ptr, val)
|
||||
void atomic_sub(ptr, val)
|
||||
void atomic_and(ptr, val)
|
||||
void atomic_or(ptr, val)
|
||||
|
||||
typeof(*ptr) atomic_fetch_inc(ptr)
|
||||
typeof(*ptr) atomic_fetch_dec(ptr)
|
||||
typeof(*ptr) atomic_fetch_add(ptr, val)
|
||||
typeof(*ptr) atomic_fetch_sub(ptr, val)
|
||||
typeof(*ptr) atomic_fetch_and(ptr, val)
|
||||
typeof(*ptr) atomic_fetch_or(ptr, val)
|
||||
typeof(*ptr) atomic_xchg(ptr, val
|
||||
typeof(*ptr) atomic_cmpxchg(ptr, old, new)
|
||||
|
||||
all of which return the old value of *ptr. These operations are
|
||||
polymorphic; they operate on any type that is as wide as an int.
|
||||
|
||||
Sequentially consistent loads and stores can be done using:
|
||||
|
||||
atomic_fetch_add(ptr, 0) for loads
|
||||
atomic_xchg(ptr, val) for stores
|
||||
|
||||
However, they are quite expensive on some platforms, notably POWER and
|
||||
ARM. Therefore, qemu/atomic.h provides two primitives with slightly
|
||||
weaker constraints:
|
||||
|
||||
typeof(*ptr) atomic_mb_read(ptr)
|
||||
void atomic_mb_set(ptr, val)
|
||||
|
||||
The semantics of these primitives map to Java volatile variables,
|
||||
and are strongly related to memory barriers as used in the Linux
|
||||
kernel (see below).
|
||||
|
||||
As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
|
||||
be reordered with each other, and it is also not possible to reorder
|
||||
"normal" accesses around them.
|
||||
|
||||
However, and this is the important difference between
|
||||
atomic_mb_read/atomic_mb_set and sequential consistency, it is important
|
||||
for both threads to access the same volatile variable. It is not the
|
||||
case that everything visible to thread A when it writes volatile field f
|
||||
becomes visible to thread B after it reads volatile field g. The store
|
||||
and load have to "match" (i.e., be performed on the same volatile
|
||||
field) to achieve the right semantics.
|
||||
|
||||
|
||||
These operations operate on any type that is as wide as an int or smaller.
|
||||
|
||||
|
||||
WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
|
||||
=============================================
|
||||
|
||||
Compared to sequentially consistent atomic access, programming with
|
||||
weaker consistency models can be considerably more complicated.
|
||||
In general, if the algorithm you are writing includes both writes
|
||||
and reads on the same side, it is generally simpler to use sequentially
|
||||
consistent primitives.
|
||||
|
||||
When using this model, variables are accessed with atomic_read() and
|
||||
atomic_set(), and restrictions to the ordering of accesses is enforced
|
||||
using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends()
|
||||
memory barriers.
|
||||
|
||||
atomic_read() and atomic_set() prevents the compiler from using
|
||||
optimizations that might otherwise optimize accesses out of existence
|
||||
on the one hand, or that might create unsolicited accesses on the other.
|
||||
In general this should not have any effect, because the same compiler
|
||||
barriers are already implied by memory barriers. However, it is useful
|
||||
to do so, because it tells readers which variables are shared with
|
||||
other threads, and which are local to the current thread or protected
|
||||
by other, more mundane means.
|
||||
|
||||
Memory barriers control the order of references to shared memory.
|
||||
They come in four kinds:
|
||||
|
||||
- smp_rmb() guarantees that all the LOAD operations specified before
|
||||
the barrier will appear to happen before all the LOAD operations
|
||||
specified after the barrier with respect to the other components of
|
||||
the system.
|
||||
|
||||
In other words, smp_rmb() puts a partial ordering on loads, but is not
|
||||
required to have any effect on stores.
|
||||
|
||||
- smp_wmb() guarantees that all the STORE operations specified before
|
||||
the barrier will appear to happen before all the STORE operations
|
||||
specified after the barrier with respect to the other components of
|
||||
the system.
|
||||
|
||||
In other words, smp_wmb() puts a partial ordering on stores, but is not
|
||||
required to have any effect on loads.
|
||||
|
||||
- smp_mb() guarantees that all the LOAD and STORE operations specified
|
||||
before the barrier will appear to happen before all the LOAD and
|
||||
STORE operations specified after the barrier with respect to the other
|
||||
components of the system.
|
||||
|
||||
smp_mb() puts a partial ordering on both loads and stores. It is
|
||||
stronger than both a read and a write memory barrier; it implies both
|
||||
smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the
|
||||
barrier from overtaking LOADs coming after the barrier and vice versa.
|
||||
|
||||
- smp_read_barrier_depends() is a weaker kind of read barrier. On
|
||||
most processors, whenever two loads are performed such that the
|
||||
second depends on the result of the first (e.g., the first load
|
||||
retrieves the address to which the second load will be directed),
|
||||
the processor will guarantee that the first LOAD will appear to happen
|
||||
before the second with respect to the other components of the system.
|
||||
However, this is not always true---for example, it was not true on
|
||||
Alpha processors. Whenever this kind of access happens to shared
|
||||
memory (that is not protected by a lock), a read barrier is needed,
|
||||
and smp_read_barrier_depends() can be used instead of smp_rmb().
|
||||
|
||||
Note that the first load really has to have a _data_ dependency and not
|
||||
a control dependency. If the address for the second load is dependent
|
||||
on the first load, but the dependency is through a conditional rather
|
||||
than actually loading the address itself, then it's a _control_
|
||||
dependency and a full read barrier or better is required.
|
||||
|
||||
|
||||
This is the set of barriers that is required *between* two atomic_read()
|
||||
and atomic_set() operations to achieve sequential consistency:
|
||||
|
||||
| 2nd operation |
|
||||
|-----------------------------------------|
|
||||
1st operation | (after last) | atomic_read | atomic_set |
|
||||
---------------+--------------+-------------+------------|
|
||||
(before first) | | none | smp_wmb() |
|
||||
---------------+--------------+-------------+------------|
|
||||
atomic_read | smp_rmb() | smp_rmb()* | ** |
|
||||
---------------+--------------+-------------+------------|
|
||||
atomic_set | none | smp_mb()*** | smp_wmb() |
|
||||
---------------+--------------+-------------+------------|
|
||||
|
||||
* Or smp_read_barrier_depends().
|
||||
|
||||
** This requires a load-store barrier. How to achieve this varies
|
||||
depending on the machine, but in practice smp_rmb()+smp_wmb()
|
||||
should have the desired effect. For example, on PowerPC the
|
||||
lwsync instruction is a combined load-load, load-store and
|
||||
store-store barrier.
|
||||
|
||||
*** This requires a store-load barrier. On most machines, the only
|
||||
way to achieve this is a full barrier.
|
||||
|
||||
|
||||
You can see that the two possible definitions of atomic_mb_read()
|
||||
and atomic_mb_set() are the following:
|
||||
|
||||
1) atomic_mb_read(p) = atomic_read(p); smp_rmb()
|
||||
atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb()
|
||||
|
||||
2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_rmb()
|
||||
atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v);
|
||||
|
||||
Usually the former is used, because smp_mb() is expensive and a program
|
||||
normally has more reads than writes. Therefore it makes more sense to
|
||||
make atomic_mb_set() the more expensive operation.
|
||||
|
||||
There are two common cases in which atomic_mb_read and atomic_mb_set
|
||||
generate too many memory barriers, and thus it can be useful to manually
|
||||
place barriers instead:
|
||||
|
||||
- when a data structure has one thread that is always a writer
|
||||
and one thread that is always a reader, manual placement of
|
||||
memory barriers makes the write side faster. Furthermore,
|
||||
correctness is easy to check for in this case using the "pairing"
|
||||
trick that is explained below:
|
||||
|
||||
thread 1 thread 1
|
||||
------------------------- ------------------------
|
||||
(other writes)
|
||||
smp_wmb()
|
||||
atomic_mb_set(&a, x) atomic_set(&a, x)
|
||||
smp_wmb()
|
||||
atomic_mb_set(&b, y) atomic_set(&b, y)
|
||||
|
||||
=>
|
||||
thread 2 thread 2
|
||||
------------------------- ------------------------
|
||||
y = atomic_mb_read(&b) y = atomic_read(&b)
|
||||
smp_rmb()
|
||||
x = atomic_mb_read(&a) x = atomic_read(&a)
|
||||
smp_rmb()
|
||||
|
||||
- sometimes, a thread is accessing many variables that are otherwise
|
||||
unrelated to each other (for example because, apart from the current
|
||||
thread, exactly one other thread will read or write each of these
|
||||
variables). In this case, it is possible to "hoist" the implicit
|
||||
barriers provided by atomic_mb_read() and atomic_mb_set() outside
|
||||
a loop. For example, the above definition atomic_mb_read() gives
|
||||
the following transformation:
|
||||
|
||||
n = 0; n = 0;
|
||||
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
|
||||
n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]);
|
||||
smp_rmb();
|
||||
|
||||
Similarly, atomic_mb_set() can be transformed as follows:
|
||||
smp_mb():
|
||||
|
||||
smp_wmb();
|
||||
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
|
||||
atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
|
||||
smp_mb();
|
||||
|
||||
|
||||
The two tricks can be combined. In this case, splitting a loop in
|
||||
two lets you hoist the barriers out of the loops _and_ eliminate the
|
||||
expensive smp_mb():
|
||||
|
||||
smp_wmb();
|
||||
for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++)
|
||||
atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
|
||||
atomic_mb_set(&b[i], false); smb_wmb();
|
||||
} for (i = 0; i < 10; i++)
|
||||
atomic_set(&a[i], false);
|
||||
smp_mb();
|
||||
|
||||
The other thread can still use atomic_mb_read()/atomic_mb_set()
|
||||
|
||||
|
||||
Memory barrier pairing
|
||||
----------------------
|
||||
|
||||
A useful rule of thumb is that memory barriers should always, or almost
|
||||
always, be paired with another barrier. In the case of QEMU, however,
|
||||
note that the other barrier may actually be in a driver that runs in
|
||||
the guest!
|
||||
|
||||
For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
|
||||
both count as read barriers. A read barriers shall pair with a write
|
||||
barrier or a full barrier; a write barrier shall pair with a read
|
||||
barrier or a full barrier. A full barrier can pair with anything.
|
||||
For example:
|
||||
|
||||
thread 1 thread 2
|
||||
=============== ===============
|
||||
a = 1;
|
||||
smp_wmb();
|
||||
b = 2; x = b;
|
||||
smp_rmb();
|
||||
y = a;
|
||||
|
||||
Note that the "writing" thread are accessing the variables in the
|
||||
opposite order as the "reading" thread. This is expected: stores
|
||||
before the write barrier will normally match the loads after the
|
||||
read barrier, and vice versa. The same is true for more than 2
|
||||
access and for data dependency barriers:
|
||||
|
||||
thread 1 thread 2
|
||||
=============== ===============
|
||||
b[2] = 1;
|
||||
smp_wmb();
|
||||
x->i = 2;
|
||||
smp_wmb();
|
||||
a = x; x = a;
|
||||
smp_read_barrier_depends();
|
||||
y = x->i;
|
||||
smp_read_barrier_depends();
|
||||
z = b[y];
|
||||
|
||||
smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs
|
||||
with atomic_mb_set().
|
||||
|
||||
|
||||
COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
|
||||
============================================
|
||||
|
||||
Here is a list of differences between Linux kernel atomic operations
|
||||
and memory barriers, and the equivalents in QEMU:
|
||||
|
||||
- atomic operations in Linux are always on a 32-bit int type and
|
||||
use a boxed atomic_t type; atomic operations in QEMU are polymorphic
|
||||
and use normal C types.
|
||||
|
||||
- atomic_read and atomic_set in Linux give no guarantee at all;
|
||||
atomic_read and atomic_set in QEMU include a compiler barrier
|
||||
(similar to the ACCESS_ONCE macro in Linux).
|
||||
|
||||
- most atomic read-modify-write operations in Linux return void;
|
||||
in QEMU, all of them return the old value of the variable.
|
||||
|
||||
- different atomic read-modify-write operations in Linux imply
|
||||
a different set of memory barriers; in QEMU, all of them enforce
|
||||
sequential consistency, which means they imply full memory barriers
|
||||
before and after the operation.
|
||||
|
||||
- Linux does not have an equivalent of atomic_mb_read() and
|
||||
atomic_mb_set(). In particular, note that set_mb() is a little
|
||||
weaker than atomic_mb_set().
|
||||
|
||||
|
||||
SOURCES
|
||||
=======
|
||||
|
||||
* Documentation/memory-barriers.txt from the Linux kernel
|
||||
|
||||
* "The JSR-133 Cookbook for Compiler Writers", available at
|
||||
http://g.oswego.edu/dl/jmm/cookbook.html
|
|
@ -1,161 +0,0 @@
|
|||
Block I/O error injection using blkdebug
|
||||
----------------------------------------
|
||||
Copyright (C) 2014 Red Hat Inc
|
||||
|
||||
This work is licensed under the terms of the GNU GPL, version 2 or later. See
|
||||
the COPYING file in the top-level directory.
|
||||
|
||||
The blkdebug block driver is a rule-based error injection engine. It can be
|
||||
used to exercise error code paths in block drivers including ENOSPC (out of
|
||||
space) and EIO.
|
||||
|
||||
This document gives an overview of the features available in blkdebug.
|
||||
|
||||
Background
|
||||
----------
|
||||
Block drivers have many error code paths that handle I/O errors. Image formats
|
||||
are especially complex since metadata I/O errors during cluster allocation or
|
||||
while updating tables happen halfway through request processing and require
|
||||
discipline to keep image files consistent.
|
||||
|
||||
Error injection allows test cases to trigger I/O errors at specific points.
|
||||
This way, all error paths can be tested to make sure they are correct.
|
||||
|
||||
Rules
|
||||
-----
|
||||
The blkdebug block driver takes a list of "rules" that tell the error injection
|
||||
engine when to fail an I/O request.
|
||||
|
||||
Each I/O request is evaluated against the rules. If a rule matches the request
|
||||
then its "action" is executed.
|
||||
|
||||
Rules can be placed in a configuration file; the configuration file
|
||||
follows the same .ini-like format used by QEMU's -readconfig option, and
|
||||
each section of the file represents a rule.
|
||||
|
||||
The following configuration file defines a single rule:
|
||||
|
||||
$ cat blkdebug.conf
|
||||
[inject-error]
|
||||
event = "read_aio"
|
||||
errno = "28"
|
||||
|
||||
This rule fails all aio read requests with ENOSPC (28). Note that the errno
|
||||
value depends on the host. On Linux, see
|
||||
/usr/include/asm-generic/errno-base.h for errno values.
|
||||
|
||||
Invoke QEMU as follows:
|
||||
|
||||
$ qemu-system-x86_64
|
||||
-drive if=none,cache=none,file=blkdebug:blkdebug.conf:test.img,id=drive0 \
|
||||
-device virtio-blk-pci,drive=drive0,id=virtio-blk-pci0
|
||||
|
||||
Rules support the following attributes:
|
||||
|
||||
event - which type of operation to match (e.g. read_aio, write_aio,
|
||||
flush_to_os, flush_to_disk). See the "Events" section for
|
||||
information on events.
|
||||
|
||||
state - (optional) the engine must be in this state number in order for this
|
||||
rule to match. See the "State transitions" section for information
|
||||
on states.
|
||||
|
||||
errno - the numeric errno value to return when a request matches this rule.
|
||||
The errno values depend on the host since the numeric values are not
|
||||
standarized in the POSIX specification.
|
||||
|
||||
sector - (optional) a sector number that the request must overlap in order to
|
||||
match this rule
|
||||
|
||||
once - (optional, default "off") only execute this action on the first
|
||||
matching request
|
||||
|
||||
immediately - (optional, default "off") return a NULL BlockAIOCB
|
||||
pointer and fail without an errno instead. This
|
||||
exercises the code path where BlockAIOCB fails and the
|
||||
caller's BlockCompletionFunc is not invoked.
|
||||
|
||||
Events
|
||||
------
|
||||
Block drivers provide information about the type of I/O request they are about
|
||||
to make so rules can match specific types of requests. For example, the qcow2
|
||||
block driver tells blkdebug when it accesses the L1 table so rules can match
|
||||
only L1 table accesses and not other metadata or guest data requests.
|
||||
|
||||
The core events are:
|
||||
|
||||
read_aio - guest data read
|
||||
|
||||
write_aio - guest data write
|
||||
|
||||
flush_to_os - write out unwritten block driver state (e.g. cached metadata)
|
||||
|
||||
flush_to_disk - flush the host block device's disk cache
|
||||
|
||||
See block/blkdebug.c:event_names[] for the full list of events. You may need
|
||||
to grep block driver source code to understand the meaning of specific events.
|
||||
|
||||
State transitions
|
||||
-----------------
|
||||
There are cases where more power is needed to match a particular I/O request in
|
||||
a longer sequence of requests. For example:
|
||||
|
||||
write_aio
|
||||
flush_to_disk
|
||||
write_aio
|
||||
|
||||
How do we match the 2nd write_aio but not the first? This is where state
|
||||
transitions come in.
|
||||
|
||||
The error injection engine has an integer called the "state" that always starts
|
||||
initialized to 1. The state integer is internal to blkdebug and cannot be
|
||||
observed from outside but rules can interact with it for powerful matching
|
||||
behavior.
|
||||
|
||||
Rules can be conditional on the current state and they can transition to a new
|
||||
state.
|
||||
|
||||
When a rule's "state" attribute is non-zero then the current state must equal
|
||||
the attribute in order for the rule to match.
|
||||
|
||||
For example, to match the 2nd write_aio:
|
||||
|
||||
[set-state]
|
||||
event = "write_aio"
|
||||
state = "1"
|
||||
new_state = "2"
|
||||
|
||||
[inject-error]
|
||||
event = "write_aio"
|
||||
state = "2"
|
||||
errno = "5"
|
||||
|
||||
The first write_aio request matches the set-state rule and transitions from
|
||||
state 1 to state 2. Once state 2 has been entered, the set-state rule no
|
||||
longer matches since it requires state 1. But the inject-error rule now
|
||||
matches the next write_aio request and injects EIO (5).
|
||||
|
||||
State transition rules support the following attributes:
|
||||
|
||||
event - which type of operation to match (e.g. read_aio, write_aio,
|
||||
flush_to_os, flush_to_disk). See the "Events" section for
|
||||
information on events.
|
||||
|
||||
state - (optional) the engine must be in this state number in order for this
|
||||
rule to match
|
||||
|
||||
new_state - transition to this state number
|
||||
|
||||
Suspend and resume
|
||||
------------------
|
||||
Exercising code paths in block drivers may require specific ordering amongst
|
||||
concurrent requests. The "breakpoint" feature allows requests to be halted on
|
||||
a blkdebug event and resumed later. This makes it possible to achieve
|
||||
deterministic ordering when multiple requests are in flight.
|
||||
|
||||
Breakpoints on blkdebug events are associated with a user-defined "tag" string.
|
||||
This tag serves as an identifier by which the request can be resumed at a later
|
||||
point.
|
||||
|
||||
See the qemu-io(1) break, resume, remove_break, and wait_break commands for
|
||||
details.
|
|
@ -1,69 +0,0 @@
|
|||
= Block driver correctness testing with blkverify =
|
||||
|
||||
== Introduction ==
|
||||
|
||||
This document describes how to use the blkverify protocol to test that a block
|
||||
driver is operating correctly.
|
||||
|
||||
It is difficult to test and debug block drivers against real guests. Often
|
||||
processes inside the guest will crash because corrupt sectors were read as part
|
||||
of the executable. Other times obscure errors are raised by a program inside
|
||||
the guest. These issues are extremely hard to trace back to bugs in the block
|
||||
driver.
|
||||
|
||||
Blkverify solves this problem by catching data corruption inside QEMU the first
|
||||
time bad data is read and reporting the disk sector that is corrupted.
|
||||
|
||||
== How it works ==
|
||||
|
||||
The blkverify protocol has two child block devices, the "test" device and the
|
||||
"raw" device. Read/write operations are mirrored to both devices so their
|
||||
state should always be in sync.
|
||||
|
||||
The "raw" device is a raw image, a flat file, that has identical starting
|
||||
contents to the "test" image. The idea is that the "raw" device will handle
|
||||
read/write operations correctly and not corrupt data. It can be used as a
|
||||
reference for comparison against the "test" device.
|
||||
|
||||
After a mirrored read operation completes, blkverify will compare the data and
|
||||
raise an error if it is not identical. This makes it possible to catch the
|
||||
first instance where corrupt data is read.
|
||||
|
||||
== Example ==
|
||||
|
||||
Imagine raw.img has 0xcd repeated throughout its first sector:
|
||||
|
||||
$ ./qemu-io -c 'read -v 0 512' raw.img
|
||||
00000000: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
|
||||
00000010: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
|
||||
[...]
|
||||
000001e0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
|
||||
000001f0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
|
||||
read 512/512 bytes at offset 0
|
||||
512.000000 bytes, 1 ops; 0.0000 sec (97.656 MiB/sec and 200000.0000 ops/sec)
|
||||
|
||||
And test.img is corrupt, its first sector is zeroed when it shouldn't be:
|
||||
|
||||
$ ./qemu-io -c 'read -v 0 512' test.img
|
||||
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
|
||||
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
|
||||
[...]
|
||||
000001e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
|
||||
000001f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
|
||||
read 512/512 bytes at offset 0
|
||||
512.000000 bytes, 1 ops; 0.0000 sec (81.380 MiB/sec and 166666.6667 ops/sec)
|
||||
|
||||
This error is caught by blkverify:
|
||||
|
||||
$ ./qemu-io -c 'read 0 512' blkverify:a.img:b.img
|
||||
blkverify: read sector_num=0 nb_sectors=4 contents mismatch in sector 0
|
||||
|
||||
A more realistic scenario is verifying the installation of a guest OS:
|
||||
|
||||
$ ./qemu-img create raw.img 16G
|
||||
$ ./qemu-img create -f qcow2 test.qcow2 16G
|
||||
$ x86_64-softmmu/qemu-system-x86_64 -cdrom debian.iso \
|
||||
-drive file=blkverify:raw.img:test.qcow2
|
||||
|
||||
If the installation is aborted when blkverify detects corruption, use qemu-io
|
||||
to explore the contents of the disk image at the sector in question.
|
|
@ -1,43 +0,0 @@
|
|||
= Bootindex property =
|
||||
|
||||
Block and net devices have bootindex property. This property is used to
|
||||
determine the order in which firmware will consider devices for booting
|
||||
the guest OS. If the bootindex property is not set for a device, it gets
|
||||
lowest boot priority. There is no particular order in which devices with
|
||||
unset bootindex property will be considered for booting, but they will
|
||||
still be bootable.
|
||||
|
||||
== Example ==
|
||||
|
||||
Let's assume we have a QEMU machine with two NICs (virtio, e1000) and two
|
||||
disks (IDE, virtio):
|
||||
|
||||
qemu -drive file=disk1.img,if=none,id=disk1
|
||||
-device ide-drive,drive=disk1,bootindex=4
|
||||
-drive file=disk2.img,if=none,id=disk2
|
||||
-device virtio-blk-pci,drive=disk2,bootindex=3
|
||||
-netdev type=user,id=net0 -device virtio-net-pci,netdev=net0,bootindex=2
|
||||
-netdev type=user,id=net1 -device e1000,netdev=net1,bootindex=1
|
||||
|
||||
Given the command above, firmware should try to boot from the e1000 NIC
|
||||
first. If this fails, it should try the virtio NIC next; if this fails
|
||||
too, it should try the virtio disk, and then the IDE disk.
|
||||
|
||||
== Limitations ==
|
||||
|
||||
1. Some firmware has limitations on which devices can be considered for
|
||||
booting. For instance, the PC BIOS boot specification allows only one
|
||||
disk to be bootable. If boot from disk fails for some reason, the BIOS
|
||||
won't retry booting from other disk. It can still try to boot from
|
||||
floppy or net, though.
|
||||
|
||||
2. Sometimes, firmware cannot map the device path QEMU wants firmware to
|
||||
boot from to a boot method. It doesn't happen for devices the firmware
|
||||
can natively boot from, but if firmware relies on an option ROM for
|
||||
booting, and the same option ROM is used for booting from more then one
|
||||
device, the firmware may not be able to ask the option ROM to boot from
|
||||
a particular device reliably. For instance with the PC BIOS, if a SCSI HBA
|
||||
has three bootable devices target1, target3, target5 connected to it,
|
||||
the option ROM will have a boot method for each of them, but it is not
|
||||
possible to map from boot method back to a specific target. This is a
|
||||
shortcoming of the PC BIOS boot specification.
|
|
@ -1,181 +0,0 @@
|
|||
QEMU CCID Device Documentation.
|
||||
|
||||
Contents
|
||||
1. USB CCID device
|
||||
2. Building
|
||||
3. Using ccid-card-emulated with hardware
|
||||
4. Using ccid-card-emulated with certificates
|
||||
5. Using ccid-card-passthru with client side hardware
|
||||
6. Using ccid-card-passthru with client side certificates
|
||||
7. Passthrough protocol scenario
|
||||
8. libcacard
|
||||
|
||||
1. USB CCID device
|
||||
|
||||
The USB CCID device is a USB device implementing the CCID specification, which
|
||||
lets one connect smart card readers that implement the same spec. For more
|
||||
information see the specification:
|
||||
|
||||
Universal Serial Bus
|
||||
Device Class: Smart Card
|
||||
CCID
|
||||
Specification for
|
||||
Integrated Circuit(s) Cards Interface Devices
|
||||
Revision 1.1
|
||||
April 22rd, 2005
|
||||
|
||||
Smartcards are used for authentication, single sign on, decryption in
|
||||
public/private schemes and digital signatures. A smartcard reader on the client
|
||||
cannot be used on a guest with simple usb passthrough since it will then not be
|
||||
available on the client, possibly locking the computer when it is "removed". On
|
||||
the other hand this device can let you use the smartcard on both the client and
|
||||
the guest machine. It is also possible to have a completely virtual smart card
|
||||
reader and smart card (i.e. not backed by a physical device) using this device.
|
||||
|
||||
2. Building
|
||||
|
||||
The cryptographic functions and access to the physical card is done via NSS.
|
||||
|
||||
Installing NSS:
|
||||
|
||||
In redhat/fedora:
|
||||
yum install nss-devel
|
||||
In ubuntu/debian:
|
||||
apt-get install libnss3-dev
|
||||
(not tested on ubuntu)
|
||||
|
||||
Configuring and building:
|
||||
./configure --enable-smartcard && make
|
||||
|
||||
|
||||
3. Using ccid-card-emulated with hardware
|
||||
|
||||
Assuming you have a working smartcard on the host with the current
|
||||
user, using NSS, qemu acts as another NSS client using ccid-card-emulated:
|
||||
|
||||
qemu -usb -device usb-ccid -device ccid-card-emulated
|
||||
|
||||
|
||||
4. Using ccid-card-emulated with certificates stored in files
|
||||
|
||||
You must create the CA and card certificates. This is a one time process.
|
||||
We use NSS certificates:
|
||||
|
||||
mkdir fake-smartcard
|
||||
cd fake-smartcard
|
||||
certutil -N -d sql:$PWD
|
||||
certutil -S -d sql:$PWD -s "CN=Fake Smart Card CA" -x -t TC,TC,TC -n fake-smartcard-ca
|
||||
certutil -S -d sql:$PWD -t ,, -s "CN=John Doe" -n id-cert -c fake-smartcard-ca
|
||||
certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (signing)" --nsCertType smime -n signing-cert -c fake-smartcard-ca
|
||||
certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (encryption)" --nsCertType sslClient -n encryption-cert -c fake-smartcard-ca
|
||||
|
||||
Note: you must have exactly three certificates.
|
||||
|
||||
You can use the emulated card type with the certificates backend:
|
||||
|
||||
qemu -usb -device usb-ccid -device ccid-card-emulated,backend=certificates,db=sql:$PWD,cert1=id-cert,cert2=signing-cert,cert3=encryption-cert
|
||||
|
||||
To use the certificates in the guest, export the CA certificate:
|
||||
|
||||
certutil -L -r -d sql:$PWD -o fake-smartcard-ca.cer -n fake-smartcard-ca
|
||||
|
||||
and import it in the guest:
|
||||
|
||||
certutil -A -d /etc/pki/nssdb -i fake-smartcard-ca.cer -t TC,TC,TC -n fake-smartcard-ca
|
||||
|
||||
In a Linux guest you can then use the CoolKey PKCS #11 module to access
|
||||
the card:
|
||||
|
||||
certutil -d /etc/pki/nssdb -L -h all
|
||||
|
||||
It will prompt you for the PIN (which is the password you assigned to the
|
||||
certificate database early on), and then show you all three certificates
|
||||
together with the manually imported CA cert:
|
||||
|
||||
Certificate Nickname Trust Attributes
|
||||
fake-smartcard-ca CT,C,C
|
||||
John Doe:CAC ID Certificate u,u,u
|
||||
John Doe:CAC Email Signature Certificate u,u,u
|
||||
John Doe:CAC Email Encryption Certificate u,u,u
|
||||
|
||||
If this does not happen, CoolKey is not installed or not registered with
|
||||
NSS. Registration can be done from Firefox or the command line:
|
||||
|
||||
modutil -dbdir /etc/pki/nssdb -add "CAC Module" -libfile /usr/lib64/pkcs11/libcoolkeypk11.so
|
||||
modutil -dbdir /etc/pki/nssdb -list
|
||||
|
||||
|
||||
5. Using ccid-card-passthru with client side hardware
|
||||
|
||||
on the host specify the ccid-card-passthru device with a suitable chardev:
|
||||
|
||||
qemu -chardev socket,server,host=0.0.0.0,port=2001,id=ccid,nowait -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
|
||||
|
||||
on the client run vscclient, built when you built QEMU:
|
||||
|
||||
vscclient <qemu-host> 2001
|
||||
|
||||
|
||||
6. Using ccid-card-passthru with client side certificates
|
||||
|
||||
This case is not particularly useful, but you can use it to debug
|
||||
your setup if #4 works but #5 does not.
|
||||
|
||||
Follow instructions as per #4, except run QEMU and vscclient as follows:
|
||||
Run qemu as per #5, and run vscclient from the "fake-smartcard"
|
||||
directory as follows:
|
||||
|
||||
qemu -chardev socket,server,host=0.0.0.0,port=2001,id=ccid,nowait -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
|
||||
vscclient -e "db=\"sql:$PWD\" use_hw=no soft=(,Test,CAC,,id-cert,signing-cert,encryption-cert)" <qemu-host> 2001
|
||||
|
||||
|
||||
7. Passthrough protocol scenario
|
||||
|
||||
This is a typical interchange of messages when using the passthru card device.
|
||||
usb-ccid is a usb device. It defaults to an unattached usb device on startup.
|
||||
usb-ccid expects a chardev and expects the protocol defined in
|
||||
cac_card/vscard_common.h to be passed over that.
|
||||
The usb-ccid device can be in one of three modes:
|
||||
* detached
|
||||
* attached with no card
|
||||
* attached with card
|
||||
|
||||
A typical interchange is: (the arrow shows who started each exchange, it can be client
|
||||
originated or guest originated)
|
||||
|
||||
client event | vscclient | passthru | usb-ccid | guest event
|
||||
----------------------------------------------------------------------------------------------
|
||||
| VSC_Init | | |
|
||||
| VSC_ReaderAdd | | attach |
|
||||
| | | | sees new usb device.
|
||||
card inserted -> | | | |
|
||||
| VSC_ATR | insert | insert | see new card
|
||||
| | | |
|
||||
| VSC_APDU | VSC_APDU | | <- guest sends APDU
|
||||
client<->physical | | | |
|
||||
card APDU exchange| | | |
|
||||
client response ->| VSC_APDU | VSC_APDU | | receive APDU response
|
||||
...
|
||||
[APDU<->APDU repeats several times]
|
||||
...
|
||||
card removed -> | | | |
|
||||
| VSC_CardRemove | remove | remove | card removed
|
||||
...
|
||||
[(card insert, apdu's, card remove) repeat]
|
||||
...
|
||||
kill/quit | | | |
|
||||
vscclient | | | |
|
||||
| VSC_ReaderRemove | | detach |
|
||||
| | | | usb device removed.
|
||||
|
||||
|
||||
8. libcacard
|
||||
|
||||
Both ccid-card-emulated and vscclient use libcacard as the card emulator.
|
||||
libcacard implements a completely virtual CAC (DoD standard for smart
|
||||
cards) compliant card and uses NSS to retrieve certificates and do
|
||||
any encryption. The backend can then be a real reader and card, or
|
||||
certificates stored in files.
|
||||
|
||||
For documentation of the library see docs/libcacard.txt.
|
||||
|
|
@ -1,37 +0,0 @@
|
|||
###########################################################################
|
||||
#
|
||||
# You can pass this file directly to qemu using the -readconfig
|
||||
# command line switch.
|
||||
#
|
||||
# This config file creates a EHCI adapter with companion UHCI
|
||||
# controllers as multifunction device in PCI slot "1d".
|
||||
#
|
||||
# Specify "bus=ehci.0" when creating usb devices to hook them up
|
||||
# there.
|
||||
#
|
||||
|
||||
[device "ehci"]
|
||||
driver = "ich9-usb-ehci1"
|
||||
addr = "1d.7"
|
||||
multifunction = "on"
|
||||
|
||||
[device "uhci-1"]
|
||||
driver = "ich9-usb-uhci1"
|
||||
addr = "1d.0"
|
||||
multifunction = "on"
|
||||
masterbus = "ehci.0"
|
||||
firstport = "0"
|
||||
|
||||
[device "uhci-2"]
|
||||
driver = "ich9-usb-uhci2"
|
||||
addr = "1d.1"
|
||||
multifunction = "on"
|
||||
masterbus = "ehci.0"
|
||||
firstport = "2"
|
||||
|
||||
[device "uhci-3"]
|
||||
driver = "ich9-usb-uhci3"
|
||||
addr = "1d.2"
|
||||
multifunction = "on"
|
||||
masterbus = "ehci.0"
|
||||
firstport = "4"
|
|
@ -1,239 +0,0 @@
|
|||
# Specification for the fuzz testing tool
|
||||
#
|
||||
# Copyright (C) 2014 Maria Kustova <maria.k@catit.be>
|
||||
#
|
||||
# This program is free software: you can redistribute it and/or modify
|
||||
# it under the terms of the GNU General Public License as published by
|
||||
# the Free Software Foundation, either version 2 of the License, or
|
||||
# (at your option) any later version.
|
||||
#
|
||||
# This program is distributed in the hope that it will be useful,
|
||||
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
# GNU General Public License for more details.
|
||||
#
|
||||
# You should have received a copy of the GNU General Public License
|
||||
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
||||
|
||||
|
||||
Image fuzzer
|
||||
============
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
The goal of the image fuzzer is to catch crashes of qemu-io/qemu-img
|
||||
by providing to them randomly corrupted images.
|
||||
Test images are generated from scratch and have valid inner structure with some
|
||||
elements, e.g. L1/L2 tables, having random invalid values.
|
||||
|
||||
|
||||
Test runner
|
||||
-----------
|
||||
|
||||
The test runner generates test images, executes tests utilizing generated
|
||||
images, indicates their results and collects all test related artifacts (logs,
|
||||
core dumps, test images, backing files).
|
||||
The test means execution of all available commands under test with the same
|
||||
generated test image.
|
||||
By default, the test runner generates new tests and executes them until
|
||||
keyboard interruption. But if a test seed is specified via the '--seed' runner
|
||||
parameter, then only one test with this seed will be executed, after its finish
|
||||
the runner will exit.
|
||||
|
||||
The runner uses an external image fuzzer to generate test images. An image
|
||||
generator should be specified as a mandatory parameter of the test runner.
|
||||
Details about interactions between the runner and fuzzers see "Module
|
||||
interfaces".
|
||||
|
||||
The runner activates generation of core dumps during test executions, but it
|
||||
assumes that core dumps will be generated in the current working directory.
|
||||
For comprehensive test results, please, set up your test environment
|
||||
properly.
|
||||
|
||||
Paths to binaries under test (SUTs) qemu-img and qemu-io are retrieved from
|
||||
environment variables. If the environment check fails the runner will
|
||||
use SUTs installed in system paths.
|
||||
qemu-img is required for creation of backing files, so it's mandatory to set
|
||||
the related environment variable if it's not installed in the system path.
|
||||
For details about environment variables see qemu-iotests/check.
|
||||
|
||||
The runner accepts a JSON array of fields expected to be fuzzed via the
|
||||
'--config' argument, e.g.
|
||||
|
||||
'[["feature_name_table"], ["header", "l1_table_offset"]]'
|
||||
|
||||
Each sublist can have one or two strings defining image structure elements.
|
||||
In the latter case a parent element should be placed on the first position,
|
||||
and a field name on the second one.
|
||||
|
||||
The runner accepts a list of commands under test as a JSON array via
|
||||
the '--command' argument. Each command is a list containing a SUT and all its
|
||||
arguments, e.g.
|
||||
|
||||
runner.py -c '[["qemu-io", "$test_img", "-c", "write $off $len"]]'
|
||||
/tmp/test ../qcow2
|
||||
|
||||
For variable arguments next aliases can be used:
|
||||
- $test_img for a fuzzed img
|
||||
- $off for an offset in the fuzzed image
|
||||
- $len for a data size
|
||||
|
||||
Values for last two aliases will be generated based on a size of a virtual
|
||||
disk of the generated image.
|
||||
In case when no commands are specified the runner will execute commands from
|
||||
the default list:
|
||||
- qemu-img check
|
||||
- qemu-img info
|
||||
- qemu-img convert
|
||||
- qemu-io -c read
|
||||
- qemu-io -c write
|
||||
- qemu-io -c aio_read
|
||||
- qemu-io -c aio_write
|
||||
- qemu-io -c flush
|
||||
- qemu-io -c discard
|
||||
- qemu-io -c truncate
|
||||
|
||||
|
||||
Qcow2 image generator
|
||||
---------------------
|
||||
|
||||
The 'qcow2' generator is a Python package providing 'create_image' method as
|
||||
a single public API. See details in 'Test runner/image fuzzer' chapter of
|
||||
'Module interfaces'.
|
||||
|
||||
Qcow2 contains two submodules: fuzz.py and layout.py.
|
||||
|
||||
'fuzz.py' contains all fuzzing functions, one per image field. It's assumed
|
||||
that after code analysis every field will have own constraints for its value.
|
||||
For now only universal potentially dangerous values are used, e.g. type limits
|
||||
for integers or unsafe symbols as '%s' for strings. For bitmasks random amount
|
||||
of bits are set to ones. All fuzzed values are checked on non-equality to the
|
||||
current valid value of the field. In case of equality the value will be
|
||||
regenerated.
|
||||
|
||||
'layout.py' creates a random valid image, fuzzes a random subset of the image
|
||||
fields by 'fuzz.py' module and writes a fuzzed image to the file specified.
|
||||
If a fuzzer configuration is specified, then it has the next interpretation:
|
||||
|
||||
1. If a list contains a parent image element only, then some random portion
|
||||
of fields of this element will be fuzzed every test.
|
||||
The same behavior is applied for the entire image if no configuration is
|
||||
used. This case is useful for the test specialization.
|
||||
|
||||
2. If a list contains a parent element and a field name, then a field
|
||||
will be always fuzzed for every test. This case is useful for regression
|
||||
testing.
|
||||
|
||||
The generator can create header fields, header extensions, L1/L2 tables and
|
||||
refcount table and blocks.
|
||||
|
||||
Module interfaces
|
||||
-----------------
|
||||
|
||||
* Test runner/image fuzzer
|
||||
|
||||
The runner calls an image generator specifying the path to a test image file,
|
||||
path to a backing file and its format and a fuzzer configuration.
|
||||
An image generator is expected to provide a
|
||||
|
||||
'create_image(test_img_path, backing_file_path=None,
|
||||
backing_file_format=None, fuzz_config=None)'
|
||||
|
||||
method that creates a test image, writes it to the specified file and returns
|
||||
the size of the virtual disk.
|
||||
The file should be created if it doesn't exist or overwritten otherwise.
|
||||
fuzz_config has a form of a list of lists. Every sublist can have one
|
||||
or two elements: first element is a name of a parent image element, second one
|
||||
if exists is a name of a field in this element.
|
||||
Example,
|
||||
[['header', 'l1_table_offset'],
|
||||
['header', 'nb_snapshots'],
|
||||
['feature_name_table']]
|
||||
|
||||
Random seed is set by the runner at every test execution for the regression
|
||||
purpose, so an image generator is not recommended to modify it internally.
|
||||
|
||||
|
||||
Overall fuzzer requirements
|
||||
===========================
|
||||
|
||||
Input data:
|
||||
----------
|
||||
|
||||
- image template (generator)
|
||||
- work directory
|
||||
- action vector (optional)
|
||||
- seed (optional)
|
||||
- SUT and its arguments (optional)
|
||||
|
||||
|
||||
Fuzzer requirements:
|
||||
-------------------
|
||||
|
||||
1. Should be able to inject random data
|
||||
2. Should be able to select a random value from the manually pregenerated
|
||||
vector (boundary values, e.g. max/min cluster size)
|
||||
3. Image template should describe a general structure invariant for all
|
||||
test images (image format description)
|
||||
4. Image template should be autonomous and other fuzzer parts should not
|
||||
rely on it
|
||||
5. Image template should contain reference rules (not only block+size
|
||||
description)
|
||||
6. Should generate the test image with the correct structure based on an image
|
||||
template
|
||||
7. Should accept a seed as an argument (for regression purpose)
|
||||
8. Should generate a seed if it is not specified as an input parameter.
|
||||
9. The same seed should generate the same image for the same action vector,
|
||||
specified or generated.
|
||||
10. Should accept a vector of actions as an argument (for test reproducing and
|
||||
for test case specification, e.g. group of tests for header structure,
|
||||
group of test for snapshots, etc)
|
||||
11. Action vector should be randomly generated from the pool of available
|
||||
actions, if it is not specified as an input parameter
|
||||
12. Pool of actions should be defined automatically based on an image template
|
||||
13. Should accept a SUT and its call parameters as an argument or select them
|
||||
randomly otherwise. As far as it's expected to be rarely changed, the list
|
||||
of all possible test commands can be available in the test runner
|
||||
internally.
|
||||
14. Should support an external cancellation of a test run
|
||||
15. Seed should be logged (for regression purpose)
|
||||
16. All files related to a test result should be collected: a test image,
|
||||
SUT logs, fuzzer logs and crash dumps
|
||||
17. Should be compatible with python version 2.4-2.7
|
||||
18. Usage of external libraries should be limited as much as possible.
|
||||
|
||||
|
||||
Image formats:
|
||||
-------------
|
||||
|
||||
Main target image format is qcow2, but support of image templates should
|
||||
provide an ability to add any other image format.
|
||||
|
||||
|
||||
Effectiveness:
|
||||
-------------
|
||||
|
||||
The fuzzer can be controlled via template, seed and action vector;
|
||||
it makes the fuzzer itself invariant to an image format and test logic.
|
||||
It should be able to perform rather complex and precise tests, that can be
|
||||
specified via an action vector. Otherwise, knowledge about an image structure
|
||||
allows the fuzzer to generate the pool of all available areas can be fuzzed
|
||||
and randomly select some of them and so compose its own action vector.
|
||||
Also complexity of a template defines complexity of the fuzzer, so its
|
||||
functionality can be varied from simple model-independent fuzzing to smart
|
||||
model-based one.
|
||||
|
||||
|
||||
Glossary:
|
||||
--------
|
||||
|
||||
Action vector is a sequence of structure elements retrieved from an image
|
||||
format, each of them will be fuzzed for the test image. It's a subset of
|
||||
elements of the action pool. Example: header, refcount table, etc.
|
||||
Action pool is all available elements of an image structure that generated
|
||||
automatically from an image template.
|
||||
Image template is a formal description of an image structure and relations
|
||||
between image blocks.
|
||||
Test image is an output image of the fuzzer defined by the current seed and
|
||||
action vector.
|
|
@ -1,483 +0,0 @@
|
|||
This file documents the CAC (Common Access Card) library in the libcacard
|
||||
subdirectory.
|
||||
|
||||
Virtual Smart Card Emulator
|
||||
|
||||
This emulator is designed to provide emulation of actual smart cards to a
|
||||
virtual card reader running in a guest virtual machine. The emulated smart
|
||||
cards can be representations of real smart cards, where the necessary functions
|
||||
such as signing, card removal/insertion, etc. are mapped to real, physical
|
||||
cards which are shared with the client machine the emulator is running on, or
|
||||
the cards could be pure software constructs.
|
||||
|
||||
The emulator is structured to allow multiple replaceable or additional pieces,
|
||||
so it can be easily modified for future requirements. The primary envisioned
|
||||
modifications are:
|
||||
|
||||
1) The socket connection to the virtual card reader (presumably a CCID reader,
|
||||
but other ISO-7816 compatible readers could be used). The code that handles
|
||||
this is in vscclient.c.
|
||||
|
||||
2) The virtual card low level emulation. This is currently supplied by using
|
||||
NSS. This emulation could be replaced by implementations based on other
|
||||
security libraries, including but not limitted to openssl+pkcs#11 library,
|
||||
raw pkcs#11, Microsoft CAPI, direct opensc calls, etc. The code that handles
|
||||
this is in vcard_emul_nss.c.
|
||||
|
||||
3) Emulation for new types of cards. The current implementation emulates the
|
||||
original DoD CAC standard with separate pki containers. This emulator lives in
|
||||
cac.c. More than one card type emulator could be included. Other cards could
|
||||
be emulated as well, including PIV, newer versions of CAC, PKCS #15, etc.
|
||||
|
||||
--------------------
|
||||
Replacing the Socket Based Virtual Reader Interface.
|
||||
|
||||
The current implementation contains a replaceable module vscclient.c. The
|
||||
current vscclient.c implements a sockets interface to the virtual ccid reader
|
||||
on the guest. CCID commands that are pertinent to emulation are passed
|
||||
across the socket, and their responses are passed back along that same socket.
|
||||
The protocol that vscclient uses is defined in vscard_common.h and connects
|
||||
to a qemu ccid usb device. Since this socket runs as a client, vscclient.c
|
||||
implements a program with a main entry. It also handles argument parsing for
|
||||
the emulator.
|
||||
|
||||
An application that wants to use the virtual reader can replace vscclient.c
|
||||
with its own implementation that connects to its own CCID reader. The calls
|
||||
that the CCID reader can call are:
|
||||
|
||||
VReaderList * vreader_get_reader_list();
|
||||
|
||||
This function returns a list of virtual readers. These readers may map to
|
||||
physical devices, or simulated devices depending on vcard the back end. Each
|
||||
reader in the list should represent a reader to the virtual machine. Virtual
|
||||
USB address mapping is left to the CCID reader front end. This call can be
|
||||
made any time to get an updated list. The returned list is a copy of the
|
||||
internal list that can be referenced by the caller without locking. This copy
|
||||
must be freed by the caller with vreader_list_delete when it is no longer
|
||||
needed.
|
||||
|
||||
VReaderListEntry *vreader_list_get_first(VReaderList *);
|
||||
|
||||
This function gets the first entry on the reader list. Along with
|
||||
vreader_list_get_next(), vreader_list_get_first() can be used to walk the
|
||||
reader list returned from vreader_get_reader_list(). VReaderListEntries are
|
||||
part of the list themselves and do not need to be freed separately from the
|
||||
list. If there are no entries on the list, it will return NULL.
|
||||
|
||||
VReaderListEntry *vreader_list_get_next(VReaderListEntry *);
|
||||
|
||||
This function gets the next entry in the list. If there are no more entries
|
||||
it will return NULL.
|
||||
|
||||
VReader * vreader_list_get_reader(VReaderListEntry *)
|
||||
|
||||
This function returns the reader stored in the reader List entry. Caller gets
|
||||
a new reference to a reader. The caller must free its reference when it is
|
||||
finished with vreader_free().
|
||||
|
||||
void vreader_free(VReader *reader);
|
||||
|
||||
This function frees a reference to a reader. Readers are reference counted
|
||||
and are automatically deleted when the last reference is freed.
|
||||
|
||||
void vreader_list_delete(VReaderList *list);
|
||||
|
||||
This function frees the list, all the elements on the list, and all the
|
||||
reader references held by the list.
|
||||
|
||||
VReaderStatus vreader_power_on(VReader *reader, char *atr, int *len);
|
||||
|
||||
This function simulates a card power on. A virtual card does not care about
|
||||
the actual voltage and other physical parameters, but it does care that the
|
||||
card is actually on or off. Cycling the card causes the card to reset. If
|
||||
the caller provides enough space, vreader_power_on will return the ATR of
|
||||
the virtual card. The amount of space provided in atr should be indicated
|
||||
in *len. The function modifies *len to be the actual length of of the
|
||||
returned ATR.
|
||||
|
||||
VReaderStatus vreader_power_off(VReader *reader);
|
||||
|
||||
This function simulates a power off of a virtual card.
|
||||
|
||||
VReaderStatus vreader_xfer_bytes(VReader *reader, unsigne char *send_buf,
|
||||
int send_buf_len,
|
||||
unsigned char *receive_buf,
|
||||
int receive_buf_len);
|
||||
|
||||
This function sends a raw apdu to a card and returns the card's response.
|
||||
The CCID front end should return the response back. Most of the emulation
|
||||
is driven from these APDUs.
|
||||
|
||||
VReaderStatus vreader_card_is_present(VReader *reader);
|
||||
|
||||
This function returns whether or not the reader has a card inserted. The
|
||||
vreader_power_on, vreader_power_off, and vreader_xfer_bytes will return
|
||||
VREADER_NO_CARD.
|
||||
|
||||
const char *vreader_get_name(VReader *reader);
|
||||
|
||||
This function returns the name of the reader. The name comes from the card
|
||||
emulator level and is usually related to the name of the physical reader.
|
||||
|
||||
VReaderID vreader_get_id(VReader *reader);
|
||||
|
||||
This function returns the id of a reader. All readers start out with an id
|
||||
of -1. The application can set the id with vreader_set_id.
|
||||
|
||||
VReaderStatus vreader_get_id(VReader *reader, VReaderID id);
|
||||
|
||||
This function sets the reader id. The application is responsible for making
|
||||
sure that the id is unique for all readers it is actively using.
|
||||
|
||||
VReader *vreader_find_reader_by_id(VReaderID id);
|
||||
|
||||
This function returns the reader which matches the id. If two readers match,
|
||||
only one is returned. The function returns NULL if the id is -1.
|
||||
|
||||
Event *vevent_wait_next_vevent();
|
||||
|
||||
This function blocks waiting for reader and card insertion events. There
|
||||
will be one event for each card insertion, each card removal, each reader
|
||||
insertion and each reader removal. At start up, events are created for all
|
||||
the initial readers found, as well as all the cards that are inserted.
|
||||
|
||||
Event *vevent_get_next_vevent();
|
||||
|
||||
This function returns a pending event if it exists, otherwise it returns
|
||||
NULL. It does not block.
|
||||
|
||||
----------------
|
||||
Card Type Emulator: Adding a New Virtual Card Type
|
||||
|
||||
The ISO 7816 card spec describes 2 types of cards:
|
||||
1) File system cards, where the smartcard is managed by reading and writing
|
||||
data to files in a file system. There is currently only boiler plate
|
||||
implemented for file system cards.
|
||||
2) VM cards, where the card has loadable applets which perform the card
|
||||
functions. The current implementation supports VM cards.
|
||||
|
||||
In the case of VM cards, the difference between various types of cards is
|
||||
really what applets have been installed in that card. This structure is
|
||||
mirrored in card type emulators. The 7816 emulator already handles the basic
|
||||
ISO 7186 commands. Card type emulators simply need to add the virtual applets
|
||||
which emulate the real card applets. Card type emulators have exactly one
|
||||
public entry point:
|
||||
|
||||
VCARDStatus xxx_card_init(VCard *card, const char *flags,
|
||||
const unsigned char *cert[],
|
||||
int cert_len[],
|
||||
VCardKey *key[],
|
||||
int cert_count);
|
||||
|
||||
The parameters for this are:
|
||||
card - the virtual card structure which will represent this card.
|
||||
flags - option flags that may be specific to this card type.
|
||||
cert - array of binary certificates.
|
||||
cert_len - array of lengths of each of the certificates specified in cert.
|
||||
key - array of opaque key structures representing the private keys on
|
||||
the card.
|
||||
cert_count - number of entries in cert, cert_len, and key arrays.
|
||||
|
||||
Any cert, cert_len, or key with the same index are matching sets. That is
|
||||
cert[0] is cert_len[0] long and has the corresponding private key of key[0].
|
||||
|
||||
The card type emulator is expected to own the VCardKeys, but it should copy
|
||||
any raw cert data it wants to save. It can create new applets and add them to
|
||||
the card using the following functions:
|
||||
|
||||
VCardApplet *vcard_new_applet(VCardProcessAPDU apdu_func,
|
||||
VCardResetApplet reset_func,
|
||||
const unsigned char *aid,
|
||||
int aid_len);
|
||||
|
||||
This function creates a new applet. Applet structures store the following
|
||||
information:
|
||||
1) the AID of the applet (set by aid and aid_len).
|
||||
2) a function to handle APDUs for this applet. (set by apdu_func, more on
|
||||
this below).
|
||||
3) a function to reset the applet state when the applet is selected.
|
||||
(set by reset_func, more on this below).
|
||||
3) applet private data, a data pointer used by the card type emulator to
|
||||
store any data or state it needs to complete requests. (set by a
|
||||
separate call).
|
||||
4) applet private data free, a function used to free the applet private
|
||||
data when the applet itself is destroyed.
|
||||
The created applet can be added to the card with vcard_add_applet below.
|
||||
|
||||
void vcard_set_applet_private(VCardApplet *applet,
|
||||
VCardAppletPrivate *private,
|
||||
VCardAppletPrivateFree private_free);
|
||||
This function sets the private data and the corresponding free function.
|
||||
VCardAppletPrivate is an opaque data structure to the rest of the emulator.
|
||||
The card type emulator can define it any way it wants by defining
|
||||
struct VCardAppletPrivateStruct {};. If there is already a private data
|
||||
structure on the applet, the old one is freed before the new one is set up.
|
||||
passing two NULL clear any existing private data.
|
||||
|
||||
VCardStatus vcard_add_applet(VCard *card, VCardApplet *applet);
|
||||
|
||||
Add an applet onto the list of applets attached to the card. Once an applet
|
||||
has been added, it can be selected by its AID, and then commands will be
|
||||
routed to it VCardProcessAPDU function. This function adopts the applet that
|
||||
is passed into it. Note: 2 applets with the same AID should not be added to
|
||||
the same card. It is permissible to add more than one applet. Multiple applets
|
||||
may have the same VCardPRocessAPDU entry point.
|
||||
|
||||
The certs and keys should be attached to private data associated with one or
|
||||
more appropriate applets for that card. Control will come to the card type
|
||||
emulators once one of its applets are selected through the VCardProcessAPDU
|
||||
function it specified when it created the applet.
|
||||
|
||||
The signature of VCardResetApplet is:
|
||||
VCardStatus (*VCardResetApplet) (VCard *card, int channel);
|
||||
This function will reset the any internal applet state that needs to be
|
||||
cleared after a select applet call. It should return VCARD_DONE;
|
||||
|
||||
The signature of VCardProcessAPDU is:
|
||||
VCardStatus (*VCardProcessAPDU)(VCard *card, VCardAPDU *apdu,
|
||||
VCardResponse **response);
|
||||
This function examines the APDU and determines whether it should process
|
||||
the apdu directly, reject the apdu as invalid, or pass the apdu on to
|
||||
the basic 7816 emulator for processing.
|
||||
If the 7816 emulator should process the apdu, then the VCardProcessAPDU
|
||||
should return VCARD_NEXT.
|
||||
If there is an error, then VCardProcessAPDU should return an error
|
||||
response using vcard_make_response and the appropriate 7816 error code
|
||||
(see card_7816t.h) or vcard_make_response with a card type specific error
|
||||
code. It should then return VCARD_DONE.
|
||||
If the apdu can be processed correctly, VCardProcessAPDU should do so,
|
||||
set the response value appropriately for that APDU, and return VCARD_DONE.
|
||||
VCardProcessAPDU should always set the response if it returns VCARD_DONE.
|
||||
It should always either return VCARD_DONE or VCARD_NEXT.
|
||||
|
||||
Parsing the APDU --
|
||||
|
||||
Prior to processing calling the card type emulator's VCardProcessAPDU function, the emulator has already decoded the APDU header and set several fields:
|
||||
|
||||
apdu->a_data - The raw apdu data bytes.
|
||||
apdu->a_len - The len of the raw apdu data.
|
||||
apdu->a_body - The start of any post header parameter data.
|
||||
apdu->a_Lc - The parameter length value.
|
||||
apdu->a_Le - The expected length of any returned data.
|
||||
apdu->a_cla - The raw apdu class.
|
||||
apdu->a_channel - The channel (decoded from the class).
|
||||
apdu->a_secure_messaging_type - The decoded secure messaging type
|
||||
(from class).
|
||||
apdu->a_type - The decode class type.
|
||||
apdu->a_gen_type - the generic class type (7816, PROPRIETARY, RFU, PTS).
|
||||
apdu->a_ins - The instruction byte.
|
||||
apdu->a_p1 - Parameter 1.
|
||||
apdu->a_p2 - Parameter 2.
|
||||
|
||||
Creating a Response --
|
||||
|
||||
The expected result of any APDU call is a response. The card type emulator must
|
||||
set *response with an appropriate VCardResponse value if it returns VCARD_DONE.
|
||||
Responses could be as simple as returning a 2 byte status word response, to as
|
||||
complex as returning a block of data along with a 2 byte response. Which is
|
||||
returned will depend on the semantics of the APDU. The following functions will
|
||||
create card responses.
|
||||
|
||||
VCardResponse *vcard_make_response(VCard7816Status status);
|
||||
|
||||
This is the most basic function to get a response. This function will
|
||||
return a response the consists solely one 2 byte status code. If that status
|
||||
code is defined in card_7816t.h, then this function is guaranteed to
|
||||
return a response with that status. If a cart type specific status code
|
||||
is passed and vcard_make_response fails to allocate the appropriate memory
|
||||
for that response, then vcard_make_response will return a VCardResponse
|
||||
of VCARD7816_STATUS_EXC_ERROR_MEMORY. In any case, this function is
|
||||
guaranteed to return a valid VCardResponse.
|
||||
|
||||
VCardResponse *vcard_response_new(unsigned char *buf, int len,
|
||||
VCard7816Status status);
|
||||
|
||||
This function is similar to vcard_make_response except it includes some
|
||||
returned data with the response. It could also fail to allocate enough
|
||||
memory, in which case it will return NULL.
|
||||
|
||||
VCardResponse *vcard_response_new_status_bytes(unsigned char sw1,
|
||||
unsigned char sw2);
|
||||
|
||||
Sometimes in 7816 the response bytes are treated as two separate bytes with
|
||||
split meanings. This function allows you to create a response based on
|
||||
two separate bytes. This function could fail, in which case it will return
|
||||
NULL.
|
||||
|
||||
VCardResponse *vcard_response_new_bytes(unsigned char *buf, int len,
|
||||
unsigned char sw1,
|
||||
unsigned char sw2);
|
||||
|
||||
This function is the same as vcard_response_new except you may specify
|
||||
the status as two separate bytes like vcard_response_new_status_bytes.
|
||||
|
||||
|
||||
Implementing functionality ---
|
||||
|
||||
The following helper functions access information about the current card
|
||||
and applet.
|
||||
|
||||
VCARDAppletPrivate *vcard_get_current_applet_private(VCard *card,
|
||||
int channel);
|
||||
|
||||
This function returns any private data set by the card type emulator on
|
||||
the currently selected applet. The card type emulator keeps track of the
|
||||
current applet state in this data structure. Any certs and keys associated
|
||||
with a particular applet is also stored here.
|
||||
|
||||
int vcard_emul_get_login_count(VCard *card);
|
||||
|
||||
This function returns the the number of remaining login attempts for this
|
||||
card. If the card emulator does not know, or the card does not have a
|
||||
way of giving this information, this function returns -1.
|
||||
|
||||
|
||||
VCard7816Status vcard_emul_login(VCard *card, unsigned char *pin,
|
||||
int pin_len);
|
||||
|
||||
This function logs into the card and returns the standard 7816 status
|
||||
word depending on the success or failure of the call.
|
||||
|
||||
void vcard_emul_delete_key(VCardKey *key);
|
||||
|
||||
This function frees the VCardKey passed in to xxxx_card_init. The card
|
||||
type emulator is responsible for freeing this key when it no longer needs
|
||||
it.
|
||||
|
||||
VCard7816Status vcard_emul_rsa_op(VCard *card, VCardKey *key,
|
||||
unsigned char *buffer,
|
||||
int buffer_size);
|
||||
|
||||
This function does a raw rsa op on the buffer with the given key.
|
||||
|
||||
The sample card type emulator is found in cac.c. It implements the cac specific
|
||||
applets. Only those applets needed by the coolkey pkcs#11 driver on the guest
|
||||
have been implemented. To support the full range CAC middleware, a complete CAC
|
||||
card according to the CAC specs should be implemented here.
|
||||
|
||||
------------------------------
|
||||
Virtual Card Emulator
|
||||
|
||||
This code accesses both real smart cards and simulated smart cards through
|
||||
services provided on the client. The current implementation uses NSS, which
|
||||
already knows how to talk to various PKCS #11 modules on the client, and is
|
||||
portable to most operating systems. A particular emulator can have only one
|
||||
virtual card implementation at a time.
|
||||
|
||||
The virtual card emulator consists of a series of virtual card services. In
|
||||
addition to the services describe above (services starting with
|
||||
vcard_emul_xxxx), the virtual card emulator also provides the following
|
||||
functions:
|
||||
|
||||
VCardEmulError vcard_emul_init(cont VCardEmulOptions *options);
|
||||
|
||||
The options structure is built by another function in the virtual card
|
||||
interface where a string of virtual card emulator specific strings are
|
||||
mapped to the options. The actual structure is defined by the virtual card
|
||||
emulator and is used to determine the configuration of soft cards, or to
|
||||
determine which physical cards to present to the guest.
|
||||
|
||||
The vcard_emul_init function will build up sets of readers, create any
|
||||
threads that are needed to watch for changes in the reader state. If readers
|
||||
have cards present in them, they are also initialized.
|
||||
|
||||
Readers are created with the function.
|
||||
|
||||
VReader *vreader_new(VReaderEmul *reader_emul,
|
||||
VReaderEmulFree reader_emul_free);
|
||||
|
||||
The freeFunc is used to free the VReaderEmul * when the reader is
|
||||
destroyed. The VReaderEmul structure is an opaque structure to the
|
||||
rest of the code, but defined by the virtual card emulator, which can
|
||||
use it to store any reader specific state.
|
||||
|
||||
Once the reader has been created, it can be added to the front end with the
|
||||
call:
|
||||
|
||||
VReaderStatus vreader_add_reader(VReader *reader);
|
||||
|
||||
This function will automatically generate the appropriate new reader
|
||||
events and add the reader to the list.
|
||||
|
||||
To create a new card, the virtual card emulator will call a similar
|
||||
function.
|
||||
|
||||
VCard *vcard_new(VCardEmul *card_emul,
|
||||
VCardEmulFree card_emul_free);
|
||||
|
||||
Like vreader_new, this function takes a virtual card emulator specific
|
||||
structure which it uses to keep track of the card state.
|
||||
|
||||
Once the card is created, it is attached to a card type emulator with the
|
||||
following function:
|
||||
|
||||
VCardStatus vcard_init(VCard *vcard, VCardEmulType type,
|
||||
const char *flags,
|
||||
unsigned char *const *certs,
|
||||
int *cert_len,
|
||||
VCardKey *key[],
|
||||
int cert_count);
|
||||
|
||||
The vcard is the value returned from vcard_new. The type is the
|
||||
card type emulator that this card should presented to the guest as.
|
||||
The flags are card type emulator specific options. The certs,
|
||||
cert_len, and keys are all arrays of length cert_count. These are the
|
||||
the same of the parameters xxxx_card_init() accepts.
|
||||
|
||||
Finally the card is associated with its reader by the call:
|
||||
|
||||
VReaderStatus vreader_insert_card(VReader *vreader, VCard *vcard);
|
||||
|
||||
This function, like vreader_add_reader, will take care of any event
|
||||
notification for the card insert.
|
||||
|
||||
|
||||
VCardEmulError vcard_emul_force_card_remove(VReader *vreader);
|
||||
|
||||
Force a card that is present to appear to be removed to the guest, even if
|
||||
that card is a physical card and is present.
|
||||
|
||||
|
||||
VCardEmulError vcard_emul_force_card_insert(VReader *reader);
|
||||
|
||||
Force a card that has been removed by vcard_emul_force_card_remove to be
|
||||
reinserted from the point of view of the guest. This will only work if the
|
||||
card is physically present (which is always true fro a soft card).
|
||||
|
||||
void vcard_emul_get_atr(Vcard *card, unsigned char *atr, int *atr_len);
|
||||
|
||||
Return the virtual ATR for the card. By convention this should be the value
|
||||
VCARD_ATR_PREFIX(size) followed by several ascii bytes related to this
|
||||
particular emulator. For instance the NSS emulator returns
|
||||
{VCARD_ATR_PREFIX(3), 'N', 'S', 'S' }. Do ot return more data then *atr_len;
|
||||
|
||||
void vcard_emul_reset(VCard *card, VCardPower power)
|
||||
|
||||
Set the state of 'card' to the current power level and reset its internal
|
||||
state (logout, etc).
|
||||
|
||||
-------------------------------------------------------
|
||||
List of files and their function:
|
||||
README - This file
|
||||
card_7816.c - emulate basic 7816 functionality. Parse APDUs.
|
||||
card_7816.h - apdu and response services definitions.
|
||||
card_7816t.h - 7816 specific structures, types and definitions.
|
||||
event.c - event handling code.
|
||||
event.h - event handling services definitions.
|
||||
eventt.h - event handling structures and types
|
||||
vcard.c - handle common virtual card services like creation, destruction, and
|
||||
applet management.
|
||||
vcard.h - common virtual card services function definitions.
|
||||
vcardt.h - comon virtual card types
|
||||
vreader.c - common virtual reader services.
|
||||
vreader.h - common virtual reader services definitions.
|
||||
vreadert.h - comon virtual reader types.
|
||||
vcard_emul_type.c - manage the card type emulators.
|
||||
vcard_emul_type.h - definitions for card type emulators.
|
||||
cac.c - card type emulator for CAC cards
|
||||
vcard_emul.h - virtual card emulator service definitions.
|
||||
vcard_emul_nss.c - virtual card emulator implementation for nss.
|
||||
vscclient.c - socket connection to guest qemu usb driver.
|
||||
vscard_common.h - common header with the guest qemu usb driver.
|
||||
mutex.h - header file for machine independent mutexes.
|
||||
link_test.c - static test to make sure all the symbols are properly defined.
|
|
@ -1,58 +0,0 @@
|
|||
LIVE BLOCK OPERATIONS
|
||||
=====================
|
||||
|
||||
High level description of live block operations. Note these are not
|
||||
supported for use with the raw format at the moment.
|
||||
|
||||
Snapshot live merge
|
||||
===================
|
||||
|
||||
Given a snapshot chain, described in this document in the following
|
||||
format:
|
||||
|
||||
[A] -> [B] -> [C] -> [D]
|
||||
|
||||
Where the rightmost object ([D] in the example) described is the current
|
||||
image which the guest OS has write access to. To the left of it is its base
|
||||
image, and so on accordingly until the leftmost image, which has no
|
||||
base.
|
||||
|
||||
The snapshot live merge operation transforms such a chain into a
|
||||
smaller one with fewer elements, such as this transformation relative
|
||||
to the first example:
|
||||
|
||||
[A] -> [D]
|
||||
|
||||
Currently only forward merge with target being the active image is
|
||||
supported, that is, data copy is performed in the right direction with
|
||||
destination being the rightmost image.
|
||||
|
||||
The operation is implemented in QEMU through image streaming facilities.
|
||||
|
||||
The basic idea is to execute 'block_stream virtio0' while the guest is
|
||||
running. Progress can be monitored using 'info block-jobs'. When the
|
||||
streaming operation completes it raises a QMP event. 'block_stream'
|
||||
copies data from the backing file(s) into the active image. When finished,
|
||||
it adjusts the backing file pointer.
|
||||
|
||||
The 'base' parameter specifies an image which data need not be streamed from.
|
||||
This image will be used as the backing file for the active image when the
|
||||
operation is finished.
|
||||
|
||||
In the example above, the command would be:
|
||||
|
||||
(qemu) block_stream virtio0 A
|
||||
|
||||
|
||||
Live block copy
|
||||
===============
|
||||
|
||||
To copy an in use image to another destination in the filesystem, one
|
||||
should create a live snapshot in the desired destination, then stream
|
||||
into that image. Example:
|
||||
|
||||
(qemu) snapshot_blkdev ide0-hd0 /new-path/disk.img qcow2
|
||||
|
||||
(qemu) block_stream ide0-hd0
|
||||
|
||||
|
|
@ -1,296 +0,0 @@
|
|||
= Migration =
|
||||
|
||||
QEMU has code to load/save the state of the guest that it is running.
|
||||
These are two complementary operations. Saving the state just does
|
||||
that, saves the state for each device that the guest is running.
|
||||
Restoring a guest is just the opposite operation: we need to load the
|
||||
state of each device.
|
||||
|
||||
For this to work, QEMU has to be launched with the same arguments the
|
||||
two times. I.e. it can only restore the state in one guest that has
|
||||
the same devices that the one it was saved (this last requirement can
|
||||
be relaxed a bit, but for now we can consider that configuration has
|
||||
to be exactly the same).
|
||||
|
||||
Once that we are able to save/restore a guest, a new functionality is
|
||||
requested: migration. This means that QEMU is able to start in one
|
||||
machine and being "migrated" to another machine. I.e. being moved to
|
||||
another machine.
|
||||
|
||||
Next was the "live migration" functionality. This is important
|
||||
because some guests run with a lot of state (specially RAM), and it
|
||||
can take a while to move all state from one machine to another. Live
|
||||
migration allows the guest to continue running while the state is
|
||||
transferred. Only while the last part of the state is transferred has
|
||||
the guest to be stopped. Typically the time that the guest is
|
||||
unresponsive during live migration is the low hundred of milliseconds
|
||||
(notice that this depends on a lot of things).
|
||||
|
||||
=== Types of migration ===
|
||||
|
||||
Now that we have talked about live migration, there are several ways
|
||||
to do migration:
|
||||
|
||||
- tcp migration: do the migration using tcp sockets
|
||||
- unix migration: do the migration using unix sockets
|
||||
- exec migration: do the migration using the stdin/stdout through a process.
|
||||
- fd migration: do the migration using an file descriptor that is
|
||||
passed to QEMU. QEMU doesn't care how this file descriptor is opened.
|
||||
|
||||
All these four migration protocols use the same infrastructure to
|
||||
save/restore state devices. This infrastructure is shared with the
|
||||
savevm/loadvm functionality.
|
||||
|
||||
=== State Live Migration ===
|
||||
|
||||
This is used for RAM and block devices. It is not yet ported to vmstate.
|
||||
<Fill more information here>
|
||||
|
||||
=== What is the common infrastructure ===
|
||||
|
||||
QEMU uses a QEMUFile abstraction to be able to do migration. Any type
|
||||
of migration that wants to use QEMU infrastructure has to create a
|
||||
QEMUFile with:
|
||||
|
||||
QEMUFile *qemu_fopen_ops(void *opaque,
|
||||
QEMUFilePutBufferFunc *put_buffer,
|
||||
QEMUFileGetBufferFunc *get_buffer,
|
||||
QEMUFileCloseFunc *close);
|
||||
|
||||
The functions have the following functionality:
|
||||
|
||||
This function writes a chunk of data to a file at the given position.
|
||||
The pos argument can be ignored if the file is only used for
|
||||
streaming. The handler should try to write all of the data it can.
|
||||
|
||||
typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
|
||||
int64_t pos, int size);
|
||||
|
||||
Read a chunk of data from a file at the given position. The pos argument
|
||||
can be ignored if the file is only be used for streaming. The number of
|
||||
bytes actually read should be returned.
|
||||
|
||||
typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
|
||||
int64_t pos, int size);
|
||||
|
||||
Close a file and return an error code.
|
||||
|
||||
typedef int (QEMUFileCloseFunc)(void *opaque);
|
||||
|
||||
You can use any internal state that you need using the opaque void *
|
||||
pointer that is passed to all functions.
|
||||
|
||||
The important functions for us are put_buffer()/get_buffer() that
|
||||
allow to write/read a buffer into the QEMUFile.
|
||||
|
||||
=== How to save the state of one device ===
|
||||
|
||||
The state of a device is saved using intermediate buffers. There are
|
||||
some helper functions to assist this saving.
|
||||
|
||||
There is a new concept that we have to explain here: device state
|
||||
version. When we migrate a device, we save/load the state as a series
|
||||
of fields. Some times, due to bugs or new functionality, we need to
|
||||
change the state to store more/different information. We use the
|
||||
version to identify each time that we do a change. Each version is
|
||||
associated with a series of fields saved. The save_state always saves
|
||||
the state as the newer version. But load_state sometimes is able to
|
||||
load state from an older version.
|
||||
|
||||
=== Legacy way ===
|
||||
|
||||
This way is going to disappear as soon as all current users are ported to VMSTATE.
|
||||
|
||||
Each device has to register two functions, one to save the state and
|
||||
another to load the state back.
|
||||
|
||||
int register_savevm(DeviceState *dev,
|
||||
const char *idstr,
|
||||
int instance_id,
|
||||
int version_id,
|
||||
SaveStateHandler *save_state,
|
||||
LoadStateHandler *load_state,
|
||||
void *opaque);
|
||||
|
||||
typedef void SaveStateHandler(QEMUFile *f, void *opaque);
|
||||
typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id);
|
||||
|
||||
The important functions for the device state format are the save_state
|
||||
and load_state. Notice that load_state receives a version_id
|
||||
parameter to know what state format is receiving. save_state doesn't
|
||||
have a version_id parameter because it always uses the latest version.
|
||||
|
||||
=== VMState ===
|
||||
|
||||
The legacy way of saving/loading state of the device had the problem
|
||||
that we have to maintain two functions in sync. If we did one change
|
||||
in one of them and not in the other, we would get a failed migration.
|
||||
|
||||
VMState changed the way that state is saved/loaded. Instead of using
|
||||
a function to save the state and another to load it, it was changed to
|
||||
a declarative way of what the state consisted of. Now VMState is able
|
||||
to interpret that definition to be able to load/save the state. As
|
||||
the state is declared only once, it can't go out of sync in the
|
||||
save/load functions.
|
||||
|
||||
An example (from hw/input/pckbd.c)
|
||||
|
||||
static const VMStateDescription vmstate_kbd = {
|
||||
.name = "pckbd",
|
||||
.version_id = 3,
|
||||
.minimum_version_id = 3,
|
||||
.fields = (VMStateField[]) {
|
||||
VMSTATE_UINT8(write_cmd, KBDState),
|
||||
VMSTATE_UINT8(status, KBDState),
|
||||
VMSTATE_UINT8(mode, KBDState),
|
||||
VMSTATE_UINT8(pending, KBDState),
|
||||
VMSTATE_END_OF_LIST()
|
||||
}
|
||||
};
|
||||
|
||||
We are declaring the state with name "pckbd".
|
||||
The version_id is 3, and the fields are 4 uint8_t in a KBDState structure.
|
||||
We registered this with:
|
||||
|
||||
vmstate_register(NULL, 0, &vmstate_kbd, s);
|
||||
|
||||
Note: talk about how vmstate <-> qdev interact, and what the instance ids mean.
|
||||
|
||||
You can search for VMSTATE_* macros for lots of types used in QEMU in
|
||||
include/hw/hw.h.
|
||||
|
||||
=== More about versions ===
|
||||
|
||||
You can see that there are several version fields:
|
||||
|
||||
- version_id: the maximum version_id supported by VMState for that device.
|
||||
- minimum_version_id: the minimum version_id that VMState is able to understand
|
||||
for that device.
|
||||
- minimum_version_id_old: For devices that were not able to port to vmstate, we can
|
||||
assign a function that knows how to read this old state. This field is
|
||||
ignored if there is no load_state_old handler.
|
||||
|
||||
So, VMState is able to read versions from minimum_version_id to
|
||||
version_id. And the function load_state_old() (if present) is able to
|
||||
load state from minimum_version_id_old to minimum_version_id. This
|
||||
function is deprecated and will be removed when no more users are left.
|
||||
|
||||
=== Massaging functions ===
|
||||
|
||||
Sometimes, it is not enough to be able to save the state directly
|
||||
from one structure, we need to fill the correct values there. One
|
||||
example is when we are using kvm. Before saving the cpu state, we
|
||||
need to ask kvm to copy to QEMU the state that it is using. And the
|
||||
opposite when we are loading the state, we need a way to tell kvm to
|
||||
load the state for the cpu that we have just loaded from the QEMUFile.
|
||||
|
||||
The functions to do that are inside a vmstate definition, and are called:
|
||||
|
||||
- int (*pre_load)(void *opaque);
|
||||
|
||||
This function is called before we load the state of one device.
|
||||
|
||||
- int (*post_load)(void *opaque, int version_id);
|
||||
|
||||
This function is called after we load the state of one device.
|
||||
|
||||
- void (*pre_save)(void *opaque);
|
||||
|
||||
This function is called before we save the state of one device.
|
||||
|
||||
Example: You can look at hpet.c, that uses the three function to
|
||||
massage the state that is transferred.
|
||||
|
||||
If you use memory API functions that update memory layout outside
|
||||
initialization (i.e., in response to a guest action), this is a strong
|
||||
indication that you need to call these functions in a post_load callback.
|
||||
Examples of such memory API functions are:
|
||||
|
||||
- memory_region_add_subregion()
|
||||
- memory_region_del_subregion()
|
||||
- memory_region_set_readonly()
|
||||
- memory_region_set_enabled()
|
||||
- memory_region_set_address()
|
||||
- memory_region_set_alias_offset()
|
||||
|
||||
=== Subsections ===
|
||||
|
||||
The use of version_id allows to be able to migrate from older versions
|
||||
to newer versions of a device. But not the other way around. This
|
||||
makes very complicated to fix bugs in stable branches. If we need to
|
||||
add anything to the state to fix a bug, we have to disable migration
|
||||
to older versions that don't have that bug-fix (i.e. a new field).
|
||||
|
||||
But sometimes, that bug-fix is only needed sometimes, not always. For
|
||||
instance, if the device is in the middle of a DMA operation, it is
|
||||
using a specific functionality, ....
|
||||
|
||||
It is impossible to create a way to make migration from any version to
|
||||
any other version to work. But we can do better than only allowing
|
||||
migration from older versions to newer ones. For that fields that are
|
||||
only needed sometimes, we add the idea of subsections. A subsection
|
||||
is "like" a device vmstate, but with a particularity, it has a Boolean
|
||||
function that tells if that values are needed to be sent or not. If
|
||||
this functions returns false, the subsection is not sent.
|
||||
|
||||
On the receiving side, if we found a subsection for a device that we
|
||||
don't understand, we just fail the migration. If we understand all
|
||||
the subsections, then we load the state with success.
|
||||
|
||||
One important note is that the post_load() function is called "after"
|
||||
loading all subsections, because a newer subsection could change same
|
||||
value that it uses.
|
||||
|
||||
Example:
|
||||
|
||||
static bool ide_drive_pio_state_needed(void *opaque)
|
||||
{
|
||||
IDEState *s = opaque;
|
||||
|
||||
return ((s->status & DRQ_STAT) != 0)
|
||||
|| (s->bus->error_status & BM_STATUS_PIO_RETRY);
|
||||
}
|
||||
|
||||
const VMStateDescription vmstate_ide_drive_pio_state = {
|
||||
.name = "ide_drive/pio_state",
|
||||
.version_id = 1,
|
||||
.minimum_version_id = 1,
|
||||
.pre_save = ide_drive_pio_pre_save,
|
||||
.post_load = ide_drive_pio_post_load,
|
||||
.fields = (VMStateField[]) {
|
||||
VMSTATE_INT32(req_nb_sectors, IDEState),
|
||||
VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
|
||||
vmstate_info_uint8, uint8_t),
|
||||
VMSTATE_INT32(cur_io_buffer_offset, IDEState),
|
||||
VMSTATE_INT32(cur_io_buffer_len, IDEState),
|
||||
VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
|
||||
VMSTATE_INT32(elementary_transfer_size, IDEState),
|
||||
VMSTATE_INT32(packet_transfer_size, IDEState),
|
||||
VMSTATE_END_OF_LIST()
|
||||
}
|
||||
};
|
||||
|
||||
const VMStateDescription vmstate_ide_drive = {
|
||||
.name = "ide_drive",
|
||||
.version_id = 3,
|
||||
.minimum_version_id = 0,
|
||||
.post_load = ide_drive_post_load,
|
||||
.fields = (VMStateField[]) {
|
||||
.... several fields ....
|
||||
VMSTATE_END_OF_LIST()
|
||||
},
|
||||
.subsections = (VMStateSubsection []) {
|
||||
{
|
||||
.vmsd = &vmstate_ide_drive_pio_state,
|
||||
.needed = ide_drive_pio_state_needed,
|
||||
}, {
|
||||
/* empty */
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
Here we have a subsection for the pio state. We only need to
|
||||
save/send this state when we are in the middle of a pio operation
|
||||
(that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is
|
||||
not enabled, the values on that fields are garbage and don't need to
|
||||
be sent.
|
|
@ -1,134 +0,0 @@
|
|||
Copyright (c) 2014 Red Hat Inc.
|
||||
|
||||
This work is licensed under the terms of the GNU GPL, version 2 or later. See
|
||||
the COPYING file in the top-level directory.
|
||||
|
||||
|
||||
This document explains the IOThread feature and how to write code that runs
|
||||
outside the QEMU global mutex.
|
||||
|
||||
The main loop and IOThreads
|
||||
---------------------------
|
||||
QEMU is an event-driven program that can do several things at once using an
|
||||
event loop. The VNC server and the QMP monitor are both processed from the
|
||||
same event loop, which monitors their file descriptors until they become
|
||||
readable and then invokes a callback.
|
||||
|
||||
The default event loop is called the main loop (see main-loop.c). It is
|
||||
possible to create additional event loop threads using -object
|
||||
iothread,id=my-iothread.
|
||||
|
||||
Side note: The main loop and IOThread are both event loops but their code is
|
||||
not shared completely. Sometimes it is useful to remember that although they
|
||||
are conceptually similar they are currently not interchangeable.
|
||||
|
||||
Why IOThreads are useful
|
||||
------------------------
|
||||
IOThreads allow the user to control the placement of work. The main loop is a
|
||||
scalability bottleneck on hosts with many CPUs. Work can be spread across
|
||||
several IOThreads instead of just one main loop. When set up correctly this
|
||||
can improve I/O latency and reduce jitter seen by the guest.
|
||||
|
||||
The main loop is also deeply associated with the QEMU global mutex, which is a
|
||||
scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
|
||||
global mutex to serialize execution of QEMU code. This mutex is necessary
|
||||
because a lot of QEMU's code historically was not thread-safe.
|
||||
|
||||
The fact that all I/O processing is done in a single main loop and that the
|
||||
QEMU global mutex is contended by all vCPU threads and the main loop explain
|
||||
why it is desirable to place work into IOThreads.
|
||||
|
||||
The experimental virtio-blk data-plane implementation has been benchmarked and
|
||||
shows these effects:
|
||||
ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
|
||||
|
||||
How to program for IOThreads
|
||||
----------------------------
|
||||
The main difference between legacy code and new code that can run in an
|
||||
IOThread is dealing explicitly with the event loop object, AioContext
|
||||
(see include/block/aio.h). Code that only works in the main loop
|
||||
implicitly uses the main loop's AioContext. Code that supports running
|
||||
in IOThreads must be aware of its AioContext.
|
||||
|
||||
AioContext supports the following services:
|
||||
* File descriptor monitoring (read/write/error on POSIX hosts)
|
||||
* Event notifiers (inter-thread signalling)
|
||||
* Timers
|
||||
* Bottom Halves (BH) deferred callbacks
|
||||
|
||||
There are several old APIs that use the main loop AioContext:
|
||||
* LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
|
||||
* LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
|
||||
* LEGACY timer_new_ms() - create a timer
|
||||
* LEGACY qemu_bh_new() - create a BH
|
||||
* LEGACY qemu_aio_wait() - run an event loop iteration
|
||||
|
||||
Since they implicitly work on the main loop they cannot be used in code that
|
||||
runs in an IOThread. They might cause a crash or deadlock if called from an
|
||||
IOThread since the QEMU global mutex is not held.
|
||||
|
||||
Instead, use the AioContext functions directly (see include/block/aio.h):
|
||||
* aio_set_fd_handler() - monitor a file descriptor
|
||||
* aio_set_event_notifier() - monitor an event notifier
|
||||
* aio_timer_new() - create a timer
|
||||
* aio_bh_new() - create a BH
|
||||
* aio_poll() - run an event loop iteration
|
||||
|
||||
The AioContext can be obtained from the IOThread using
|
||||
iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
|
||||
Code that takes an AioContext argument works both in IOThreads or the main
|
||||
loop, depending on which AioContext instance the caller passes in.
|
||||
|
||||
How to synchronize with an IOThread
|
||||
-----------------------------------
|
||||
AioContext is not thread-safe so some rules must be followed when using file
|
||||
descriptors, event notifiers, timers, or BHs across threads:
|
||||
|
||||
1. AioContext functions can be called safely from file descriptor, event
|
||||
notifier, timer, or BH callbacks invoked by the AioContext. No locking is
|
||||
necessary.
|
||||
|
||||
2. Other threads wishing to access the AioContext must use
|
||||
aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
|
||||
context is acquired no other thread can access it or run event loop iterations
|
||||
in this AioContext.
|
||||
|
||||
aio_context_acquire()/aio_context_release() calls may be nested. This
|
||||
means you can call them if you're not sure whether #1 applies.
|
||||
|
||||
There is currently no lock ordering rule if a thread needs to acquire multiple
|
||||
AioContexts simultaneously. Therefore, it is only safe for code holding the
|
||||
QEMU global mutex to acquire other AioContexts.
|
||||
|
||||
Side note: the best way to schedule a function call across threads is to create
|
||||
a BH in the target AioContext beforehand and then call qemu_bh_schedule(). No
|
||||
acquire/release or locking is needed for the qemu_bh_schedule() call. But be
|
||||
sure to acquire the AioContext for aio_bh_new() if necessary.
|
||||
|
||||
The relationship between AioContext and the block layer
|
||||
-------------------------------------------------------
|
||||
The AioContext originates from the QEMU block layer because it provides a
|
||||
scoped way of running event loop iterations until all work is done. This
|
||||
feature is used to complete all in-flight block I/O requests (see
|
||||
bdrv_drain_all()). Nowadays AioContext is a generic event loop that can be
|
||||
used by any QEMU subsystem.
|
||||
|
||||
The block layer has support for AioContext integrated. Each BlockDriverState
|
||||
is associated with an AioContext using bdrv_set_aio_context() and
|
||||
bdrv_get_aio_context(). This allows block layer code to process I/O inside the
|
||||
right AioContext. Other subsystems may wish to follow a similar approach.
|
||||
|
||||
Block layer code must therefore expect to run in an IOThread and avoid using
|
||||
old APIs that implicitly use the main loop. See the "How to program for
|
||||
IOThreads" above for information on how to do that.
|
||||
|
||||
If main loop code such as a QMP function wishes to access a BlockDriverState it
|
||||
must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
|
||||
IOThread does not run in parallel.
|
||||
|
||||
Long-running jobs (usually in the form of coroutines) are best scheduled in the
|
||||
BlockDriverState's AioContext to avoid the need to acquire/release around each
|
||||
bdrv_*() call. Be aware that there is currently no mechanism to get notified
|
||||
when bdrv_set_aio_context() moves this BlockDriverState to a different
|
||||
AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
|
||||
may need to add this if you want to support long-running jobs.
|
|
@ -1,102 +0,0 @@
|
|||
|
||||
multiseat howto (with some multihead coverage)
|
||||
==============================================
|
||||
|
||||
host side
|
||||
---------
|
||||
|
||||
First you must compile qemu with a user interface supporting
|
||||
multihead/multiseat and input event routing. Right now this
|
||||
list includes sdl2 and gtk (both 2+3):
|
||||
|
||||
./configure --enable-sdl --with-sdlabi=2.0
|
||||
|
||||
or
|
||||
|
||||
./configure --enable-gtk
|
||||
|
||||
|
||||
Next put together the qemu command line:
|
||||
|
||||
qemu -enable-kvm -usb $memory $disk $whatever \
|
||||
-display [ sdl | gtk ] \
|
||||
-vga std \
|
||||
-device usb-tablet
|
||||
|
||||
That is it for the first head, which will use the standard vga, the
|
||||
standard ps/2 keyboard (implicitly there) and the usb-tablet. Now the
|
||||
additional switches for the second head:
|
||||
|
||||
-device pci-bridge,addr=12.0,chassis_nr=2,id=head.2 \
|
||||
-device secondary-vga,bus=head.2,addr=02.0,id=video.2 \
|
||||
-device nec-usb-xhci,bus=head.2,addr=0f.0,id=usb.2 \
|
||||
-device usb-kbd,bus=usb.2.0,port=1,display=video.2 \
|
||||
-device usb-tablet,bus=usb.2.0,port=2,display=video.2
|
||||
|
||||
This places a pci bridge in slot 12, connects a display adapter and
|
||||
xhci (usb) controller to the bridge. Then it adds a usb keyboard and
|
||||
usb mouse, both connected to the xhci and linked to the display.
|
||||
|
||||
The "display=video2" sets up the input routing. Any input coming from
|
||||
the window which belongs to the video.2 display adapter will be routed
|
||||
to these input devices.
|
||||
|
||||
The sdl2 ui will start up with two windows, one for each display
|
||||
device. The gtk ui will start with a single window and each display
|
||||
in a separate tab. You can either simply switch tabs to switch heads,
|
||||
or use the "View / Detach tab" menu item to move one of the displays
|
||||
to its own window so you can see both display devices side-by-side.
|
||||
|
||||
Note on spice: Spice handles multihead just fine. But it can't do
|
||||
multiseat. For tablet events the event source is sent to the spice
|
||||
agent. But qemu can't figure it, so it can't do input routing.
|
||||
Fixing this needs a new or extended input interface between
|
||||
libspice-server and qemu. For keyboard events it is even worse: The
|
||||
event source isn't included in the spice protocol, so the wire
|
||||
protocol must be extended to support this.
|
||||
|
||||
|
||||
guest side
|
||||
----------
|
||||
|
||||
You need a pretty recent linux guest. systemd with loginctl. kernel
|
||||
3.14+ with CONFIG_DRM_BOCHS enabled. Fedora 20 will do. Must be
|
||||
fully updated for the new kernel though, i.e. the live iso doesn't cut
|
||||
it.
|
||||
|
||||
Now we'll have to configure the guest. Boot and login. "lspci -vt"
|
||||
should list the pci bridge with the display adapter and usb controller:
|
||||
|
||||
[root@fedora ~]# lspci -vt
|
||||
-[0000:00]-+-00.0 Intel Corporation 440FX - 82441FX PMC [Natoma]
|
||||
[ ... ]
|
||||
\-12.0-[01]--+-02.0 Device 1234:1111
|
||||
\-0f.0 NEC Corporation USB 3.0 Host Controller
|
||||
|
||||
Good. Now lets tell the system that the pci bridge and all devices
|
||||
below it belong to a separate seat by dropping a file into
|
||||
/etc/udev/rules.d:
|
||||
|
||||
[root@fedora ~]# cat /etc/udev/rules.d/70-qemu-autoseat.rules
|
||||
SUBSYSTEMS=="pci", DEVPATH=="*/0000:00:12.0", TAG+="seat", ENV{ID_AUTOSEAT}="1"
|
||||
|
||||
Reboot. System should come up with two seats. With loginctl you can
|
||||
check the configuration:
|
||||
|
||||
[root@fedora ~]# loginctl list-seats
|
||||
SEAT
|
||||
seat0
|
||||
seat-pci-pci-0000_00_12_0
|
||||
|
||||
2 seats listed.
|
||||
|
||||
You can use "loginctl seat-status seat-pci-pci-0000_00_12_0" to list
|
||||
the devices attached to the seat.
|
||||
|
||||
Background info is here:
|
||||
http://www.freedesktop.org/wiki/Software/systemd/multiseat/
|
||||
|
||||
Enjoy!
|
||||
|
||||
--
|
||||
Gerd Hoffmann <kraxel@redhat.com>
|
|
@ -1,152 +0,0 @@
|
|||
################################################################
|
||||
#
|
||||
# qemu -M q35 creates a bare machine with just the very essential
|
||||
# chipset devices being present:
|
||||
#
|
||||
# 00.0 - Host bridge
|
||||
# 1f.0 - ISA bridge / LPC
|
||||
# 1f.2 - SATA (AHCI) controller
|
||||
# 1f.3 - SMBus controller
|
||||
#
|
||||
# This config file documents the other devices and how they are
|
||||
# created. You can simply use "-readconfig $thisfile" to create
|
||||
# them all. Here is a overview:
|
||||
#
|
||||
# 19.0 - Ethernet controller (not created, our e1000 emulation
|
||||
# doesn't emulate the ich9 device).
|
||||
# 1a.* - USB Controller #2 (ehci + uhci companions)
|
||||
# 1b.0 - HD Audio Controller
|
||||
# 1c.* - PCI Express Ports
|
||||
# 1d.* - USB Controller #1 (ehci + uhci companions,
|
||||
# "qemu -M q35 -usb" creates these too)
|
||||
# 1e.0 - PCI Bridge
|
||||
#
|
||||
|
||||
[device "ich9-ehci-2"]
|
||||
driver = "ich9-usb-ehci2"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1a.7"
|
||||
|
||||
[device "ich9-uhci-4"]
|
||||
driver = "ich9-usb-uhci4"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1a.0"
|
||||
masterbus = "ich9-ehci-2.0"
|
||||
firstport = "0"
|
||||
|
||||
[device "ich9-uhci-5"]
|
||||
driver = "ich9-usb-uhci5"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1a.1"
|
||||
masterbus = "ich9-ehci-2.0"
|
||||
firstport = "2"
|
||||
|
||||
[device "ich9-uhci-6"]
|
||||
driver = "ich9-usb-uhci6"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1a.2"
|
||||
masterbus = "ich9-ehci-2.0"
|
||||
firstport = "4"
|
||||
|
||||
|
||||
[device "ich9-hda-audio"]
|
||||
driver = "ich9-intel-hda"
|
||||
bus = "pcie.0"
|
||||
addr = "1b.0"
|
||||
|
||||
|
||||
[device "ich9-pcie-port-1"]
|
||||
driver = "ioh3420"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1c.0"
|
||||
port = "1"
|
||||
chassis = "1"
|
||||
|
||||
[device "ich9-pcie-port-2"]
|
||||
driver = "ioh3420"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1c.1"
|
||||
port = "2"
|
||||
chassis = "2"
|
||||
|
||||
[device "ich9-pcie-port-3"]
|
||||
driver = "ioh3420"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1c.2"
|
||||
port = "3"
|
||||
chassis = "3"
|
||||
|
||||
[device "ich9-pcie-port-4"]
|
||||
driver = "ioh3420"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1c.3"
|
||||
port = "4"
|
||||
chassis = "4"
|
||||
|
||||
##
|
||||
# Example PCIe switch with two downstream ports
|
||||
#
|
||||
#[device "pcie-switch-upstream-port-1"]
|
||||
# driver = "x3130-upstream"
|
||||
# bus = "ich9-pcie-port-4"
|
||||
# addr = "00.0"
|
||||
#
|
||||
#[device "pcie-switch-downstream-port-1-1"]
|
||||
# driver = "xio3130-downstream"
|
||||
# multifunction = "on"
|
||||
# bus = "pcie-switch-upstream-port-1"
|
||||
# addr = "00.0"
|
||||
# port = "1"
|
||||
# chassis = "5"
|
||||
#
|
||||
#[device "pcie-switch-downstream-port-1-2"]
|
||||
# driver = "xio3130-downstream"
|
||||
# multifunction = "on"
|
||||
# bus = "pcie-switch-upstream-port-1"
|
||||
# addr = "00.1"
|
||||
# port = "1"
|
||||
# chassis = "6"
|
||||
|
||||
[device "ich9-ehci-1"]
|
||||
driver = "ich9-usb-ehci1"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1d.7"
|
||||
|
||||
[device "ich9-uhci-1"]
|
||||
driver = "ich9-usb-uhci1"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1d.0"
|
||||
masterbus = "ich9-ehci-1.0"
|
||||
firstport = "0"
|
||||
|
||||
[device "ich9-uhci-2"]
|
||||
driver = "ich9-usb-uhci2"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1d.1"
|
||||
masterbus = "ich9-ehci-1.0"
|
||||
firstport = "2"
|
||||
|
||||
[device "ich9-uhci-3"]
|
||||
driver = "ich9-usb-uhci3"
|
||||
multifunction = "on"
|
||||
bus = "pcie.0"
|
||||
addr = "1d.2"
|
||||
masterbus = "ich9-ehci-1.0"
|
||||
firstport = "4"
|
||||
|
||||
|
||||
[device "ich9-pci-bridge"]
|
||||
driver = "i82801b11-bridge"
|
||||
bus = "pcie.0"
|
||||
addr = "1e.0"
|
|
@ -1,590 +0,0 @@
|
|||
= How to use the QAPI code generator =
|
||||
|
||||
QAPI is a native C API within QEMU which provides management-level
|
||||
functionality to internal/external users. For external
|
||||
users/processes, this interface is made available by a JSON-based
|
||||
QEMU Monitor protocol that is provided by the QMP server.
|
||||
|
||||
To map QMP-defined interfaces to the native C QAPI implementations,
|
||||
a JSON-based schema is used to define types and function
|
||||
signatures, and a set of scripts is used to generate types/signatures,
|
||||
and marshaling/dispatch code. The QEMU Guest Agent also uses these
|
||||
scripts, paired with a separate schema, to generate
|
||||
marshaling/dispatch code for the guest agent server running in the
|
||||
guest.
|
||||
|
||||
This document will describe how the schemas, scripts, and resulting
|
||||
code are used.
|
||||
|
||||
|
||||
== QMP/Guest agent schema ==
|
||||
|
||||
This file defines the types, commands, and events used by QMP. It should
|
||||
fully describe the interface used by QMP.
|
||||
|
||||
This file is designed to be loosely based on JSON although it's technically
|
||||
executable Python. While dictionaries are used, they are parsed as
|
||||
OrderedDicts so that ordering is preserved.
|
||||
|
||||
There are two basic syntaxes used, type definitions and command definitions.
|
||||
|
||||
The first syntax defines a type and is represented by a dictionary. There are
|
||||
three kinds of user-defined types that are supported: complex types,
|
||||
enumeration types and union types.
|
||||
|
||||
Generally speaking, types definitions should always use CamelCase for the type
|
||||
names. Command names should be all lower case with words separated by a hyphen.
|
||||
|
||||
|
||||
=== Includes ===
|
||||
|
||||
The QAPI schema definitions can be modularized using the 'include' directive:
|
||||
|
||||
{ 'include': 'path/to/file.json'}
|
||||
|
||||
The directive is evaluated recursively, and include paths are relative to the
|
||||
file using the directive. Multiple includes of the same file are safe.
|
||||
|
||||
|
||||
=== Complex types ===
|
||||
|
||||
A complex type is a dictionary containing a single key whose value is a
|
||||
dictionary. This corresponds to a struct in C or an Object in JSON. An
|
||||
example of a complex type is:
|
||||
|
||||
{ 'type': 'MyType',
|
||||
'data': { 'member1': 'str', 'member2': 'int', '*member3': 'str' } }
|
||||
|
||||
The use of '*' as a prefix to the name means the member is optional.
|
||||
|
||||
The default initialization value of an optional argument should not be changed
|
||||
between versions of QEMU unless the new default maintains backward
|
||||
compatibility to the user-visible behavior of the old default.
|
||||
|
||||
With proper documentation, this policy still allows some flexibility; for
|
||||
example, documenting that a default of 0 picks an optimal buffer size allows
|
||||
one release to declare the optimal size at 512 while another release declares
|
||||
the optimal size at 4096 - the user-visible behavior is not the bytes used by
|
||||
the buffer, but the fact that the buffer was optimal size.
|
||||
|
||||
On input structures (only mentioned in the 'data' side of a command), changing
|
||||
from mandatory to optional is safe (older clients will supply the option, and
|
||||
newer clients can benefit from the default); changing from optional to
|
||||
mandatory is backwards incompatible (older clients may be omitting the option,
|
||||
and must continue to work).
|
||||
|
||||
On output structures (only mentioned in the 'returns' side of a command),
|
||||
changing from mandatory to optional is in general unsafe (older clients may be
|
||||
expecting the field, and could crash if it is missing), although it can be done
|
||||
if the only way that the optional argument will be omitted is when it is
|
||||
triggered by the presence of a new input flag to the command that older clients
|
||||
don't know to send. Changing from optional to mandatory is safe.
|
||||
|
||||
A structure that is used in both input and output of various commands
|
||||
must consider the backwards compatibility constraints of both directions
|
||||
of use.
|
||||
|
||||
A complex type definition can specify another complex type as its base.
|
||||
In this case, the fields of the base type are included as top-level fields
|
||||
of the new complex type's dictionary in the QMP wire format. An example
|
||||
definition is:
|
||||
|
||||
{ 'type': 'BlockdevOptionsGenericFormat', 'data': { 'file': 'str' } }
|
||||
{ 'type': 'BlockdevOptionsGenericCOWFormat',
|
||||
'base': 'BlockdevOptionsGenericFormat',
|
||||
'data': { '*backing': 'str' } }
|
||||
|
||||
An example BlockdevOptionsGenericCOWFormat object on the wire could use
|
||||
both fields like this:
|
||||
|
||||
{ "file": "/some/place/my-image",
|
||||
"backing": "/some/place/my-backing-file" }
|
||||
|
||||
=== Enumeration types ===
|
||||
|
||||
An enumeration type is a dictionary containing a single key whose value is a
|
||||
list of strings. An example enumeration is:
|
||||
|
||||
{ 'enum': 'MyEnum', 'data': [ 'value1', 'value2', 'value3' ] }
|
||||
|
||||
=== Union types ===
|
||||
|
||||
Union types are used to let the user choose between several different data
|
||||
types. A union type is defined using a dictionary as explained in the
|
||||
following paragraphs.
|
||||
|
||||
|
||||
A simple union type defines a mapping from discriminator values to data types
|
||||
like in this example:
|
||||
|
||||
{ 'type': 'FileOptions', 'data': { 'filename': 'str' } }
|
||||
{ 'type': 'Qcow2Options',
|
||||
'data': { 'backing-file': 'str', 'lazy-refcounts': 'bool' } }
|
||||
|
||||
{ 'union': 'BlockdevOptions',
|
||||
'data': { 'file': 'FileOptions',
|
||||
'qcow2': 'Qcow2Options' } }
|
||||
|
||||
In the QMP wire format, a simple union is represented by a dictionary that
|
||||
contains the 'type' field as a discriminator, and a 'data' field that is of the
|
||||
specified data type corresponding to the discriminator value:
|
||||
|
||||
{ "type": "qcow2", "data" : { "backing-file": "/some/place/my-image",
|
||||
"lazy-refcounts": true } }
|
||||
|
||||
|
||||
A union definition can specify a complex type as its base. In this case, the
|
||||
fields of the complex type are included as top-level fields of the union
|
||||
dictionary in the QMP wire format. An example definition is:
|
||||
|
||||
{ 'type': 'BlockdevCommonOptions', 'data': { 'readonly': 'bool' } }
|
||||
{ 'union': 'BlockdevOptions',
|
||||
'base': 'BlockdevCommonOptions',
|
||||
'data': { 'raw': 'RawOptions',
|
||||
'qcow2': 'Qcow2Options' } }
|
||||
|
||||
And it looks like this on the wire:
|
||||
|
||||
{ "type": "qcow2",
|
||||
"readonly": false,
|
||||
"data" : { "backing-file": "/some/place/my-image",
|
||||
"lazy-refcounts": true } }
|
||||
|
||||
|
||||
Flat union types avoid the nesting on the wire. They are used whenever a
|
||||
specific field of the base type is declared as the discriminator ('type' is
|
||||
then no longer generated). The discriminator must be of enumeration type.
|
||||
The above example can then be modified as follows:
|
||||
|
||||
{ 'enum': 'BlockdevDriver', 'data': [ 'raw', 'qcow2' ] }
|
||||
{ 'type': 'BlockdevCommonOptions',
|
||||
'data': { 'driver': 'BlockdevDriver', 'readonly': 'bool' } }
|
||||
{ 'union': 'BlockdevOptions',
|
||||
'base': 'BlockdevCommonOptions',
|
||||
'discriminator': 'driver',
|
||||
'data': { 'raw': 'RawOptions',
|
||||
'qcow2': 'Qcow2Options' } }
|
||||
|
||||
Resulting in this JSON object:
|
||||
|
||||
{ "driver": "qcow2",
|
||||
"readonly": false,
|
||||
"backing-file": "/some/place/my-image",
|
||||
"lazy-refcounts": true }
|
||||
|
||||
|
||||
A special type of unions are anonymous unions. They don't form a dictionary in
|
||||
the wire format but allow the direct use of different types in their place. As
|
||||
they aren't structured, they don't have any explicit discriminator but use
|
||||
the (QObject) data type of their value as an implicit discriminator. This means
|
||||
that they are restricted to using only one discriminator value per QObject
|
||||
type. For example, you cannot have two different complex types in an anonymous
|
||||
union, or two different integer types.
|
||||
|
||||
Anonymous unions are declared using an empty dictionary as their discriminator.
|
||||
The discriminator values never appear on the wire, they are only used in the
|
||||
generated C code. Anonymous unions cannot have a base type.
|
||||
|
||||
{ 'union': 'BlockRef',
|
||||
'discriminator': {},
|
||||
'data': { 'definition': 'BlockdevOptions',
|
||||
'reference': 'str' } }
|
||||
|
||||
This example allows using both of the following example objects:
|
||||
|
||||
{ "file": "my_existing_block_device_id" }
|
||||
{ "file": { "driver": "file",
|
||||
"readonly": false,
|
||||
"filename": "/tmp/mydisk.qcow2" } }
|
||||
|
||||
|
||||
=== Commands ===
|
||||
|
||||
Commands are defined by using a list containing three members. The first
|
||||
member is the command name, the second member is a dictionary containing
|
||||
arguments, and the third member is the return type.
|
||||
|
||||
An example command is:
|
||||
|
||||
{ 'command': 'my-command',
|
||||
'data': { 'arg1': 'str', '*arg2': 'str' },
|
||||
'returns': 'str' }
|
||||
|
||||
=== Events ===
|
||||
|
||||
Events are defined with the keyword 'event'. When 'data' is also specified,
|
||||
additional info will be included in the event. Finally there will be C API
|
||||
generated in qapi-event.h; when called by QEMU code, a message with timestamp
|
||||
will be emitted on the wire. If timestamp is -1, it means failure to retrieve
|
||||
host time.
|
||||
|
||||
An example event is:
|
||||
|
||||
{ 'event': 'EVENT_C',
|
||||
'data': { '*a': 'int', 'b': 'str' } }
|
||||
|
||||
Resulting in this JSON object:
|
||||
|
||||
{ "event": "EVENT_C",
|
||||
"data": { "b": "test string" },
|
||||
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
|
||||
|
||||
|
||||
== Code generation ==
|
||||
|
||||
Schemas are fed into 3 scripts to generate all the code/files that, paired
|
||||
with the core QAPI libraries, comprise everything required to take JSON
|
||||
commands read in by a QMP/guest agent server, unmarshal the arguments into
|
||||
the underlying C types, call into the corresponding C function, and map the
|
||||
response back to a QMP/guest agent response to be returned to the user.
|
||||
|
||||
As an example, we'll use the following schema, which describes a single
|
||||
complex user-defined type (which will produce a C struct, along with a list
|
||||
node structure that can be used to chain together a list of such types in
|
||||
case we want to accept/return a list of this type with a command), and a
|
||||
command which takes that type as a parameter and returns the same type:
|
||||
|
||||
$ cat example-schema.json
|
||||
{ 'type': 'UserDefOne',
|
||||
'data': { 'integer': 'int', 'string': 'str' } }
|
||||
|
||||
{ 'command': 'my-command',
|
||||
'data': {'arg1': 'UserDefOne'},
|
||||
'returns': 'UserDefOne' }
|
||||
|
||||
{ 'event': 'MY_EVENT' }
|
||||
|
||||
=== scripts/qapi-types.py ===
|
||||
|
||||
Used to generate the C types defined by a schema. The following files are
|
||||
created:
|
||||
|
||||
$(prefix)qapi-types.h - C types corresponding to types defined in
|
||||
the schema you pass in
|
||||
$(prefix)qapi-types.c - Cleanup functions for the above C types
|
||||
|
||||
The $(prefix) is an optional parameter used as a namespace to keep the
|
||||
generated code from one schema/code-generation separated from others so code
|
||||
can be generated/used from multiple schemas without clobbering previously
|
||||
created code.
|
||||
|
||||
Example:
|
||||
|
||||
$ python scripts/qapi-types.py --output-dir="qapi-generated" \
|
||||
--prefix="example-" --input-file=example-schema.json
|
||||
$ cat qapi-generated/example-qapi-types.c
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
void qapi_free_UserDefOneList(UserDefOneList *obj)
|
||||
{
|
||||
QapiDeallocVisitor *md;
|
||||
Visitor *v;
|
||||
|
||||
if (!obj) {
|
||||
return;
|
||||
}
|
||||
|
||||
md = qapi_dealloc_visitor_new();
|
||||
v = qapi_dealloc_get_visitor(md);
|
||||
visit_type_UserDefOneList(v, &obj, NULL, NULL);
|
||||
qapi_dealloc_visitor_cleanup(md);
|
||||
}
|
||||
|
||||
void qapi_free_UserDefOne(UserDefOne *obj)
|
||||
{
|
||||
QapiDeallocVisitor *md;
|
||||
Visitor *v;
|
||||
|
||||
if (!obj) {
|
||||
return;
|
||||
}
|
||||
|
||||
md = qapi_dealloc_visitor_new();
|
||||
v = qapi_dealloc_get_visitor(md);
|
||||
visit_type_UserDefOne(v, &obj, NULL, NULL);
|
||||
qapi_dealloc_visitor_cleanup(md);
|
||||
}
|
||||
|
||||
$ cat qapi-generated/example-qapi-types.h
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
#ifndef EXAMPLE_QAPI_TYPES_H
|
||||
#define EXAMPLE_QAPI_TYPES_H
|
||||
|
||||
[Builtin types omitted...]
|
||||
|
||||
typedef struct UserDefOne UserDefOne;
|
||||
|
||||
typedef struct UserDefOneList
|
||||
{
|
||||
union {
|
||||
UserDefOne *value;
|
||||
uint64_t padding;
|
||||
};
|
||||
struct UserDefOneList *next;
|
||||
} UserDefOneList;
|
||||
|
||||
[Functions on builtin types omitted...]
|
||||
|
||||
struct UserDefOne
|
||||
{
|
||||
int64_t integer;
|
||||
char *string;
|
||||
};
|
||||
|
||||
void qapi_free_UserDefOneList(UserDefOneList *obj);
|
||||
void qapi_free_UserDefOne(UserDefOne *obj);
|
||||
|
||||
#endif
|
||||
|
||||
=== scripts/qapi-visit.py ===
|
||||
|
||||
Used to generate the visitor functions used to walk through and convert
|
||||
a QObject (as provided by QMP) to a native C data structure and
|
||||
vice-versa, as well as the visitor function used to dealloc a complex
|
||||
schema-defined C type.
|
||||
|
||||
The following files are generated:
|
||||
|
||||
$(prefix)qapi-visit.c: visitor function for a particular C type, used
|
||||
to automagically convert QObjects into the
|
||||
corresponding C type and vice-versa, as well
|
||||
as for deallocating memory for an existing C
|
||||
type
|
||||
|
||||
$(prefix)qapi-visit.h: declarations for previously mentioned visitor
|
||||
functions
|
||||
|
||||
Example:
|
||||
|
||||
$ python scripts/qapi-visit.py --output-dir="qapi-generated"
|
||||
--prefix="example-" --input-file=example-schema.json
|
||||
$ cat qapi-generated/example-qapi-visit.c
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
static void visit_type_UserDefOne_fields(Visitor *m, UserDefOne **obj, Error **errp)
|
||||
{
|
||||
Error *err = NULL;
|
||||
visit_type_int(m, &(*obj)->integer, "integer", &err);
|
||||
if (err) {
|
||||
goto out;
|
||||
}
|
||||
visit_type_str(m, &(*obj)->string, "string", &err);
|
||||
if (err) {
|
||||
goto out;
|
||||
}
|
||||
|
||||
out:
|
||||
error_propagate(errp, err);
|
||||
}
|
||||
|
||||
void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp)
|
||||
{
|
||||
Error *err = NULL;
|
||||
|
||||
visit_start_struct(m, (void **)obj, "UserDefOne", name, sizeof(UserDefOne), &err);
|
||||
if (!err) {
|
||||
if (*obj) {
|
||||
visit_type_UserDefOne_fields(m, obj, errp);
|
||||
}
|
||||
visit_end_struct(m, &err);
|
||||
}
|
||||
error_propagate(errp, err);
|
||||
}
|
||||
|
||||
void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp)
|
||||
{
|
||||
Error *err = NULL;
|
||||
GenericList *i, **prev;
|
||||
|
||||
visit_start_list(m, name, &err);
|
||||
if (err) {
|
||||
goto out;
|
||||
}
|
||||
|
||||
for (prev = (GenericList **)obj;
|
||||
!err && (i = visit_next_list(m, prev, &err)) != NULL;
|
||||
prev = &i) {
|
||||
UserDefOneList *native_i = (UserDefOneList *)i;
|
||||
visit_type_UserDefOne(m, &native_i->value, NULL, &err);
|
||||
}
|
||||
|
||||
error_propagate(errp, err);
|
||||
err = NULL;
|
||||
visit_end_list(m, &err);
|
||||
out:
|
||||
error_propagate(errp, err);
|
||||
}
|
||||
$ python scripts/qapi-commands.py --output-dir="qapi-generated" \
|
||||
--prefix="example-" --input-file=example-schema.json
|
||||
$ cat qapi-generated/example-qapi-visit.h
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
#ifndef EXAMPLE_QAPI_VISIT_H
|
||||
#define EXAMPLE_QAPI_VISIT_H
|
||||
|
||||
[Visitors for builtin types omitted...]
|
||||
|
||||
void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp);
|
||||
void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp);
|
||||
|
||||
#endif
|
||||
|
||||
=== scripts/qapi-commands.py ===
|
||||
|
||||
Used to generate the marshaling/dispatch functions for the commands defined
|
||||
in the schema. The following files are generated:
|
||||
|
||||
$(prefix)qmp-marshal.c: command marshal/dispatch functions for each
|
||||
QMP command defined in the schema. Functions
|
||||
generated by qapi-visit.py are used to
|
||||
convert QObjects received from the wire into
|
||||
function parameters, and uses the same
|
||||
visitor functions to convert native C return
|
||||
values to QObjects from transmission back
|
||||
over the wire.
|
||||
|
||||
$(prefix)qmp-commands.h: Function prototypes for the QMP commands
|
||||
specified in the schema.
|
||||
|
||||
Example:
|
||||
|
||||
$ python scripts/qapi-commands.py --output-dir="qapi-generated"
|
||||
--prefix="example-" --input-file=example-schema.json
|
||||
$ cat qapi-generated/example-qmp-marshal.c
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
static void qmp_marshal_output_my_command(UserDefOne *ret_in, QObject **ret_out, Error **errp)
|
||||
{
|
||||
Error *local_err = NULL;
|
||||
QmpOutputVisitor *mo = qmp_output_visitor_new();
|
||||
QapiDeallocVisitor *md;
|
||||
Visitor *v;
|
||||
|
||||
v = qmp_output_get_visitor(mo);
|
||||
visit_type_UserDefOne(v, &ret_in, "unused", &local_err);
|
||||
if (local_err) {
|
||||
goto out;
|
||||
}
|
||||
*ret_out = qmp_output_get_qobject(mo);
|
||||
|
||||
out:
|
||||
error_propagate(errp, local_err);
|
||||
qmp_output_visitor_cleanup(mo);
|
||||
md = qapi_dealloc_visitor_new();
|
||||
v = qapi_dealloc_get_visitor(md);
|
||||
visit_type_UserDefOne(v, &ret_in, "unused", NULL);
|
||||
qapi_dealloc_visitor_cleanup(md);
|
||||
}
|
||||
|
||||
static void qmp_marshal_input_my_command(QDict *args, QObject **ret, Error **errp)
|
||||
{
|
||||
Error *local_err = NULL;
|
||||
UserDefOne *retval = NULL;
|
||||
QmpInputVisitor *mi = qmp_input_visitor_new_strict(QOBJECT(args));
|
||||
QapiDeallocVisitor *md;
|
||||
Visitor *v;
|
||||
UserDefOne *arg1 = NULL;
|
||||
|
||||
v = qmp_input_get_visitor(mi);
|
||||
visit_type_UserDefOne(v, &arg1, "arg1", &local_err);
|
||||
if (local_err) {
|
||||
goto out;
|
||||
}
|
||||
|
||||
retval = qmp_my_command(arg1, &local_err);
|
||||
if (local_err) {
|
||||
goto out;
|
||||
}
|
||||
|
||||
qmp_marshal_output_my_command(retval, ret, &local_err);
|
||||
|
||||
out:
|
||||
error_propagate(errp, local_err);
|
||||
qmp_input_visitor_cleanup(mi);
|
||||
md = qapi_dealloc_visitor_new();
|
||||
v = qapi_dealloc_get_visitor(md);
|
||||
visit_type_UserDefOne(v, &arg1, "arg1", NULL);
|
||||
qapi_dealloc_visitor_cleanup(md);
|
||||
return;
|
||||
}
|
||||
|
||||
static void qmp_init_marshal(void)
|
||||
{
|
||||
qmp_register_command("my-command", qmp_marshal_input_my_command, QCO_NO_OPTIONS);
|
||||
}
|
||||
|
||||
qapi_init(qmp_init_marshal);
|
||||
$ cat qapi-generated/example-qmp-commands.h
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
#ifndef EXAMPLE_QMP_COMMANDS_H
|
||||
#define EXAMPLE_QMP_COMMANDS_H
|
||||
|
||||
#include "example-qapi-types.h"
|
||||
#include "qapi/qmp/qdict.h"
|
||||
#include "qapi/error.h"
|
||||
|
||||
UserDefOne *qmp_my_command(UserDefOne *arg1, Error **errp);
|
||||
|
||||
#endif
|
||||
|
||||
=== scripts/qapi-event.py ===
|
||||
|
||||
Used to generate the event-related C code defined by a schema. The
|
||||
following files are created:
|
||||
|
||||
$(prefix)qapi-event.h - Function prototypes for each event type, plus an
|
||||
enumeration of all event names
|
||||
$(prefix)qapi-event.c - Implementation of functions to send an event
|
||||
|
||||
Example:
|
||||
|
||||
$ python scripts/qapi-event.py --output-dir="qapi-generated"
|
||||
--prefix="example-" --input-file=example-schema.json
|
||||
$ cat qapi-generated/example-qapi-event.c
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
void qapi_event_send_my_event(Error **errp)
|
||||
{
|
||||
QDict *qmp;
|
||||
Error *local_err = NULL;
|
||||
QMPEventFuncEmit emit;
|
||||
emit = qmp_event_get_func_emit();
|
||||
if (!emit) {
|
||||
return;
|
||||
}
|
||||
|
||||
qmp = qmp_event_build_dict("MY_EVENT");
|
||||
|
||||
emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp, &local_err);
|
||||
|
||||
error_propagate(errp, local_err);
|
||||
QDECREF(qmp);
|
||||
}
|
||||
|
||||
const char *EXAMPLE_QAPIEvent_lookup[] = {
|
||||
"MY_EVENT",
|
||||
NULL,
|
||||
};
|
||||
$ cat qapi-generated/example-qapi-event.h
|
||||
[Uninteresting stuff omitted...]
|
||||
|
||||
#ifndef EXAMPLE_QAPI_EVENT_H
|
||||
#define EXAMPLE_QAPI_EVENT_H
|
||||
|
||||
#include "qapi/error.h"
|
||||
#include "qapi/qmp/qdict.h"
|
||||
#include "example-qapi-types.h"
|
||||
|
||||
|
||||
void qapi_event_send_my_event(Error **errp);
|
||||
|
||||
extern const char *EXAMPLE_QAPIEvent_lookup[];
|
||||
typedef enum EXAMPLE_QAPIEvent
|
||||
{
|
||||
EXAMPLE_QAPI_EVENT_MY_EVENT = 0,
|
||||
EXAMPLE_QAPI_EVENT_MAX = 1,
|
||||
} EXAMPLE_QAPIEvent;
|
||||
|
||||
#endif
|
|
@ -1,416 +0,0 @@
|
|||
= How to convert to -device & friends =
|
||||
|
||||
=== Specifying Bus and Address on Bus ===
|
||||
|
||||
In qdev, each device has a parent bus. Some devices provide one or
|
||||
more buses for children. You can specify a device's parent bus with
|
||||
-device parameter bus.
|
||||
|
||||
A device typically has a device address on its parent bus. For buses
|
||||
where this address can be configured, devices provide a bus-specific
|
||||
property. Examples:
|
||||
|
||||
bus property name value format
|
||||
PCI addr %x.%x (dev.fn, .fn optional)
|
||||
I2C address %u
|
||||
SCSI scsi-id %u
|
||||
IDE unit %u
|
||||
HDA cad %u
|
||||
virtio-serial-bus nr %u
|
||||
ccid-bus slot %u
|
||||
USB port %d(.%d)* (port.port...)
|
||||
|
||||
Example: device i440FX-pcihost is on the root bus, and provides a PCI
|
||||
bus named pci.0. To put a FOO device into its slot 4, use -device
|
||||
FOO,bus=/i440FX-pcihost/pci.0,addr=4. The abbreviated form bus=pci.0
|
||||
also works as long as the bus name is unique.
|
||||
|
||||
=== Block Devices ===
|
||||
|
||||
A QEMU block device (drive) has a host and a guest part.
|
||||
|
||||
In the general case, the guest device is connected to a controller
|
||||
device. For instance, the IDE controller provides two IDE buses, each
|
||||
of which can have up to two ide-drive devices, and each ide-drive
|
||||
device is a guest part, and is connected to a host part.
|
||||
|
||||
Except we sometimes lump controller, bus(es) and drive device(s) all
|
||||
together into a single device. For instance, the ISA floppy
|
||||
controller is connected to up to two host drives.
|
||||
|
||||
The old ways to define block devices define host and guest part
|
||||
together. Sometimes, they can even define a controller device in
|
||||
addition to the block device.
|
||||
|
||||
The new way keeps the parts separate: you create the host part with
|
||||
-drive, and guest device(s) with -device.
|
||||
|
||||
The various old ways to define drives all boil down to the common form
|
||||
|
||||
-drive if=TYPE,bus=BUS,unit=UNIT,OPTS...
|
||||
|
||||
TYPE, BUS and UNIT identify the controller device, which of its buses
|
||||
to use, and the drive's address on that bus. Details depend on TYPE.
|
||||
|
||||
Instead of bus=BUS,unit=UNIT, you can also say index=IDX.
|
||||
|
||||
In the new way, this becomes something like
|
||||
|
||||
-drive if=none,id=DRIVE-ID,HOST-OPTS...
|
||||
-device DEVNAME,drive=DRIVE-ID,DEV-OPTS...
|
||||
|
||||
The old OPTS get split into HOST-OPTS and DEV-OPTS as follows:
|
||||
|
||||
* file, format, snapshot, cache, aio, readonly, rerror, werror go into
|
||||
HOST-OPTS.
|
||||
|
||||
* cyls, head, secs and trans go into HOST-OPTS. Future work: they
|
||||
should go into DEV-OPTS instead.
|
||||
|
||||
* serial goes into DEV-OPTS, for devices supporting serial numbers.
|
||||
For other devices, it goes nowhere.
|
||||
|
||||
* media is special. In the old way, it selects disk vs. CD-ROM with
|
||||
if=ide, if=scsi and if=xen. The new way uses DEVNAME for that.
|
||||
Additionally, readonly=on goes into HOST-OPTS.
|
||||
|
||||
* addr is special, see if=virtio below.
|
||||
|
||||
The -device argument differs in detail for each type of drive:
|
||||
|
||||
* if=ide
|
||||
|
||||
-device DEVNAME,drive=DRIVE-ID,bus=IDE-BUS,unit=UNIT
|
||||
|
||||
where DEVNAME is either ide-hd or ide-cd, IDE-BUS identifies an IDE
|
||||
bus, normally either ide.0 or ide.1, and UNIT is either 0 or 1.
|
||||
|
||||
* if=scsi
|
||||
|
||||
The old way implicitly creates SCSI controllers as needed. The new
|
||||
way makes that explicit:
|
||||
|
||||
-device lsi53c895a,id=ID
|
||||
|
||||
As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to
|
||||
control the PCI device address.
|
||||
|
||||
This SCSI controller provides a single SCSI bus, named ID.0. Put a
|
||||
disk on it:
|
||||
|
||||
-device DEVNAME,drive=DRIVE-ID,bus=ID.0,scsi-id=UNIT
|
||||
|
||||
where DEVNAME is either scsi-hd, scsi-cd or scsi-generic.
|
||||
|
||||
* if=floppy
|
||||
|
||||
-global isa-fdc.driveA=DRIVE-ID
|
||||
-global isa-fdc.driveB=DRIVE-ID
|
||||
|
||||
This is -global instead of -device, because the floppy controller is
|
||||
created automatically, and we want to configure that one, not create
|
||||
a second one (which isn't possible anyway).
|
||||
|
||||
Without any -global isa-fdc,... you get an empty driveA and no
|
||||
driveB. You can use -nodefaults to suppress the default driveA, see
|
||||
"Default Devices".
|
||||
|
||||
* if=virtio
|
||||
|
||||
-device virtio-blk-pci,drive=DRIVE-ID,class=C,vectors=V,ioeventfd=IOEVENTFD
|
||||
|
||||
This lets you control PCI device class and MSI-X vectors.
|
||||
|
||||
IOEVENTFD controls whether or not ioeventfd is used for virtqueue
|
||||
notify. It can be set to on (default) or off.
|
||||
|
||||
As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to
|
||||
control the PCI device address. This replaces option addr available
|
||||
with -drive if=virtio.
|
||||
|
||||
* if=pflash, if=mtd, if=sd, if=xen are not yet available with -device
|
||||
|
||||
For USB devices, the old way is actually different:
|
||||
|
||||
-usbdevice disk:format=FMT:FILENAME
|
||||
|
||||
Provides much less control than -drive's OPTS... The new way fixes
|
||||
that:
|
||||
|
||||
-device usb-storage,drive=DRIVE-ID,removable=RMB
|
||||
|
||||
The removable parameter gives control over the SCSI INQUIRY removable
|
||||
(RMB) bit. USB thumbdrives usually set removable=on, while USB hard
|
||||
disks set removable=off.
|
||||
|
||||
Bug: usb-storage pretends to be a block device, but it's really a SCSI
|
||||
controller that can serve only a single device, which it creates
|
||||
automatically. The automatic creation guesses what kind of guest part
|
||||
to create from the host part, like -drive if=scsi. Host and guest
|
||||
part are not cleanly separated.
|
||||
|
||||
=== Character Devices ===
|
||||
|
||||
A QEMU character device has a host and a guest part.
|
||||
|
||||
The old ways to define character devices define host and guest part
|
||||
together.
|
||||
|
||||
The new way keeps the parts separate: you create the host part with
|
||||
-chardev, and the guest device with -device.
|
||||
|
||||
The various old ways to define a character device are all of the
|
||||
general form
|
||||
|
||||
-FOO FOO-OPTS...,LEGACY-CHARDEV
|
||||
|
||||
where FOO-OPTS... is specific to -FOO, and the host part
|
||||
LEGACY-CHARDEV is the same everywhere.
|
||||
|
||||
In the new way, this becomes
|
||||
|
||||
-chardev HOST-OPTS...,id=CHR-ID
|
||||
-device DEVNAME,chardev=CHR-ID,DEV-OPTS...
|
||||
|
||||
The appropriate DEVNAME depends on the machine type. For type "pc":
|
||||
|
||||
* -serial becomes -device isa-serial,iobase=IOADDR,irq=IRQ,index=IDX
|
||||
|
||||
This lets you control I/O ports and IRQs.
|
||||
|
||||
* -parallel becomes -device isa-parallel,iobase=IOADDR,irq=IRQ,index=IDX
|
||||
|
||||
This lets you control I/O ports and IRQs.
|
||||
|
||||
* -usbdevice serial:vendorid=VID,productid=PRID becomes
|
||||
-device usb-serial,vendorid=VID,productid=PRID
|
||||
|
||||
* -usbdevice braille doesn't support LEGACY-CHARDEV syntax. It always
|
||||
uses "braille". With -device, this useful default is gone, so you
|
||||
have to use something like
|
||||
|
||||
-device usb-braille,chardev=braille,vendorid=VID,productid=PRID
|
||||
-chardev braille,id=braille
|
||||
|
||||
* -virtioconsole becomes
|
||||
-device virtio-serial-pci,class=C,vectors=V,ioeventfd=IOEVENTFD,max_ports=N
|
||||
-device virtconsole,is_console=NUM,nr=NR,name=NAME
|
||||
|
||||
LEGACY-CHARDEV translates to -chardev HOST-OPTS... as follows:
|
||||
|
||||
* null becomes -chardev null
|
||||
|
||||
* pty, msmouse, braille, stdio likewise
|
||||
|
||||
* vc:WIDTHxHEIGHT becomes -chardev vc,width=WIDTH,height=HEIGHT
|
||||
|
||||
* vc:<COLS>Cx<ROWS>C becomes -chardev vc,cols=<COLS>,rows=<ROWS>
|
||||
|
||||
* con: becomes -chardev console
|
||||
|
||||
* COM<NUM> becomes -chardev serial,path=COM<NUM>
|
||||
|
||||
* file:FNAME becomes -chardev file,path=FNAME
|
||||
|
||||
* pipe:FNAME becomes -chardev pipe,path=FNAME
|
||||
|
||||
* tcp:HOST:PORT,OPTS... becomes -chardev socket,host=HOST,port=PORT,OPTS...
|
||||
|
||||
* telnet:HOST:PORT,OPTS... becomes
|
||||
-chardev socket,host=HOST,port=PORT,OPTS...,telnet=on
|
||||
|
||||
* udp:HOST:PORT@LOCALADDR:LOCALPORT becomes
|
||||
-chardev udp,host=HOST,port=PORT,localaddr=LOCALADDR,localport=LOCALPORT
|
||||
|
||||
* unix:FNAME becomes -chardev socket,path=FNAME
|
||||
|
||||
* /dev/parportN becomes -chardev parport,file=/dev/parportN
|
||||
|
||||
* /dev/ppiN likewise
|
||||
|
||||
* Any other /dev/FNAME becomes -chardev tty,path=/dev/FNAME
|
||||
|
||||
* mon:LEGACY-CHARDEV is special: it multiplexes the monitor onto the
|
||||
character device defined by LEGACY-CHARDEV. -chardev provides more
|
||||
general multiplexing instead: you can connect up to four users to a
|
||||
single host part. You need to pass mux=on to -chardev to enable
|
||||
switching the input focus.
|
||||
|
||||
QEMU uses LEGACY-CHARDEV syntax not just to set up guest devices, but
|
||||
also in various other places such as -monitor or -net
|
||||
user,guestfwd=... You can use chardev:CHR-ID in place of
|
||||
LEGACY-CHARDEV to refer to a host part defined with -chardev.
|
||||
|
||||
=== Network Devices ===
|
||||
|
||||
Host and guest part of network devices have always been separate.
|
||||
|
||||
The old way to define the guest part looks like this:
|
||||
|
||||
-net nic,netdev=NET-ID,macaddr=MACADDR,model=MODEL,name=ID,addr=STR,vectors=V
|
||||
|
||||
Except for USB it looks like this:
|
||||
|
||||
-usbdevice net:netdev=NET-ID,macaddr=MACADDR,name=ID
|
||||
|
||||
The new way is -device:
|
||||
|
||||
-device DEVNAME,netdev=NET-ID,mac=MACADDR,DEV-OPTS...
|
||||
|
||||
DEVNAME equals MODEL, except for virtio you have to name the virtio
|
||||
device appropriate for the bus (virtio-net-pci for PCI), and for USB
|
||||
you have to use usb-net.
|
||||
|
||||
The old name=ID parameter becomes the usual id=ID with -device.
|
||||
|
||||
For PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control the PCI
|
||||
device address, as usual. The old -net nic provides parameter addr
|
||||
for that, which is silently ignored when the NIC is not a PCI device.
|
||||
|
||||
For virtio-net-pci, you can control whether or not ioeventfd is used for
|
||||
virtqueue notify by setting ioeventfd= to on or off (default).
|
||||
|
||||
-net nic accepts vectors=V for all models, but it's silently ignored
|
||||
except for virtio-net-pci (model=virtio). With -device, only devices
|
||||
that support it accept it.
|
||||
|
||||
Not all devices are available with -device at this time. All PCI
|
||||
devices and ne2k_isa are.
|
||||
|
||||
Some PCI devices aren't available with -net nic, e.g. i82558a.
|
||||
|
||||
To connect to a VLAN instead of an ordinary host part, replace
|
||||
netdev=NET-ID by vlan=VLAN.
|
||||
|
||||
=== Graphics Devices ===
|
||||
|
||||
Host and guest part of graphics devices have always been separate.
|
||||
|
||||
The old way to define the guest graphics device is -vga VGA. Not all
|
||||
machines support all -vga options.
|
||||
|
||||
The new way is -device. The mapping from -vga argument to -device
|
||||
depends on the machine type. For machine "pc", it's:
|
||||
|
||||
std -device VGA
|
||||
cirrus -device cirrus-vga
|
||||
vmware -device vmware-svga
|
||||
qxl -device qxl-vga
|
||||
none -nodefaults
|
||||
disables more than just VGA, see "Default Devices"
|
||||
|
||||
As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control
|
||||
the PCI device address.
|
||||
|
||||
-device VGA supports properties bios-offset and bios-size, but they
|
||||
aren't used with machine type "pc".
|
||||
|
||||
For machine "isapc", it's
|
||||
|
||||
std -device isa-vga
|
||||
cirrus not yet available with -device
|
||||
none -nodefaults
|
||||
disables more than just VGA, see "Default Devices"
|
||||
|
||||
Bug: the new way doesn't work for machine types "pc" and "isapc",
|
||||
because it violates obscure device initialization ordering
|
||||
constraints.
|
||||
|
||||
=== Audio Devices ===
|
||||
|
||||
Host and guest part of audio devices have always been separate.
|
||||
|
||||
The old way to define guest audio devices is -soundhw C1,...
|
||||
|
||||
The new way is to define each guest audio device separately with
|
||||
-device.
|
||||
|
||||
Map from -soundhw sound card name to -device:
|
||||
|
||||
ac97 -device AC97
|
||||
cs4231a -device cs4231a,iobase=IOADDR,irq=IRQ,dma=DMA
|
||||
es1370 -device ES1370
|
||||
gus -device gus,iobase=IOADDR,irq=IRQ,dma=DMA,freq=F
|
||||
hda -device intel-hda,msi=MSI -device hda-duplex
|
||||
sb16 -device sb16,iobase=IOADDR,irq=IRQ,dma=DMA,dma16=DMA16,version=V
|
||||
adlib not yet available with -device
|
||||
pcspk not yet available with -device
|
||||
|
||||
For PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control the PCI
|
||||
device address, as usual.
|
||||
|
||||
=== USB Devices ===
|
||||
|
||||
The old way to define a virtual USB device is -usbdevice DRIVER:OPTS...
|
||||
|
||||
The new way is -device DEVNAME,DEV-OPTS... Details depend on DRIVER:
|
||||
|
||||
* ccid -device usb-ccid
|
||||
* keyboard -device usb-kbd
|
||||
* mouse -device usb-mouse
|
||||
* tablet -device usb-tablet
|
||||
* wacom-tablet -device usb-wacom-tablet
|
||||
* host:... See "Host Device Assignment"
|
||||
* disk:... See "Block Devices"
|
||||
* serial:... See "Character Devices"
|
||||
* braille See "Character Devices"
|
||||
* net:... See "Network Devices"
|
||||
* bt:... not yet available with -device
|
||||
|
||||
=== Watchdog Devices ===
|
||||
|
||||
Host and guest part of watchdog devices have always been separate.
|
||||
|
||||
The old way to define a guest watchdog device is -watchdog DEVNAME.
|
||||
The new way is -device DEVNAME. For PCI devices, you can add
|
||||
bus=PCI-BUS,addr=DEVFN to control the PCI device address, as usual.
|
||||
|
||||
=== Host Device Assignment ===
|
||||
|
||||
QEMU supports assigning host PCI devices (qemu-kvm only at this time)
|
||||
and host USB devices.
|
||||
|
||||
The old way to assign a host PCI device is
|
||||
|
||||
-pcidevice host=ADDR,dma=none,id=ID
|
||||
|
||||
The new way is
|
||||
|
||||
-device pci-assign,host=ADDR,iommu=IOMMU,id=ID
|
||||
|
||||
The old dma=none becomes iommu=off with -device.
|
||||
|
||||
The old way to assign a host USB device is
|
||||
|
||||
-usbdevice host:auto:BUS.ADDR:VID:PRID
|
||||
|
||||
where any of BUS, ADDR, VID, PRID can be the wildcard *.
|
||||
|
||||
The new way is
|
||||
|
||||
-device usb-host,hostbus=BUS,hostaddr=ADDR,vendorid=VID,productid=PRID
|
||||
|
||||
Omitted options match anything, just like the old way's wildcard.
|
||||
|
||||
=== Default Devices ===
|
||||
|
||||
QEMU creates a number of devices by default, depending on the machine
|
||||
type.
|
||||
|
||||
-device DEVNAME... and global DEVNAME... suppress default devices for
|
||||
some DEVNAMEs:
|
||||
|
||||
default device suppressing DEVNAMEs
|
||||
CD-ROM ide-cd, ide-drive, scsi-cd
|
||||
isa-fdc's driveA isa-fdc
|
||||
parallel isa-parallel
|
||||
serial isa-serial
|
||||
VGA VGA, cirrus-vga, vmware-svga
|
||||
virtioconsole virtio-serial-pci, virtio-serial-s390, virtio-serial
|
||||
|
||||
The default NIC is connected to a default part created along with it.
|
||||
It is *not* suppressed by configuring a NIC with -device (you may call
|
||||
that a bug). -net and -netdev suppress the default NIC.
|
||||
|
||||
-nodefaults suppresses all the default devices mentioned above, plus a
|
||||
few other things such as default SD-Card drive and default monitor.
|
|
@ -1,102 +0,0 @@
|
|||
; qemupciserial.inf for QEMU, based on MSPORTS.INF
|
||||
|
||||
; The driver itself is shipped with Windows (serial.sys). This is
|
||||
; just a inf file to tell windows which pci id the serial pci card
|
||||
; emulated by qemu has, and to apply a name tag to it which windows
|
||||
; will show in the device manager.
|
||||
|
||||
; Installing the driver: Go to device manager. You should find a "pci
|
||||
; serial card" tagged with a yellow question mark. Open properties.
|
||||
; Pick "update driver". Then "select driver manually". Pick "Ports
|
||||
; (Com+Lpt)" from the list. Click "Have a disk". Select this file.
|
||||
; Procedure may vary a bit depending on the windows version.
|
||||
|
||||
; This file covers all options: pci-serial, pci-serial-2x, pci-serial-4x
|
||||
; for both 32 and 64 bit platforms.
|
||||
|
||||
[Version]
|
||||
Signature="$Windows NT$"
|
||||
Class=MultiFunction
|
||||
ClassGUID={4d36e971-e325-11ce-bfc1-08002be10318}
|
||||
Provider=%QEMU%
|
||||
DriverVer=12/29/2013,1.3.0
|
||||
[ControlFlags]
|
||||
ExcludeFromSelect=*
|
||||
[Manufacturer]
|
||||
%QEMU%=QEMU,NTx86,NTAMD64
|
||||
|
||||
[QEMU.NTx86]
|
||||
%QEMU-PCI_SERIAL_1_PORT%=ComPort_inst1, PCI\VEN_1B36&DEV_0002
|
||||
%QEMU-PCI_SERIAL_2_PORT%=ComPort_inst2, PCI\VEN_1B36&DEV_0003
|
||||
%QEMU-PCI_SERIAL_4_PORT%=ComPort_inst4, PCI\VEN_1B36&DEV_0004
|
||||
|
||||
[QEMU.NTAMD64]
|
||||
%QEMU-PCI_SERIAL_1_PORT%=ComPort_inst1, PCI\VEN_1B36&DEV_0002
|
||||
%QEMU-PCI_SERIAL_2_PORT%=ComPort_inst2, PCI\VEN_1B36&DEV_0003
|
||||
%QEMU-PCI_SERIAL_4_PORT%=ComPort_inst4, PCI\VEN_1B36&DEV_0004
|
||||
|
||||
[ComPort_inst1]
|
||||
Include=mf.inf
|
||||
Needs=MFINSTALL.mf
|
||||
|
||||
[ComPort_inst2]
|
||||
Include=mf.inf
|
||||
Needs=MFINSTALL.mf
|
||||
|
||||
[ComPort_inst4]
|
||||
Include=mf.inf
|
||||
Needs=MFINSTALL.mf
|
||||
|
||||
[ComPort_inst1.HW]
|
||||
AddReg=ComPort_inst1.RegHW
|
||||
|
||||
[ComPort_inst2.HW]
|
||||
AddReg=ComPort_inst2.RegHW
|
||||
|
||||
[ComPort_inst4.HW]
|
||||
AddReg=ComPort_inst4.RegHW
|
||||
|
||||
[ComPort_inst1.Services]
|
||||
Include=mf.inf
|
||||
Needs=MFINSTALL.mf.Services
|
||||
|
||||
[ComPort_inst2.Services]
|
||||
Include=mf.inf
|
||||
Needs=MFINSTALL.mf.Services
|
||||
|
||||
[ComPort_inst4.Services]
|
||||
Include=mf.inf
|
||||
Needs=MFINSTALL.mf.Services
|
||||
|
||||
[ComPort_inst1.RegHW]
|
||||
HKR,Child0000,HardwareID,,*PNP0501
|
||||
HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00
|
||||
HKR,Child0000,ResourceMap,1,02
|
||||
|
||||
[ComPort_inst2.RegHW]
|
||||
HKR,Child0000,HardwareID,,*PNP0501
|
||||
HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00
|
||||
HKR,Child0000,ResourceMap,1,02
|
||||
HKR,Child0001,HardwareID,,*PNP0501
|
||||
HKR,Child0001,VaryingResourceMap,1,00, 08,00,00,00, 08,00,00,00
|
||||
HKR,Child0001,ResourceMap,1,02
|
||||
|
||||
[ComPort_inst4.RegHW]
|
||||
HKR,Child0000,HardwareID,,*PNP0501
|
||||
HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00
|
||||
HKR,Child0000,ResourceMap,1,02
|
||||
HKR,Child0001,HardwareID,,*PNP0501
|
||||
HKR,Child0001,VaryingResourceMap,1,00, 08,00,00,00, 08,00,00,00
|
||||
HKR,Child0001,ResourceMap,1,02
|
||||
HKR,Child0002,HardwareID,,*PNP0501
|
||||
HKR,Child0002,VaryingResourceMap,1,00, 10,00,00,00, 08,00,00,00
|
||||
HKR,Child0002,ResourceMap,1,02
|
||||
HKR,Child0003,HardwareID,,*PNP0501
|
||||
HKR,Child0003,VaryingResourceMap,1,00, 18,00,00,00, 08,00,00,00
|
||||
HKR,Child0003,ResourceMap,1,02
|
||||
|
||||
[Strings]
|
||||
QEMU="QEMU"
|
||||
QEMU-PCI_SERIAL_1_PORT="1x QEMU PCI Serial Card"
|
||||
QEMU-PCI_SERIAL_2_PORT="2x QEMU PCI Serial Card"
|
||||
QEMU-PCI_SERIAL_4_PORT="4x QEMU PCI Serial Card"
|
|
@ -1,87 +0,0 @@
|
|||
QEMU Machine Protocol
|
||||
=====================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
The QEMU Machine Protocol (QMP) allows applications to operate a
|
||||
QEMU instance.
|
||||
|
||||
QMP is JSON[1] based and features the following:
|
||||
|
||||
- Lightweight, text-based, easy to parse data format
|
||||
- Asynchronous messages support (ie. events)
|
||||
- Capabilities Negotiation
|
||||
|
||||
For detailed information on QMP's usage, please, refer to the following files:
|
||||
|
||||
o qmp-spec.txt QEMU Machine Protocol current specification
|
||||
o qmp-commands.txt QMP supported commands (auto-generated at build-time)
|
||||
o qmp-events.txt List of available asynchronous events
|
||||
|
||||
[1] http://www.json.org
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
You can use the -qmp option to enable QMP. For example, the following
|
||||
makes QMP available on localhost port 4444:
|
||||
|
||||
$ qemu [...] -qmp tcp:localhost:4444,server,nowait
|
||||
|
||||
However, for more flexibility and to make use of more options, the -mon
|
||||
command-line option should be used. For instance, the following example
|
||||
creates one HMP instance (human monitor) on stdio and one QMP instance
|
||||
on localhost port 4444:
|
||||
|
||||
$ qemu [...] -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline \
|
||||
-chardev socket,id=mon1,host=localhost,port=4444,server,nowait \
|
||||
-mon chardev=mon1,mode=control,pretty=on
|
||||
|
||||
Please, refer to QEMU's manpage for more information.
|
||||
|
||||
Simple Testing
|
||||
--------------
|
||||
|
||||
To manually test QMP one can connect with telnet and issue commands by hand:
|
||||
|
||||
$ telnet localhost 4444
|
||||
Trying 127.0.0.1...
|
||||
Connected to localhost.
|
||||
Escape character is '^]'.
|
||||
{
|
||||
"QMP": {
|
||||
"version": {
|
||||
"qemu": {
|
||||
"micro": 50,
|
||||
"minor": 6,
|
||||
"major": 1
|
||||
},
|
||||
"package": ""
|
||||
},
|
||||
"capabilities": [
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
{ "execute": "qmp_capabilities" }
|
||||
{
|
||||
"return": {
|
||||
}
|
||||
}
|
||||
|
||||
{ "execute": "query-status" }
|
||||
{
|
||||
"return": {
|
||||
"status": "prelaunch",
|
||||
"singlestep": false,
|
||||
"running": false
|
||||
}
|
||||
}
|
||||
|
||||
Please, refer to the qapi-schema.json file for a complete command reference.
|
||||
|
||||
QMP wiki page
|
||||
-------------
|
||||
|
||||
http://wiki.qemu-project.org/QMP
|
|
@ -1,627 +0,0 @@
|
|||
QEMU Machine Protocol Events
|
||||
============================
|
||||
|
||||
ACPI_DEVICE_OST
|
||||
---------------
|
||||
|
||||
Emitted when guest executes ACPI _OST method.
|
||||
|
||||
- data: ACPIOSTInfo type as described in qapi-schema.json
|
||||
|
||||
{ "event": "ACPI_DEVICE_OST",
|
||||
"data": { "device": "d1", "slot": "0", "slot-type": "DIMM", "source": 1, "status": 0 } }
|
||||
|
||||
BALLOON_CHANGE
|
||||
--------------
|
||||
|
||||
Emitted when the guest changes the actual BALLOON level. This
|
||||
value is equivalent to the 'actual' field return by the
|
||||
'query-balloon' command
|
||||
|
||||
Data:
|
||||
|
||||
- "actual": actual level of the guest memory balloon in bytes (json-number)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "BALLOON_CHANGE",
|
||||
"data": { "actual": 944766976 },
|
||||
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
|
||||
|
||||
BLOCK_IMAGE_CORRUPTED
|
||||
---------------------
|
||||
|
||||
Emitted when a disk image is being marked corrupt.
|
||||
|
||||
Data:
|
||||
|
||||
- "device": Device name (json-string)
|
||||
- "msg": Informative message (e.g., reason for the corruption) (json-string)
|
||||
- "offset": If the corruption resulted from an image access, this is the access
|
||||
offset into the image (json-int)
|
||||
- "size": If the corruption resulted from an image access, this is the access
|
||||
size (json-int)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "BLOCK_IMAGE_CORRUPTED",
|
||||
"data": { "device": "ide0-hd0",
|
||||
"msg": "Prevented active L1 table overwrite", "offset": 196608,
|
||||
"size": 65536 },
|
||||
"timestamp": { "seconds": 1378126126, "microseconds": 966463 } }
|
||||
|
||||
BLOCK_IO_ERROR
|
||||
--------------
|
||||
|
||||
Emitted when a disk I/O error occurs.
|
||||
|
||||
Data:
|
||||
|
||||
- "device": device name (json-string)
|
||||
- "operation": I/O operation (json-string, "read" or "write")
|
||||
- "action": action that has been taken, it's one of the following (json-string):
|
||||
"ignore": error has been ignored
|
||||
"report": error has been reported to the device
|
||||
"stop": the VM is going to stop because of the error
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "BLOCK_IO_ERROR",
|
||||
"data": { "device": "ide0-hd1",
|
||||
"operation": "write",
|
||||
"action": "stop" },
|
||||
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
|
||||
|
||||
Note: If action is "stop", a STOP event will eventually follow the
|
||||
BLOCK_IO_ERROR event.
|
||||
|
||||
BLOCK_JOB_CANCELLED
|
||||
-------------------
|
||||
|
||||
Emitted when a block job has been cancelled.
|
||||
|
||||
Data:
|
||||
|
||||
- "type": Job type (json-string; "stream" for image streaming
|
||||
"commit" for block commit)
|
||||
- "device": Device name (json-string)
|
||||
- "len": Maximum progress value (json-int)
|
||||
- "offset": Current progress value (json-int)
|
||||
On success this is equal to len.
|
||||
On failure this is less than len.
|
||||
- "speed": Rate limit, bytes per second (json-int)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "BLOCK_JOB_CANCELLED",
|
||||
"data": { "type": "stream", "device": "virtio-disk0",
|
||||
"len": 10737418240, "offset": 134217728,
|
||||
"speed": 0 },
|
||||
"timestamp": { "seconds": 1267061043, "microseconds": 959568 } }
|
||||
|
||||
BLOCK_JOB_COMPLETED
|
||||
-------------------
|
||||
|
||||
Emitted when a block job has completed.
|
||||
|
||||
Data:
|
||||
|
||||
- "type": Job type (json-string; "stream" for image streaming
|
||||
"commit" for block commit)
|
||||
- "device": Device name (json-string)
|
||||
- "len": Maximum progress value (json-int)
|
||||
- "offset": Current progress value (json-int)
|
||||
On success this is equal to len.
|
||||
On failure this is less than len.
|
||||
- "speed": Rate limit, bytes per second (json-int)
|
||||
- "error": Error message (json-string, optional)
|
||||
Only present on failure. This field contains a human-readable
|
||||
error message. There are no semantics other than that streaming
|
||||
has failed and clients should not try to interpret the error
|
||||
string.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "BLOCK_JOB_COMPLETED",
|
||||
"data": { "type": "stream", "device": "virtio-disk0",
|
||||
"len": 10737418240, "offset": 10737418240,
|
||||
"speed": 0 },
|
||||
"timestamp": { "seconds": 1267061043, "microseconds": 959568 } }
|
||||
|
||||
BLOCK_JOB_ERROR
|
||||
---------------
|
||||
|
||||
Emitted when a block job encounters an error.
|
||||
|
||||
Data:
|
||||
|
||||
- "device": device name (json-string)
|
||||
- "operation": I/O operation (json-string, "read" or "write")
|
||||
- "action": action that has been taken, it's one of the following (json-string):
|
||||
"ignore": error has been ignored, the job may fail later
|
||||
"report": error will be reported and the job canceled
|
||||
"stop": error caused job to be paused
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "BLOCK_JOB_ERROR",
|
||||
"data": { "device": "ide0-hd1",
|
||||
"operation": "write",
|
||||
"action": "stop" },
|
||||
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
|
||||
|
||||
BLOCK_JOB_READY
|
||||
---------------
|
||||
|
||||
Emitted when a block job is ready to complete.
|
||||
|
||||
Data:
|
||||
|
||||
- "type": Job type (json-string; "stream" for image streaming
|
||||
"commit" for block commit)
|
||||
- "device": Device name (json-string)
|
||||
- "len": Maximum progress value (json-int)
|
||||
- "offset": Current progress value (json-int)
|
||||
On success this is equal to len.
|
||||
On failure this is less than len.
|
||||
- "speed": Rate limit, bytes per second (json-int)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "BLOCK_JOB_READY",
|
||||
"data": { "device": "drive0", "type": "mirror", "speed": 0,
|
||||
"len": 2097152, "offset": 2097152 }
|
||||
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
|
||||
|
||||
Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
|
||||
event.
|
||||
|
||||
DEVICE_DELETED
|
||||
--------------
|
||||
|
||||
Emitted whenever the device removal completion is acknowledged
|
||||
by the guest.
|
||||
At this point, it's safe to reuse the specified device ID.
|
||||
Device removal can be initiated by the guest or by HMP/QMP commands.
|
||||
|
||||
Data:
|
||||
|
||||
- "device": device name (json-string, optional)
|
||||
- "path": device path (json-string)
|
||||
|
||||
{ "event": "DEVICE_DELETED",
|
||||
"data": { "device": "virtio-net-pci-0",
|
||||
"path": "/machine/peripheral/virtio-net-pci-0" },
|
||||
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
|
||||
|
||||
DEVICE_TRAY_MOVED
|
||||
-----------------
|
||||
|
||||
It's emitted whenever the tray of a removable device is moved by the guest
|
||||
or by HMP/QMP commands.
|
||||
|
||||
Data:
|
||||
|
||||
- "device": device name (json-string)
|
||||
- "tray-open": true if the tray has been opened or false if it has been closed
|
||||
(json-bool)
|
||||
|
||||
{ "event": "DEVICE_TRAY_MOVED",
|
||||
"data": { "device": "ide1-cd0",
|
||||
"tray-open": true
|
||||
},
|
||||
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
|
||||
|
||||
GUEST_PANICKED
|
||||
--------------
|
||||
|
||||
Emitted when guest OS panic is detected.
|
||||
|
||||
Data:
|
||||
|
||||
- "action": Action that has been taken (json-string, currently always "pause").
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "GUEST_PANICKED",
|
||||
"data": { "action": "pause" } }
|
||||
|
||||
NIC_RX_FILTER_CHANGED
|
||||
---------------------
|
||||
|
||||
The event is emitted once until the query command is executed,
|
||||
the first event will always be emitted.
|
||||
|
||||
Data:
|
||||
|
||||
- "name": net client name (json-string)
|
||||
- "path": device path (json-string)
|
||||
|
||||
{ "event": "NIC_RX_FILTER_CHANGED",
|
||||
"data": { "name": "vnet0",
|
||||
"path": "/machine/peripheral/vnet0/virtio-backend" },
|
||||
"timestamp": { "seconds": 1368697518, "microseconds": 326866 } }
|
||||
}
|
||||
|
||||
POWERDOWN
|
||||
---------
|
||||
|
||||
Emitted when the Virtual Machine is powered down through the power
|
||||
control system, such as via ACPI.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "POWERDOWN",
|
||||
"timestamp": { "seconds": 1267040730, "microseconds": 682951 } }
|
||||
|
||||
QUORUM_FAILURE
|
||||
--------------
|
||||
|
||||
Emitted by the Quorum block driver if it fails to establish a quorum.
|
||||
|
||||
Data:
|
||||
|
||||
- "reference": device name if defined else node name.
|
||||
- "sector-num": Number of the first sector of the failed read operation.
|
||||
- "sectors-count": Failed read operation sector count.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "QUORUM_FAILURE",
|
||||
"data": { "reference": "usr1", "sector-num": 345435, "sectors-count": 5 },
|
||||
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
|
||||
|
||||
QUORUM_REPORT_BAD
|
||||
-----------------
|
||||
|
||||
Emitted to report a corruption of a Quorum file.
|
||||
|
||||
Data:
|
||||
|
||||
- "error": Error message (json-string, optional)
|
||||
Only present on failure. This field contains a human-readable
|
||||
error message. There are no semantics other than that the
|
||||
block layer reported an error and clients should not try to
|
||||
interpret the error string.
|
||||
- "node-name": The graph node name of the block driver state.
|
||||
- "sector-num": Number of the first sector of the failed read operation.
|
||||
- "sectors-count": Failed read operation sector count.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "QUORUM_REPORT_BAD",
|
||||
"data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 },
|
||||
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
|
||||
|
||||
RESET
|
||||
-----
|
||||
|
||||
Emitted when the Virtual Machine is reset.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "RESET",
|
||||
"timestamp": { "seconds": 1267041653, "microseconds": 9518 } }
|
||||
|
||||
RESUME
|
||||
------
|
||||
|
||||
Emitted when the Virtual Machine resumes execution.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "RESUME",
|
||||
"timestamp": { "seconds": 1271770767, "microseconds": 582542 } }
|
||||
|
||||
RTC_CHANGE
|
||||
----------
|
||||
|
||||
Emitted when the guest changes the RTC time.
|
||||
|
||||
Data:
|
||||
|
||||
- "offset": Offset between base RTC clock (as specified by -rtc base), and
|
||||
new RTC clock value (json-number)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "RTC_CHANGE",
|
||||
"data": { "offset": 78 },
|
||||
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
|
||||
|
||||
SHUTDOWN
|
||||
--------
|
||||
|
||||
Emitted when the Virtual Machine has shut down, indicating that qemu
|
||||
is about to exit.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "SHUTDOWN",
|
||||
"timestamp": { "seconds": 1267040730, "microseconds": 682951 } }
|
||||
|
||||
Note: If the command-line option "-no-shutdown" has been specified, a STOP
|
||||
event will eventually follow the SHUTDOWN event.
|
||||
|
||||
SPICE_CONNECTED
|
||||
---------------
|
||||
|
||||
Emitted when a SPICE client connects.
|
||||
|
||||
Data:
|
||||
|
||||
- "server": Server information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "port": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "client": Client information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "port": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
|
||||
Example:
|
||||
|
||||
{ "timestamp": {"seconds": 1290688046, "microseconds": 388707},
|
||||
"event": "SPICE_CONNECTED",
|
||||
"data": {
|
||||
"server": { "port": "5920", "family": "ipv4", "host": "127.0.0.1"},
|
||||
"client": {"port": "52873", "family": "ipv4", "host": "127.0.0.1"}
|
||||
}}
|
||||
|
||||
SPICE_DISCONNECTED
|
||||
------------------
|
||||
|
||||
Emitted when a SPICE client disconnects.
|
||||
|
||||
Data:
|
||||
|
||||
- "server": Server information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "port": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "client": Client information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "port": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
|
||||
Example:
|
||||
|
||||
{ "timestamp": {"seconds": 1290688046, "microseconds": 388707},
|
||||
"event": "SPICE_DISCONNECTED",
|
||||
"data": {
|
||||
"server": { "port": "5920", "family": "ipv4", "host": "127.0.0.1"},
|
||||
"client": {"port": "52873", "family": "ipv4", "host": "127.0.0.1"}
|
||||
}}
|
||||
|
||||
SPICE_INITIALIZED
|
||||
-----------------
|
||||
|
||||
Emitted after initial handshake and authentication takes place (if any)
|
||||
and the SPICE channel is up and running
|
||||
|
||||
Data:
|
||||
|
||||
- "server": Server information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "port": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "auth": authentication method (json-string, optional)
|
||||
- "client": Client information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "port": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "connection-id": spice connection id. All channels with the same id
|
||||
belong to the same spice session (json-int)
|
||||
- "channel-type": channel type. "1" is the main control channel, filter for
|
||||
this one if you want track spice sessions only (json-int)
|
||||
- "channel-id": channel id. Usually "0", might be different needed when
|
||||
multiple channels of the same type exist, such as multiple
|
||||
display channels in a multihead setup (json-int)
|
||||
- "tls": whevener the channel is encrypted (json-bool)
|
||||
|
||||
Example:
|
||||
|
||||
{ "timestamp": {"seconds": 1290688046, "microseconds": 417172},
|
||||
"event": "SPICE_INITIALIZED",
|
||||
"data": {"server": {"auth": "spice", "port": "5921",
|
||||
"family": "ipv4", "host": "127.0.0.1"},
|
||||
"client": {"port": "49004", "family": "ipv4", "channel-type": 3,
|
||||
"connection-id": 1804289383, "host": "127.0.0.1",
|
||||
"channel-id": 0, "tls": true}
|
||||
}}
|
||||
|
||||
SPICE_MIGRATE_COMPLETED
|
||||
-----------------------
|
||||
|
||||
Emitted when SPICE migration has completed
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "timestamp": {"seconds": 1290688046, "microseconds": 417172},
|
||||
"event": "SPICE_MIGRATE_COMPLETED" }
|
||||
|
||||
|
||||
STOP
|
||||
----
|
||||
|
||||
Emitted when the Virtual Machine is stopped.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "STOP",
|
||||
"timestamp": { "seconds": 1267041730, "microseconds": 281295 } }
|
||||
|
||||
SUSPEND
|
||||
-------
|
||||
|
||||
Emitted when guest enters S3 state.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "SUSPEND",
|
||||
"timestamp": { "seconds": 1344456160, "microseconds": 309119 } }
|
||||
|
||||
SUSPEND_DISK
|
||||
------------
|
||||
|
||||
Emitted when the guest makes a request to enter S4 state.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "SUSPEND_DISK",
|
||||
"timestamp": { "seconds": 1344456160, "microseconds": 309119 } }
|
||||
|
||||
Note: QEMU shuts down when entering S4 state.
|
||||
|
||||
VNC_CONNECTED
|
||||
-------------
|
||||
|
||||
Emitted when a VNC client establishes a connection.
|
||||
|
||||
Data:
|
||||
|
||||
- "server": Server information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "service": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "auth": authentication method (json-string, optional)
|
||||
- "client": Client information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "service": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "VNC_CONNECTED",
|
||||
"data": {
|
||||
"server": { "auth": "sasl", "family": "ipv4",
|
||||
"service": "5901", "host": "0.0.0.0" },
|
||||
"client": { "family": "ipv4", "service": "58425",
|
||||
"host": "127.0.0.1" } },
|
||||
"timestamp": { "seconds": 1262976601, "microseconds": 975795 } }
|
||||
|
||||
|
||||
Note: This event is emitted before any authentication takes place, thus
|
||||
the authentication ID is not provided.
|
||||
|
||||
VNC_DISCONNECTED
|
||||
----------------
|
||||
|
||||
Emitted when the connection is closed.
|
||||
|
||||
Data:
|
||||
|
||||
- "server": Server information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "service": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "auth": authentication method (json-string, optional)
|
||||
- "client": Client information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "service": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "x509_dname": TLS dname (json-string, optional)
|
||||
- "sasl_username": SASL username (json-string, optional)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "VNC_DISCONNECTED",
|
||||
"data": {
|
||||
"server": { "auth": "sasl", "family": "ipv4",
|
||||
"service": "5901", "host": "0.0.0.0" },
|
||||
"client": { "family": "ipv4", "service": "58425",
|
||||
"host": "127.0.0.1", "sasl_username": "luiz" } },
|
||||
"timestamp": { "seconds": 1262976601, "microseconds": 975795 } }
|
||||
|
||||
VNC_INITIALIZED
|
||||
---------------
|
||||
|
||||
Emitted after authentication takes place (if any) and the VNC session is
|
||||
made active.
|
||||
|
||||
Data:
|
||||
|
||||
- "server": Server information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "service": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "auth": authentication method (json-string, optional)
|
||||
- "client": Client information (json-object)
|
||||
- "host": IP address (json-string)
|
||||
- "service": port number (json-string)
|
||||
- "family": address family (json-string, "ipv4" or "ipv6")
|
||||
- "x509_dname": TLS dname (json-string, optional)
|
||||
- "sasl_username": SASL username (json-string, optional)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "VNC_INITIALIZED",
|
||||
"data": {
|
||||
"server": { "auth": "sasl", "family": "ipv4",
|
||||
"service": "5901", "host": "0.0.0.0"},
|
||||
"client": { "family": "ipv4", "service": "46089",
|
||||
"host": "127.0.0.1", "sasl_username": "luiz" } },
|
||||
"timestamp": { "seconds": 1263475302, "microseconds": 150772 } }
|
||||
|
||||
VSERPORT_CHANGE
|
||||
---------------
|
||||
|
||||
Emitted when the guest opens or closes a virtio-serial port.
|
||||
|
||||
Data:
|
||||
|
||||
- "id": device identifier of the virtio-serial port (json-string)
|
||||
- "open": true if the guest has opened the virtio-serial port (json-bool)
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "VSERPORT_CHANGE",
|
||||
"data": { "id": "channel0", "open": true },
|
||||
"timestamp": { "seconds": 1401385907, "microseconds": 422329 } }
|
||||
|
||||
WAKEUP
|
||||
------
|
||||
|
||||
Emitted when the guest has woken up from S3 and is running.
|
||||
|
||||
Data: None.
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "WAKEUP",
|
||||
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
|
||||
|
||||
WATCHDOG
|
||||
--------
|
||||
|
||||
Emitted when the watchdog device's timer is expired.
|
||||
|
||||
Data:
|
||||
|
||||
- "action": Action that has been taken, it's one of the following (json-string):
|
||||
"reset", "shutdown", "poweroff", "pause", "debug", or "none"
|
||||
|
||||
Example:
|
||||
|
||||
{ "event": "WATCHDOG",
|
||||
"data": { "action": "reset" },
|
||||
"timestamp": { "seconds": 1267061043, "microseconds": 959568 } }
|
||||
|
||||
Note: If action is "reset", "shutdown", or "pause" the WATCHDOG event is
|
||||
followed respectively by the RESET, SHUTDOWN, or STOP events.
|
|
@ -1,273 +0,0 @@
|
|||
QEMU Machine Protocol Specification
|
||||
|
||||
1. Introduction
|
||||
===============
|
||||
|
||||
This document specifies the QEMU Machine Protocol (QMP), a JSON-based protocol
|
||||
which is available for applications to operate QEMU at the machine-level.
|
||||
|
||||
2. Protocol Specification
|
||||
=========================
|
||||
|
||||
This section details the protocol format. For the purpose of this document
|
||||
"Client" is any application which is using QMP to communicate with QEMU and
|
||||
"Server" is QEMU itself.
|
||||
|
||||
JSON data structures, when mentioned in this document, are always in the
|
||||
following format:
|
||||
|
||||
json-DATA-STRUCTURE-NAME
|
||||
|
||||
Where DATA-STRUCTURE-NAME is any valid JSON data structure, as defined by
|
||||
the JSON standard:
|
||||
|
||||
http://www.ietf.org/rfc/rfc4627.txt
|
||||
|
||||
For convenience, json-object members and json-array elements mentioned in
|
||||
this document will be in a certain order. However, in real protocol usage
|
||||
they can be in ANY order, thus no particular order should be assumed.
|
||||
|
||||
2.1 General Definitions
|
||||
-----------------------
|
||||
|
||||
2.1.1 All interactions transmitted by the Server are json-objects, always
|
||||
terminating with CRLF
|
||||
|
||||
2.1.2 All json-objects members are mandatory when not specified otherwise
|
||||
|
||||
2.2 Server Greeting
|
||||
-------------------
|
||||
|
||||
Right when connected the Server will issue a greeting message, which signals
|
||||
that the connection has been successfully established and that the Server is
|
||||
ready for capabilities negotiation (for more information refer to section
|
||||
'4. Capabilities Negotiation').
|
||||
|
||||
The greeting message format is:
|
||||
|
||||
{ "QMP": { "version": json-object, "capabilities": json-array } }
|
||||
|
||||
Where,
|
||||
|
||||
- The "version" member contains the Server's version information (the format
|
||||
is the same of the query-version command)
|
||||
- The "capabilities" member specify the availability of features beyond the
|
||||
baseline specification
|
||||
|
||||
2.3 Issuing Commands
|
||||
--------------------
|
||||
|
||||
The format for command execution is:
|
||||
|
||||
{ "execute": json-string, "arguments": json-object, "id": json-value }
|
||||
|
||||
Where,
|
||||
|
||||
- The "execute" member identifies the command to be executed by the Server
|
||||
- The "arguments" member is used to pass any arguments required for the
|
||||
execution of the command, it is optional when no arguments are required
|
||||
- The "id" member is a transaction identification associated with the
|
||||
command execution, it is optional and will be part of the response if
|
||||
provided
|
||||
|
||||
2.4 Commands Responses
|
||||
----------------------
|
||||
|
||||
There are two possible responses which the Server will issue as the result
|
||||
of a command execution: success or error.
|
||||
|
||||
2.4.1 success
|
||||
-------------
|
||||
|
||||
The format of a success response is:
|
||||
|
||||
{ "return": json-object, "id": json-value }
|
||||
|
||||
Where,
|
||||
|
||||
- The "return" member contains the command returned data, which is defined
|
||||
in a per-command basis or an empty json-object if the command does not
|
||||
return data
|
||||
- The "id" member contains the transaction identification associated
|
||||
with the command execution if issued by the Client
|
||||
|
||||
2.4.2 error
|
||||
-----------
|
||||
|
||||
The format of an error response is:
|
||||
|
||||
{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
|
||||
|
||||
Where,
|
||||
|
||||
- The "class" member contains the error class name (eg. "GenericError")
|
||||
- The "desc" member is a human-readable error message. Clients should
|
||||
not attempt to parse this message.
|
||||
- The "id" member contains the transaction identification associated with
|
||||
the command execution if issued by the Client
|
||||
|
||||
NOTE: Some errors can occur before the Server is able to read the "id" member,
|
||||
in these cases the "id" member will not be part of the error response, even
|
||||
if provided by the client.
|
||||
|
||||
2.5 Asynchronous events
|
||||
-----------------------
|
||||
|
||||
As a result of state changes, the Server may send messages unilaterally
|
||||
to the Client at any time. They are called "asynchronous events".
|
||||
|
||||
The format of asynchronous events is:
|
||||
|
||||
{ "event": json-string, "data": json-object,
|
||||
"timestamp": { "seconds": json-number, "microseconds": json-number } }
|
||||
|
||||
Where,
|
||||
|
||||
- The "event" member contains the event's name
|
||||
- The "data" member contains event specific data, which is defined in a
|
||||
per-event basis, it is optional
|
||||
- The "timestamp" member contains the exact time of when the event occurred
|
||||
in the Server. It is a fixed json-object with time in seconds and
|
||||
microseconds
|
||||
|
||||
For a listing of supported asynchronous events, please, refer to the
|
||||
qmp-events.txt file.
|
||||
|
||||
3. QMP Examples
|
||||
===============
|
||||
|
||||
This section provides some examples of real QMP usage, in all of them
|
||||
"C" stands for "Client" and "S" stands for "Server".
|
||||
|
||||
3.1 Server greeting
|
||||
-------------------
|
||||
|
||||
S: { "QMP": { "version": { "qemu": { "micro": 50, "minor": 6, "major": 1 },
|
||||
"package": ""}, "capabilities": []}}
|
||||
|
||||
3.2 Simple 'stop' execution
|
||||
---------------------------
|
||||
|
||||
C: { "execute": "stop" }
|
||||
S: { "return": {} }
|
||||
|
||||
3.3 KVM information
|
||||
-------------------
|
||||
|
||||
C: { "execute": "query-kvm", "id": "example" }
|
||||
S: { "return": { "enabled": true, "present": true }, "id": "example"}
|
||||
|
||||
3.4 Parsing error
|
||||
------------------
|
||||
|
||||
C: { "execute": }
|
||||
S: { "error": { "class": "GenericError", "desc": "Invalid JSON syntax" } }
|
||||
|
||||
3.5 Powerdown event
|
||||
-------------------
|
||||
|
||||
S: { "timestamp": { "seconds": 1258551470, "microseconds": 802384 },
|
||||
"event": "POWERDOWN" }
|
||||
|
||||
4. Capabilities Negotiation
|
||||
----------------------------
|
||||
|
||||
When a Client successfully establishes a connection, the Server is in
|
||||
Capabilities Negotiation mode.
|
||||
|
||||
In this mode only the qmp_capabilities command is allowed to run, all
|
||||
other commands will return the CommandNotFound error. Asynchronous
|
||||
messages are not delivered either.
|
||||
|
||||
Clients should use the qmp_capabilities command to enable capabilities
|
||||
advertised in the Server's greeting (section '2.2 Server Greeting') they
|
||||
support.
|
||||
|
||||
When the qmp_capabilities command is issued, and if it does not return an
|
||||
error, the Server enters in Command mode where capabilities changes take
|
||||
effect, all commands (except qmp_capabilities) are allowed and asynchronous
|
||||
messages are delivered.
|
||||
|
||||
5 Compatibility Considerations
|
||||
------------------------------
|
||||
|
||||
All protocol changes or new features which modify the protocol format in an
|
||||
incompatible way are disabled by default and will be advertised by the
|
||||
capabilities array (section '2.2 Server Greeting'). Thus, Clients can check
|
||||
that array and enable the capabilities they support.
|
||||
|
||||
The QMP Server performs a type check on the arguments to a command. It
|
||||
generates an error if a value does not have the expected type for its
|
||||
key, or if it does not understand a key that the Client included. The
|
||||
strictness of the Server catches wrong assumptions of Clients about
|
||||
the Server's schema. Clients can assume that, when such validation
|
||||
errors occur, they will be reported before the command generated any
|
||||
side effect.
|
||||
|
||||
However, Clients must not assume any particular:
|
||||
|
||||
- Length of json-arrays
|
||||
- Size of json-objects; in particular, future versions of QEMU may add
|
||||
new keys and Clients should be able to ignore them.
|
||||
- Order of json-object members or json-array elements
|
||||
- Amount of errors generated by a command, that is, new errors can be added
|
||||
to any existing command in newer versions of the Server
|
||||
|
||||
Of course, the Server does guarantee to send valid JSON. But apart from
|
||||
this, a Client should be "conservative in what they send, and liberal in
|
||||
what they accept".
|
||||
|
||||
6. Downstream extension of QMP
|
||||
------------------------------
|
||||
|
||||
We recommend that downstream consumers of QEMU do *not* modify QMP.
|
||||
Management tools should be able to support both upstream and downstream
|
||||
versions of QMP without special logic, and downstream extensions are
|
||||
inherently at odds with that.
|
||||
|
||||
However, we recognize that it is sometimes impossible for downstreams to
|
||||
avoid modifying QMP. Both upstream and downstream need to take care to
|
||||
preserve long-term compatibility and interoperability.
|
||||
|
||||
To help with that, QMP reserves JSON object member names beginning with
|
||||
'__' (double underscore) for downstream use ("downstream names"). This
|
||||
means upstream will never use any downstream names for its commands,
|
||||
arguments, errors, asynchronous events, and so forth.
|
||||
|
||||
Any new names downstream wishes to add must begin with '__'. To
|
||||
ensure compatibility with other downstreams, it is strongly
|
||||
recommended that you prefix your downstream names with '__RFQDN_' where
|
||||
RFQDN is a valid, reverse fully qualified domain name which you
|
||||
control. For example, a qemu-kvm specific monitor command would be:
|
||||
|
||||
(qemu) __org.linux-kvm_enable_irqchip
|
||||
|
||||
Downstream must not change the server greeting (section 2.2) other than
|
||||
to offer additional capabilities. But see below for why even that is
|
||||
discouraged.
|
||||
|
||||
Section '5 Compatibility Considerations' applies to downstream as well
|
||||
as to upstream, obviously. It follows that downstream must behave
|
||||
exactly like upstream for any input not containing members with
|
||||
downstream names ("downstream members"), except it may add members
|
||||
with downstream names to its output.
|
||||
|
||||
Thus, a client should not be able to distinguish downstream from
|
||||
upstream as long as it doesn't send input with downstream members, and
|
||||
properly ignores any downstream members in the output it receives.
|
||||
|
||||
Advice on downstream modifications:
|
||||
|
||||
1. Introducing new commands is okay. If you want to extend an existing
|
||||
command, consider introducing a new one with the new behaviour
|
||||
instead.
|
||||
|
||||
2. Introducing new asynchronous messages is okay. If you want to extend
|
||||
an existing message, consider adding a new one instead.
|
||||
|
||||
3. Introducing new errors for use in new commands is okay. Adding new
|
||||
errors to existing commands counts as extension, so 1. applies.
|
||||
|
||||
4. New capabilities are strongly discouraged. Capabilities are for
|
||||
evolving the basic protocol, and multiple diverging basic protocol
|
||||
dialects are most undesirable.
|
|
@ -1,420 +0,0 @@
|
|||
(RDMA: Remote Direct Memory Access)
|
||||
RDMA Live Migration Specification, Version # 1
|
||||
==============================================
|
||||
Wiki: http://wiki.qemu-project.org/Features/RDMALiveMigration
|
||||
Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
|
||||
|
||||
Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
|
||||
|
||||
An *exhaustive* paper (2010) shows additional performance details
|
||||
linked on the QEMU wiki above.
|
||||
|
||||
Contents:
|
||||
=========
|
||||
* Introduction
|
||||
* Before running
|
||||
* Running
|
||||
* Performance
|
||||
* RDMA Migration Protocol Description
|
||||
* Versioning and Capabilities
|
||||
* QEMUFileRDMA Interface
|
||||
* Migration of VM's ram
|
||||
* Error handling
|
||||
* TODO
|
||||
|
||||
Introduction:
|
||||
=============
|
||||
|
||||
RDMA helps make your migration more deterministic under heavy load because
|
||||
of the significantly lower latency and higher throughput over TCP/IP. This is
|
||||
because the RDMA I/O architecture reduces the number of interrupts and
|
||||
data copies by bypassing the host networking stack. In particular, a TCP-based
|
||||
migration, under certain types of memory-bound workloads, may take a more
|
||||
unpredicatable amount of time to complete the migration if the amount of
|
||||
memory tracked during each live migration iteration round cannot keep pace
|
||||
with the rate of dirty memory produced by the workload.
|
||||
|
||||
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
|
||||
over Converged Ethernet) as well as Infiniband-based. This implementation of
|
||||
migration using RDMA is capable of using both technologies because of
|
||||
the use of the OpenFabrics OFED software stack that abstracts out the
|
||||
programming model irrespective of the underlying hardware.
|
||||
|
||||
Refer to openfabrics.org or your respective RDMA hardware vendor for
|
||||
an understanding on how to verify that you have the OFED software stack
|
||||
installed in your environment. You should be able to successfully link
|
||||
against the "librdmacm" and "libibverbs" libraries and development headers
|
||||
for a working build of QEMU to run successfully using RDMA Migration.
|
||||
|
||||
BEFORE RUNNING:
|
||||
===============
|
||||
|
||||
Use of RDMA during migration requires pinning and registering memory
|
||||
with the hardware. This means that memory must be physically resident
|
||||
before the hardware can transmit that memory to another machine.
|
||||
If this is not acceptable for your application or product, then the use
|
||||
of RDMA migration may in fact be harmful to co-located VMs or other
|
||||
software on the machine if there is not sufficient memory available to
|
||||
relocate the entire footprint of the virtual machine. If so, then the
|
||||
use of RDMA is discouraged and it is recommended to use standard TCP migration.
|
||||
|
||||
Experimental: Next, decide if you want dynamic page registration.
|
||||
For example, if you have an 8GB RAM virtual machine, but only 1GB
|
||||
is in active use, then enabling this feature will cause all 8GB to
|
||||
be pinned and resident in memory. This feature mostly affects the
|
||||
bulk-phase round of the migration and can be enabled for extremely
|
||||
high-performance RDMA hardware using the following command:
|
||||
|
||||
QEMU Monitor Command:
|
||||
$ migrate_set_capability rdma-pin-all on # disabled by default
|
||||
|
||||
Performing this action will cause all 8GB to be pinned, so if that's
|
||||
not what you want, then please ignore this step altogether.
|
||||
|
||||
On the other hand, this will also significantly speed up the bulk round
|
||||
of the migration, which can greatly reduce the "total" time of your migration.
|
||||
Example performance of this using an idle VM in the previous example
|
||||
can be found in the "Performance" section.
|
||||
|
||||
Note: for very large virtual machines (hundreds of GBs), pinning all
|
||||
*all* of the memory of your virtual machine in the kernel is very expensive
|
||||
may extend the initial bulk iteration time by many seconds,
|
||||
and thus extending the total migration time. However, this will not
|
||||
affect the determinism or predictability of your migration you will
|
||||
still gain from the benefits of advanced pinning with RDMA.
|
||||
|
||||
RUNNING:
|
||||
========
|
||||
|
||||
First, set the migration speed to match your hardware's capabilities:
|
||||
|
||||
QEMU Monitor Command:
|
||||
$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
|
||||
|
||||
Next, on the destination machine, add the following to the QEMU command line:
|
||||
|
||||
qemu ..... -incoming rdma:host:port
|
||||
|
||||
Finally, perform the actual migration on the source machine:
|
||||
|
||||
QEMU Monitor Command:
|
||||
$ migrate -d rdma:host:port
|
||||
|
||||
PERFORMANCE
|
||||
===========
|
||||
|
||||
Here is a brief summary of total migration time and downtime using RDMA:
|
||||
Using a 40gbps infiniband link performing a worst-case stress test,
|
||||
using an 8GB RAM virtual machine:
|
||||
|
||||
Using the following command:
|
||||
$ apt-get install stress
|
||||
$ stress --vm-bytes 7500M --vm 1 --vm-keep
|
||||
|
||||
1. Migration throughput: 26 gigabits/second.
|
||||
2. Downtime (stop time) varies between 15 and 100 milliseconds.
|
||||
|
||||
EFFECTS of memory registration on bulk phase round:
|
||||
|
||||
For example, in the same 8GB RAM example with all 8GB of memory in
|
||||
active use and the VM itself is completely idle using the same 40 gbps
|
||||
infiniband link:
|
||||
|
||||
1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
|
||||
2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
|
||||
|
||||
These numbers would of course scale up to whatever size virtual machine
|
||||
you have to migrate using RDMA.
|
||||
|
||||
Enabling this feature does *not* have any measurable affect on
|
||||
migration *downtime*. This is because, without this feature, all of the
|
||||
memory will have already been registered already in advance during
|
||||
the bulk round and does not need to be re-registered during the successive
|
||||
iteration rounds.
|
||||
|
||||
RDMA Protocol Description:
|
||||
==========================
|
||||
|
||||
Migration with RDMA is separated into two parts:
|
||||
|
||||
1. The transmission of the pages using RDMA
|
||||
2. Everything else (a control channel is introduced)
|
||||
|
||||
"Everything else" is transmitted using a formal
|
||||
protocol now, consisting of infiniband SEND messages.
|
||||
|
||||
An infiniband SEND message is the standard ibverbs
|
||||
message used by applications of infiniband hardware.
|
||||
The only difference between a SEND message and an RDMA
|
||||
message is that SEND messages cause notifications
|
||||
to be posted to the completion queue (CQ) on the
|
||||
infiniband receiver side, whereas RDMA messages (used
|
||||
for VM's ram) do not (to behave like an actual DMA).
|
||||
|
||||
Messages in infiniband require two things:
|
||||
|
||||
1. registration of the memory that will be transmitted
|
||||
2. (SEND only) work requests to be posted on both
|
||||
sides of the network before the actual transmission
|
||||
can occur.
|
||||
|
||||
RDMA messages are much easier to deal with. Once the memory
|
||||
on the receiver side is registered and pinned, we're
|
||||
basically done. All that is required is for the sender
|
||||
side to start dumping bytes onto the link.
|
||||
|
||||
(Memory is not released from pinning until the migration
|
||||
completes, given that RDMA migrations are very fast.)
|
||||
|
||||
SEND messages require more coordination because the
|
||||
receiver must have reserved space (using a receive
|
||||
work request) on the receive queue (RQ) before QEMUFileRDMA
|
||||
can start using them to carry all the bytes as
|
||||
a control transport for migration of device state.
|
||||
|
||||
To begin the migration, the initial connection setup is
|
||||
as follows (migration-rdma.c):
|
||||
|
||||
1. Receiver and Sender are started (command line or libvirt):
|
||||
2. Both sides post two RQ work requests
|
||||
3. Receiver does listen()
|
||||
4. Sender does connect()
|
||||
5. Receiver accept()
|
||||
6. Check versioning and capabilities (described later)
|
||||
|
||||
At this point, we define a control channel on top of SEND messages
|
||||
which is described by a formal protocol. Each SEND message has a
|
||||
header portion and a data portion (but together are transmitted
|
||||
as a single SEND message).
|
||||
|
||||
Header:
|
||||
* Length (of the data portion, uint32, network byte order)
|
||||
* Type (what command to perform, uint32, network byte order)
|
||||
* Repeat (Number of commands in data portion, same type only)
|
||||
|
||||
The 'Repeat' field is here to support future multiple page registrations
|
||||
in a single message without any need to change the protocol itself
|
||||
so that the protocol is compatible against multiple versions of QEMU.
|
||||
Version #1 requires that all server implementations of the protocol must
|
||||
check this field and register all requests found in the array of commands located
|
||||
in the data portion and return an equal number of results in the response.
|
||||
The maximum number of repeats is hard-coded to 4096. This is a conservative
|
||||
limit based on the maximum size of a SEND message along with empirical
|
||||
observations on the maximum future benefit of simultaneous page registrations.
|
||||
|
||||
The 'type' field has 12 different command values:
|
||||
1. Unused
|
||||
2. Error (sent to the source during bad things)
|
||||
3. Ready (control-channel is available)
|
||||
4. QEMU File (for sending non-live device state)
|
||||
5. RAM Blocks request (used right after connection setup)
|
||||
6. RAM Blocks result (used right after connection setup)
|
||||
7. Compress page (zap zero page and skip registration)
|
||||
8. Register request (dynamic chunk registration)
|
||||
9. Register result ('rkey' to be used by sender)
|
||||
10. Register finished (registration for current iteration finished)
|
||||
11. Unregister request (unpin previously registered memory)
|
||||
12. Unregister finished (confirmation that unpin completed)
|
||||
|
||||
A single control message, as hinted above, can contain within the data
|
||||
portion an array of many commands of the same type. If there is more than
|
||||
one command, then the 'repeat' field will be greater than 1.
|
||||
|
||||
After connection setup, message 5 & 6 are used to exchange ram block
|
||||
information and optionally pin all the memory if requested by the user.
|
||||
|
||||
After ram block exchange is completed, we have two protocol-level
|
||||
functions, responsible for communicating control-channel commands
|
||||
using the above list of values:
|
||||
|
||||
Logically:
|
||||
|
||||
qemu_rdma_exchange_recv(header, expected command type)
|
||||
|
||||
1. We transmit a READY command to let the sender know that
|
||||
we are *ready* to receive some data bytes on the control channel.
|
||||
2. Before attempting to receive the expected command, we post another
|
||||
RQ work request to replace the one we just used up.
|
||||
3. Block on a CQ event channel and wait for the SEND to arrive.
|
||||
4. When the send arrives, librdmacm will unblock us.
|
||||
5. Verify that the command-type and version received matches the one we expected.
|
||||
|
||||
qemu_rdma_exchange_send(header, data, optional response header & data):
|
||||
|
||||
1. Block on the CQ event channel waiting for a READY command
|
||||
from the receiver to tell us that the receiver
|
||||
is *ready* for us to transmit some new bytes.
|
||||
2. Optionally: if we are expecting a response from the command
|
||||
(that we have not yet transmitted), let's post an RQ
|
||||
work request to receive that data a few moments later.
|
||||
3. When the READY arrives, librdmacm will
|
||||
unblock us and we immediately post a RQ work request
|
||||
to replace the one we just used up.
|
||||
4. Now, we can actually post the work request to SEND
|
||||
the requested command type of the header we were asked for.
|
||||
5. Optionally, if we are expecting a response (as before),
|
||||
we block again and wait for that response using the additional
|
||||
work request we previously posted. (This is used to carry
|
||||
'Register result' commands #6 back to the sender which
|
||||
hold the rkey need to perform RDMA. Note that the virtual address
|
||||
corresponding to this rkey was already exchanged at the beginning
|
||||
of the connection (described below).
|
||||
|
||||
All of the remaining command types (not including 'ready')
|
||||
described above all use the aformentioned two functions to do the hard work:
|
||||
|
||||
1. After connection setup, RAMBlock information is exchanged using
|
||||
this protocol before the actual migration begins. This information includes
|
||||
a description of each RAMBlock on the server side as well as the virtual addresses
|
||||
and lengths of each RAMBlock. This is used by the client to determine the
|
||||
start and stop locations of chunks and how to register them dynamically
|
||||
before performing the RDMA operations.
|
||||
2. During runtime, once a 'chunk' becomes full of pages ready to
|
||||
be sent with RDMA, the registration commands are used to ask the
|
||||
other side to register the memory for this chunk and respond
|
||||
with the result (rkey) of the registration.
|
||||
3. Also, the QEMUFile interfaces also call these functions (described below)
|
||||
when transmitting non-live state, such as devices or to send
|
||||
its own protocol information during the migration process.
|
||||
4. Finally, zero pages are only checked if a page has not yet been registered
|
||||
using chunk registration (or not checked at all and unconditionally
|
||||
written if chunk registration is disabled. This is accomplished using
|
||||
the "Compress" command listed above. If the page *has* been registered
|
||||
then we check the entire chunk for zero. Only if the entire chunk is
|
||||
zero, then we send a compress command to zap the page on the other side.
|
||||
|
||||
Versioning and Capabilities
|
||||
===========================
|
||||
Current version of the protocol is version #1.
|
||||
|
||||
The same version applies to both for protocol traffic and capabilities
|
||||
negotiation. (i.e. There is only one version number that is referred to
|
||||
by all communication).
|
||||
|
||||
librdmacm provides the user with a 'private data' area to be exchanged
|
||||
at connection-setup time before any infiniband traffic is generated.
|
||||
|
||||
Header:
|
||||
* Version (protocol version validated before send/recv occurs),
|
||||
uint32, network byte order
|
||||
* Flags (bitwise OR of each capability),
|
||||
uint32, network byte order
|
||||
|
||||
There is no data portion of this header right now, so there is
|
||||
no length field. The maximum size of the 'private data' section
|
||||
is only 192 bytes per the Infiniband specification, so it's not
|
||||
very useful for data anyway. This structure needs to remain small.
|
||||
|
||||
This private data area is a convenient place to check for protocol
|
||||
versioning because the user does not need to register memory to
|
||||
transmit a few bytes of version information.
|
||||
|
||||
This is also a convenient place to negotiate capabilities
|
||||
(like dynamic page registration).
|
||||
|
||||
If the version is invalid, we throw an error.
|
||||
|
||||
If the version is new, we only negotiate the capabilities that the
|
||||
requested version is able to perform and ignore the rest.
|
||||
|
||||
Currently there is only one capability in Version #1: dynamic page registration
|
||||
|
||||
Finally: Negotiation happens with the Flags field: If the primary-VM
|
||||
sets a flag, but the destination does not support this capability, it
|
||||
will return a zero-bit for that flag and the primary-VM will understand
|
||||
that as not being an available capability and will thus disable that
|
||||
capability on the primary-VM side.
|
||||
|
||||
QEMUFileRDMA Interface:
|
||||
=======================
|
||||
|
||||
QEMUFileRDMA introduces a couple of new functions:
|
||||
|
||||
1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
|
||||
2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
|
||||
|
||||
These two functions are very short and simply use the protocol
|
||||
describe above to deliver bytes without changing the upper-level
|
||||
users of QEMUFile that depend on a bytestream abstraction.
|
||||
|
||||
Finally, how do we handoff the actual bytes to get_buffer()?
|
||||
|
||||
Again, because we're trying to "fake" a bytestream abstraction
|
||||
using an analogy not unlike individual UDP frames, we have
|
||||
to hold on to the bytes received from control-channel's SEND
|
||||
messages in memory.
|
||||
|
||||
Each time we receive a complete "QEMU File" control-channel
|
||||
message, the bytes from SEND are copied into a small local holding area.
|
||||
|
||||
Then, we return the number of bytes requested by get_buffer()
|
||||
and leave the remaining bytes in the holding area until get_buffer()
|
||||
comes around for another pass.
|
||||
|
||||
If the buffer is empty, then we follow the same steps
|
||||
listed above and issue another "QEMU File" protocol command,
|
||||
asking for a new SEND message to re-fill the buffer.
|
||||
|
||||
Migration of VM's ram:
|
||||
====================
|
||||
|
||||
At the beginning of the migration, (migration-rdma.c),
|
||||
the sender and the receiver populate the list of RAMBlocks
|
||||
to be registered with each other into a structure.
|
||||
Then, using the aforementioned protocol, they exchange a
|
||||
description of these blocks with each other, to be used later
|
||||
during the iteration of main memory. This description includes
|
||||
a list of all the RAMBlocks, their offsets and lengths, virtual
|
||||
addresses and possibly includes pre-registered RDMA keys in case dynamic
|
||||
page registration was disabled on the server-side, otherwise not.
|
||||
|
||||
Main memory is not migrated with the aforementioned protocol,
|
||||
but is instead migrated with normal RDMA Write operations.
|
||||
|
||||
Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
|
||||
Chunk size is not dynamic, but it could be in a future implementation.
|
||||
There's nothing to indicate that this is useful right now.
|
||||
|
||||
When a chunk is full (or a flush() occurs), the memory backed by
|
||||
the chunk is registered with librdmacm is pinned in memory on
|
||||
both sides using the aforementioned protocol.
|
||||
After pinning, an RDMA Write is generated and transmitted
|
||||
for the entire chunk.
|
||||
|
||||
Chunks are also transmitted in batches: This means that we
|
||||
do not request that the hardware signal the completion queue
|
||||
for the completion of *every* chunk. The current batch size
|
||||
is about 64 chunks (corresponding to 64 MB of memory).
|
||||
Only the last chunk in a batch must be signaled.
|
||||
This helps keep everything as asynchronous as possible
|
||||
and helps keep the hardware busy performing RDMA operations.
|
||||
|
||||
Error-handling:
|
||||
===============
|
||||
|
||||
Infiniband has what is called a "Reliable, Connected"
|
||||
link (one of 4 choices). This is the mode in which
|
||||
we use for RDMA migration.
|
||||
|
||||
If a *single* message fails,
|
||||
the decision is to abort the migration entirely and
|
||||
cleanup all the RDMA descriptors and unregister all
|
||||
the memory.
|
||||
|
||||
After cleanup, the Virtual Machine is returned to normal
|
||||
operation the same way that would happen if the TCP
|
||||
socket is broken during a non-RDMA based migration.
|
||||
|
||||
TODO:
|
||||
=====
|
||||
1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
|
||||
are not compatible with infinband memory pinning and will result in
|
||||
an aborted migration (but with the source VM left unaffected).
|
||||
2. Use of the recent /proc/<pid>/pagemap would likely speed up
|
||||
the use of KSM and ballooning while using RDMA.
|
||||
3. Also, some form of balloon-device usage tracking would also
|
||||
help alleviate some issues.
|
||||
4. Use LRU to provide more fine-grained direction of UNREGISTER
|
||||
requests for unpinning memory in an overcommitted environment.
|
||||
5. Expose UNREGISTER support to the user by way of workload-specific
|
||||
hints about application behavior.
|
|
@ -1,24 +0,0 @@
|
|||
QEMU<->ACPI BIOS CPU hotplug interface
|
||||
--------------------------------------
|
||||
|
||||
QEMU supports CPU hotplug via ACPI. This document
|
||||
describes the interface between QEMU and the ACPI BIOS.
|
||||
|
||||
ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
|
||||
-----------------------------------------
|
||||
|
||||
Generic ACPI GPE block. Bit 2 (GPE.2) used to notify CPU
|
||||
hot-add/remove event to ACPI BIOS, via SCI interrupt.
|
||||
|
||||
CPU present bitmap for:
|
||||
ICH9-LPC (IO port 0x0cd8-0xcf7, 1-byte access)
|
||||
PIIX-PM (IO port 0xaf00-0xaf1f, 1-byte access)
|
||||
---------------------------------------------------------------
|
||||
One bit per CPU. Bit position reflects corresponding CPU APIC ID.
|
||||
Read-only.
|
||||
|
||||
CPU hot-add/remove notification:
|
||||
-----------------------------------------------------
|
||||
QEMU sets/clears corresponding CPU bit on hot-add/remove event.
|
||||
CPU present map read by ACPI BIOS GPE.2 handler to notify OS of CPU
|
||||
hot-(un)plug events.
|
|
@ -1,44 +0,0 @@
|
|||
QEMU<->ACPI BIOS memory hotplug interface
|
||||
--------------------------------------
|
||||
|
||||
ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add
|
||||
events.
|
||||
|
||||
Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access):
|
||||
---------------------------------------------------------------
|
||||
0xa00:
|
||||
read access:
|
||||
[0x0-0x3] Lo part of memory device phys address
|
||||
[0x4-0x7] Hi part of memory device phys address
|
||||
[0x8-0xb] Lo part of memory device size in bytes
|
||||
[0xc-0xf] Hi part of memory device size in bytes
|
||||
[0x10-0x13] Memory device proximity domain
|
||||
[0x14] Memory device status fields
|
||||
bits:
|
||||
0: Device is enabled and may be used by guest
|
||||
1: Device insert event, used to distinguish device for which
|
||||
no device check event to OSPM was issued.
|
||||
It's valid only when bit 1 is set.
|
||||
2-7: reserved and should be ignored by OSPM
|
||||
[0x15-0x17] reserved
|
||||
|
||||
write access:
|
||||
[0x0-0x3] Memory device slot selector, selects active memory device.
|
||||
All following accesses to other registers in 0xa00-0xa17
|
||||
region will read/store data from/to selected memory device.
|
||||
[0x4-0x7] OST event code reported by OSPM
|
||||
[0x8-0xb] OST status code reported by OSPM
|
||||
[0xc-0x13] reserved, writes into it are ignored
|
||||
[0x14] Memory device control fields
|
||||
bits:
|
||||
0: reserved, OSPM must clear it before writing to register
|
||||
1: if set to 1 clears device insert event, set by OSPM
|
||||
after it has emitted device check event for the
|
||||
selected memory device
|
||||
2-7: reserved, OSPM must clear them before writing to register
|
||||
|
||||
Selecting memory device slot beyond present range has no effect on platform:
|
||||
- write accesses to memory hot-plug registers not documented above are
|
||||
ignored
|
||||
- read accesses to memory hot-plug registers not documented above return
|
||||
all bits set to 1.
|
|
@ -1,45 +0,0 @@
|
|||
QEMU<->ACPI BIOS PCI hotplug interface
|
||||
--------------------------------------
|
||||
|
||||
QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document
|
||||
describes the interface between QEMU and the ACPI BIOS.
|
||||
|
||||
ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
|
||||
-----------------------------------------
|
||||
|
||||
Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
|
||||
event to ACPI BIOS, via SCI interrupt.
|
||||
|
||||
PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
|
||||
---------------------------------------------------------------
|
||||
Slot injection notification pending. One bit per slot.
|
||||
|
||||
Read by ACPI BIOS GPE.1 handler to notify OS of injection
|
||||
events. Read-only.
|
||||
|
||||
PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
|
||||
-----------------------------------------------------
|
||||
Slot removal notification pending. One bit per slot.
|
||||
|
||||
Read by ACPI BIOS GPE.1 handler to notify OS of removal
|
||||
events. Read-only.
|
||||
|
||||
PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
|
||||
----------------------------------------
|
||||
|
||||
Write: Used by ACPI BIOS _EJ0 method to request device removal.
|
||||
One bit per slot.
|
||||
|
||||
Read: Hotplug features register. Used by platform to identify features
|
||||
available. Current base feature set (no bits set):
|
||||
- Read-only "up" register @0xae00, 4-byte access, bit per slot
|
||||
- Read-only "down" register @0xae04, 4-byte access, bit per slot
|
||||
- Read/write "eject" register @0xae08, 4-byte access,
|
||||
write: bit per slot eject, read: hotplug feature set
|
||||
- Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot
|
||||
|
||||
PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
|
||||
-----------------------------------------------
|
||||
|
||||
Used by ACPI BIOS _RMV method to indicate removability status to OS. One
|
||||
bit per slot. Read-only
|
|
@ -1,96 +0,0 @@
|
|||
|
||||
Device Specification for Inter-VM shared memory device
|
||||
------------------------------------------------------
|
||||
|
||||
The Inter-VM shared memory device is designed to share a region of memory to
|
||||
userspace in multiple virtual guests. The memory region does not belong to any
|
||||
guest, but is a POSIX memory object on the host. Optionally, the device may
|
||||
support sending interrupts to other guests sharing the same memory region.
|
||||
|
||||
|
||||
The Inter-VM PCI device
|
||||
-----------------------
|
||||
|
||||
*BARs*
|
||||
|
||||
The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support
|
||||
registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is
|
||||
used to map the shared memory object from the host. The size of BAR2 is
|
||||
specified when the guest is started and must be a power of 2 in size.
|
||||
|
||||
*Registers*
|
||||
|
||||
The device currently supports 4 registers of 32-bits each. Registers
|
||||
are used for synchronization between guests sharing the same memory object when
|
||||
interrupts are supported (this requires using the shared memory server).
|
||||
|
||||
The server assigns each VM an ID number and sends this ID number to the QEMU
|
||||
process when the guest starts.
|
||||
|
||||
enum ivshmem_registers {
|
||||
IntrMask = 0,
|
||||
IntrStatus = 4,
|
||||
IVPosition = 8,
|
||||
Doorbell = 12
|
||||
};
|
||||
|
||||
The first two registers are the interrupt mask and status registers. Mask and
|
||||
status are only used with pin-based interrupts. They are unused with MSI
|
||||
interrupts.
|
||||
|
||||
Status Register: The status register is set to 1 when an interrupt occurs.
|
||||
|
||||
Mask Register: The mask register is bitwise ANDed with the interrupt status
|
||||
and the result will raise an interrupt if it is non-zero. However, since 1 is
|
||||
the only value the status will be set to, it is only the first bit of the mask
|
||||
that has any effect. Therefore interrupts can be masked by setting the first
|
||||
bit to 0 and unmasked by setting the first bit to 1.
|
||||
|
||||
IVPosition Register: The IVPosition register is read-only and reports the
|
||||
guest's ID number. The guest IDs are non-negative integers. When using the
|
||||
server, since the server is a separate process, the VM ID will only be set when
|
||||
the device is ready (shared memory is received from the server and accessible via
|
||||
the device). If the device is not ready, the IVPosition will return -1.
|
||||
Applications should ensure that they have a valid VM ID before accessing the
|
||||
shared memory.
|
||||
|
||||
Doorbell Register: To interrupt another guest, a guest must write to the
|
||||
Doorbell register. The doorbell register is 32-bits, logically divided into
|
||||
two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low
|
||||
16-bits are the interrupt vector to trigger. The semantics of the value
|
||||
written to the doorbell depends on whether the device is using MSI or a regular
|
||||
pin-based interrupt. In short, MSI uses vectors while regular interrupts set the
|
||||
status register.
|
||||
|
||||
Regular Interrupts
|
||||
|
||||
If regular interrupts are used (due to either a guest not supporting MSI or the
|
||||
user specifying not to use them on startup) then the value written to the lower
|
||||
16-bits of the Doorbell register results is arbitrary and will trigger an
|
||||
interrupt in the destination guest.
|
||||
|
||||
Message Signalled Interrupts
|
||||
|
||||
A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
|
||||
written to the Doorbell register must be between 0 and the maximum number of
|
||||
vectors the guest supports. The lower 16 bits written to the doorbell is the
|
||||
MSI vector that will be raised in the destination guest. The number of MSI
|
||||
vectors is configurable but it is set when the VM is started.
|
||||
|
||||
The important thing to remember with MSI is that it is only a signal, no status
|
||||
is set (since MSI interrupts are not shared). All information other than the
|
||||
interrupt itself should be communicated via the shared memory region. Devices
|
||||
supporting multiple MSI vectors can use different vectors to indicate different
|
||||
events have occurred. The semantics of interrupt vectors are left to the
|
||||
user's discretion.
|
||||
|
||||
|
||||
Usage in the Guest
|
||||
------------------
|
||||
|
||||
The shared memory device is intended to be used with the provided UIO driver.
|
||||
Very little configuration is needed. The guest should map BAR0 to access the
|
||||
registers (an array of 32-bit ints allows simple writing) and map BAR2 to
|
||||
access the shared memory region itself. The size of the shared memory region
|
||||
is specified when the guest (or shared memory server) is started. A guest may
|
||||
map the whole shared memory region or only part of it.
|
|
@ -1,50 +0,0 @@
|
|||
|
||||
PCI IDs for qemu
|
||||
================
|
||||
|
||||
Red Hat, Inc. donates a part of its device ID range to qemu, to be used for
|
||||
virtual devices. The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36.
|
||||
|
||||
Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned
|
||||
for your devices.
|
||||
|
||||
1af4 vendor ID
|
||||
--------------
|
||||
|
||||
The 1000 -> 10ff device ID range is used as follows for virtio-pci devices.
|
||||
Note that this allocation separate from the virtio device IDs, which are
|
||||
maintained as part of the virtio specification.
|
||||
|
||||
1af4:1000 network device
|
||||
1af4:1001 block device
|
||||
1af4:1002 balloon device
|
||||
1af4:1003 console device
|
||||
1af4:1004 SCSI host bus adapter device
|
||||
1af4:1005 entropy generator device
|
||||
1af4:1009 9p filesystem device
|
||||
|
||||
1af4:10f0 Available for experimental usage without registration. Must get
|
||||
to official ID when the code leaves the test lab (i.e. when seeking
|
||||
1af4:10ff upstream merge or shipping a distro/product) to avoid conflicts.
|
||||
|
||||
1af4:1100 Used as PCI Subsystem ID for existing hardware devices emulated
|
||||
by qemu.
|
||||
|
||||
1af4:1110 ivshmem device (shared memory, docs/specs/ivshmem_device_spec.txt)
|
||||
|
||||
All other device IDs are reserved.
|
||||
|
||||
1b36 vendor ID
|
||||
--------------
|
||||
|
||||
The 0000 -> 00ff device ID range is used as follows for QEMU-specific
|
||||
PCI devices (other than virtio):
|
||||
|
||||
1b36:0001 PCI-PCI bridge
|
||||
1b36:0002 PCI serial port (16550A) adapter (docs/specs/pci-serial.txt)
|
||||
1b36:0003 PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt)
|
||||
1b36:0004 PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt)
|
||||
|
||||
All these devices are documented in docs/specs.
|
||||
|
||||
The 0100 device ID is used for the QXL video card device.
|
|
@ -1,34 +0,0 @@
|
|||
|
||||
QEMU pci serial devices
|
||||
=======================
|
||||
|
||||
There is one single-port variant and two muliport-variants. Linux
|
||||
guests out-of-the box with all cards. There is a Windows inf file
|
||||
(docs/qemupciserial.inf) to setup the single-port card in Windows
|
||||
guests.
|
||||
|
||||
|
||||
single-port card
|
||||
----------------
|
||||
|
||||
Name: pci-serial
|
||||
PCI ID: 1b36:0002
|
||||
|
||||
PCI Region 0:
|
||||
IO bar, 8 bytes long, with the 16550 uart mapped to it.
|
||||
Interrupt is wired to pin A.
|
||||
|
||||
|
||||
multiport cards
|
||||
---------------
|
||||
|
||||
Name: pci-serial-2x
|
||||
PCI ID: 1b36:0003
|
||||
|
||||
Name: pci-serial-4x
|
||||
PCI ID: 1b36:0004
|
||||
|
||||
PCI Region 0:
|
||||
IO bar, with two/four 16550 uart mapped after each other.
|
||||
The first is at offset 0, second at offset 8, ...
|
||||
Interrupt is wired to pin A.
|
|
@ -1,26 +0,0 @@
|
|||
pci-test is a device used for testing low level IO
|
||||
|
||||
device implements up to two BARs: BAR0 and BAR1.
|
||||
Each BAR can be memory or IO. Guests must detect
|
||||
BAR type and act accordingly.
|
||||
|
||||
Each BAR size is up to 4K bytes.
|
||||
Each BAR starts with the following header:
|
||||
|
||||
typedef struct PCITestDevHdr {
|
||||
uint8_t test; <- write-only, starts a given test number
|
||||
uint8_t width_type; <- read-only, type and width of access for a given test.
|
||||
1,2,4 for byte,word or long write.
|
||||
any other value if test not supported on this BAR
|
||||
uint8_t pad0[2];
|
||||
uint32_t offset; <- read-only, offset in this BAR for a given test
|
||||
uint32_t data; <- read-only, data to use for a given test
|
||||
uint32_t count; <- for debugging. number of writes detected.
|
||||
uint8_t name[]; <- for debugging. 0-terminated ASCII string.
|
||||
} PCITestDevHdr;
|
||||
|
||||
All registers are little endian.
|
||||
|
||||
device is expected to always implement tests 0 to N on each BAR, and to add new
|
||||
tests with higher numbers. In this way a guest can scan test numbers until it
|
||||
detects an access type that it does not support on this BAR, then stop.
|
|
@ -1,78 +0,0 @@
|
|||
When used with the "pseries" machine type, QEMU-system-ppc64 implements
|
||||
a set of hypervisor calls using a subset of the server "PAPR" specification
|
||||
(IBM internal at this point), which is also what IBM's proprietary hypervisor
|
||||
adheres too.
|
||||
|
||||
The subset is selected based on the requirements of Linux as a guest.
|
||||
|
||||
In addition to those calls, we have added our own private hypervisor
|
||||
calls which are mostly used as a private interface between the firmware
|
||||
running in the guest and QEMU.
|
||||
|
||||
All those hypercalls start at hcall number 0xf000 which correspond
|
||||
to a implementation specific range in PAPR.
|
||||
|
||||
- H_RTAS (0xf000)
|
||||
|
||||
RTAS is a set of runtime services generally provided by the firmware
|
||||
inside the guest to the operating system. It predates the existence
|
||||
of hypervisors (it was originally an extension to Open Firmware) and
|
||||
is still used by PAPR to provide various services that aren't performance
|
||||
sensitive.
|
||||
|
||||
We currently implement the RTAS services in QEMU itself. The actual RTAS
|
||||
"firmware" blob in the guest is a small stub of a few instructions which
|
||||
calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU.
|
||||
|
||||
Arguments:
|
||||
|
||||
r3 : H_RTAS (0xf000)
|
||||
r4 : Guest physical address of RTAS parameter block
|
||||
|
||||
Returns:
|
||||
|
||||
H_SUCCESS : Successfully called the RTAS function (RTAS result
|
||||
will have been stored in the parameter block)
|
||||
H_PARAMETER : Unknown token
|
||||
|
||||
- H_LOGICAL_MEMOP (0xf001)
|
||||
|
||||
When the guest runs in "real mode" (in powerpc lingua this means
|
||||
with MMU disabled, ie guest effective == guest physical), it only
|
||||
has access to a subset of memory and no IOs.
|
||||
|
||||
PAPR provides a set of hypervisor calls to perform cachable or
|
||||
non-cachable accesses to any guest physical addresses that the
|
||||
guest can use in order to access IO devices while in real mode.
|
||||
|
||||
This is typically used by the firmware running in the guest.
|
||||
|
||||
However, doing a hypercall for each access is extremely inefficient
|
||||
(even more so when running KVM) when accessing the frame buffer. In
|
||||
that case, things like scrolling become unusably slow.
|
||||
|
||||
This hypercall allows the guest to request a "memory op" to be applied
|
||||
to memory. The supported memory ops at this point are to copy a range
|
||||
of memory (supports overlap of source and destination) and XOR which
|
||||
is used by our SLOF firmware to invert the screen.
|
||||
|
||||
Arguments:
|
||||
|
||||
r3: H_LOGICAL_MEMOP (0xf001)
|
||||
r4: Guest physical address of destination
|
||||
r5: Guest physical address of source
|
||||
r6: Individual element size
|
||||
0 = 1 byte
|
||||
1 = 2 bytes
|
||||
2 = 4 bytes
|
||||
3 = 8 bytes
|
||||
r7: Number of elements
|
||||
r8: Operation
|
||||
0 = copy
|
||||
1 = xor
|
||||
|
||||
Returns:
|
||||
|
||||
H_SUCCESS : Success
|
||||
H_PARAMETER : Invalid argument
|
||||
|
|
@ -1,39 +0,0 @@
|
|||
PVPANIC DEVICE
|
||||
==============
|
||||
|
||||
pvpanic device is a simulated ISA device, through which a guest panic
|
||||
event is sent to qemu, and a QMP event is generated. This allows
|
||||
management apps (e.g. libvirt) to be notified and respond to the event.
|
||||
|
||||
The management app has the option of waiting for GUEST_PANICKED events,
|
||||
and/or polling for guest-panicked RunState, to learn when the pvpanic
|
||||
device has fired a panic event.
|
||||
|
||||
ISA Interface
|
||||
-------------
|
||||
|
||||
pvpanic exposes a single I/O port, by default 0x505. On read, the bits
|
||||
recognized by the device are set. Software should ignore bits it doesn't
|
||||
recognize. On write, the bits not recognized by the device are ignored.
|
||||
Software should set only bits both itself and the device recognize.
|
||||
Currently, only bit 0 is recognized, setting it indicates a guest panic
|
||||
has happened.
|
||||
|
||||
ACPI Interface
|
||||
--------------
|
||||
|
||||
pvpanic device is defined with ACPI ID "QEMU0001". Custom methods:
|
||||
|
||||
RDPT: To determine whether guest panic notification is supported.
|
||||
Arguments: None
|
||||
Return: Returns a byte, bit 0 set to indicate guest panic
|
||||
notification is supported. Other bits are reserved and
|
||||
should be ignored.
|
||||
|
||||
WRPT: To send a guest panic event
|
||||
Arguments: Arg0 is a byte, with bit 0 set to indicate guest panic has
|
||||
happened. Other bits are reserved and should be cleared.
|
||||
Return: None
|
||||
|
||||
The ACPI device will automatically refer to the right port in case it
|
||||
is modified.
|
|
@ -1,362 +0,0 @@
|
|||
== General ==
|
||||
|
||||
A qcow2 image file is organized in units of constant size, which are called
|
||||
(host) clusters. A cluster is the unit in which all allocations are done,
|
||||
both for actual guest data and for image metadata.
|
||||
|
||||
Likewise, the virtual disk as seen by the guest is divided into (guest)
|
||||
clusters of the same size.
|
||||
|
||||
All numbers in qcow2 are stored in Big Endian byte order.
|
||||
|
||||
|
||||
== Header ==
|
||||
|
||||
The first cluster of a qcow2 image contains the file header:
|
||||
|
||||
Byte 0 - 3: magic
|
||||
QCOW magic string ("QFI\xfb")
|
||||
|
||||
4 - 7: version
|
||||
Version number (valid values are 2 and 3)
|
||||
|
||||
8 - 15: backing_file_offset
|
||||
Offset into the image file at which the backing file name
|
||||
is stored (NB: The string is not null terminated). 0 if the
|
||||
image doesn't have a backing file.
|
||||
|
||||
16 - 19: backing_file_size
|
||||
Length of the backing file name in bytes. Must not be
|
||||
longer than 1023 bytes. Undefined if the image doesn't have
|
||||
a backing file.
|
||||
|
||||
20 - 23: cluster_bits
|
||||
Number of bits that are used for addressing an offset
|
||||
within a cluster (1 << cluster_bits is the cluster size).
|
||||
Must not be less than 9 (i.e. 512 byte clusters).
|
||||
|
||||
Note: qemu as of today has an implementation limit of 2 MB
|
||||
as the maximum cluster size and won't be able to open images
|
||||
with larger cluster sizes.
|
||||
|
||||
24 - 31: size
|
||||
Virtual disk size in bytes
|
||||
|
||||
32 - 35: crypt_method
|
||||
0 for no encryption
|
||||
1 for AES encryption
|
||||
|
||||
36 - 39: l1_size
|
||||
Number of entries in the active L1 table
|
||||
|
||||
40 - 47: l1_table_offset
|
||||
Offset into the image file at which the active L1 table
|
||||
starts. Must be aligned to a cluster boundary.
|
||||
|
||||
48 - 55: refcount_table_offset
|
||||
Offset into the image file at which the refcount table
|
||||
starts. Must be aligned to a cluster boundary.
|
||||
|
||||
56 - 59: refcount_table_clusters
|
||||
Number of clusters that the refcount table occupies
|
||||
|
||||
60 - 63: nb_snapshots
|
||||
Number of snapshots contained in the image
|
||||
|
||||
64 - 71: snapshots_offset
|
||||
Offset into the image file at which the snapshot table
|
||||
starts. Must be aligned to a cluster boundary.
|
||||
|
||||
If the version is 3 or higher, the header has the following additional fields.
|
||||
For version 2, the values are assumed to be zero, unless specified otherwise
|
||||
in the description of a field.
|
||||
|
||||
72 - 79: incompatible_features
|
||||
Bitmask of incompatible features. An implementation must
|
||||
fail to open an image if an unknown bit is set.
|
||||
|
||||
Bit 0: Dirty bit. If this bit is set then refcounts
|
||||
may be inconsistent, make sure to scan L1/L2
|
||||
tables to repair refcounts before accessing the
|
||||
image.
|
||||
|
||||
Bit 1: Corrupt bit. If this bit is set then any data
|
||||
structure may be corrupt and the image must not
|
||||
be written to (unless for regaining
|
||||
consistency).
|
||||
|
||||
Bits 2-63: Reserved (set to 0)
|
||||
|
||||
80 - 87: compatible_features
|
||||
Bitmask of compatible features. An implementation can
|
||||
safely ignore any unknown bits that are set.
|
||||
|
||||
Bit 0: Lazy refcounts bit. If this bit is set then
|
||||
lazy refcount updates can be used. This means
|
||||
marking the image file dirty and postponing
|
||||
refcount metadata updates.
|
||||
|
||||
Bits 1-63: Reserved (set to 0)
|
||||
|
||||
88 - 95: autoclear_features
|
||||
Bitmask of auto-clear features. An implementation may only
|
||||
write to an image with unknown auto-clear features if it
|
||||
clears the respective bits from this field first.
|
||||
|
||||
Bits 0-63: Reserved (set to 0)
|
||||
|
||||
96 - 99: refcount_order
|
||||
Describes the width of a reference count block entry (width
|
||||
in bits: refcount_bits = 1 << refcount_order). For version 2
|
||||
images, the order is always assumed to be 4
|
||||
(i.e. refcount_bits = 16).
|
||||
This value may not exceed 6 (i.e. refcount_bits = 64).
|
||||
|
||||
100 - 103: header_length
|
||||
Length of the header structure in bytes. For version 2
|
||||
images, the length is always assumed to be 72 bytes.
|
||||
|
||||
Directly after the image header, optional sections called header extensions can
|
||||
be stored. Each extension has a structure like the following:
|
||||
|
||||
Byte 0 - 3: Header extension type:
|
||||
0x00000000 - End of the header extension area
|
||||
0xE2792ACA - Backing file format name
|
||||
0x6803f857 - Feature name table
|
||||
other - Unknown header extension, can be safely
|
||||
ignored
|
||||
|
||||
4 - 7: Length of the header extension data
|
||||
|
||||
8 - n: Header extension data
|
||||
|
||||
n - m: Padding to round up the header extension size to the next
|
||||
multiple of 8.
|
||||
|
||||
Unless stated otherwise, each header extension type shall appear at most once
|
||||
in the same image.
|
||||
|
||||
If the image has a backing file then the backing file name should be stored in
|
||||
the remaining space between the end of the header extension area and the end of
|
||||
the first cluster. It is not allowed to store other data here, so that an
|
||||
implementation can safely modify the header and add extensions without harming
|
||||
data of compatible features that it doesn't support. Compatible features that
|
||||
need space for additional data can use a header extension.
|
||||
|
||||
|
||||
== Feature name table ==
|
||||
|
||||
The feature name table is an optional header extension that contains the name
|
||||
for features used by the image. It can be used by applications that don't know
|
||||
the respective feature (e.g. because the feature was introduced only later) to
|
||||
display a useful error message.
|
||||
|
||||
The number of entries in the feature name table is determined by the length of
|
||||
the header extension data. Each entry look like this:
|
||||
|
||||
Byte 0: Type of feature (select feature bitmap)
|
||||
0: Incompatible feature
|
||||
1: Compatible feature
|
||||
2: Autoclear feature
|
||||
|
||||
1: Bit number within the selected feature bitmap (valid
|
||||
values: 0-63)
|
||||
|
||||
2 - 47: Feature name (padded with zeros, but not necessarily null
|
||||
terminated if it has full length)
|
||||
|
||||
|
||||
== Host cluster management ==
|
||||
|
||||
qcow2 manages the allocation of host clusters by maintaining a reference count
|
||||
for each host cluster. A refcount of 0 means that the cluster is free, 1 means
|
||||
that it is used, and >= 2 means that it is used and any write access must
|
||||
perform a COW (copy on write) operation.
|
||||
|
||||
The refcounts are managed in a two-level table. The first level is called
|
||||
refcount table and has a variable size (which is stored in the header). The
|
||||
refcount table can cover multiple clusters, however it needs to be contiguous
|
||||
in the image file.
|
||||
|
||||
It contains pointers to the second level structures which are called refcount
|
||||
blocks and are exactly one cluster in size.
|
||||
|
||||
Given a offset into the image file, the refcount of its cluster can be obtained
|
||||
as follows:
|
||||
|
||||
refcount_block_entries = (cluster_size * 8 / refcount_bits)
|
||||
|
||||
refcount_block_index = (offset / cluster_size) % refcount_block_entries
|
||||
refcount_table_index = (offset / cluster_size) / refcount_block_entries
|
||||
|
||||
refcount_block = load_cluster(refcount_table[refcount_table_index]);
|
||||
return refcount_block[refcount_block_index];
|
||||
|
||||
Refcount table entry:
|
||||
|
||||
Bit 0 - 8: Reserved (set to 0)
|
||||
|
||||
9 - 63: Bits 9-63 of the offset into the image file at which the
|
||||
refcount block starts. Must be aligned to a cluster
|
||||
boundary.
|
||||
|
||||
If this is 0, the corresponding refcount block has not yet
|
||||
been allocated. All refcounts managed by this refcount block
|
||||
are 0.
|
||||
|
||||
Refcount block entry (x = refcount_bits - 1):
|
||||
|
||||
Bit 0 - x: Reference count of the cluster. If refcount_bits implies a
|
||||
sub-byte width, note that bit 0 means the least significant
|
||||
bit in this context.
|
||||
|
||||
|
||||
== Cluster mapping ==
|
||||
|
||||
Just as for refcounts, qcow2 uses a two-level structure for the mapping of
|
||||
guest clusters to host clusters. They are called L1 and L2 table.
|
||||
|
||||
The L1 table has a variable size (stored in the header) and may use multiple
|
||||
clusters, however it must be contiguous in the image file. L2 tables are
|
||||
exactly one cluster in size.
|
||||
|
||||
Given a offset into the virtual disk, the offset into the image file can be
|
||||
obtained as follows:
|
||||
|
||||
l2_entries = (cluster_size / sizeof(uint64_t))
|
||||
|
||||
l2_index = (offset / cluster_size) % l2_entries
|
||||
l1_index = (offset / cluster_size) / l2_entries
|
||||
|
||||
l2_table = load_cluster(l1_table[l1_index]);
|
||||
cluster_offset = l2_table[l2_index];
|
||||
|
||||
return cluster_offset + (offset % cluster_size)
|
||||
|
||||
L1 table entry:
|
||||
|
||||
Bit 0 - 8: Reserved (set to 0)
|
||||
|
||||
9 - 55: Bits 9-55 of the offset into the image file at which the L2
|
||||
table starts. Must be aligned to a cluster boundary. If the
|
||||
offset is 0, the L2 table and all clusters described by this
|
||||
L2 table are unallocated.
|
||||
|
||||
56 - 62: Reserved (set to 0)
|
||||
|
||||
63: 0 for an L2 table that is unused or requires COW, 1 if its
|
||||
refcount is exactly one. This information is only accurate
|
||||
in the active L1 table.
|
||||
|
||||
L2 table entry:
|
||||
|
||||
Bit 0 - 61: Cluster descriptor
|
||||
|
||||
62: 0 for standard clusters
|
||||
1 for compressed clusters
|
||||
|
||||
63: 0 for a cluster that is unused or requires COW, 1 if its
|
||||
refcount is exactly one. This information is only accurate
|
||||
in L2 tables that are reachable from the the active L1
|
||||
table.
|
||||
|
||||
Standard Cluster Descriptor:
|
||||
|
||||
Bit 0: If set to 1, the cluster reads as all zeros. The host
|
||||
cluster offset can be used to describe a preallocation,
|
||||
but it won't be used for reading data from this cluster,
|
||||
nor is data read from the backing file if the cluster is
|
||||
unallocated.
|
||||
|
||||
With version 2, this is always 0.
|
||||
|
||||
1 - 8: Reserved (set to 0)
|
||||
|
||||
9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a
|
||||
cluster boundary. If the offset is 0, the cluster is
|
||||
unallocated.
|
||||
|
||||
56 - 61: Reserved (set to 0)
|
||||
|
||||
|
||||
Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
|
||||
|
||||
Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
|
||||
cluster boundary!
|
||||
|
||||
x+1 - 61: Compressed size of the images in sectors of 512 bytes
|
||||
|
||||
If a cluster is unallocated, read requests shall read the data from the backing
|
||||
file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
|
||||
no backing file or the backing file is smaller than the image, they shall read
|
||||
zeros for all parts that are not covered by the backing file.
|
||||
|
||||
|
||||
== Snapshots ==
|
||||
|
||||
qcow2 supports internal snapshots. Their basic principle of operation is to
|
||||
switch the active L1 table, so that a different set of host clusters are
|
||||
exposed to the guest.
|
||||
|
||||
When creating a snapshot, the L1 table should be copied and the refcount of all
|
||||
L2 tables and clusters reachable from this L1 table must be increased, so that
|
||||
a write causes a COW and isn't visible in other snapshots.
|
||||
|
||||
When loading a snapshot, bit 63 of all entries in the new active L1 table and
|
||||
all L2 tables referenced by it must be reconstructed from the refcount table
|
||||
as it doesn't need to be accurate in inactive L1 tables.
|
||||
|
||||
A directory of all snapshots is stored in the snapshot table, a contiguous area
|
||||
in the image file, whose starting offset and length are given by the header
|
||||
fields snapshots_offset and nb_snapshots. The entries of the snapshot table
|
||||
have variable length, depending on the length of ID, name and extra data.
|
||||
|
||||
Snapshot table entry:
|
||||
|
||||
Byte 0 - 7: Offset into the image file at which the L1 table for the
|
||||
snapshot starts. Must be aligned to a cluster boundary.
|
||||
|
||||
8 - 11: Number of entries in the L1 table of the snapshots
|
||||
|
||||
12 - 13: Length of the unique ID string describing the snapshot
|
||||
|
||||
14 - 15: Length of the name of the snapshot
|
||||
|
||||
16 - 19: Time at which the snapshot was taken in seconds since the
|
||||
Epoch
|
||||
|
||||
20 - 23: Subsecond part of the time at which the snapshot was taken
|
||||
in nanoseconds
|
||||
|
||||
24 - 31: Time that the guest was running until the snapshot was
|
||||
taken in nanoseconds
|
||||
|
||||
32 - 35: Size of the VM state in bytes. 0 if no VM state is saved.
|
||||
If there is VM state, it starts at the first cluster
|
||||
described by first L1 table entry that doesn't describe a
|
||||
regular guest cluster (i.e. VM state is stored like guest
|
||||
disk content, except that it is stored at offsets that are
|
||||
larger than the virtual disk presented to the guest)
|
||||
|
||||
36 - 39: Size of extra data in the table entry (used for future
|
||||
extensions of the format)
|
||||
|
||||
variable: Extra data for future extensions. Unknown fields must be
|
||||
ignored. Currently defined are (offset relative to snapshot
|
||||
table entry):
|
||||
|
||||
Byte 40 - 47: Size of the VM state in bytes. 0 if no VM
|
||||
state is saved. If this field is present,
|
||||
the 32-bit value in bytes 32-35 is ignored.
|
||||
|
||||
Byte 48 - 55: Virtual disk size of the snapshot in bytes
|
||||
|
||||
Version 3 images must include extra data at least up to
|
||||
byte 55.
|
||||
|
||||
variable: Unique ID string for the snapshot (not null terminated)
|
||||
|
||||
variable: Name of the snapshot (not null terminated)
|
||||
|
||||
variable: Padding to round up the snapshot table entry size to the
|
||||
next multiple of 8.
|
|
@ -1,138 +0,0 @@
|
|||
=Specification=
|
||||
|
||||
The file format looks like this:
|
||||
|
||||
+----------+----------+----------+-----+
|
||||
| cluster0 | cluster1 | cluster2 | ... |
|
||||
+----------+----------+----------+-----+
|
||||
|
||||
The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.
|
||||
|
||||
Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.
|
||||
|
||||
All fields are little-endian.
|
||||
|
||||
==Header==
|
||||
Header {
|
||||
uint32_t magic; /* QED\0 */
|
||||
|
||||
uint32_t cluster_size; /* in bytes */
|
||||
uint32_t table_size; /* for L1 and L2 tables, in clusters */
|
||||
uint32_t header_size; /* in clusters */
|
||||
|
||||
uint64_t features; /* format feature bits */
|
||||
uint64_t compat_features; /* compat feature bits */
|
||||
uint64_t autoclear_features; /* self-resetting feature bits */
|
||||
|
||||
uint64_t l1_table_offset; /* in bytes */
|
||||
uint64_t image_size; /* total logical image size, in bytes */
|
||||
|
||||
/* if (features & QED_F_BACKING_FILE) */
|
||||
uint32_t backing_filename_offset; /* in bytes from start of header */
|
||||
uint32_t backing_filename_size; /* in bytes */
|
||||
}
|
||||
|
||||
Field descriptions:
|
||||
* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].
|
||||
* ''table_size'' must be a power of 2 in range [1, 16].
|
||||
* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.
|
||||
* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:
|
||||
** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.
|
||||
** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.
|
||||
** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.
|
||||
* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.
|
||||
* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.
|
||||
* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file.
|
||||
|
||||
Feature bits:
|
||||
* QED_F_BACKING_FILE = 0x01. The image uses a backing file.
|
||||
* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
|
||||
* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants.
|
||||
|
||||
There are currently no defined ''compat_features'' or ''autoclear_features'' bits.
|
||||
|
||||
Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.
|
||||
|
||||
==Tables==
|
||||
|
||||
Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
|
||||
|
||||
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
|
||||
|
||||
Table {
|
||||
uint64_t offsets[TABLE_NOFFSETS];
|
||||
}
|
||||
|
||||
The tables are organized as follows:
|
||||
|
||||
+----------+
|
||||
| L1 table |
|
||||
+----------+
|
||||
,------' | '------.
|
||||
+----------+ | +----------+
|
||||
| L2 table | ... | L2 table |
|
||||
+----------+ +----------+
|
||||
,------' | '------.
|
||||
+----------+ | +----------+
|
||||
| Data | ... | Data |
|
||||
+----------+ +----------+
|
||||
|
||||
A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
|
||||
|
||||
The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
|
||||
header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
|
||||
|
||||
L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings:
|
||||
|
||||
===L2 table offsets===
|
||||
* 0 - unallocated. The L2 table is not yet allocated.
|
||||
|
||||
===Data cluster offsets===
|
||||
* 0 - unallocated. The data cluster is not yet allocated.
|
||||
* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.
|
||||
|
||||
Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed.
|
||||
|
||||
===Unallocated L2 tables and data clusters===
|
||||
Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.
|
||||
|
||||
Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written.
|
||||
|
||||
===Zero data clusters===
|
||||
Zero data clusters are a space-efficient way of storing zeroed regions of the image.
|
||||
|
||||
Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file.
|
||||
|
||||
Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written.
|
||||
|
||||
===Logical offset translation===
|
||||
Logical offsets are translated into cluster offsets as follows:
|
||||
|
||||
table_bits table_bits cluster_bits
|
||||
<--------> <--------> <--------------->
|
||||
+----------+----------+-----------------+
|
||||
| L1 index | L2 index | byte offset |
|
||||
+----------+----------+-----------------+
|
||||
|
||||
Structure of a logical offset
|
||||
|
||||
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset
|
||||
|
||||
def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
|
||||
l2_offset = l1_table[l1_index]
|
||||
l2_table = load_table(l2_offset)
|
||||
cluster_offset = l2_table[l2_index] & offset_mask
|
||||
return cluster_offset + byte_offset
|
||||
|
||||
==Consistency checking==
|
||||
|
||||
This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.
|
||||
|
||||
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.
|
||||
|
||||
Consistency check includes the following invariants:
|
||||
# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.
|
||||
# Offsets must be within the image file size and must be ''cluster_size'' aligned.
|
||||
# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.
|
||||
|
||||
The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.
|
|
@ -1,81 +0,0 @@
|
|||
|
||||
QEMU Standard VGA
|
||||
=================
|
||||
|
||||
Exists in two variants, for isa and pci.
|
||||
|
||||
command line switches:
|
||||
-vga std [ picks isa for -M isapc, otherwise pci ]
|
||||
-device VGA [ pci variant ]
|
||||
-device isa-vga [ isa variant ]
|
||||
-device secondary-vga [ legacy-free pci variant ]
|
||||
|
||||
|
||||
PCI spec
|
||||
--------
|
||||
|
||||
Applies to the pci variant only for obvious reasons.
|
||||
|
||||
PCI ID: 1234:1111
|
||||
|
||||
PCI Region 0:
|
||||
Framebuffer memory, 16 MB in size (by default).
|
||||
Size is tunable via vga_mem_mb property.
|
||||
|
||||
PCI Region 1:
|
||||
Reserved (so we have the option to make the framebuffer bar 64bit).
|
||||
|
||||
PCI Region 2:
|
||||
MMIO bar, 4096 bytes in size (qemu 1.3+)
|
||||
|
||||
PCI ROM Region:
|
||||
Holds the vgabios (qemu 0.14+).
|
||||
|
||||
|
||||
The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER
|
||||
instead of PCI_CLASS_DISPLAY_VGA.
|
||||
|
||||
|
||||
IO ports used
|
||||
-------------
|
||||
|
||||
Doesn't apply to the legacy-free pci variant, use the MMIO bar instead.
|
||||
|
||||
03c0 - 03df : standard vga ports
|
||||
01ce : bochs vbe interface index port
|
||||
01cf : bochs vbe interface data port (x86 only)
|
||||
01d0 : bochs vbe interface data port
|
||||
|
||||
|
||||
Memory regions used
|
||||
-------------------
|
||||
|
||||
0xe0000000 : Framebuffer memory, isa variant only.
|
||||
|
||||
The pci variant used to mirror the framebuffer bar here, qemu 0.14+
|
||||
stops doing that (except when in -M pc-$old compat mode).
|
||||
|
||||
|
||||
MMIO area spec
|
||||
--------------
|
||||
|
||||
Likewise applies to the pci variant only for obvious reasons.
|
||||
|
||||
0000 - 03ff : reserved, for possible virtio extension.
|
||||
0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1.
|
||||
word access is supported, bytes are written
|
||||
in little endia order (aka index port first),
|
||||
so indexed registers can be updated with a
|
||||
single mmio write (and thus only one vmexit).
|
||||
0500 - 0515 : bochs dispi interface registers, mapped flat
|
||||
without index/data ports. Use (index << 1)
|
||||
as offset for (16bit) register access.
|
||||
|
||||
0600 - 0607 : qemu extended registers. qemu 2.2+ only.
|
||||
The pci revision is 2 (or greater) when
|
||||
these registers are present. The registers
|
||||
are 32bit.
|
||||
0600 : qemu extended register region size, in bytes.
|
||||
0604 : framebuffer endianness register.
|
||||
- 0xbebebebe indicates big endian.
|
||||
- 0x1e1e1e1e indicates little endian.
|
|
@ -1,266 +0,0 @@
|
|||
Vhost-user Protocol
|
||||
===================
|
||||
|
||||
Copyright (c) 2014 Virtual Open Systems Sarl.
|
||||
|
||||
This work is licensed under the terms of the GNU GPL, version 2 or later.
|
||||
See the COPYING file in the top-level directory.
|
||||
===================
|
||||
|
||||
This protocol is aiming to complement the ioctl interface used to control the
|
||||
vhost implementation in the Linux kernel. It implements the control plane needed
|
||||
to establish virtqueue sharing with a user space process on the same host. It
|
||||
uses communication over a Unix domain socket to share file descriptors in the
|
||||
ancillary data of the message.
|
||||
|
||||
The protocol defines 2 sides of the communication, master and slave. Master is
|
||||
the application that shares its virtqueues, in our case QEMU. Slave is the
|
||||
consumer of the virtqueues.
|
||||
|
||||
In the current implementation QEMU is the Master, and the Slave is intended to
|
||||
be a software Ethernet switch running in user space, such as Snabbswitch.
|
||||
|
||||
Master and slave can be either a client (i.e. connecting) or server (listening)
|
||||
in the socket communication.
|
||||
|
||||
Message Specification
|
||||
---------------------
|
||||
|
||||
Note that all numbers are in the machine native byte order. A vhost-user message
|
||||
consists of 3 header fields and a payload:
|
||||
|
||||
------------------------------------
|
||||
| request | flags | size | payload |
|
||||
------------------------------------
|
||||
|
||||
* Request: 32-bit type of the request
|
||||
* Flags: 32-bit bit field:
|
||||
- Lower 2 bits are the version (currently 0x01)
|
||||
- Bit 2 is the reply flag - needs to be sent on each reply from the slave
|
||||
* Size - 32-bit size of the payload
|
||||
|
||||
|
||||
Depending on the request type, payload can be:
|
||||
|
||||
* A single 64-bit integer
|
||||
-------
|
||||
| u64 |
|
||||
-------
|
||||
|
||||
u64: a 64-bit unsigned integer
|
||||
|
||||
* A vring state description
|
||||
---------------
|
||||
| index | num |
|
||||
---------------
|
||||
|
||||
Index: a 32-bit index
|
||||
Num: a 32-bit number
|
||||
|
||||
* A vring address description
|
||||
--------------------------------------------------------------
|
||||
| index | flags | size | descriptor | used | available | log |
|
||||
--------------------------------------------------------------
|
||||
|
||||
Index: a 32-bit vring index
|
||||
Flags: a 32-bit vring flags
|
||||
Descriptor: a 64-bit user address of the vring descriptor table
|
||||
Used: a 64-bit user address of the vring used ring
|
||||
Available: a 64-bit user address of the vring available ring
|
||||
Log: a 64-bit guest address for logging
|
||||
|
||||
* Memory regions description
|
||||
---------------------------------------------------
|
||||
| num regions | padding | region0 | ... | region7 |
|
||||
---------------------------------------------------
|
||||
|
||||
Num regions: a 32-bit number of regions
|
||||
Padding: 32-bit
|
||||
|
||||
A region is:
|
||||
-----------------------------------------------------
|
||||
| guest address | size | user address | mmap offset |
|
||||
-----------------------------------------------------
|
||||
|
||||
Guest address: a 64-bit guest address of the region
|
||||
Size: a 64-bit size
|
||||
User address: a 64-bit user address
|
||||
mmap offset: 64-bit offset where region starts in the mapped memory
|
||||
|
||||
In QEMU the vhost-user message is implemented with the following struct:
|
||||
|
||||
typedef struct VhostUserMsg {
|
||||
VhostUserRequest request;
|
||||
uint32_t flags;
|
||||
uint32_t size;
|
||||
union {
|
||||
uint64_t u64;
|
||||
struct vhost_vring_state state;
|
||||
struct vhost_vring_addr addr;
|
||||
VhostUserMemory memory;
|
||||
};
|
||||
} QEMU_PACKED VhostUserMsg;
|
||||
|
||||
Communication
|
||||
-------------
|
||||
|
||||
The protocol for vhost-user is based on the existing implementation of vhost
|
||||
for the Linux Kernel. Most messages that can be sent via the Unix domain socket
|
||||
implementing vhost-user have an equivalent ioctl to the kernel implementation.
|
||||
|
||||
The communication consists of master sending message requests and slave sending
|
||||
message replies. Most of the requests don't require replies. Here is a list of
|
||||
the ones that do:
|
||||
|
||||
* VHOST_GET_FEATURES
|
||||
* VHOST_GET_VRING_BASE
|
||||
|
||||
There are several messages that the master sends with file descriptors passed
|
||||
in the ancillary data:
|
||||
|
||||
* VHOST_SET_MEM_TABLE
|
||||
* VHOST_SET_LOG_FD
|
||||
* VHOST_SET_VRING_KICK
|
||||
* VHOST_SET_VRING_CALL
|
||||
* VHOST_SET_VRING_ERR
|
||||
|
||||
If Master is unable to send the full message or receives a wrong reply it will
|
||||
close the connection. An optional reconnection mechanism can be implemented.
|
||||
|
||||
Message types
|
||||
-------------
|
||||
|
||||
* VHOST_USER_GET_FEATURES
|
||||
|
||||
Id: 1
|
||||
Equivalent ioctl: VHOST_GET_FEATURES
|
||||
Master payload: N/A
|
||||
Slave payload: u64
|
||||
|
||||
Get from the underlying vhost implementation the features bitmask.
|
||||
|
||||
* VHOST_USER_SET_FEATURES
|
||||
|
||||
Id: 2
|
||||
Ioctl: VHOST_SET_FEATURES
|
||||
Master payload: u64
|
||||
|
||||
Enable features in the underlying vhost implementation using a bitmask.
|
||||
|
||||
* VHOST_USER_SET_OWNER
|
||||
|
||||
Id: 3
|
||||
Equivalent ioctl: VHOST_SET_OWNER
|
||||
Master payload: N/A
|
||||
|
||||
Issued when a new connection is established. It sets the current Master
|
||||
as an owner of the session. This can be used on the Slave as a
|
||||
"session start" flag.
|
||||
|
||||
* VHOST_USER_RESET_OWNER
|
||||
|
||||
Id: 4
|
||||
Equivalent ioctl: VHOST_RESET_OWNER
|
||||
Master payload: N/A
|
||||
|
||||
Issued when a new connection is about to be closed. The Master will no
|
||||
longer own this connection (and will usually close it).
|
||||
|
||||
* VHOST_USER_SET_MEM_TABLE
|
||||
|
||||
Id: 5
|
||||
Equivalent ioctl: VHOST_SET_MEM_TABLE
|
||||
Master payload: memory regions description
|
||||
|
||||
Sets the memory map regions on the slave so it can translate the vring
|
||||
addresses. In the ancillary data there is an array of file descriptors
|
||||
for each memory mapped region. The size and ordering of the fds matches
|
||||
the number and ordering of memory regions.
|
||||
|
||||
* VHOST_USER_SET_LOG_BASE
|
||||
|
||||
Id: 6
|
||||
Equivalent ioctl: VHOST_SET_LOG_BASE
|
||||
Master payload: u64
|
||||
|
||||
Sets the logging base address.
|
||||
|
||||
* VHOST_USER_SET_LOG_FD
|
||||
|
||||
Id: 7
|
||||
Equivalent ioctl: VHOST_SET_LOG_FD
|
||||
Master payload: N/A
|
||||
|
||||
Sets the logging file descriptor, which is passed as ancillary data.
|
||||
|
||||
* VHOST_USER_SET_VRING_NUM
|
||||
|
||||
Id: 8
|
||||
Equivalent ioctl: VHOST_SET_VRING_NUM
|
||||
Master payload: vring state description
|
||||
|
||||
Sets the number of vrings for this owner.
|
||||
|
||||
* VHOST_USER_SET_VRING_ADDR
|
||||
|
||||
Id: 9
|
||||
Equivalent ioctl: VHOST_SET_VRING_ADDR
|
||||
Master payload: vring address description
|
||||
Slave payload: N/A
|
||||
|
||||
Sets the addresses of the different aspects of the vring.
|
||||
|
||||
* VHOST_USER_SET_VRING_BASE
|
||||
|
||||
Id: 10
|
||||
Equivalent ioctl: VHOST_SET_VRING_BASE
|
||||
Master payload: vring state description
|
||||
|
||||
Sets the base offset in the available vring.
|
||||
|
||||
* VHOST_USER_GET_VRING_BASE
|
||||
|
||||
Id: 11
|
||||
Equivalent ioctl: VHOST_USER_GET_VRING_BASE
|
||||
Master payload: vring state description
|
||||
Slave payload: vring state description
|
||||
|
||||
Get the available vring base offset.
|
||||
|
||||
* VHOST_USER_SET_VRING_KICK
|
||||
|
||||
Id: 12
|
||||
Equivalent ioctl: VHOST_SET_VRING_KICK
|
||||
Master payload: u64
|
||||
|
||||
Set the event file descriptor for adding buffers to the vring. It
|
||||
is passed in the ancillary data.
|
||||
Bits (0-7) of the payload contain the vring index. Bit 8 is the
|
||||
invalid FD flag. This flag is set when there is no file descriptor
|
||||
in the ancillary data. This signals that polling should be used
|
||||
instead of waiting for a kick.
|
||||
|
||||
* VHOST_USER_SET_VRING_CALL
|
||||
|
||||
Id: 13
|
||||
Equivalent ioctl: VHOST_SET_VRING_CALL
|
||||
Master payload: u64
|
||||
|
||||
Set the event file descriptor to signal when buffers are used. It
|
||||
is passed in the ancillary data.
|
||||
Bits (0-7) of the payload contain the vring index. Bit 8 is the
|
||||
invalid FD flag. This flag is set when there is no file descriptor
|
||||
in the ancillary data. This signals that polling will be used
|
||||
instead of waiting for the call.
|
||||
|
||||
* VHOST_USER_SET_VRING_ERR
|
||||
|
||||
Id: 14
|
||||
Equivalent ioctl: VHOST_SET_VRING_ERR
|
||||
Master payload: u64
|
||||
|
||||
Set the event file descriptor to signal when error occurs. It
|
||||
is passed in the ancillary data.
|
||||
Bits (0-7) of the payload contain the vring index. Bit 8 is the
|
||||
invalid FD flag. This flag is set when there is no file descriptor
|
||||
in the ancillary data.
|
|
@ -1,92 +0,0 @@
|
|||
General Description
|
||||
===================
|
||||
|
||||
This document describes VMWare PVSCSI device interface specification.
|
||||
Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD.
|
||||
Based on source code of PVSCSI Linux driver from kernel 3.0.4
|
||||
|
||||
PVSCSI Device Interface Overview
|
||||
================================
|
||||
|
||||
The interface is based on memory area shared between hypervisor and VM.
|
||||
Memory area is obtained by driver as device IO memory resource of
|
||||
PVSCSI_MEM_SPACE_SIZE length.
|
||||
The shared memory consists of registers area and rings area.
|
||||
The registers area is used to raise hypervisor interrupts and issue device
|
||||
commands. The rings area is used to transfer data descriptors and SCSI
|
||||
commands from VM to hypervisor and to transfer messages produced by
|
||||
hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA.
|
||||
|
||||
PVSCSI Device Registers
|
||||
=======================
|
||||
|
||||
The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES).
|
||||
The structure of the registers area is described by the PVSCSIRegOffset enum.
|
||||
There are registers to issue device command (with optional short data),
|
||||
issue device interrupt, control interrupts masking.
|
||||
|
||||
PVSCSI Device Rings
|
||||
===================
|
||||
|
||||
There are three rings in shared memory:
|
||||
|
||||
1. Request ring (struct PVSCSIRingReqDesc *req_ring)
|
||||
- ring for OS to device requests
|
||||
2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring)
|
||||
- ring for device request completions
|
||||
3. Message ring (struct PVSCSIRingMsgDesc *msg_ring)
|
||||
- ring for messages from device.
|
||||
This ring is optional and the guest might not configure it.
|
||||
There is a control area (struct PVSCSIRingsState *rings_state) used to control
|
||||
rings operation.
|
||||
|
||||
PVSCSI Device to Host Interrupts
|
||||
================================
|
||||
There are following interrupt types supported by PVSCSI device:
|
||||
1. Completion interrupts (completion ring notifications):
|
||||
PVSCSI_INTR_CMPL_0
|
||||
PVSCSI_INTR_CMPL_1
|
||||
2. Message interrupts (message ring notifications):
|
||||
PVSCSI_INTR_MSG_0
|
||||
PVSCSI_INTR_MSG_1
|
||||
|
||||
Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register
|
||||
Bit set means interrupt enabled, bit cleared - disabled
|
||||
|
||||
Interrupt modes supported are legacy, MSI and MSI-X
|
||||
In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS
|
||||
is used to check which interrupt has arrived. Interrupts are
|
||||
acknowledged when the corresponding bit is written to the interrupt
|
||||
status register.
|
||||
|
||||
PVSCSI Device Operation Sequences
|
||||
=================================
|
||||
|
||||
1. Startup sequence:
|
||||
a. Issue PVSCSI_CMD_ADAPTER_RESET command;
|
||||
aa. Windows driver reads interrupt status register here;
|
||||
b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data,
|
||||
check status and disable device messages if error returned;
|
||||
(Omitted if device messages disabled by driver configuration)
|
||||
c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration
|
||||
as struct PVSCSICmdDescSetupRings;
|
||||
d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide
|
||||
rings configuration as struct PVSCSICmdDescSetupMsgRing;
|
||||
e. Unmask completion and message (if device messages enabled) interrupts.
|
||||
|
||||
2. Shutdown sequences
|
||||
a. Mask interrupts;
|
||||
b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO;
|
||||
c. Issue PVSCSI_CMD_ADAPTER_RESET command.
|
||||
|
||||
3. Send request
|
||||
a. Fill next free request ring descriptor;
|
||||
b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations;
|
||||
or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations.
|
||||
|
||||
4. Abort command
|
||||
a. Issue PVSCSI_CMD_ABORT_CMD command;
|
||||
|
||||
5. Request completion processing
|
||||
a. Upon completion interrupt arrival process completion
|
||||
and message (if enabled) rings.
|
|
@ -1,19 +0,0 @@
|
|||
A Spice port channel is an arbitrary communication between the Spice
|
||||
server host side and the client side.
|
||||
|
||||
Thanks to the associated reverse fully qualified domain name (fqdn),
|
||||
a Spice client can handle the various ports appropriately.
|
||||
|
||||
The following fqdn names are reserved by the QEMU project:
|
||||
|
||||
org.qemu.monitor.hmp.0
|
||||
QEMU human monitor
|
||||
|
||||
org.qemu.monitor.qmp.0:
|
||||
QEMU control monitor
|
||||
|
||||
org.qemu.console.serial.0
|
||||
QEMU virtual serial port
|
||||
|
||||
org.qemu.console.debug.0
|
||||
QEMU debug console
|
|
@ -1,349 +0,0 @@
|
|||
= Tracing =
|
||||
|
||||
== Introduction ==
|
||||
|
||||
This document describes the tracing infrastructure in QEMU and how to use it
|
||||
for debugging, profiling, and observing execution.
|
||||
|
||||
== Quickstart ==
|
||||
|
||||
1. Build with the 'simple' trace backend:
|
||||
|
||||
./configure --enable-trace-backends=simple
|
||||
make
|
||||
|
||||
2. Create a file with the events you want to trace:
|
||||
|
||||
echo bdrv_aio_readv > /tmp/events
|
||||
echo bdrv_aio_writev >> /tmp/events
|
||||
|
||||
3. Run the virtual machine to produce a trace file:
|
||||
|
||||
qemu -trace events=/tmp/events ... # your normal QEMU invocation
|
||||
|
||||
4. Pretty-print the binary trace file:
|
||||
|
||||
./scripts/simpletrace.py trace-events trace-* # Override * with QEMU <pid>
|
||||
|
||||
== Trace events ==
|
||||
|
||||
There is a set of static trace events declared in the "trace-events" source
|
||||
file. Each trace event declaration names the event, its arguments, and the
|
||||
format string which can be used for pretty-printing:
|
||||
|
||||
qemu_vmalloc(size_t size, void *ptr) "size %zu ptr %p"
|
||||
qemu_vfree(void *ptr) "ptr %p"
|
||||
|
||||
The "trace-events" file is processed by the "tracetool" script during build to
|
||||
generate code for the trace events. Trace events are invoked directly from
|
||||
source code like this:
|
||||
|
||||
#include "trace.h" /* needed for trace event prototype */
|
||||
|
||||
void *qemu_vmalloc(size_t size)
|
||||
{
|
||||
void *ptr;
|
||||
size_t align = QEMU_VMALLOC_ALIGN;
|
||||
|
||||
if (size < align) {
|
||||
align = getpagesize();
|
||||
}
|
||||
ptr = qemu_memalign(align, size);
|
||||
trace_qemu_vmalloc(size, ptr);
|
||||
return ptr;
|
||||
}
|
||||
|
||||
=== Declaring trace events ===
|
||||
|
||||
The "tracetool" script produces the trace.h header file which is included by
|
||||
every source file that uses trace events. Since many source files include
|
||||
trace.h, it uses a minimum of types and other header files included to keep the
|
||||
namespace clean and compile times and dependencies down.
|
||||
|
||||
Trace events should use types as follows:
|
||||
|
||||
* Use stdint.h types for fixed-size types. Most offsets and guest memory
|
||||
addresses are best represented with uint32_t or uint64_t. Use fixed-size
|
||||
types over primitive types whose size may change depending on the host
|
||||
(32-bit versus 64-bit) so trace events don't truncate values or break
|
||||
the build.
|
||||
|
||||
* Use void * for pointers to structs or for arrays. The trace.h header
|
||||
cannot include all user-defined struct declarations and it is therefore
|
||||
necessary to use void * for pointers to structs.
|
||||
|
||||
* For everything else, use primitive scalar types (char, int, long) with the
|
||||
appropriate signedness.
|
||||
|
||||
Format strings should reflect the types defined in the trace event. Take
|
||||
special care to use PRId64 and PRIu64 for int64_t and uint64_t types,
|
||||
respectively. This ensures portability between 32- and 64-bit platforms.
|
||||
|
||||
=== Hints for adding new trace events ===
|
||||
|
||||
1. Trace state changes in the code. Interesting points in the code usually
|
||||
involve a state change like starting, stopping, allocating, freeing. State
|
||||
changes are good trace events because they can be used to understand the
|
||||
execution of the system.
|
||||
|
||||
2. Trace guest operations. Guest I/O accesses like reading device registers
|
||||
are good trace events because they can be used to understand guest
|
||||
interactions.
|
||||
|
||||
3. Use correlator fields so the context of an individual line of trace output
|
||||
can be understood. For example, trace the pointer returned by malloc and
|
||||
used as an argument to free. This way mallocs and frees can be matched up.
|
||||
Trace events with no context are not very useful.
|
||||
|
||||
4. Name trace events after their function. If there are multiple trace events
|
||||
in one function, append a unique distinguisher at the end of the name.
|
||||
|
||||
== Generic interface and monitor commands ==
|
||||
|
||||
You can programmatically query and control the state of trace events through a
|
||||
backend-agnostic interface provided by the header "trace/control.h".
|
||||
|
||||
Note that some of the backends do not provide an implementation for some parts
|
||||
of this interface, in which case QEMU will just print a warning (please refer to
|
||||
header "trace/control.h" to see which routines are backend-dependent).
|
||||
|
||||
The state of events can also be queried and modified through monitor commands:
|
||||
|
||||
* info trace-events
|
||||
View available trace events and their state. State 1 means enabled, state 0
|
||||
means disabled.
|
||||
|
||||
* trace-event NAME on|off
|
||||
Enable/disable a given trace event or a group of events (using wildcards).
|
||||
|
||||
The "-trace events=<file>" command line argument can be used to enable the
|
||||
events listed in <file> from the very beginning of the program. This file must
|
||||
contain one event name per line.
|
||||
|
||||
If a line in the "-trace events=<file>" file begins with a '-', the trace event
|
||||
will be disabled instead of enabled. This is useful when a wildcard was used
|
||||
to enable an entire family of events but one noisy event needs to be disabled.
|
||||
|
||||
Wildcard matching is supported in both the monitor command "trace-event" and the
|
||||
events list file. That means you can enable/disable the events having a common
|
||||
prefix in a batch. For example, virtio-blk trace events could be enabled using
|
||||
the following monitor command:
|
||||
|
||||
trace-event virtio_blk_* on
|
||||
|
||||
== Trace backends ==
|
||||
|
||||
The "tracetool" script automates tedious trace event code generation and also
|
||||
keeps the trace event declarations independent of the trace backend. The trace
|
||||
events are not tightly coupled to a specific trace backend, such as LTTng or
|
||||
SystemTap. Support for trace backends can be added by extending the "tracetool"
|
||||
script.
|
||||
|
||||
The trace backends are chosen at configure time:
|
||||
|
||||
./configure --enable-trace-backends=simple
|
||||
|
||||
For a list of supported trace backends, try ./configure --help or see below.
|
||||
If multiple backends are enabled, the trace is sent to them all.
|
||||
|
||||
The following subsections describe the supported trace backends.
|
||||
|
||||
=== Nop ===
|
||||
|
||||
The "nop" backend generates empty trace event functions so that the compiler
|
||||
can optimize out trace events completely. This is the default and imposes no
|
||||
performance penalty.
|
||||
|
||||
Note that regardless of the selected trace backend, events with the "disable"
|
||||
property will be generated with the "nop" backend.
|
||||
|
||||
=== Stderr ===
|
||||
|
||||
The "stderr" backend sends trace events directly to standard error. This
|
||||
effectively turns trace events into debug printfs.
|
||||
|
||||
This is the simplest backend and can be used together with existing code that
|
||||
uses DPRINTF().
|
||||
|
||||
=== Simpletrace ===
|
||||
|
||||
The "simple" backend supports common use cases and comes as part of the QEMU
|
||||
source tree. It may not be as powerful as platform-specific or third-party
|
||||
trace backends but it is portable. This is the recommended trace backend
|
||||
unless you have specific needs for more advanced backends.
|
||||
|
||||
The "simple" backend currently does not capture string arguments, it simply
|
||||
records the char* pointer value instead of the string that is pointed to.
|
||||
|
||||
=== Ftrace ===
|
||||
|
||||
The "ftrace" backend writes trace data to ftrace marker. This effectively
|
||||
sends trace events to ftrace ring buffer, and you can compare qemu trace
|
||||
data and kernel(especially kvm.ko when using KVM) trace data.
|
||||
|
||||
if you use KVM, enable kvm events in ftrace:
|
||||
|
||||
# echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
|
||||
|
||||
After running qemu by root user, you can get the trace:
|
||||
|
||||
# cat /sys/kernel/debug/tracing/trace
|
||||
|
||||
Restriction: "ftrace" backend is restricted to Linux only.
|
||||
|
||||
==== Monitor commands ====
|
||||
|
||||
* trace-file on|off|flush|set <path>
|
||||
Enable/disable/flush the trace file or set the trace file name.
|
||||
|
||||
==== Analyzing trace files ====
|
||||
|
||||
The "simple" backend produces binary trace files that can be formatted with the
|
||||
simpletrace.py script. The script takes the "trace-events" file and the binary
|
||||
trace:
|
||||
|
||||
./scripts/simpletrace.py trace-events trace-12345
|
||||
|
||||
You must ensure that the same "trace-events" file was used to build QEMU,
|
||||
otherwise trace event declarations may have changed and output will not be
|
||||
consistent.
|
||||
|
||||
=== LTTng Userspace Tracer ===
|
||||
|
||||
The "ust" backend uses the LTTng Userspace Tracer library. There are no
|
||||
monitor commands built into QEMU, instead UST utilities should be used to list,
|
||||
enable/disable, and dump traces.
|
||||
|
||||
Package lttng-tools is required for userspace tracing. You must ensure that the
|
||||
current user belongs to the "tracing" group, or manually launch the
|
||||
lttng-sessiond daemon for the current user prior to running any instance of
|
||||
QEMU.
|
||||
|
||||
While running an instrumented QEMU, LTTng should be able to list all available
|
||||
events:
|
||||
|
||||
lttng list -u
|
||||
|
||||
Create tracing session:
|
||||
|
||||
lttng create mysession
|
||||
|
||||
Enable events:
|
||||
|
||||
lttng enable-event qemu:g_malloc -u
|
||||
|
||||
Where the events can either be a comma-separated list of events, or "-a" to
|
||||
enable all tracepoint events. Start and stop tracing as needed:
|
||||
|
||||
lttng start
|
||||
lttng stop
|
||||
|
||||
View the trace:
|
||||
|
||||
lttng view
|
||||
|
||||
Destroy tracing session:
|
||||
|
||||
lttng destroy
|
||||
|
||||
Babeltrace can be used at any later time to view the trace:
|
||||
|
||||
babeltrace $HOME/lttng-traces/mysession-<date>-<time>
|
||||
|
||||
=== SystemTap ===
|
||||
|
||||
The "dtrace" backend uses DTrace sdt probes but has only been tested with
|
||||
SystemTap. When SystemTap support is detected a .stp file with wrapper probes
|
||||
is generated to make use in scripts more convenient. This step can also be
|
||||
performed manually after a build in order to change the binary name in the .stp
|
||||
probes:
|
||||
|
||||
scripts/tracetool --dtrace --stap \
|
||||
--binary path/to/qemu-binary \
|
||||
--target-type system \
|
||||
--target-name x86_64 \
|
||||
<trace-events >qemu.stp
|
||||
|
||||
== Trace event properties ==
|
||||
|
||||
Each event in the "trace-events" file can be prefixed with a space-separated
|
||||
list of zero or more of the following event properties.
|
||||
|
||||
=== "disable" ===
|
||||
|
||||
If a specific trace event is going to be invoked a huge number of times, this
|
||||
might have a noticeable performance impact even when the event is
|
||||
programmatically disabled.
|
||||
|
||||
In this case you should declare such event with the "disable" property. This
|
||||
will effectively disable the event at compile time (by using the "nop" backend),
|
||||
thus having no performance impact at all on regular builds (i.e., unless you
|
||||
edit the "trace-events" file).
|
||||
|
||||
In addition, there might be cases where relatively complex computations must be
|
||||
performed to generate values that are only used as arguments for a trace
|
||||
function. In these cases you can use the macro 'TRACE_${EVENT_NAME}_ENABLED' to
|
||||
guard such computations and avoid its compilation when the event is disabled:
|
||||
|
||||
#include "trace.h" /* needed for trace event prototype */
|
||||
|
||||
void *qemu_vmalloc(size_t size)
|
||||
{
|
||||
void *ptr;
|
||||
size_t align = QEMU_VMALLOC_ALIGN;
|
||||
|
||||
if (size < align) {
|
||||
align = getpagesize();
|
||||
}
|
||||
ptr = qemu_memalign(align, size);
|
||||
if (TRACE_QEMU_VMALLOC_ENABLED) { /* preprocessor macro */
|
||||
void *complex;
|
||||
/* some complex computations to produce the 'complex' value */
|
||||
trace_qemu_vmalloc(size, ptr, complex);
|
||||
}
|
||||
return ptr;
|
||||
}
|
||||
|
||||
You can check both if the event has been disabled and is dynamically enabled at
|
||||
the same time using the 'trace_event_get_state' routine (see header
|
||||
"trace/control.h" for more information).
|
||||
|
||||
=== "tcg" ===
|
||||
|
||||
Guest code generated by TCG can be traced by defining an event with the "tcg"
|
||||
event property. Internally, this property generates two events:
|
||||
"<eventname>_trans" to trace the event at translation time, and
|
||||
"<eventname>_exec" to trace the event at execution time.
|
||||
|
||||
Instead of using these two events, you should instead use the function
|
||||
"trace_<eventname>_tcg" during translation (TCG code generation). This function
|
||||
will automatically call "trace_<eventname>_trans", and will generate the
|
||||
necessary TCG code to call "trace_<eventname>_exec" during guest code execution.
|
||||
|
||||
Events with the "tcg" property can be declared in the "trace-events" file with a
|
||||
mix of native and TCG types, and "trace_<eventname>_tcg" will gracefully forward
|
||||
them to the "<eventname>_trans" and "<eventname>_exec" events. Since TCG values
|
||||
are not known at translation time, these are ignored by the "<eventname>_trans"
|
||||
event. Because of this, the entry in the "trace-events" file needs two printing
|
||||
formats (separated by a comma):
|
||||
|
||||
tcg foo(uint8_t a1, TCGv_i32 a2) "a1=%d", "a1=%d a2=%d"
|
||||
|
||||
For example:
|
||||
|
||||
#include "trace-tcg.h"
|
||||
|
||||
void some_disassembly_func (...)
|
||||
{
|
||||
uint8_t a1 = ...;
|
||||
TCGv_i32 a2 = ...;
|
||||
trace_foo_tcg(a1, a2);
|
||||
}
|
||||
|
||||
This will immediately call:
|
||||
|
||||
void trace_foo_trans(uint8_t a1);
|
||||
|
||||
and will generate the TCG code to call:
|
||||
|
||||
void trace_foo(uint8_t a1, uint32_t a2);
|
|
@ -1,47 +0,0 @@
|
|||
|
||||
qemu usb storage emulation
|
||||
--------------------------
|
||||
|
||||
QEMU has three devices for usb storage emulation.
|
||||
|
||||
Number one emulates the classic bulk-only transport protocol which is
|
||||
used by 99% of the usb sticks on the market today and is called
|
||||
"usb-storage". Usage (hooking up to xhci, other host controllers work
|
||||
too):
|
||||
|
||||
qemu ${other_vm_args} \
|
||||
-drive if=none,id=stick,file=/path/to/file.img \
|
||||
-device nec-usb-xhci,id=xhci \
|
||||
-device usb-storage,bus=xhci.0,drive=stick
|
||||
|
||||
|
||||
Number two is the newer usb attached scsi transport. This one doesn't
|
||||
automagically create a scsi disk, so you have to explicitly attach one
|
||||
manually. Multiple logical units are supported. Here is an example
|
||||
with tree logical units:
|
||||
|
||||
qemu ${other_vm_args} \
|
||||
-drive if=none,id=uas-disk1,file=/path/to/file1.img \
|
||||
-drive if=none,id=uas-disk2,file=/path/to/file2.img \
|
||||
-drive if=none,id=uas-cdrom,media=cdrom,file=/path/to/image.iso \
|
||||
-device nec-usb-xhci,id=xhci \
|
||||
-device usb-uas,id=uas,bus=xhci.0 \
|
||||
-device scsi-hd,bus=uas.0,scsi-id=0,lun=0,drive=uas-disk1 \
|
||||
-device scsi-hd,bus=uas.0,scsi-id=0,lun=1,drive=uas-disk2 \
|
||||
-device scsi-cd,bus=uas.0,scsi-id=0,lun=5,drive=uas-cdrom
|
||||
|
||||
|
||||
Number three emulates the classic bulk-only transport protocol too.
|
||||
It's called "usb-bot". It shares most code with "usb-storage", and
|
||||
the guest will not be able to see the difference. The qemu command
|
||||
line interface is simliar to usb-uas though, i.e. no automatic scsi
|
||||
disk creation. It also features support for up to 16 LUNs. The LUN
|
||||
numbers must be continuous, i.e. for three devices you must use 0+1+2.
|
||||
The 0+1+5 numbering from the "usb-uas" example isn't going to work
|
||||
with "usb-bot".
|
||||
|
||||
enjoy,
|
||||
Gerd
|
||||
|
||||
--
|
||||
Gerd Hoffmann <kraxel@redhat.com>
|
|
@ -1,161 +0,0 @@
|
|||
|
||||
USB 2.0 Quick Start
|
||||
===================
|
||||
|
||||
The QEMU EHCI Adapter can be used with and without companion
|
||||
controllers. See below for the companion controller mode.
|
||||
|
||||
When not running in companion controller mode there are two completely
|
||||
separate USB busses: One USB 1.1 bus driven by the UHCI controller and
|
||||
one USB 2.0 bus driven by the EHCI controller. Devices must be
|
||||
attached to the correct controller manually.
|
||||
|
||||
The '-usb' switch will make qemu create the UHCI controller as part of
|
||||
the PIIX3 chipset. The USB 1.1 bus will carry the name "usb-bus.0".
|
||||
|
||||
You can use the standard -device switch to add a EHCI controller to
|
||||
your virtual machine. It is strongly recommended to specify an ID for
|
||||
the controller so the USB 2.0 bus gets a individual name, for example
|
||||
'-device usb-ehci,id=ehci". This will give you a USB 2.0 bus named
|
||||
"ehci.0".
|
||||
|
||||
I strongly recomment to also use -device to attach usb devices because
|
||||
you can specify the bus they should be attached to this way. Here is
|
||||
a complete example:
|
||||
|
||||
qemu -M pc ${otheroptions} \
|
||||
-drive if=none,id=usbstick,file=/path/to/image \
|
||||
-usb \
|
||||
-device usb-ehci,id=ehci \
|
||||
-device usb-tablet,bus=usb-bus.0 \
|
||||
-device usb-storage,bus=ehci.0,drive=usbstick
|
||||
|
||||
This attaches a usb tablet to the UHCI adapter and a usb mass storage
|
||||
device to the EHCI adapter.
|
||||
|
||||
|
||||
Companion controller support
|
||||
----------------------------
|
||||
|
||||
Companion controller support has been added recently. The operational
|
||||
model described above with two completely separate busses still works
|
||||
fine. Additionally the UHCI and OHCI controllers got the ability to
|
||||
attach to a usb bus created by EHCI as companion controllers. This is
|
||||
done by specifying the masterbus and firstport properties. masterbus
|
||||
specifies the bus name the controller should attach to. firstport
|
||||
specifies the first port the controller should attach to, which is
|
||||
needed as usually one ehci controller with six ports has three uhci
|
||||
companion controllers with two ports each.
|
||||
|
||||
There is a config file in docs which will do all this for you, just
|
||||
try ...
|
||||
|
||||
qemu -readconfig docs/ich9-ehci-uhci.cfg
|
||||
|
||||
... then use "bus=ehci.0" to assign your usb devices to that bus.
|
||||
|
||||
|
||||
xhci controller support
|
||||
-----------------------
|
||||
|
||||
There is also xhci host controller support available. It got a lot
|
||||
less testing than ehci and there are a bunch of known limitations, so
|
||||
ehci may work better for you. On the other hand the xhci hardware
|
||||
design is much more virtualization-friendly, thus xhci emulation uses
|
||||
less resources (especially cpu). If you want to give xhci a try
|
||||
use this to add the host controller ...
|
||||
|
||||
qemu -device nec-usb-xhci,id=xhci
|
||||
|
||||
... then use "bus=xhci.0" when assigning usb devices.
|
||||
|
||||
|
||||
More USB tips & tricks
|
||||
======================
|
||||
|
||||
Recently the usb pass through driver (also known as usb-host) and the
|
||||
qemu usb subsystem gained a few capabilities which are available only
|
||||
via qdev properties, i,e. when using '-device'.
|
||||
|
||||
|
||||
physical port addressing
|
||||
------------------------
|
||||
|
||||
First you can (for all usb devices) specify the physical port where
|
||||
the device will show up in the guest. This can be done using the
|
||||
"port" property. UHCI has two root ports (1,2). EHCI has four root
|
||||
ports (1-4), the emulated (1.1) USB hub has eight ports.
|
||||
|
||||
Plugging a tablet into UHCI port 1 works like this:
|
||||
|
||||
-device usb-tablet,bus=usb-bus.0,port=1
|
||||
|
||||
Plugging a hub into UHCI port 2 works like this:
|
||||
|
||||
-device usb-hub,bus=usb-bus.0,port=2
|
||||
|
||||
Plugging a virtual usb stick into port 4 of the hub just plugged works
|
||||
this way:
|
||||
|
||||
-device usb-storage,bus=usb-bus.0,port=2.4,drive=...
|
||||
|
||||
You can do basically the same in the monitor using the device_add
|
||||
command. If you want to unplug devices too you should specify some
|
||||
unique id which you can use to refer to the device ...
|
||||
|
||||
(qemu) device_add usb-tablet,bus=usb-bus.0,port=1,id=my-tablet
|
||||
(qemu) device_del my-tablet
|
||||
|
||||
... when unplugging it with device_del.
|
||||
|
||||
|
||||
USB pass through hints
|
||||
----------------------
|
||||
|
||||
The usb-host driver has a bunch of properties to specify the device
|
||||
which should be passed to the guest:
|
||||
|
||||
hostbus=<nr> -- Specifies the bus number the device must be attached
|
||||
to.
|
||||
|
||||
hostaddr=<nr> -- Specifies the device address the device got
|
||||
assigned by the guest os.
|
||||
|
||||
hostport=<str> -- Specifies the physical port the device is attached
|
||||
to.
|
||||
|
||||
vendorid=<hexnr> -- Specifies the vendor ID of the device.
|
||||
productid=<hexnr> -- Specifies the product ID of the device.
|
||||
|
||||
In theory you can combine all these properties as you like. In
|
||||
practice only a few combinations are useful:
|
||||
|
||||
(1) vendorid+productid -- match for a specific device, pass it to
|
||||
the guest when it shows up somewhere in the host.
|
||||
|
||||
(2) hostbus+hostport -- match for a specific physical port in the
|
||||
host, any device which is plugged in there gets passed to the
|
||||
guest.
|
||||
|
||||
(3) hostbus+hostaddr -- most useful for ad-hoc pass through as the
|
||||
hostaddr isn't stable, the next time you plug in the device it
|
||||
gets a new one ...
|
||||
|
||||
Note that USB 1.1 devices are handled by UHCI/OHCI and USB 2.0 by
|
||||
EHCI. That means a device plugged into the very same physical port
|
||||
may show up on different busses depending on the speed. The port I'm
|
||||
using for testing is bus 1 + port 1 for 2.0 devices and bus 3 + port 1
|
||||
for 1.1 devices. Passing through any device plugged into that port
|
||||
and also assign them to the correct bus can be done this way:
|
||||
|
||||
qemu -M pc ${otheroptions} \
|
||||
-usb \
|
||||
-device usb-ehci,id=ehci \
|
||||
-device usb-host,bus=usb-bus.0,hostbus=3,hostport=1 \
|
||||
-device usb-host,bus=ehci.0,hostbus=1,hostport=1
|
||||
|
||||
enjoy,
|
||||
Gerd
|
||||
|
||||
--
|
||||
Gerd Hoffmann <kraxel@redhat.com>
|
|
@ -1,105 +0,0 @@
|
|||
virtio balloon memory statistics
|
||||
================================
|
||||
|
||||
The virtio balloon driver supports guest memory statistics reporting. These
|
||||
statistics are available to QEMU users as QOM (QEMU Object Model) device
|
||||
properties via a polling mechanism.
|
||||
|
||||
Before querying the available stats, clients first have to enable polling.
|
||||
This is done by writing a time interval value (in seconds) to the
|
||||
guest-stats-polling-interval property. This value can be:
|
||||
|
||||
> 0 enables polling in the specified interval. If polling is already
|
||||
enabled, the polling time interval is changed to the new value
|
||||
|
||||
0 disables polling. Previous polled statistics are still valid and
|
||||
can be queried.
|
||||
|
||||
Once polling is enabled, the virtio-balloon device in QEMU will start
|
||||
polling the guest's balloon driver for new stats in the specified time
|
||||
interval.
|
||||
|
||||
To retrieve those stats, clients have to query the guest-stats property,
|
||||
which will return a dictionary containing:
|
||||
|
||||
o A key named 'stats', containing all available stats. If the guest
|
||||
doesn't support a particular stat, or if it couldn't be retrieved,
|
||||
its value will be -1. Currently, the following stats are supported:
|
||||
|
||||
- stat-swap-in
|
||||
- stat-swap-out
|
||||
- stat-major-faults
|
||||
- stat-minor-faults
|
||||
- stat-free-memory
|
||||
- stat-total-memory
|
||||
|
||||
o A key named last-update, which contains the last stats update
|
||||
timestamp in seconds. Since this timestamp is generated by the host,
|
||||
a buggy guest can't influence its value. The value is 0 if the guest
|
||||
has not updated the stats (yet).
|
||||
|
||||
It's also important to note the following:
|
||||
|
||||
- Previously polled statistics remain available even if the polling is
|
||||
later disabled
|
||||
|
||||
- As noted above, if a guest doesn't support a particular stat its value
|
||||
will always be -1. However, it's also possible that a guest temporarily
|
||||
couldn't update one or even all stats. If this happens, just wait for
|
||||
the next update
|
||||
|
||||
- Polling can be enabled even if the guest doesn't have stats support
|
||||
or the balloon driver wasn't loaded in the guest. If this is the case
|
||||
and stats are queried, last-update will be 0.
|
||||
|
||||
- The polling timer is only re-armed when the guest responds to the
|
||||
statistics request. This means that if a (buggy) guest doesn't ever
|
||||
respond to the request the timer will never be re-armed, which has
|
||||
the same effect as disabling polling
|
||||
|
||||
Here are a few examples. QEMU is started with '-balloon virtio', which
|
||||
generates '/machine/peripheral-anon/device[1]' as the QOM path for the
|
||||
balloon device.
|
||||
|
||||
Enable polling with 2 seconds interval:
|
||||
|
||||
{ "execute": "qom-set",
|
||||
"arguments": { "path": "/machine/peripheral-anon/device[1]",
|
||||
"property": "guest-stats-polling-interval", "value": 2 } }
|
||||
|
||||
{ "return": {} }
|
||||
|
||||
Change polling to 10 seconds:
|
||||
|
||||
{ "execute": "qom-set",
|
||||
"arguments": { "path": "/machine/peripheral-anon/device[1]",
|
||||
"property": "guest-stats-polling-interval", "value": 10 } }
|
||||
|
||||
{ "return": {} }
|
||||
|
||||
Get stats:
|
||||
|
||||
{ "execute": "qom-get",
|
||||
"arguments": { "path": "/machine/peripheral-anon/device[1]",
|
||||
"property": "guest-stats" } }
|
||||
{
|
||||
"return": {
|
||||
"stats": {
|
||||
"stat-swap-out": 0,
|
||||
"stat-free-memory": 844943360,
|
||||
"stat-minor-faults": 219028,
|
||||
"stat-major-faults": 235,
|
||||
"stat-total-memory": 1044406272,
|
||||
"stat-swap-in": 0
|
||||
},
|
||||
"last-update": 1358529861
|
||||
}
|
||||
}
|
||||
|
||||
Disable polling:
|
||||
|
||||
{ "execute": "qom-set",
|
||||
"arguments": { "path": "/machine/peripheral-anon/device[1]",
|
||||
"property": "stats-polling-interval", "value": 0 } }
|
||||
|
||||
{ "return": {} }
|
|
@ -1,50 +0,0 @@
|
|||
VNC LED state Pseudo-encoding
|
||||
=============================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
This document describes the Pseudo-encoding of LED state for RFB which
|
||||
is the protocol used in VNC as reference link below:
|
||||
|
||||
http://tigervnc.svn.sourceforge.net/viewvc/tigervnc/rfbproto/rfbproto.rst?content-type=text/plain
|
||||
|
||||
When accessing a guest by console through VNC, there might be mismatch
|
||||
between the lock keys notification LED on the computer running the VNC
|
||||
client session and the current status of the lock keys on the guest
|
||||
machine.
|
||||
|
||||
To solve this problem it attempts to add LED state Pseudo-encoding
|
||||
extension to VNC protocol to deal with setting LED state.
|
||||
|
||||
Pseudo-encoding
|
||||
---------------
|
||||
|
||||
This Pseudo-encoding requested by client declares to server that it supports
|
||||
LED state extensions to the protocol.
|
||||
|
||||
The Pseudo-encoding number for LED state defined as:
|
||||
|
||||
======= ===============================================================
|
||||
Number Name
|
||||
======= ===============================================================
|
||||
-261 'LED state Pseudo-encoding'
|
||||
======= ===============================================================
|
||||
|
||||
LED state Pseudo-encoding
|
||||
--------------------------
|
||||
|
||||
The LED state Pseudo-encoding describes the encoding of LED state which
|
||||
consists of 3 bits, from left to right each bit represents the Caps, Num,
|
||||
and Scroll lock key respectively. '1' indicates that the LED should be
|
||||
on and '0' should be off.
|
||||
|
||||
Some example encodings for it as following:
|
||||
|
||||
======= ===============================================================
|
||||
Code Description
|
||||
======= ===============================================================
|
||||
100 CapsLock is on, NumLock and ScrollLock are off
|
||||
010 NumLock is on, CapsLock and ScrollLock are off
|
||||
111 CapsLock, NumLock and ScrollLock are on
|
||||
======= ===============================================================
|
|
@ -1,651 +0,0 @@
|
|||
= How to write QMP commands using the QAPI framework =
|
||||
|
||||
This document is a step-by-step guide on how to write new QMP commands using
|
||||
the QAPI framework. It also shows how to implement new style HMP commands.
|
||||
|
||||
This document doesn't discuss QMP protocol level details, nor does it dive
|
||||
into the QAPI framework implementation.
|
||||
|
||||
For an in-depth introduction to the QAPI framework, please refer to
|
||||
docs/qapi-code-gen.txt. For documentation about the QMP protocol, please
|
||||
check the files in QMP/.
|
||||
|
||||
== Overview ==
|
||||
|
||||
Generally speaking, the following steps should be taken in order to write a
|
||||
new QMP command.
|
||||
|
||||
1. Write the command's and type(s) specification in the QAPI schema file
|
||||
(qapi-schema.json in the root source directory)
|
||||
|
||||
2. Write the QMP command itself, which is a regular C function. Preferably,
|
||||
the command should be exported by some QEMU subsystem. But it can also be
|
||||
added to the qmp.c file
|
||||
|
||||
3. At this point the command can be tested under the QMP protocol
|
||||
|
||||
4. Write the HMP command equivalent. This is not required and should only be
|
||||
done if it does make sense to have the functionality in HMP. The HMP command
|
||||
is implemented in terms of the QMP command
|
||||
|
||||
The following sections will demonstrate each of the steps above. We will start
|
||||
very simple and get more complex as we progress.
|
||||
|
||||
=== Testing ===
|
||||
|
||||
For all the examples in the next sections, the test setup is the same and is
|
||||
shown here.
|
||||
|
||||
First, QEMU should be started as:
|
||||
|
||||
# /path/to/your/source/qemu [...] \
|
||||
-chardev socket,id=qmp,port=4444,host=localhost,server \
|
||||
-mon chardev=qmp,mode=control,pretty=on
|
||||
|
||||
Then, in a different terminal:
|
||||
|
||||
$ telnet localhost 4444
|
||||
Trying 127.0.0.1...
|
||||
Connected to localhost.
|
||||
Escape character is '^]'.
|
||||
{
|
||||
"QMP": {
|
||||
"version": {
|
||||
"qemu": {
|
||||
"micro": 50,
|
||||
"minor": 15,
|
||||
"major": 0
|
||||
},
|
||||
"package": ""
|
||||
},
|
||||
"capabilities": [
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
The above output is the QMP server saying you're connected. The server is
|
||||
actually in capabilities negotiation mode. To enter in command mode type:
|
||||
|
||||
{ "execute": "qmp_capabilities" }
|
||||
|
||||
Then the server should respond:
|
||||
|
||||
{
|
||||
"return": {
|
||||
}
|
||||
}
|
||||
|
||||
Which is QMP's way of saying "the latest command executed OK and didn't return
|
||||
any data". Now you're ready to enter the QMP example commands as explained in
|
||||
the following sections.
|
||||
|
||||
== Writing a command that doesn't return data ==
|
||||
|
||||
That's the most simple QMP command that can be written. Usually, this kind of
|
||||
command carries some meaningful action in QEMU but here it will just print
|
||||
"Hello, world" to the standard output.
|
||||
|
||||
Our command will be called "hello-world". It takes no arguments, nor does it
|
||||
return any data.
|
||||
|
||||
The first step is to add the following line to the bottom of the
|
||||
qapi-schema.json file:
|
||||
|
||||
{ 'command': 'hello-world' }
|
||||
|
||||
The "command" keyword defines a new QMP command. It's an JSON object. All
|
||||
schema entries are JSON objects. The line above will instruct the QAPI to
|
||||
generate any prototypes and the necessary code to marshal and unmarshal
|
||||
protocol data.
|
||||
|
||||
The next step is to write the "hello-world" implementation. As explained
|
||||
earlier, it's preferable for commands to live in QEMU subsystems. But
|
||||
"hello-world" doesn't pertain to any, so we put its implementation in qmp.c:
|
||||
|
||||
void qmp_hello_world(Error **errp)
|
||||
{
|
||||
printf("Hello, world!\n");
|
||||
}
|
||||
|
||||
There are a few things to be noticed:
|
||||
|
||||
1. QMP command implementation functions must be prefixed with "qmp_"
|
||||
2. qmp_hello_world() returns void, this is in accordance with the fact that the
|
||||
command doesn't return any data
|
||||
3. It takes an "Error **" argument. This is required. Later we will see how to
|
||||
return errors and take additional arguments. The Error argument should not
|
||||
be touched if the command doesn't return errors
|
||||
4. We won't add the function's prototype. That's automatically done by the QAPI
|
||||
5. Printing to the terminal is discouraged for QMP commands, we do it here
|
||||
because it's the easiest way to demonstrate a QMP command
|
||||
|
||||
Now a little hack is needed. As we're still using the old QMP server we need
|
||||
to add the new command to its internal dispatch table. This step won't be
|
||||
required in the near future. Open the qmp-commands.hx file and add the
|
||||
following in the botton:
|
||||
|
||||
{
|
||||
.name = "hello-world",
|
||||
.args_type = "",
|
||||
.mhandler.cmd_new = qmp_marshal_input_hello_world,
|
||||
},
|
||||
|
||||
You're done. Now build qemu, run it as suggested in the "Testing" section,
|
||||
and then type the following QMP command:
|
||||
|
||||
{ "execute": "hello-world" }
|
||||
|
||||
Then check the terminal running qemu and look for the "Hello, world" string. If
|
||||
you don't see it then something went wrong.
|
||||
|
||||
=== Arguments ===
|
||||
|
||||
Let's add an argument called "message" to our "hello-world" command. The new
|
||||
argument will contain the string to be printed to stdout. It's an optional
|
||||
argument, if it's not present we print our default "Hello, World" string.
|
||||
|
||||
The first change we have to do is to modify the command specification in the
|
||||
schema file to the following:
|
||||
|
||||
{ 'command': 'hello-world', 'data': { '*message': 'str' } }
|
||||
|
||||
Notice the new 'data' member in the schema. It's an JSON object whose each
|
||||
element is an argument to the command in question. Also notice the asterisk,
|
||||
it's used to mark the argument optional (that means that you shouldn't use it
|
||||
for mandatory arguments). Finally, 'str' is the argument's type, which
|
||||
stands for "string". The QAPI also supports integers, booleans, enumerations
|
||||
and user defined types.
|
||||
|
||||
Now, let's update our C implementation in qmp.c:
|
||||
|
||||
void qmp_hello_world(bool has_message, const char *message, Error **errp)
|
||||
{
|
||||
if (has_message) {
|
||||
printf("%s\n", message);
|
||||
} else {
|
||||
printf("Hello, world\n");
|
||||
}
|
||||
}
|
||||
|
||||
There are two important details to be noticed:
|
||||
|
||||
1. All optional arguments are accompanied by a 'has_' boolean, which is set
|
||||
if the optional argument is present or false otherwise
|
||||
2. The C implementation signature must follow the schema's argument ordering,
|
||||
which is defined by the "data" member
|
||||
|
||||
The last step is to update the qmp-commands.hx file:
|
||||
|
||||
{
|
||||
.name = "hello-world",
|
||||
.args_type = "message:s?",
|
||||
.mhandler.cmd_new = qmp_marshal_input_hello_world,
|
||||
},
|
||||
|
||||
Notice that the "args_type" member got our "message" argument. The character
|
||||
"s" stands for "string" and "?" means it's optional. This too must be ordered
|
||||
according to the C implementation and schema file. You can look for more
|
||||
examples in the qmp-commands.hx file if you need to define more arguments.
|
||||
|
||||
Again, this step won't be required in the future.
|
||||
|
||||
Time to test our new version of the "hello-world" command. Build qemu, run it as
|
||||
described in the "Testing" section and then send two commands:
|
||||
|
||||
{ "execute": "hello-world" }
|
||||
{
|
||||
"return": {
|
||||
}
|
||||
}
|
||||
|
||||
{ "execute": "hello-world", "arguments": { "message": "We love qemu" } }
|
||||
{
|
||||
"return": {
|
||||
}
|
||||
}
|
||||
|
||||
You should see "Hello, world" and "we love qemu" in the terminal running qemu,
|
||||
if you don't see these strings, then something went wrong.
|
||||
|
||||
=== Errors ===
|
||||
|
||||
QMP commands should use the error interface exported by the error.h header
|
||||
file. Basically, errors are set by calling the error_set() function.
|
||||
|
||||
Let's say we don't accept the string "message" to contain the word "love". If
|
||||
it does contain it, we want the "hello-world" command to return an error:
|
||||
|
||||
void qmp_hello_world(bool has_message, const char *message, Error **errp)
|
||||
{
|
||||
if (has_message) {
|
||||
if (strstr(message, "love")) {
|
||||
error_set(errp, ERROR_CLASS_GENERIC_ERROR,
|
||||
"the word 'love' is not allowed");
|
||||
return;
|
||||
}
|
||||
printf("%s\n", message);
|
||||
} else {
|
||||
printf("Hello, world\n");
|
||||
}
|
||||
}
|
||||
|
||||
The first argument to the error_set() function is the Error pointer to pointer,
|
||||
which is passed to all QMP functions. The second argument is a ErrorClass
|
||||
value, which should be ERROR_CLASS_GENERIC_ERROR most of the time (more
|
||||
details about error classes are given below). The third argument is a human
|
||||
description of the error, this is a free-form printf-like string.
|
||||
|
||||
Let's test the example above. Build qemu, run it as defined in the "Testing"
|
||||
section, and then issue the following command:
|
||||
|
||||
{ "execute": "hello-world", "arguments": { "message": "all you need is love" } }
|
||||
|
||||
The QMP server's response should be:
|
||||
|
||||
{
|
||||
"error": {
|
||||
"class": "GenericError",
|
||||
"desc": "the word 'love' is not allowed"
|
||||
}
|
||||
}
|
||||
|
||||
As a general rule, all QMP errors should use ERROR_CLASS_GENERIC_ERROR. There
|
||||
are two exceptions to this rule:
|
||||
|
||||
1. A non-generic ErrorClass value exists* for the failure you want to report
|
||||
(eg. DeviceNotFound)
|
||||
|
||||
2. Management applications have to take special action on the failure you
|
||||
want to report, hence you have to add a new ErrorClass value so that they
|
||||
can check for it
|
||||
|
||||
If the failure you want to report doesn't fall in one of the two cases above,
|
||||
just report ERROR_CLASS_GENERIC_ERROR.
|
||||
|
||||
* All existing ErrorClass values are defined in the qapi-schema.json file
|
||||
|
||||
=== Command Documentation ===
|
||||
|
||||
There's only one step missing to make "hello-world"'s implementation complete,
|
||||
and that's its documentation in the schema file.
|
||||
|
||||
This is very important. No QMP command will be accepted in QEMU without proper
|
||||
documentation.
|
||||
|
||||
There are many examples of such documentation in the schema file already, but
|
||||
here goes "hello-world"'s new entry for the qapi-schema.json file:
|
||||
|
||||
##
|
||||
# @hello-world
|
||||
#
|
||||
# Print a client provided string to the standard output stream.
|
||||
#
|
||||
# @message: #optional string to be printed
|
||||
#
|
||||
# Returns: Nothing on success.
|
||||
#
|
||||
# Notes: if @message is not provided, the "Hello, world" string will
|
||||
# be printed instead
|
||||
#
|
||||
# Since: <next qemu stable release, eg. 1.0>
|
||||
##
|
||||
{ 'command': 'hello-world', 'data': { '*message': 'str' } }
|
||||
|
||||
Please, note that the "Returns" clause is optional if a command doesn't return
|
||||
any data nor any errors.
|
||||
|
||||
=== Implementing the HMP command ===
|
||||
|
||||
Now that the QMP command is in place, we can also make it available in the human
|
||||
monitor (HMP).
|
||||
|
||||
With the introduction of the QAPI, HMP commands make QMP calls. Most of the
|
||||
time HMP commands are simple wrappers. All HMP commands implementation exist in
|
||||
the hmp.c file.
|
||||
|
||||
Here's the implementation of the "hello-world" HMP command:
|
||||
|
||||
void hmp_hello_world(Monitor *mon, const QDict *qdict)
|
||||
{
|
||||
const char *message = qdict_get_try_str(qdict, "message");
|
||||
Error *err = NULL;
|
||||
|
||||
qmp_hello_world(!!message, message, &err);
|
||||
if (err) {
|
||||
monitor_printf(mon, "%s\n", error_get_pretty(err));
|
||||
error_free(err);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
Also, you have to add the function's prototype to the hmp.h file.
|
||||
|
||||
There are three important points to be noticed:
|
||||
|
||||
1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The
|
||||
former is the monitor object. The latter is how the monitor passes
|
||||
arguments entered by the user to the command implementation
|
||||
2. hmp_hello_world() performs error checking. In this example we just print
|
||||
the error description to the user, but we could do more, like taking
|
||||
different actions depending on the error qmp_hello_world() returns
|
||||
3. The "err" variable must be initialized to NULL before performing the
|
||||
QMP call
|
||||
|
||||
There's one last step to actually make the command available to monitor users,
|
||||
we should add it to the hmp-commands.hx file:
|
||||
|
||||
{
|
||||
.name = "hello-world",
|
||||
.args_type = "message:s?",
|
||||
.params = "hello-world [message]",
|
||||
.help = "Print message to the standard output",
|
||||
.mhandler.cmd = hmp_hello_world,
|
||||
},
|
||||
|
||||
STEXI
|
||||
@item hello_world @var{message}
|
||||
@findex hello_world
|
||||
Print message to the standard output
|
||||
ETEXI
|
||||
|
||||
To test this you have to open a user monitor and issue the "hello-world"
|
||||
command. It might be instructive to check the command's documentation with
|
||||
HMP's "help" command.
|
||||
|
||||
Please, check the "-monitor" command-line option to know how to open a user
|
||||
monitor.
|
||||
|
||||
== Writing a command that returns data ==
|
||||
|
||||
A QMP command is capable of returning any data the QAPI supports like integers,
|
||||
strings, booleans, enumerations and user defined types.
|
||||
|
||||
In this section we will focus on user defined types. Please, check the QAPI
|
||||
documentation for information about the other types.
|
||||
|
||||
=== User Defined Types ===
|
||||
|
||||
FIXME This example needs to be redone after commit 6d32717
|
||||
|
||||
For this example we will write the query-alarm-clock command, which returns
|
||||
information about QEMU's timer alarm. For more information about it, please
|
||||
check the "-clock" command-line option.
|
||||
|
||||
We want to return two pieces of information. The first one is the alarm clock's
|
||||
name. The second one is when the next alarm will fire. The former information is
|
||||
returned as a string, the latter is an integer in nanoseconds (which is not
|
||||
very useful in practice, as the timer has probably already fired when the
|
||||
information reaches the client).
|
||||
|
||||
The best way to return that data is to create a new QAPI type, as shown below:
|
||||
|
||||
##
|
||||
# @QemuAlarmClock
|
||||
#
|
||||
# QEMU alarm clock information.
|
||||
#
|
||||
# @clock-name: The alarm clock method's name.
|
||||
#
|
||||
# @next-deadline: #optional The time (in nanoseconds) the next alarm will fire.
|
||||
#
|
||||
# Since: 1.0
|
||||
##
|
||||
{ 'type': 'QemuAlarmClock',
|
||||
'data': { 'clock-name': 'str', '*next-deadline': 'int' } }
|
||||
|
||||
The "type" keyword defines a new QAPI type. Its "data" member contains the
|
||||
type's members. In this example our members are the "clock-name" and the
|
||||
"next-deadline" one, which is optional.
|
||||
|
||||
Now let's define the query-alarm-clock command:
|
||||
|
||||
##
|
||||
# @query-alarm-clock
|
||||
#
|
||||
# Return information about QEMU's alarm clock.
|
||||
#
|
||||
# Returns a @QemuAlarmClock instance describing the alarm clock method
|
||||
# being currently used by QEMU (this is usually set by the '-clock'
|
||||
# command-line option).
|
||||
#
|
||||
# Since: 1.0
|
||||
##
|
||||
{ 'command': 'query-alarm-clock', 'returns': 'QemuAlarmClock' }
|
||||
|
||||
Notice the "returns" keyword. As its name suggests, it's used to define the
|
||||
data returned by a command.
|
||||
|
||||
It's time to implement the qmp_query_alarm_clock() function, you can put it
|
||||
in the qemu-timer.c file:
|
||||
|
||||
QemuAlarmClock *qmp_query_alarm_clock(Error **errp)
|
||||
{
|
||||
QemuAlarmClock *clock;
|
||||
int64_t deadline;
|
||||
|
||||
clock = g_malloc0(sizeof(*clock));
|
||||
|
||||
deadline = qemu_next_alarm_deadline();
|
||||
if (deadline > 0) {
|
||||
clock->has_next_deadline = true;
|
||||
clock->next_deadline = deadline;
|
||||
}
|
||||
clock->clock_name = g_strdup(alarm_timer->name);
|
||||
|
||||
return clock;
|
||||
}
|
||||
|
||||
There are a number of things to be noticed:
|
||||
|
||||
1. The QemuAlarmClock type is automatically generated by the QAPI framework,
|
||||
its members correspond to the type's specification in the schema file
|
||||
2. As specified in the schema file, the function returns a QemuAlarmClock
|
||||
instance and takes no arguments (besides the "errp" one, which is mandatory
|
||||
for all QMP functions)
|
||||
3. The "clock" variable (which will point to our QAPI type instance) is
|
||||
allocated by the regular g_malloc0() function. Note that we chose to
|
||||
initialize the memory to zero. This is recommended for all QAPI types, as
|
||||
it helps avoiding bad surprises (specially with booleans)
|
||||
4. Remember that "next_deadline" is optional? All optional members have a
|
||||
'has_TYPE_NAME' member that should be properly set by the implementation,
|
||||
as shown above
|
||||
5. Even static strings, such as "alarm_timer->name", should be dynamically
|
||||
allocated by the implementation. This is so because the QAPI also generates
|
||||
a function to free its types and it cannot distinguish between dynamically
|
||||
or statically allocated strings
|
||||
6. You have to include the "qmp-commands.h" header file in qemu-timer.c,
|
||||
otherwise qemu won't build
|
||||
|
||||
The last step is to add the correspoding entry in the qmp-commands.hx file:
|
||||
|
||||
{
|
||||
.name = "query-alarm-clock",
|
||||
.args_type = "",
|
||||
.mhandler.cmd_new = qmp_marshal_input_query_alarm_clock,
|
||||
},
|
||||
|
||||
Time to test the new command. Build qemu, run it as described in the "Testing"
|
||||
section and try this:
|
||||
|
||||
{ "execute": "query-alarm-clock" }
|
||||
{
|
||||
"return": {
|
||||
"next-deadline": 2368219,
|
||||
"clock-name": "dynticks"
|
||||
}
|
||||
}
|
||||
|
||||
==== The HMP command ====
|
||||
|
||||
Here's the HMP counterpart of the query-alarm-clock command:
|
||||
|
||||
void hmp_info_alarm_clock(Monitor *mon)
|
||||
{
|
||||
QemuAlarmClock *clock;
|
||||
Error *err = NULL;
|
||||
|
||||
clock = qmp_query_alarm_clock(&err);
|
||||
if (err) {
|
||||
monitor_printf(mon, "Could not query alarm clock information\n");
|
||||
error_free(err);
|
||||
return;
|
||||
}
|
||||
|
||||
monitor_printf(mon, "Alarm clock method in use: '%s'\n", clock->clock_name);
|
||||
if (clock->has_next_deadline) {
|
||||
monitor_printf(mon, "Next alarm will fire in %" PRId64 " nanoseconds\n",
|
||||
clock->next_deadline);
|
||||
}
|
||||
|
||||
qapi_free_QemuAlarmClock(clock);
|
||||
}
|
||||
|
||||
It's important to notice that hmp_info_alarm_clock() calls
|
||||
qapi_free_QemuAlarmClock() to free the data returned by qmp_query_alarm_clock().
|
||||
For user defined types, the QAPI will generate a qapi_free_QAPI_TYPE_NAME()
|
||||
function and that's what you have to use to free the types you define and
|
||||
qapi_free_QAPI_TYPE_NAMEList() for list types (explained in the next section).
|
||||
If the QMP call returns a string, then you should g_free() to free it.
|
||||
|
||||
Also note that hmp_info_alarm_clock() performs error handling. That's not
|
||||
strictly required if you're sure the QMP function doesn't return errors, but
|
||||
it's good practice to always check for errors.
|
||||
|
||||
Another important detail is that HMP's "info" commands don't go into the
|
||||
hmp-commands.hx. Instead, they go into the info_cmds[] table, which is defined
|
||||
in the monitor.c file. The entry for the "info alarmclock" follows:
|
||||
|
||||
{
|
||||
.name = "alarmclock",
|
||||
.args_type = "",
|
||||
.params = "",
|
||||
.help = "show information about the alarm clock",
|
||||
.mhandler.info = hmp_info_alarm_clock,
|
||||
},
|
||||
|
||||
To test this, run qemu and type "info alarmclock" in the user monitor.
|
||||
|
||||
=== Returning Lists ===
|
||||
|
||||
For this example, we're going to return all available methods for the timer
|
||||
alarm, which is pretty much what the command-line option "-clock ?" does,
|
||||
except that we're also going to inform which method is in use.
|
||||
|
||||
This first step is to define a new type:
|
||||
|
||||
##
|
||||
# @TimerAlarmMethod
|
||||
#
|
||||
# Timer alarm method information.
|
||||
#
|
||||
# @method-name: The method's name.
|
||||
#
|
||||
# @current: true if this alarm method is currently in use, false otherwise
|
||||
#
|
||||
# Since: 1.0
|
||||
##
|
||||
{ 'type': 'TimerAlarmMethod',
|
||||
'data': { 'method-name': 'str', 'current': 'bool' } }
|
||||
|
||||
The command will be called "query-alarm-methods", here is its schema
|
||||
specification:
|
||||
|
||||
##
|
||||
# @query-alarm-methods
|
||||
#
|
||||
# Returns information about available alarm methods.
|
||||
#
|
||||
# Returns: a list of @TimerAlarmMethod for each method
|
||||
#
|
||||
# Since: 1.0
|
||||
##
|
||||
{ 'command': 'query-alarm-methods', 'returns': ['TimerAlarmMethod'] }
|
||||
|
||||
Notice the syntax for returning lists "'returns': ['TimerAlarmMethod']", this
|
||||
should be read as "returns a list of TimerAlarmMethod instances".
|
||||
|
||||
The C implementation follows:
|
||||
|
||||
TimerAlarmMethodList *qmp_query_alarm_methods(Error **errp)
|
||||
{
|
||||
TimerAlarmMethodList *method_list = NULL;
|
||||
const struct qemu_alarm_timer *p;
|
||||
bool current = true;
|
||||
|
||||
for (p = alarm_timers; p->name; p++) {
|
||||
TimerAlarmMethodList *info = g_malloc0(sizeof(*info));
|
||||
info->value = g_malloc0(sizeof(*info->value));
|
||||
info->value->method_name = g_strdup(p->name);
|
||||
info->value->current = current;
|
||||
|
||||
current = false;
|
||||
|
||||
info->next = method_list;
|
||||
method_list = info;
|
||||
}
|
||||
|
||||
return method_list;
|
||||
}
|
||||
|
||||
The most important difference from the previous examples is the
|
||||
TimerAlarmMethodList type, which is automatically generated by the QAPI from
|
||||
the TimerAlarmMethod type.
|
||||
|
||||
Each list node is represented by a TimerAlarmMethodList instance. We have to
|
||||
allocate it, and that's done inside the for loop: the "info" pointer points to
|
||||
an allocated node. We also have to allocate the node's contents, which is
|
||||
stored in its "value" member. In our example, the "value" member is a pointer
|
||||
to an TimerAlarmMethod instance.
|
||||
|
||||
Notice that the "current" variable is used as "true" only in the first
|
||||
interation of the loop. That's because the alarm timer method in use is the
|
||||
first element of the alarm_timers array. Also notice that QAPI lists are handled
|
||||
by hand and we return the head of the list.
|
||||
|
||||
To test this you have to add the corresponding qmp-commands.hx entry:
|
||||
|
||||
{
|
||||
.name = "query-alarm-methods",
|
||||
.args_type = "",
|
||||
.mhandler.cmd_new = qmp_marshal_input_query_alarm_methods,
|
||||
},
|
||||
|
||||
Now Build qemu, run it as explained in the "Testing" section and try our new
|
||||
command:
|
||||
|
||||
{ "execute": "query-alarm-methods" }
|
||||
{
|
||||
"return": [
|
||||
{
|
||||
"current": false,
|
||||
"method-name": "unix"
|
||||
},
|
||||
{
|
||||
"current": true,
|
||||
"method-name": "dynticks"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
The HMP counterpart is a bit more complex than previous examples because it
|
||||
has to traverse the list, it's shown below for reference:
|
||||
|
||||
void hmp_info_alarm_methods(Monitor *mon)
|
||||
{
|
||||
TimerAlarmMethodList *method_list, *method;
|
||||
Error *err = NULL;
|
||||
|
||||
method_list = qmp_query_alarm_methods(&err);
|
||||
if (err) {
|
||||
monitor_printf(mon, "Could not query alarm methods\n");
|
||||
error_free(err);
|
||||
return;
|
||||
}
|
||||
|
||||
for (method = method_list; method; method = method->next) {
|
||||
monitor_printf(mon, "%c %s\n", method->value->current ? '*' : ' ',
|
||||
method->value->method_name);
|
||||
}
|
||||
|
||||
qapi_free_TimerAlarmMethodList(method_list);
|
||||
}
|
|
@ -1,128 +0,0 @@
|
|||
XBZRLE (Xor Based Zero Run Length Encoding)
|
||||
===========================================
|
||||
|
||||
Using XBZRLE (Xor Based Zero Run Length Encoding) allows for the reduction
|
||||
of VM downtime and the total live-migration time of Virtual machines.
|
||||
It is particularly useful for virtual machines running memory write intensive
|
||||
workloads that are typical of large enterprise applications such as SAP ERP
|
||||
Systems, and generally speaking for any application that uses a sparse memory
|
||||
update pattern.
|
||||
|
||||
Instead of sending the changed guest memory page this solution will send a
|
||||
compressed version of the updates, thus reducing the amount of data sent during
|
||||
live migration.
|
||||
In order to be able to calculate the update, the previous memory pages need to
|
||||
be stored on the source. Those pages are stored in a dedicated cache
|
||||
(hash table) and are accessed by their address.
|
||||
The larger the cache size the better the chances are that the page has already
|
||||
been stored in the cache.
|
||||
A small cache size will result in high cache miss rate.
|
||||
Cache size can be changed before and during migration.
|
||||
|
||||
Format
|
||||
=======
|
||||
|
||||
The compression format performs a XOR between the previous and current content
|
||||
of the page, where zero represents an unchanged value.
|
||||
The page data delta is represented by zero and non zero runs.
|
||||
A zero run is represented by its length (in bytes).
|
||||
A non zero run is represented by its length (in bytes) and the new data.
|
||||
The run length is encoded using ULEB128 (http://en.wikipedia.org/wiki/LEB128)
|
||||
|
||||
There can be more than one valid encoding, the sender may send a longer encoding
|
||||
for the benefit of reducing computation cost.
|
||||
|
||||
page = zrun nzrun
|
||||
| zrun nzrun page
|
||||
|
||||
zrun = length
|
||||
|
||||
nzrun = length byte...
|
||||
|
||||
length = uleb128 encoded integer
|
||||
|
||||
On the sender side XBZRLE is used as a compact delta encoding of page updates,
|
||||
retrieving the old page content from the cache (default size of 512 MB). The
|
||||
receiving side uses the existing page's content and XBZRLE to decode the new
|
||||
page's content.
|
||||
|
||||
This work was originally based on research results published
|
||||
VEE 2011: Evaluation of Delta Compression Techniques for Efficient Live
|
||||
Migration of Large Virtual Machines by Benoit, Svard, Tordsson and Elmroth.
|
||||
Additionally the delta encoder XBRLE was improved further using the XBZRLE
|
||||
instead.
|
||||
|
||||
XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
|
||||
ideal for in-line, real-time encoding such as is needed for live-migration.
|
||||
|
||||
Example
|
||||
old buffer:
|
||||
1001 zeros
|
||||
05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 68 00 00 6b 00 6d
|
||||
3074 zeros
|
||||
|
||||
new buffer:
|
||||
1001 zeros
|
||||
01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 68 00 00 67 00 69
|
||||
3074 zeros
|
||||
|
||||
encoded buffer:
|
||||
|
||||
encoded length 24
|
||||
e9 07 0f 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 03 01 67 01 01 69
|
||||
|
||||
Usage
|
||||
======================
|
||||
1. Verify the destination QEMU version is able to decode the new format.
|
||||
{qemu} info migrate_capabilities
|
||||
{qemu} xbzrle: off , ...
|
||||
|
||||
2. Activate xbzrle on both source and destination:
|
||||
{qemu} migrate_set_capability xbzrle on
|
||||
|
||||
3. Set the XBZRLE cache size - the cache size is in MBytes and should be a
|
||||
power of 2. The cache default value is 64MBytes. (on source only)
|
||||
{qemu} migrate_set_cache_size 256m
|
||||
|
||||
4. Start outgoing migration
|
||||
{qemu} migrate -d tcp:destination.host:4444
|
||||
{qemu} info migrate
|
||||
capabilities: xbzrle: on
|
||||
Migration status: active
|
||||
transferred ram: A kbytes
|
||||
remaining ram: B kbytes
|
||||
total ram: C kbytes
|
||||
total time: D milliseconds
|
||||
duplicate: E pages
|
||||
normal: F pages
|
||||
normal bytes: G kbytes
|
||||
cache size: H bytes
|
||||
xbzrle transferred: I kbytes
|
||||
xbzrle pages: J pages
|
||||
xbzrle cache miss: K
|
||||
xbzrle overflow : L
|
||||
|
||||
xbzrle cache-miss: the number of cache misses to date - high cache-miss rate
|
||||
indicates that the cache size is set too low.
|
||||
xbzrle overflow: the number of overflows in the decoding which where the delta
|
||||
could not be compressed. This can happen if the changes in the pages are too
|
||||
large or there are many short changes; for example, changing every second byte
|
||||
(half a page).
|
||||
|
||||
Testing: Testing indicated that live migration with XBZRLE was completed in 110
|
||||
seconds, whereas without it would not be able to complete.
|
||||
|
||||
A simple synthetic memory r/w load generator:
|
||||
.. include <stdlib.h>
|
||||
.. include <stdio.h>
|
||||
.. int main()
|
||||
.. {
|
||||
.. char *buf = (char *) calloc(4096, 4096);
|
||||
.. while (1) {
|
||||
.. int i;
|
||||
.. for (i = 0; i < 4096 * 4; i++) {
|
||||
.. buf[i * 4096 / 4]++;
|
||||
.. }
|
||||
.. printf(".");
|
||||
.. }
|
||||
.. }
|
|
@ -1,34 +0,0 @@
|
|||
= Save Devices =
|
||||
|
||||
QEMU has code to load/save the state of the guest that it is running.
|
||||
These are two complementary operations. Saving the state just does
|
||||
that, saves the state for each device that the guest is running.
|
||||
|
||||
These operations are normally used with migration (see migration.txt),
|
||||
however it is also possible to save the state of all devices to file,
|
||||
without saving the RAM or the block devices of the VM.
|
||||
|
||||
This operation is called "xen-save-devices-state" (see
|
||||
QMP/qmp-commands.txt)
|
||||
|
||||
|
||||
The binary format used in the file is the following:
|
||||
|
||||
|
||||
-------------------------------------------
|
||||
|
||||
32 bit big endian: QEMU_VM_FILE_MAGIC
|
||||
32 bit big endian: QEMU_VM_FILE_VERSION
|
||||
|
||||
for_each_device
|
||||
{
|
||||
8 bit: QEMU_VM_SECTION_FULL
|
||||
32 bit big endian: section_id
|
||||
8 bit: idstr (ID string) length
|
||||
string: idstr (ID string)
|
||||
32 bit big endian: instance_id
|
||||
32 bit big endian: version_id
|
||||
buffer: device specific data
|
||||
}
|
||||
|
||||
8 bit: QEMU_VM_EOF
|
Loading…
Reference in New Issue