docs/migration: Split "Postcopy"
Split postcopy into a separate file. Introduce a head page "features.rst" to keep all the features on top of migration framework. Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20240109064628.595453-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com>
This commit is contained in:
parent
774ad6b53b
commit
bfb4c7cd99
9
docs/devel/migration/features.rst
Normal file
9
docs/devel/migration/features.rst
Normal file
@ -0,0 +1,9 @@
|
||||
Migration features
|
||||
==================
|
||||
|
||||
Migration has plenty of features to support different use cases.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
postcopy
|
@ -8,6 +8,7 @@ QEMU live migration works.
|
||||
:maxdepth: 2
|
||||
|
||||
main
|
||||
features
|
||||
compatibility
|
||||
vfio
|
||||
virtio
|
||||
|
@ -644,308 +644,3 @@ algorithm will restrict virtual CPUs as needed to keep their dirty page
|
||||
rate inside the limit. This leads to more steady reading performance during
|
||||
live migration and can aid in improving large guest responsiveness.
|
||||
|
||||
Postcopy
|
||||
========
|
||||
|
||||
'Postcopy' migration is a way to deal with migrations that refuse to converge
|
||||
(or take too long to converge) its plus side is that there is an upper bound on
|
||||
the amount of migration traffic and time it takes, the down side is that during
|
||||
the postcopy phase, a failure of *either* side causes the guest to be lost.
|
||||
|
||||
In postcopy the destination CPUs are started before all the memory has been
|
||||
transferred, and accesses to pages that are yet to be transferred cause
|
||||
a fault that's translated by QEMU into a request to the source QEMU.
|
||||
|
||||
Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
|
||||
doesn't finish in a given time the switch is made to postcopy.
|
||||
|
||||
Enabling postcopy
|
||||
-----------------
|
||||
|
||||
To enable postcopy, issue this command on the monitor (both source and
|
||||
destination) prior to the start of migration:
|
||||
|
||||
``migrate_set_capability postcopy-ram on``
|
||||
|
||||
The normal commands are then used to start a migration, which is still
|
||||
started in precopy mode. Issuing:
|
||||
|
||||
``migrate_start_postcopy``
|
||||
|
||||
will now cause the transition from precopy to postcopy.
|
||||
It can be issued immediately after migration is started or any
|
||||
time later on. Issuing it after the end of a migration is harmless.
|
||||
|
||||
Blocktime is a postcopy live migration metric, intended to show how
|
||||
long the vCPU was in state of interruptible sleep due to pagefault.
|
||||
That metric is calculated both for all vCPUs as overlapped value, and
|
||||
separately for each vCPU. These values are calculated on destination
|
||||
side. To enable postcopy blocktime calculation, enter following
|
||||
command on destination monitor:
|
||||
|
||||
``migrate_set_capability postcopy-blocktime on``
|
||||
|
||||
Postcopy blocktime can be retrieved by query-migrate qmp command.
|
||||
postcopy-blocktime value of qmp command will show overlapped blocking
|
||||
time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
|
||||
time per vCPU.
|
||||
|
||||
.. note::
|
||||
During the postcopy phase, the bandwidth limits set using
|
||||
``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
|
||||
the destination is waiting for).
|
||||
|
||||
Postcopy device transfer
|
||||
------------------------
|
||||
|
||||
Loading of device data may cause the device emulation to access guest RAM
|
||||
that may trigger faults that have to be resolved by the source, as such
|
||||
the migration stream has to be able to respond with page data *during* the
|
||||
device load, and hence the device data has to be read from the stream completely
|
||||
before the device load begins to free the stream up. This is achieved by
|
||||
'packaging' the device data into a blob that's read in one go.
|
||||
|
||||
Source behaviour
|
||||
----------------
|
||||
|
||||
Until postcopy is entered the migration stream is identical to normal
|
||||
precopy, except for the addition of a 'postcopy advise' command at
|
||||
the beginning, to tell the destination that postcopy might happen.
|
||||
When postcopy starts the source sends the page discard data and then
|
||||
forms the 'package' containing:
|
||||
|
||||
- Command: 'postcopy listen'
|
||||
- The device state
|
||||
|
||||
A series of sections, identical to the precopy streams device state stream
|
||||
containing everything except postcopiable devices (i.e. RAM)
|
||||
- Command: 'postcopy run'
|
||||
|
||||
The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
|
||||
contents are formatted in the same way as the main migration stream.
|
||||
|
||||
During postcopy the source scans the list of dirty pages and sends them
|
||||
to the destination without being requested (in much the same way as precopy),
|
||||
however when a page request is received from the destination, the dirty page
|
||||
scanning restarts from the requested location. This causes requested pages
|
||||
to be sent quickly, and also causes pages directly after the requested page
|
||||
to be sent quickly in the hope that those pages are likely to be used
|
||||
by the destination soon.
|
||||
|
||||
Destination behaviour
|
||||
---------------------
|
||||
|
||||
Initially the destination looks the same as precopy, with a single thread
|
||||
reading the migration stream; the 'postcopy advise' and 'discard' commands
|
||||
are processed to change the way RAM is managed, but don't affect the stream
|
||||
processing.
|
||||
|
||||
::
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
1 2 3 4 5 6 7
|
||||
main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
|
||||
thread | |
|
||||
| (page request)
|
||||
| \___
|
||||
v \
|
||||
listen thread: --- page -- page -- page -- page -- page --
|
||||
|
||||
a b c
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
- On receipt of ``CMD_PACKAGED`` (1)
|
||||
|
||||
All the data associated with the package - the ( ... ) section in the diagram -
|
||||
is read into memory, and the main thread recurses into qemu_loadvm_state_main
|
||||
to process the contents of the package (2) which contains commands (3,6) and
|
||||
devices (4...)
|
||||
|
||||
- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
|
||||
|
||||
a new thread (a) is started that takes over servicing the migration stream,
|
||||
while the main thread carries on loading the package. It loads normal
|
||||
background page data (b) but if during a device load a fault happens (5)
|
||||
the returned page (c) is loaded by the listen thread allowing the main
|
||||
threads device load to carry on.
|
||||
|
||||
- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
|
||||
|
||||
letting the destination CPUs start running. At the end of the
|
||||
``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
|
||||
is no longer used by migration, while the listen thread carries on servicing
|
||||
page data until the end of migration.
|
||||
|
||||
Postcopy Recovery
|
||||
-----------------
|
||||
|
||||
Comparing to precopy, postcopy is special on error handlings. When any
|
||||
error happens (in this case, mostly network errors), QEMU cannot easily
|
||||
fail a migration because VM data resides in both source and destination
|
||||
QEMU instances. On the other hand, when issue happens QEMU on both sides
|
||||
will go into a paused state. It'll need a recovery phase to continue a
|
||||
paused postcopy migration.
|
||||
|
||||
The recovery phase normally contains a few steps:
|
||||
|
||||
- When network issue occurs, both QEMU will go into PAUSED state
|
||||
|
||||
- When the network is recovered (or a new network is provided), the admin
|
||||
can setup the new channel for migration using QMP command
|
||||
'migrate-recover' on destination node, preparing for a resume.
|
||||
|
||||
- On source host, the admin can continue the interrupted postcopy
|
||||
migration using QMP command 'migrate' with resume=true flag set.
|
||||
|
||||
- After the connection is re-established, QEMU will continue the postcopy
|
||||
migration on both sides.
|
||||
|
||||
During a paused postcopy migration, the VM can logically still continue
|
||||
running, and it will not be impacted from any page access to pages that
|
||||
were already migrated to destination VM before the interruption happens.
|
||||
However, if any of the missing pages got accessed on destination VM, the VM
|
||||
thread will be halted waiting for the page to be migrated, it means it can
|
||||
be halted until the recovery is complete.
|
||||
|
||||
The impact of accessing missing pages can be relevant to different
|
||||
configurations of the guest. For example, when with async page fault
|
||||
enabled, logically the guest can proactively schedule out the threads
|
||||
accessing missing pages.
|
||||
|
||||
Postcopy states
|
||||
---------------
|
||||
|
||||
Postcopy moves through a series of states (see postcopy_state) from
|
||||
ADVISE->DISCARD->LISTEN->RUNNING->END
|
||||
|
||||
- Advise
|
||||
|
||||
Set at the start of migration if postcopy is enabled, even
|
||||
if it hasn't had the start command; here the destination
|
||||
checks that its OS has the support needed for postcopy, and performs
|
||||
setup to ensure the RAM mappings are suitable for later postcopy.
|
||||
The destination will fail early in migration at this point if the
|
||||
required OS support is not present.
|
||||
(Triggered by reception of POSTCOPY_ADVISE command)
|
||||
|
||||
- Discard
|
||||
|
||||
Entered on receipt of the first 'discard' command; prior to
|
||||
the first Discard being performed, hugepages are switched off
|
||||
(using madvise) to ensure that no new huge pages are created
|
||||
during the postcopy phase, and to cause any huge pages that
|
||||
have discards on them to be broken.
|
||||
|
||||
- Listen
|
||||
|
||||
The first command in the package, POSTCOPY_LISTEN, switches
|
||||
the destination state to Listen, and starts a new thread
|
||||
(the 'listen thread') which takes over the job of receiving
|
||||
pages off the migration stream, while the main thread carries
|
||||
on processing the blob. With this thread able to process page
|
||||
reception, the destination now 'sensitises' the RAM to detect
|
||||
any access to missing pages (on Linux using the 'userfault'
|
||||
system).
|
||||
|
||||
- Running
|
||||
|
||||
POSTCOPY_RUN causes the destination to synchronise all
|
||||
state and start the CPUs and IO devices running. The main
|
||||
thread now finishes processing the migration package and
|
||||
now carries on as it would for normal precopy migration
|
||||
(although it can't do the cleanup it would do as it
|
||||
finishes a normal migration).
|
||||
|
||||
- Paused
|
||||
|
||||
Postcopy can run into a paused state (normally on both sides when
|
||||
happens), where all threads will be temporarily halted mostly due to
|
||||
network errors. When reaching paused state, migration will make sure
|
||||
the qemu binary on both sides maintain the data without corrupting
|
||||
the VM. To continue the migration, the admin needs to fix the
|
||||
migration channel using the QMP command 'migrate-recover' on the
|
||||
destination node, then resume the migration using QMP command 'migrate'
|
||||
again on source node, with resume=true flag set.
|
||||
|
||||
- End
|
||||
|
||||
The listen thread can now quit, and perform the cleanup of migration
|
||||
state, the migration is now complete.
|
||||
|
||||
Source side page map
|
||||
--------------------
|
||||
|
||||
The 'migration bitmap' in postcopy is basically the same as in the precopy,
|
||||
where each of the bit to indicate that page is 'dirty' - i.e. needs
|
||||
sending. During the precopy phase this is updated as the CPU dirties
|
||||
pages, however during postcopy the CPUs are stopped and nothing should
|
||||
dirty anything any more. Instead, dirty bits are cleared when the relevant
|
||||
pages are sent during postcopy.
|
||||
|
||||
Postcopy with hugepages
|
||||
-----------------------
|
||||
|
||||
Postcopy now works with hugetlbfs backed memory:
|
||||
|
||||
a) The linux kernel on the destination must support userfault on hugepages.
|
||||
b) The huge-page configuration on the source and destination VMs must be
|
||||
identical; i.e. RAMBlocks on both sides must use the same page size.
|
||||
c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
|
||||
RAM if it doesn't have enough hugepages, triggering (b) to fail.
|
||||
Using ``-mem-prealloc`` enforces the allocation using hugepages.
|
||||
d) Care should be taken with the size of hugepage used; postcopy with 2MB
|
||||
hugepages works well, however 1GB hugepages are likely to be problematic
|
||||
since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
|
||||
and until the full page is transferred the destination thread is blocked.
|
||||
|
||||
Postcopy with shared memory
|
||||
---------------------------
|
||||
|
||||
Postcopy migration with shared memory needs explicit support from the other
|
||||
processes that share memory and from QEMU. There are restrictions on the type of
|
||||
memory that userfault can support shared.
|
||||
|
||||
The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
|
||||
(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
|
||||
for hugetlbfs which may be a problem in some configurations).
|
||||
|
||||
The vhost-user code in QEMU supports clients that have Postcopy support,
|
||||
and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
|
||||
to support postcopy.
|
||||
|
||||
The client needs to open a userfaultfd and register the areas
|
||||
of memory that it maps with userfault. The client must then pass the
|
||||
userfaultfd back to QEMU together with a mapping table that allows
|
||||
fault addresses in the clients address space to be converted back to
|
||||
RAMBlock/offsets. The client's userfaultfd is added to the postcopy
|
||||
fault-thread and page requests are made on behalf of the client by QEMU.
|
||||
QEMU performs 'wake' operations on the client's userfaultfd to allow it
|
||||
to continue after a page has arrived.
|
||||
|
||||
.. note::
|
||||
There are two future improvements that would be nice:
|
||||
a) Some way to make QEMU ignorant of the addresses in the clients
|
||||
address space
|
||||
b) Avoiding the need for QEMU to perform ufd-wake calls after the
|
||||
pages have arrived
|
||||
|
||||
Retro-fitting postcopy to existing clients is possible:
|
||||
a) A mechanism is needed for the registration with userfault as above,
|
||||
and the registration needs to be coordinated with the phases of
|
||||
postcopy. In vhost-user extra messages are added to the existing
|
||||
control channel.
|
||||
b) Any thread that can block due to guest memory accesses must be
|
||||
identified and the implication understood; for example if the
|
||||
guest memory access is made while holding a lock then all other
|
||||
threads waiting for that lock will also be blocked.
|
||||
|
||||
Postcopy Preemption Mode
|
||||
------------------------
|
||||
|
||||
Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
|
||||
allows urgent pages (those got page fault requested from destination QEMU
|
||||
explicitly) to be sent in a separate preempt channel, rather than queued in
|
||||
the background migration channel. Anyone who cares about latencies of page
|
||||
faults during a postcopy migration should enable this feature. By default,
|
||||
it's not enabled.
|
||||
|
||||
|
304
docs/devel/migration/postcopy.rst
Normal file
304
docs/devel/migration/postcopy.rst
Normal file
@ -0,0 +1,304 @@
|
||||
Postcopy
|
||||
========
|
||||
|
||||
'Postcopy' migration is a way to deal with migrations that refuse to converge
|
||||
(or take too long to converge) its plus side is that there is an upper bound on
|
||||
the amount of migration traffic and time it takes, the down side is that during
|
||||
the postcopy phase, a failure of *either* side causes the guest to be lost.
|
||||
|
||||
In postcopy the destination CPUs are started before all the memory has been
|
||||
transferred, and accesses to pages that are yet to be transferred cause
|
||||
a fault that's translated by QEMU into a request to the source QEMU.
|
||||
|
||||
Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
|
||||
doesn't finish in a given time the switch is made to postcopy.
|
||||
|
||||
Enabling postcopy
|
||||
-----------------
|
||||
|
||||
To enable postcopy, issue this command on the monitor (both source and
|
||||
destination) prior to the start of migration:
|
||||
|
||||
``migrate_set_capability postcopy-ram on``
|
||||
|
||||
The normal commands are then used to start a migration, which is still
|
||||
started in precopy mode. Issuing:
|
||||
|
||||
``migrate_start_postcopy``
|
||||
|
||||
will now cause the transition from precopy to postcopy.
|
||||
It can be issued immediately after migration is started or any
|
||||
time later on. Issuing it after the end of a migration is harmless.
|
||||
|
||||
Blocktime is a postcopy live migration metric, intended to show how
|
||||
long the vCPU was in state of interruptible sleep due to pagefault.
|
||||
That metric is calculated both for all vCPUs as overlapped value, and
|
||||
separately for each vCPU. These values are calculated on destination
|
||||
side. To enable postcopy blocktime calculation, enter following
|
||||
command on destination monitor:
|
||||
|
||||
``migrate_set_capability postcopy-blocktime on``
|
||||
|
||||
Postcopy blocktime can be retrieved by query-migrate qmp command.
|
||||
postcopy-blocktime value of qmp command will show overlapped blocking
|
||||
time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
|
||||
time per vCPU.
|
||||
|
||||
.. note::
|
||||
During the postcopy phase, the bandwidth limits set using
|
||||
``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
|
||||
the destination is waiting for).
|
||||
|
||||
Postcopy device transfer
|
||||
------------------------
|
||||
|
||||
Loading of device data may cause the device emulation to access guest RAM
|
||||
that may trigger faults that have to be resolved by the source, as such
|
||||
the migration stream has to be able to respond with page data *during* the
|
||||
device load, and hence the device data has to be read from the stream completely
|
||||
before the device load begins to free the stream up. This is achieved by
|
||||
'packaging' the device data into a blob that's read in one go.
|
||||
|
||||
Source behaviour
|
||||
----------------
|
||||
|
||||
Until postcopy is entered the migration stream is identical to normal
|
||||
precopy, except for the addition of a 'postcopy advise' command at
|
||||
the beginning, to tell the destination that postcopy might happen.
|
||||
When postcopy starts the source sends the page discard data and then
|
||||
forms the 'package' containing:
|
||||
|
||||
- Command: 'postcopy listen'
|
||||
- The device state
|
||||
|
||||
A series of sections, identical to the precopy streams device state stream
|
||||
containing everything except postcopiable devices (i.e. RAM)
|
||||
- Command: 'postcopy run'
|
||||
|
||||
The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
|
||||
contents are formatted in the same way as the main migration stream.
|
||||
|
||||
During postcopy the source scans the list of dirty pages and sends them
|
||||
to the destination without being requested (in much the same way as precopy),
|
||||
however when a page request is received from the destination, the dirty page
|
||||
scanning restarts from the requested location. This causes requested pages
|
||||
to be sent quickly, and also causes pages directly after the requested page
|
||||
to be sent quickly in the hope that those pages are likely to be used
|
||||
by the destination soon.
|
||||
|
||||
Destination behaviour
|
||||
---------------------
|
||||
|
||||
Initially the destination looks the same as precopy, with a single thread
|
||||
reading the migration stream; the 'postcopy advise' and 'discard' commands
|
||||
are processed to change the way RAM is managed, but don't affect the stream
|
||||
processing.
|
||||
|
||||
::
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
1 2 3 4 5 6 7
|
||||
main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
|
||||
thread | |
|
||||
| (page request)
|
||||
| \___
|
||||
v \
|
||||
listen thread: --- page -- page -- page -- page -- page --
|
||||
|
||||
a b c
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
- On receipt of ``CMD_PACKAGED`` (1)
|
||||
|
||||
All the data associated with the package - the ( ... ) section in the diagram -
|
||||
is read into memory, and the main thread recurses into qemu_loadvm_state_main
|
||||
to process the contents of the package (2) which contains commands (3,6) and
|
||||
devices (4...)
|
||||
|
||||
- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
|
||||
|
||||
a new thread (a) is started that takes over servicing the migration stream,
|
||||
while the main thread carries on loading the package. It loads normal
|
||||
background page data (b) but if during a device load a fault happens (5)
|
||||
the returned page (c) is loaded by the listen thread allowing the main
|
||||
threads device load to carry on.
|
||||
|
||||
- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
|
||||
|
||||
letting the destination CPUs start running. At the end of the
|
||||
``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
|
||||
is no longer used by migration, while the listen thread carries on servicing
|
||||
page data until the end of migration.
|
||||
|
||||
Postcopy Recovery
|
||||
-----------------
|
||||
|
||||
Comparing to precopy, postcopy is special on error handlings. When any
|
||||
error happens (in this case, mostly network errors), QEMU cannot easily
|
||||
fail a migration because VM data resides in both source and destination
|
||||
QEMU instances. On the other hand, when issue happens QEMU on both sides
|
||||
will go into a paused state. It'll need a recovery phase to continue a
|
||||
paused postcopy migration.
|
||||
|
||||
The recovery phase normally contains a few steps:
|
||||
|
||||
- When network issue occurs, both QEMU will go into PAUSED state
|
||||
|
||||
- When the network is recovered (or a new network is provided), the admin
|
||||
can setup the new channel for migration using QMP command
|
||||
'migrate-recover' on destination node, preparing for a resume.
|
||||
|
||||
- On source host, the admin can continue the interrupted postcopy
|
||||
migration using QMP command 'migrate' with resume=true flag set.
|
||||
|
||||
- After the connection is re-established, QEMU will continue the postcopy
|
||||
migration on both sides.
|
||||
|
||||
During a paused postcopy migration, the VM can logically still continue
|
||||
running, and it will not be impacted from any page access to pages that
|
||||
were already migrated to destination VM before the interruption happens.
|
||||
However, if any of the missing pages got accessed on destination VM, the VM
|
||||
thread will be halted waiting for the page to be migrated, it means it can
|
||||
be halted until the recovery is complete.
|
||||
|
||||
The impact of accessing missing pages can be relevant to different
|
||||
configurations of the guest. For example, when with async page fault
|
||||
enabled, logically the guest can proactively schedule out the threads
|
||||
accessing missing pages.
|
||||
|
||||
Postcopy states
|
||||
---------------
|
||||
|
||||
Postcopy moves through a series of states (see postcopy_state) from
|
||||
ADVISE->DISCARD->LISTEN->RUNNING->END
|
||||
|
||||
- Advise
|
||||
|
||||
Set at the start of migration if postcopy is enabled, even
|
||||
if it hasn't had the start command; here the destination
|
||||
checks that its OS has the support needed for postcopy, and performs
|
||||
setup to ensure the RAM mappings are suitable for later postcopy.
|
||||
The destination will fail early in migration at this point if the
|
||||
required OS support is not present.
|
||||
(Triggered by reception of POSTCOPY_ADVISE command)
|
||||
|
||||
- Discard
|
||||
|
||||
Entered on receipt of the first 'discard' command; prior to
|
||||
the first Discard being performed, hugepages are switched off
|
||||
(using madvise) to ensure that no new huge pages are created
|
||||
during the postcopy phase, and to cause any huge pages that
|
||||
have discards on them to be broken.
|
||||
|
||||
- Listen
|
||||
|
||||
The first command in the package, POSTCOPY_LISTEN, switches
|
||||
the destination state to Listen, and starts a new thread
|
||||
(the 'listen thread') which takes over the job of receiving
|
||||
pages off the migration stream, while the main thread carries
|
||||
on processing the blob. With this thread able to process page
|
||||
reception, the destination now 'sensitises' the RAM to detect
|
||||
any access to missing pages (on Linux using the 'userfault'
|
||||
system).
|
||||
|
||||
- Running
|
||||
|
||||
POSTCOPY_RUN causes the destination to synchronise all
|
||||
state and start the CPUs and IO devices running. The main
|
||||
thread now finishes processing the migration package and
|
||||
now carries on as it would for normal precopy migration
|
||||
(although it can't do the cleanup it would do as it
|
||||
finishes a normal migration).
|
||||
|
||||
- Paused
|
||||
|
||||
Postcopy can run into a paused state (normally on both sides when
|
||||
happens), where all threads will be temporarily halted mostly due to
|
||||
network errors. When reaching paused state, migration will make sure
|
||||
the qemu binary on both sides maintain the data without corrupting
|
||||
the VM. To continue the migration, the admin needs to fix the
|
||||
migration channel using the QMP command 'migrate-recover' on the
|
||||
destination node, then resume the migration using QMP command 'migrate'
|
||||
again on source node, with resume=true flag set.
|
||||
|
||||
- End
|
||||
|
||||
The listen thread can now quit, and perform the cleanup of migration
|
||||
state, the migration is now complete.
|
||||
|
||||
Source side page map
|
||||
--------------------
|
||||
|
||||
The 'migration bitmap' in postcopy is basically the same as in the precopy,
|
||||
where each of the bit to indicate that page is 'dirty' - i.e. needs
|
||||
sending. During the precopy phase this is updated as the CPU dirties
|
||||
pages, however during postcopy the CPUs are stopped and nothing should
|
||||
dirty anything any more. Instead, dirty bits are cleared when the relevant
|
||||
pages are sent during postcopy.
|
||||
|
||||
Postcopy with hugepages
|
||||
-----------------------
|
||||
|
||||
Postcopy now works with hugetlbfs backed memory:
|
||||
|
||||
a) The linux kernel on the destination must support userfault on hugepages.
|
||||
b) The huge-page configuration on the source and destination VMs must be
|
||||
identical; i.e. RAMBlocks on both sides must use the same page size.
|
||||
c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
|
||||
RAM if it doesn't have enough hugepages, triggering (b) to fail.
|
||||
Using ``-mem-prealloc`` enforces the allocation using hugepages.
|
||||
d) Care should be taken with the size of hugepage used; postcopy with 2MB
|
||||
hugepages works well, however 1GB hugepages are likely to be problematic
|
||||
since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
|
||||
and until the full page is transferred the destination thread is blocked.
|
||||
|
||||
Postcopy with shared memory
|
||||
---------------------------
|
||||
|
||||
Postcopy migration with shared memory needs explicit support from the other
|
||||
processes that share memory and from QEMU. There are restrictions on the type of
|
||||
memory that userfault can support shared.
|
||||
|
||||
The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
|
||||
(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
|
||||
for hugetlbfs which may be a problem in some configurations).
|
||||
|
||||
The vhost-user code in QEMU supports clients that have Postcopy support,
|
||||
and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
|
||||
to support postcopy.
|
||||
|
||||
The client needs to open a userfaultfd and register the areas
|
||||
of memory that it maps with userfault. The client must then pass the
|
||||
userfaultfd back to QEMU together with a mapping table that allows
|
||||
fault addresses in the clients address space to be converted back to
|
||||
RAMBlock/offsets. The client's userfaultfd is added to the postcopy
|
||||
fault-thread and page requests are made on behalf of the client by QEMU.
|
||||
QEMU performs 'wake' operations on the client's userfaultfd to allow it
|
||||
to continue after a page has arrived.
|
||||
|
||||
.. note::
|
||||
There are two future improvements that would be nice:
|
||||
a) Some way to make QEMU ignorant of the addresses in the clients
|
||||
address space
|
||||
b) Avoiding the need for QEMU to perform ufd-wake calls after the
|
||||
pages have arrived
|
||||
|
||||
Retro-fitting postcopy to existing clients is possible:
|
||||
a) A mechanism is needed for the registration with userfault as above,
|
||||
and the registration needs to be coordinated with the phases of
|
||||
postcopy. In vhost-user extra messages are added to the existing
|
||||
control channel.
|
||||
b) Any thread that can block due to guest memory accesses must be
|
||||
identified and the implication understood; for example if the
|
||||
guest memory access is made while holding a lock then all other
|
||||
threads waiting for that lock will also be blocked.
|
||||
|
||||
Postcopy Preemption Mode
|
||||
------------------------
|
||||
|
||||
Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
|
||||
allows urgent pages (those got page fault requested from destination QEMU
|
||||
explicitly) to be sent in a separate preempt channel, rather than queued in
|
||||
the background migration channel. Anyone who cares about latencies of page
|
||||
faults during a postcopy migration should enable this feature. By default,
|
||||
it's not enabled.
|
Loading…
Reference in New Issue
Block a user