rdma: update documentation to reflect new unpin support

As requested, the protocol now includes memory unpinning support.
This has been implemented in a non-optimized manner, in such a way
that one could devise an LRU or other workload-specific information
on top of the basic mechanism to influence the way unpinning happens
during runtime.

The feature is not yet user-facing, and is thus can only be enabled
at compile-time.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
This commit is contained in:
Michael R. Hines 2013-07-22 10:01:51 -04:00 committed by Juan Quintela
parent 3464700f6a
commit a5f56b906e

View File

@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace
with the rate of dirty memory produced by the workload. with the rate of dirty memory produced by the workload.
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
over Convered Ethernet) as well as Infiniband-based. This implementation of over Converged Ethernet) as well as Infiniband-based. This implementation of
migration using RDMA is capable of using both technologies because of migration using RDMA is capable of using both technologies because of
the use of the OpenFabrics OFED software stack that abstracts out the the use of the OpenFabrics OFED software stack that abstracts out the
programming model irrespective of the underlying hardware. programming model irrespective of the underlying hardware.
@ -202,7 +202,7 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative
limit based on the maximum size of a SEND message along with emperical limit based on the maximum size of a SEND message along with emperical
observations on the maximum future benefit of simultaneous page registrations. observations on the maximum future benefit of simultaneous page registrations.
The 'type' field has 10 different command values: The 'type' field has 12 different command values:
1. Unused 1. Unused
2. Error (sent to the source during bad things) 2. Error (sent to the source during bad things)
3. Ready (control-channel is available) 3. Ready (control-channel is available)
@ -213,6 +213,8 @@ The 'type' field has 10 different command values:
8. Register request (dynamic chunk registration) 8. Register request (dynamic chunk registration)
9. Register result ('rkey' to be used by sender) 9. Register result ('rkey' to be used by sender)
10. Register finished (registration for current iteration finished) 10. Register finished (registration for current iteration finished)
11. Unregister request (unpin previously registered memory)
12. Unregister finished (confirmation that unpin completed)
A single control message, as hinted above, can contain within the data A single control message, as hinted above, can contain within the data
portion an array of many commands of the same type. If there is more than portion an array of many commands of the same type. If there is more than
@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
from the receiver to tell us that the receiver from the receiver to tell us that the receiver
is *ready* for us to transmit some new bytes. is *ready* for us to transmit some new bytes.
2. Optionally: if we are expecting a response from the command 2. Optionally: if we are expecting a response from the command
(that we have no yet transmitted), let's post an RQ (that we have not yet transmitted), let's post an RQ
work request to receive that data a few moments later. work request to receive that data a few moments later.
3. When the READY arrives, librdmacm will 3. When the READY arrives, librdmacm will
unblock us and we immediately post a RQ work request unblock us and we immediately post a RQ work request
@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
at connection-setup time before any infiniband traffic is generated. at connection-setup time before any infiniband traffic is generated.
Header: Header:
* Version (protocol version validated before send/recv occurs), uint32, network byte order * Version (protocol version validated before send/recv occurs),
* Flags (bitwise OR of each capability), uint32, network byte order uint32, network byte order
* Flags (bitwise OR of each capability),
uint32, network byte order
There is no data portion of this header right now, so there is There is no data portion of this header right now, so there is
no length field. The maximum size of the 'private data' section no length field. The maximum size of the 'private data' section
@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
If the version is new, we only negotiate the capabilities that the If the version is new, we only negotiate the capabilities that the
requested version is able to perform and ignore the rest. requested version is able to perform and ignore the rest.
Currently there is only *one* capability in Version #1: dynamic page registration Currently there is only one capability in Version #1: dynamic page registration
Finally: Negotiation happens with the Flags field: If the primary-VM Finally: Negotiation happens with the Flags field: If the primary-VM
sets a flag, but the destination does not support this capability, it sets a flag, but the destination does not support this capability, it
@ -413,3 +417,8 @@ TODO:
the use of KSM and ballooning while using RDMA. the use of KSM and ballooning while using RDMA.
4. Also, some form of balloon-device usage tracking would also 4. Also, some form of balloon-device usage tracking would also
help alleviate some issues. help alleviate some issues.
5. Move UNREGISTER requests to a separate thread.
6. Use LRU to provide more fine-grained direction of UNREGISTER
requests for unpinning memory in an overcommitted environment.
7. Expose UNREGISTER support to the user by way of workload-specific
hints about application behavior.