rdma: update documentation to reflect new unpin support
As requested, the protocol now includes memory unpinning support. This has been implemented in a non-optimized manner, in such a way that one could devise an LRU or other workload-specific information on top of the basic mechanism to influence the way unpinning happens during runtime. The feature is not yet user-facing, and is thus can only be enabled at compile-time. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
This commit is contained in:
parent
3464700f6a
commit
a5f56b906e
@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace
|
|||||||
with the rate of dirty memory produced by the workload.
|
with the rate of dirty memory produced by the workload.
|
||||||
|
|
||||||
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
|
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
|
||||||
over Convered Ethernet) as well as Infiniband-based. This implementation of
|
over Converged Ethernet) as well as Infiniband-based. This implementation of
|
||||||
migration using RDMA is capable of using both technologies because of
|
migration using RDMA is capable of using both technologies because of
|
||||||
the use of the OpenFabrics OFED software stack that abstracts out the
|
the use of the OpenFabrics OFED software stack that abstracts out the
|
||||||
programming model irrespective of the underlying hardware.
|
programming model irrespective of the underlying hardware.
|
||||||
@ -202,7 +202,7 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative
|
|||||||
limit based on the maximum size of a SEND message along with emperical
|
limit based on the maximum size of a SEND message along with emperical
|
||||||
observations on the maximum future benefit of simultaneous page registrations.
|
observations on the maximum future benefit of simultaneous page registrations.
|
||||||
|
|
||||||
The 'type' field has 10 different command values:
|
The 'type' field has 12 different command values:
|
||||||
1. Unused
|
1. Unused
|
||||||
2. Error (sent to the source during bad things)
|
2. Error (sent to the source during bad things)
|
||||||
3. Ready (control-channel is available)
|
3. Ready (control-channel is available)
|
||||||
@ -213,6 +213,8 @@ The 'type' field has 10 different command values:
|
|||||||
8. Register request (dynamic chunk registration)
|
8. Register request (dynamic chunk registration)
|
||||||
9. Register result ('rkey' to be used by sender)
|
9. Register result ('rkey' to be used by sender)
|
||||||
10. Register finished (registration for current iteration finished)
|
10. Register finished (registration for current iteration finished)
|
||||||
|
11. Unregister request (unpin previously registered memory)
|
||||||
|
12. Unregister finished (confirmation that unpin completed)
|
||||||
|
|
||||||
A single control message, as hinted above, can contain within the data
|
A single control message, as hinted above, can contain within the data
|
||||||
portion an array of many commands of the same type. If there is more than
|
portion an array of many commands of the same type. If there is more than
|
||||||
@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
|
|||||||
from the receiver to tell us that the receiver
|
from the receiver to tell us that the receiver
|
||||||
is *ready* for us to transmit some new bytes.
|
is *ready* for us to transmit some new bytes.
|
||||||
2. Optionally: if we are expecting a response from the command
|
2. Optionally: if we are expecting a response from the command
|
||||||
(that we have no yet transmitted), let's post an RQ
|
(that we have not yet transmitted), let's post an RQ
|
||||||
work request to receive that data a few moments later.
|
work request to receive that data a few moments later.
|
||||||
3. When the READY arrives, librdmacm will
|
3. When the READY arrives, librdmacm will
|
||||||
unblock us and we immediately post a RQ work request
|
unblock us and we immediately post a RQ work request
|
||||||
@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
|
|||||||
at connection-setup time before any infiniband traffic is generated.
|
at connection-setup time before any infiniband traffic is generated.
|
||||||
|
|
||||||
Header:
|
Header:
|
||||||
* Version (protocol version validated before send/recv occurs), uint32, network byte order
|
* Version (protocol version validated before send/recv occurs),
|
||||||
* Flags (bitwise OR of each capability), uint32, network byte order
|
uint32, network byte order
|
||||||
|
* Flags (bitwise OR of each capability),
|
||||||
|
uint32, network byte order
|
||||||
|
|
||||||
There is no data portion of this header right now, so there is
|
There is no data portion of this header right now, so there is
|
||||||
no length field. The maximum size of the 'private data' section
|
no length field. The maximum size of the 'private data' section
|
||||||
@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
|
|||||||
If the version is new, we only negotiate the capabilities that the
|
If the version is new, we only negotiate the capabilities that the
|
||||||
requested version is able to perform and ignore the rest.
|
requested version is able to perform and ignore the rest.
|
||||||
|
|
||||||
Currently there is only *one* capability in Version #1: dynamic page registration
|
Currently there is only one capability in Version #1: dynamic page registration
|
||||||
|
|
||||||
Finally: Negotiation happens with the Flags field: If the primary-VM
|
Finally: Negotiation happens with the Flags field: If the primary-VM
|
||||||
sets a flag, but the destination does not support this capability, it
|
sets a flag, but the destination does not support this capability, it
|
||||||
@ -413,3 +417,8 @@ TODO:
|
|||||||
the use of KSM and ballooning while using RDMA.
|
the use of KSM and ballooning while using RDMA.
|
||||||
4. Also, some form of balloon-device usage tracking would also
|
4. Also, some form of balloon-device usage tracking would also
|
||||||
help alleviate some issues.
|
help alleviate some issues.
|
||||||
|
5. Move UNREGISTER requests to a separate thread.
|
||||||
|
6. Use LRU to provide more fine-grained direction of UNREGISTER
|
||||||
|
requests for unpinning memory in an overcommitted environment.
|
||||||
|
7. Expose UNREGISTER support to the user by way of workload-specific
|
||||||
|
hints about application behavior.
|
||||||
|
Loading…
Reference in New Issue
Block a user