313e162951
Signed-off-by: Stefan Weil <sw@weilnetz.de> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-Id: <20230730180329.851576-1-sw@weilnetz.de> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
359 lines
12 KiB
ReStructuredText
359 lines
12 KiB
ReStructuredText
==============
|
|
NVMe Emulation
|
|
==============
|
|
|
|
QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and
|
|
``nvme-subsys`` devices.
|
|
|
|
See the following sections for specific information on
|
|
|
|
* `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_.
|
|
* Configuration of `Optional Features`_ such as `Controller Memory Buffer`_,
|
|
`Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data
|
|
Protection`_,
|
|
|
|
Adding NVMe Devices
|
|
===================
|
|
|
|
Controller Emulation
|
|
--------------------
|
|
|
|
The QEMU emulated NVMe controller implements version 1.4 of the NVM Express
|
|
specification. All mandatory features are implement with a couple of exceptions
|
|
and limitations:
|
|
|
|
* Accounting numbers in the SMART/Health log page are reset when the device
|
|
is power cycled.
|
|
* Interrupt Coalescing is not supported and is disabled by default.
|
|
|
|
The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the
|
|
following parameters:
|
|
|
|
.. code-block:: console
|
|
|
|
-drive file=nvm.img,if=none,id=nvm
|
|
-device nvme,serial=deadbeef,drive=nvm
|
|
|
|
There are a number of optional general parameters for the ``nvme`` device. Some
|
|
are mentioned here, but see ``-device nvme,help`` to list all possible
|
|
parameters.
|
|
|
|
``max_ioqpairs=UINT32`` (default: ``64``)
|
|
Set the maximum number of allowed I/O queue pairs. This replaces the
|
|
deprecated ``num_queues`` parameter.
|
|
|
|
``msix_qsize=UINT16`` (default: ``65``)
|
|
The number of MSI-X vectors that the device should support.
|
|
|
|
``mdts=UINT8`` (default: ``7``)
|
|
Set the Maximum Data Transfer Size of the device.
|
|
|
|
``use-intel-id`` (default: ``off``)
|
|
Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and
|
|
Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
|
|
previously used.
|
|
|
|
Additional Namespaces
|
|
---------------------
|
|
|
|
In the simplest possible invocation sketched above, the device only support a
|
|
single namespace with the namespace identifier ``1``. To support multiple
|
|
namespaces and additional features, the ``nvme-ns`` device must be used.
|
|
|
|
.. code-block:: console
|
|
|
|
-device nvme,id=nvme-ctrl-0,serial=deadbeef
|
|
-drive file=nvm-1.img,if=none,id=nvm-1
|
|
-device nvme-ns,drive=nvm-1
|
|
-drive file=nvm-2.img,if=none,id=nvm-2
|
|
-device nvme-ns,drive=nvm-2
|
|
|
|
The namespaces defined by the ``nvme-ns`` device will attach to the most
|
|
recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace
|
|
identifiers are allocated automatically, starting from ``1``.
|
|
|
|
There are a number of parameters available:
|
|
|
|
``nsid`` (default: ``0``)
|
|
Explicitly set the namespace identifier.
|
|
|
|
``uuid`` (default: *autogenerated*)
|
|
Set the UUID of the namespace. This will be reported as a "Namespace UUID"
|
|
descriptor in the Namespace Identification Descriptor List.
|
|
|
|
``eui64``
|
|
Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
|
|
Unique Identifier" descriptor in the Namespace Identification Descriptor List.
|
|
Since machine type 6.1 a non-zero default value is used if the parameter
|
|
is not provided. For earlier machine types the field defaults to 0.
|
|
|
|
``bus``
|
|
If there are more ``nvme`` devices defined, this parameter may be used to
|
|
attach the namespace to a specific ``nvme`` device (identified by an ``id``
|
|
parameter on the controller device).
|
|
|
|
NVM Subsystems
|
|
--------------
|
|
|
|
Additional features becomes available if the controller device (``nvme``) is
|
|
linked to an NVM Subsystem device (``nvme-subsys``).
|
|
|
|
The NVM Subsystem emulation allows features such as shared namespaces and
|
|
multipath I/O.
|
|
|
|
.. code-block:: console
|
|
|
|
-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
|
|
-device nvme,serial=deadbeef,subsys=nvme-subsys-0
|
|
-device nvme,serial=deadbeef,subsys=nvme-subsys-0
|
|
|
|
This will create an NVM subsystem with two controllers. Having controllers
|
|
linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
|
|
|
|
``shared`` (default: ``on`` since 6.2)
|
|
Specifies that the namespace will be attached to all controllers in the
|
|
subsystem. If set to ``off``, the namespace will remain a private namespace
|
|
and may only be attached to a single controller at a time. Shared namespaces
|
|
are always automatically attached to all controllers (also when controllers
|
|
are hotplugged).
|
|
|
|
``detached`` (default: ``off``)
|
|
If set to ``on``, the namespace will be be available in the subsystem, but
|
|
not attached to any controllers initially. A shared namespace with this set
|
|
to ``on`` will never be automatically attached to controllers.
|
|
|
|
Thus, adding
|
|
|
|
.. code-block:: console
|
|
|
|
-drive file=nvm-1.img,if=none,id=nvm-1
|
|
-device nvme-ns,drive=nvm-1,nsid=1
|
|
-drive file=nvm-2.img,if=none,id=nvm-2
|
|
-device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on
|
|
|
|
will cause NSID 1 will be a shared namespace that is initially attached to both
|
|
controllers. NSID 3 will be a private namespace due to ``shared=off`` and only
|
|
attachable to a single controller at a time. Additionally it will not be
|
|
attached to any controller initially (due to ``detached=on``) or to hotplugged
|
|
controllers.
|
|
|
|
Optional Features
|
|
=================
|
|
|
|
Controller Memory Buffer
|
|
------------------------
|
|
|
|
``nvme`` device parameters related to the Controller Memory Buffer support:
|
|
|
|
``cmb_size_mb=UINT32`` (default: ``0``)
|
|
This adds a Controller Memory Buffer of the given size at offset zero in BAR
|
|
2.
|
|
|
|
``legacy-cmb`` (default: ``off``)
|
|
By default, the device uses the "v1.4 scheme" for the Controller Memory
|
|
Buffer support (i.e, the CMB is initially disabled and must be explicitly
|
|
enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the
|
|
CMB.
|
|
|
|
Simple Copy
|
|
-----------
|
|
|
|
The device includes support for TP 4065 ("Simple Copy Command"). A number of
|
|
additional ``nvme-ns`` device parameters may be used to control the Copy
|
|
command limits:
|
|
|
|
``mssrl=UINT16`` (default: ``128``)
|
|
Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum
|
|
number of logical blocks that may be specified in each source range.
|
|
|
|
``mcl=UINT32`` (default: ``128``)
|
|
Set the Maximum Copy Length (``MCL``). This is the maximum number of logical
|
|
blocks that may be specified in a Copy command (the total for all source
|
|
ranges).
|
|
|
|
``msrc=UINT8`` (default: ``127``)
|
|
Set the Maximum Source Range Count (``MSRC``). This is the maximum number of
|
|
source ranges that may be used in a Copy command. This is a 0's based value.
|
|
|
|
Zoned Namespaces
|
|
----------------
|
|
|
|
A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set
|
|
``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace.
|
|
|
|
The namespace may be configured with additional parameters
|
|
|
|
``zoned.zone_size=SIZE`` (default: ``128MiB``)
|
|
Define the zone size (``ZSZE``).
|
|
|
|
``zoned.zone_capacity=SIZE`` (default: ``0``)
|
|
Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone
|
|
capacity will equal the zone size.
|
|
|
|
``zoned.descr_ext_size=UINT32`` (default: ``0``)
|
|
Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64
|
|
bytes.
|
|
|
|
``zoned.cross_read=BOOL`` (default: ``off``)
|
|
Set to ``on`` to allow reads to cross zone boundaries.
|
|
|
|
``zoned.max_active=UINT32`` (default: ``0``)
|
|
Set the maximum number of active resources (``MAR``). The default (``0``)
|
|
allows all zones to be active.
|
|
|
|
``zoned.max_open=UINT32`` (default: ``0``)
|
|
Set the maximum number of open resources (``MOR``). The default (``0``)
|
|
allows all zones to be open. If ``zoned.max_active`` is specified, this value
|
|
must be less than or equal to that.
|
|
|
|
``zoned.zasl=UINT8`` (default: ``0``)
|
|
Set the maximum data transfer size for the Zone Append command. Like
|
|
``mdts``, the value is specified as a power of two (2^n) and is in units of
|
|
the minimum memory page size (CAP.MPSMIN). The default value (``0``)
|
|
has this property inherit the ``mdts`` value.
|
|
|
|
Flexible Data Placement
|
|
-----------------------
|
|
|
|
The device may be configured to support TP4146 ("Flexible Data Placement") by
|
|
configuring it (``fdp=on``) on the subsystem::
|
|
|
|
-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16
|
|
|
|
The subsystem emulates a single Endurance Group, on which Flexible Data
|
|
Placement will be supported. Also note that the device emulation deviates
|
|
slightly from the specification, by always enabling the "FDP Mode" feature on
|
|
the controller if the subsystems is configured for Flexible Data Placement.
|
|
|
|
Enabling Flexible Data Placement on the subsyste enables the following
|
|
parameters:
|
|
|
|
``fdp.nrg`` (default: ``1``)
|
|
Set the number of Reclaim Groups.
|
|
|
|
``fdp.nruh`` (default: ``0``)
|
|
Set the number of Reclaim Unit Handles. This is a mandatory parameter and
|
|
must be non-zero.
|
|
|
|
``fdp.runs`` (default: ``96M``)
|
|
Set the Reclaim Unit Nominal Size. Defaults to 96 MiB.
|
|
|
|
Namespaces within this subsystem may requests Reclaim Unit Handles::
|
|
|
|
-device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST
|
|
|
|
The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may
|
|
include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified,
|
|
the controller will assign the controller-specified reclaim unit handle to
|
|
placement handle identifier 0.
|
|
|
|
Metadata
|
|
--------
|
|
|
|
The virtual namespace device supports LBA metadata in the form separate
|
|
metadata (``MPTR``-based) and extended LBAs.
|
|
|
|
``ms=UINT16`` (default: ``0``)
|
|
Defines the number of metadata bytes per LBA.
|
|
|
|
``mset=UINT8`` (default: ``0``)
|
|
Set to ``1`` to enable extended LBAs.
|
|
|
|
End-to-End Data Protection
|
|
--------------------------
|
|
|
|
The virtual namespace device supports DIF- and DIX-based protection information
|
|
(depending on ``mset``).
|
|
|
|
``pi=UINT8`` (default: ``0``)
|
|
Enable protection information of the specified type (type ``1``, ``2`` or
|
|
``3``).
|
|
|
|
``pil=UINT8`` (default: ``0``)
|
|
Controls the location of the protection information within the metadata. Set
|
|
to ``1`` to transfer protection information as the first eight bytes of
|
|
metadata. Otherwise, the protection information is transferred as the last
|
|
eight bytes.
|
|
|
|
Virtualization Enhancements and SR-IOV (Experimental Support)
|
|
-------------------------------------------------------------
|
|
|
|
The ``nvme`` device supports Single Root I/O Virtualization and Sharing
|
|
along with Virtualization Enhancements. The controller has to be linked to
|
|
an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
|
|
|
|
A number of parameters are present (**please note, that they may be
|
|
subject to change**):
|
|
|
|
``sriov_max_vfs`` (default: ``0``)
|
|
Indicates the maximum number of PCIe virtual functions supported
|
|
by the controller. Specifying a non-zero value enables reporting of both
|
|
SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
|
|
by the NVMe device. Virtual function controllers will not report SR-IOV.
|
|
|
|
``sriov_vq_flexible``
|
|
Indicates the total number of flexible queue resources assignable to all
|
|
the secondary controllers. Implicitly sets the number of primary
|
|
controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
|
|
|
|
``sriov_vi_flexible``
|
|
Indicates the total number of flexible interrupt resources assignable to
|
|
all the secondary controllers. Implicitly sets the number of primary
|
|
controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
|
|
|
|
``sriov_max_vi_per_vf`` (default: ``0``)
|
|
Indicates the maximum number of virtual interrupt resources assignable
|
|
to a secondary controller. The default ``0`` resolves to
|
|
``(sriov_vi_flexible / sriov_max_vfs)``
|
|
|
|
``sriov_max_vq_per_vf`` (default: ``0``)
|
|
Indicates the maximum number of virtual queue resources assignable to
|
|
a secondary controller. The default ``0`` resolves to
|
|
``(sriov_vq_flexible / sriov_max_vfs)``
|
|
|
|
The simplest possible invocation enables the capability to set up one VF
|
|
controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
|
|
|
|
.. code-block:: console
|
|
|
|
-device nvme-subsys,id=subsys0
|
|
-device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
|
|
sriov_vq_flexible=2,sriov_vi_flexible=1
|
|
|
|
The minimum steps required to configure a functional NVMe secondary
|
|
controller are:
|
|
|
|
* unbind flexible resources from the primary controller
|
|
|
|
.. code-block:: console
|
|
|
|
nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
|
|
nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
|
|
|
|
* perform a Function Level Reset on the primary controller to actually
|
|
release the resources
|
|
|
|
.. code-block:: console
|
|
|
|
echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
|
|
|
|
* enable VF
|
|
|
|
.. code-block:: console
|
|
|
|
echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
|
|
|
|
* assign the flexible resources to the VF and set it ONLINE
|
|
|
|
.. code-block:: console
|
|
|
|
nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
|
|
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
|
|
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
|
|
|
|
* bind the NVMe driver to the VF
|
|
|
|
.. code-block:: console
|
|
|
|
echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
|