qemu/include/hw
thomas 681e338ee6 virtio-net: Fix network stall at the host side waiting for kick
Patch 06b1297017 ("virtio-net: fix network stall under load")
added double-check to test whether the available buffer size
can satisfy the request or not, in case the guest has added
some buffers to the avail ring simultaneously after the first
check. It will be lucky if the available buffer size becomes
okay after the double-check, then the host can send the packet
to the guest. If the buffer size still can't satisfy the request,
even if the guest has added some buffers, viritio-net would
stall at the host side forever.

The patch enables notification and checks whether the guest has
added some buffers since last check of available buffers when
the available buffers are insufficient. If no buffer is added,
return false, else recheck the available buffers in the loop.
If the available buffers are sufficient, disable notification
and return true.

Changes:
1. Change the return type of virtqueue_get_avail_bytes() from void
   to int, it returns an opaque that represents the shadow_avail_idx
   of the virtqueue on success, else -1 on error.
2. Add a new API: virtio_queue_enable_notification_and_check(),
   it takes an opaque as input arg which is returned from
   virtqueue_get_avail_bytes(). It enables notification firstly,
   then checks whether the guest has added some buffers since
   last check of available buffers or not by virtio_queue_poll(),
   return ture if yes.

The patch also reverts patch "06b12970174".

The case below can reproduce the stall.

                                       Guest 0
                                     +--------+
                                     | iperf  |
                    ---------------> | server |
         Host       |                +--------+
       +--------+   |                    ...
       | iperf  |----
       | client |----                  Guest n
       +--------+   |                +--------+
                    |                | iperf  |
                    ---------------> | server |
                                     +--------+

Boot many guests from qemu with virtio network:
 qemu ... -netdev tap,id=net_x \
    -device virtio-net-pci-non-transitional,\
    iommu_platform=on,mac=xx:xx:xx:xx:xx:xx,netdev=net_x

Each guest acts as iperf server with commands below:
 iperf3 -s -D -i 10 -p 8001
 iperf3 -s -D -i 10 -p 8002

The host as iperf client:
 iperf3 -c guest_IP -p 8001 -i 30 -w 256k -P 20 -t 40000
 iperf3 -c guest_IP -p 8002 -i 30 -w 256k -P 20 -t 40000

After some time, the host loses connection to the guest,
the guest can send packet to the host, but can't receive
packet from the host.

It's more likely to happen if SWIOTLB is enabled in the guest,
allocating and freeing bounce buffer takes some CPU ticks,
copying from/to bounce buffer takes more CPU ticks, compared
with that there is no bounce buffer in the guest.
Once the rate of producing packets from the host approximates
the rate of receiveing packets in the guest, the guest would
loop in NAPI.

         receive packets    ---
               |             |
               v             |
           free buf      virtnet_poll
               |             |
               v             |
     add buf to avail ring  ---
               |
               |  need kick the host?
               |  NAPI continues
               v
         receive packets    ---
               |             |
               v             |
           free buf      virtnet_poll
               |             |
               v             |
     add buf to avail ring  ---
               |
               v
              ...           ...

On the other hand, the host fetches free buf from avail
ring, if the buf in the avail ring is not enough, the
host notifies the guest the event by writing the avail
idx read from avail ring to the event idx of used ring,
then the host goes to sleep, waiting for the kick signal
from the guest.

Once the guest finds the host is waiting for kick singal
(in virtqueue_kick_prepare_split()), it kicks the host.

The host may stall forever at the sequences below:

         Host                        Guest
     ------------                 -----------
 fetch buf, send packet           receive packet ---
         ...                          ...         |
 fetch buf, send packet             add buf       |
         ...                        add buf   virtnet_poll
    buf not enough      avail idx-> add buf       |
    read avail idx                  add buf       |
                                    add buf      ---
                                  receive packet ---
    write event idx                   ...         |
    wait for kick                   add buf   virtnet_poll
                                      ...         |
                                                 ---
                                 no more packet, exit NAPI

In the first loop of NAPI above, indicated in the range of
virtnet_poll above, the host is sending packets while the
guest is receiving packets and adding buffers.
 step 1: The buf is not enough, for example, a big packet
         needs 5 buf, but the available buf count is 3.
         The host read current avail idx.
 step 2: The guest adds some buf, then checks whether the
         host is waiting for kick signal, not at this time.
         The used ring is not empty, the guest continues
         the second loop of NAPI.
 step 3: The host writes the avail idx read from avail
         ring to used ring as event idx via
         virtio_queue_set_notification(q->rx_vq, 1).
 step 4: At the end of the second loop of NAPI, recheck
         whether kick is needed, as the event idx in the
         used ring written by the host is beyound the
         range of kick condition, the guest will not
         send kick signal to the host.

Fixes: 06b1297017 ("virtio-net: fix network stall under load")
Cc: qemu-stable@nongnu.org
Signed-off-by: Wencheng Yang <east.moutain.yang@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
(cherry picked from commit f937309fbd)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
(Mjt: context fixup in include/hw/virtio/virtio.h)
2024-08-06 17:18:25 +03:00
..
acpi acpi: add get_dev_aml_func() helper 2022-11-07 14:08:17 -05:00
adc hw/adc/zynq-xadc: Use qemu_irq typedef 2022-05-19 16:19:02 +01:00
arm hw/arm/boot: Make write_bootloader() public as arm_write_bootloader() 2023-05-18 21:09:59 +03:00
audio introduce -audio as a replacement for -soundhw 2022-05-14 12:33:44 +02:00
block block: add missed block_acct_setup with new block device init procedure 2022-09-30 18:42:34 +02:00
char hw/riscv: spike: Allow using binary firmware as bios 2022-01-21 15:52:56 +10:00
core accel/qtest: Support qtest accelerator for Windows 2022-10-28 11:17:12 +02:00
cpu Use OBJECT_DECLARE_SIMPLE_TYPE when possible 2020-09-18 14:12:32 -04:00
cris hw: Replace anti-social QOM type names 2021-03-19 15:18:43 +01:00
cxl hw/pci-bridge/cxl-upstream: Add a CDAT table access DOE 2022-11-07 13:12:19 -05:00
display xlnx_dp: Introduce a vblank signal 2022-06-08 19:38:47 +01:00
dma hw/dma/xlnx_csu_dma: Support starting a read transfer through a class method 2022-01-28 14:29:46 +00:00
firmware hw/smbios: add core_count2 to smbios table type 4 2022-11-07 14:08:17 -05:00
gpio hw/gpio: replace HWADDR_PRIx with PRIx64 2022-05-25 10:31:33 +02:00
hyperv hw/hyperv/vmbus: Remove unused vmbus_load/save_req() 2022-05-30 19:49:42 +02:00
i2c hw/i2c/aspeed: Fix Tx count and Rx size error in buffer pool mode 2023-09-11 10:53:51 +03:00
i386 include/hw/i386/x86-iommu: Fix struct X86IOMMU_MSIMessage for big endian hosts 2023-08-04 08:27:03 +03:00
ide hw/ide/piix: Introduce TYPE_ macros for PIIX IDE controllers 2022-10-31 11:32:07 +01:00
input pckbd: remove legacy i8042_mm_init() function 2022-07-18 19:28:46 +01:00
intc hw/intc: Move mtimer/mtimecmp to aclint 2022-09-07 09:19:10 +02:00
ipack ipack: Rename ipack_bus_new_inplace() to ipack_bus_init() 2021-09-30 13:42:10 +01:00
ipmi Use OBJECT_DECLARE_SIMPLE_TYPE when possible 2020-09-18 14:12:32 -04:00
isa hw/isa/vt82c686: Instantiate PM function in host device 2022-10-31 11:32:07 +01:00
kvm target/i386: always create kvmclock device 2020-09-30 19:11:36 +02:00
loongarch Revert "hw/loongarch/virt: Add cfi01 pflash device" 2022-12-05 11:24:35 -05:00
m68k hw/m68k/mcf: Add missing 'exec/hwaddr.h' header 2022-02-21 10:35:13 +01:00
mem acpi/nvdimm: Define trace events for NVDIMM and substitute nvdimm_debug() 2022-07-26 10:37:46 -04:00
mips hw/mips/bootloader: Allow bl_gen_jump_kernel to optionally set register 2022-10-31 11:32:45 +01:00
misc hw/ppc/mac.h: Rename to include/hw/nvram/mac_nvram.h 2022-10-31 18:48:23 +00:00
net Clean up decorations and whitespace around header guards 2022-05-11 16:50:32 +02:00
nubus Clean up header guards that don't match their file name 2022-05-11 16:49:06 +02:00
nvram Revert "x86: return modified setup_data only if read as memory, not as file" 2023-03-29 10:20:04 +03:00
openrisc hw/openrisc: Split re-usable boot time apis out to boot.c 2022-09-04 07:02:56 +01:00
pci pcie: Introduce pcie_sriov_num_vfs 2024-03-14 21:24:34 +03:00
pci-bridge pci/pci_expander_bridge: For CXL HB delay the HB register memory region setup. 2022-06-09 19:32:49 -04:00
pci-host hw/loongarch: Improve fdt for LoongArch virt machine 2022-11-04 17:07:40 +08:00
ppc ppc/spapr: Introduce SPAPR_IRQ_NR_IPIS to refer IRQ range for CPU IPIs. 2024-04-16 21:16:09 +03:00
rdma qapi: introduce x-query-rdma QMP command 2021-11-02 15:55:14 +00:00
remote vfio-user: handle device interrupts 2022-06-15 16:43:42 +01:00
riscv hw/riscv: virt: Enable booting S-mode firmware from pflash 2022-10-14 14:29:50 +10:00
rtc hw/rtc/sun4v-rtc: Relicense to GPLv2-or-later 2024-03-13 23:09:00 +03:00
rx Clean up header guards that don't match their file name 2022-05-11 16:49:06 +02:00
s390x Revert "s390x/s390-virtio-ccw: add zpcii-disable machine property" 2022-11-08 10:10:57 +01:00
scsi include/hw/scsi/scsi.h: Remove unused scsi_legacy_handle_cmdline() prototype 2022-10-22 23:21:16 +02:00
sd hw/sd: Fix sun4i allwinner-sdhost for U-Boot 2022-11-21 11:45:12 +00:00
sensor hw/sensor: Add IC_DEVICE_ID to ISL voltage regulators 2022-07-14 16:24:38 +02:00
sh4 hw/intc/sh_intc: Inline and drop sh_intc_source() function 2021-10-30 18:39:37 +02:00
southbridge hw/isa/piix3: Inline and remove piix3_create() 2022-06-11 11:44:50 +02:00
sparc hw: Replace anti-social QOM type names 2021-03-19 15:18:43 +01:00
ssi aspeed/smc: Cache AspeedSMCClass 2022-10-24 11:20:15 +02:00
timer hw/intc: Move mtimer/mtimecmp to aclint 2022-09-07 09:19:10 +02:00
tricore Clean up ill-advised or unusual header guards 2022-05-11 16:50:01 +02:00
usb hw/usb: fix tab indentation 2022-11-08 11:13:48 +01:00
vfio vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr 2022-05-06 09:06:51 -06:00
virtio virtio-net: Fix network stall at the host side waiting for kick 2024-08-06 17:18:25 +03:00
watchdog Clean up header guards that don't match their file name 2022-05-11 16:49:06 +02:00
xen hw/i386/xen/xen-hvm: Inline xen_piix_pci_write_config_client() and remove it 2022-06-29 00:24:59 +02:00
xtensa Include hw/irq.h a lot less 2019-08-16 13:31:52 +02:00
boards.h machine: Add helpers to get cores/threads per socket 2023-09-11 10:53:50 +03:00
clock.h host-utils: add 128-bit quotient support to divu128/divs128 2021-10-27 17:10:00 -07:00
elf_ops.h load_elf: fix iterator's type for elf file processing 2024-01-19 13:41:14 +03:00
fw-path-provider.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
hotplug.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
hw.h compiler.h: replace QEMU_NORETURN with G_NORETURN 2022-04-21 17:03:51 +04:00
ide.h include/hw/ide: Unexport pci_piix3_xen_ide_unplug() 2022-06-09 14:47:42 +01:00
irq.h hw/core/irq: remove unused 'qemu_irq_split' function 2022-04-21 11:37:04 +01:00
loader-fit.h nomaintainer: Fix Lesser GPL version number 2020-11-15 17:04:40 +01:00
loader.h hw/core/loader: return image sizes as ssize_t 2022-06-10 09:31:42 +10:00
nmi.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
or-irq.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
pcmcia.h Use OBJECT_DECLARE_TYPE when possible 2020-09-18 14:12:32 -04:00
platform-bus.h nomaintainer: Fix Lesser GPL version number 2020-11-15 17:04:40 +01:00
ptimer.h ptimer: Rename PTIMER_POLICY_DEFAULT to PTIMER_POLICY_LEGACY 2022-05-19 16:19:03 +01:00
qdev-clock.h clock: Add ClockEvent parameter to callbacks 2021-03-08 17:20:01 +00:00
qdev-core.h memory: prevent dma-reentracy issues 2023-09-11 10:53:50 +03:00
qdev-dma.h Supply missing header guards 2019-06-12 13:20:21 +02:00
qdev-properties-system.h qdev: Reuse DEFINE_PROP in all DEFINE_PROP_* macros 2020-12-18 15:20:17 -05:00
qdev-properties.h qdev-properties: Add a new macro with bitmask check for uint64_t property 2022-05-14 12:32:41 +02:00
register.h hw/core/register: Add more 64-bit utilities 2021-09-01 11:59:12 +10:00
registerfields.h hw/registerfields: Add shared fields macros 2022-06-22 09:49:34 +02:00
resettable.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
stream.h hw/core/stream: Rename StreamSlave as StreamSink 2020-12-10 12:15:04 -05:00
sysbus.h qom: Remove module_obj_name parameter from OBJECT_DECLARE* macros 2020-09-18 14:12:32 -04:00
usb.h hw/usb: fix tab indentation 2022-11-08 11:13:48 +01:00
vmstate-if.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00