qemu/include/hw
thomas f937309fbd virtio-net: Fix network stall at the host side waiting for kick
Patch 06b1297017 ("virtio-net: fix network stall under load")
added double-check to test whether the available buffer size
can satisfy the request or not, in case the guest has added
some buffers to the avail ring simultaneously after the first
check. It will be lucky if the available buffer size becomes
okay after the double-check, then the host can send the packet
to the guest. If the buffer size still can't satisfy the request,
even if the guest has added some buffers, viritio-net would
stall at the host side forever.

The patch enables notification and checks whether the guest has
added some buffers since last check of available buffers when
the available buffers are insufficient. If no buffer is added,
return false, else recheck the available buffers in the loop.
If the available buffers are sufficient, disable notification
and return true.

Changes:
1. Change the return type of virtqueue_get_avail_bytes() from void
   to int, it returns an opaque that represents the shadow_avail_idx
   of the virtqueue on success, else -1 on error.
2. Add a new API: virtio_queue_enable_notification_and_check(),
   it takes an opaque as input arg which is returned from
   virtqueue_get_avail_bytes(). It enables notification firstly,
   then checks whether the guest has added some buffers since
   last check of available buffers or not by virtio_queue_poll(),
   return ture if yes.

The patch also reverts patch "06b12970174".

The case below can reproduce the stall.

                                       Guest 0
                                     +--------+
                                     | iperf  |
                    ---------------> | server |
         Host       |                +--------+
       +--------+   |                    ...
       | iperf  |----
       | client |----                  Guest n
       +--------+   |                +--------+
                    |                | iperf  |
                    ---------------> | server |
                                     +--------+

Boot many guests from qemu with virtio network:
 qemu ... -netdev tap,id=net_x \
    -device virtio-net-pci-non-transitional,\
    iommu_platform=on,mac=xx:xx:xx:xx:xx:xx,netdev=net_x

Each guest acts as iperf server with commands below:
 iperf3 -s -D -i 10 -p 8001
 iperf3 -s -D -i 10 -p 8002

The host as iperf client:
 iperf3 -c guest_IP -p 8001 -i 30 -w 256k -P 20 -t 40000
 iperf3 -c guest_IP -p 8002 -i 30 -w 256k -P 20 -t 40000

After some time, the host loses connection to the guest,
the guest can send packet to the host, but can't receive
packet from the host.

It's more likely to happen if SWIOTLB is enabled in the guest,
allocating and freeing bounce buffer takes some CPU ticks,
copying from/to bounce buffer takes more CPU ticks, compared
with that there is no bounce buffer in the guest.
Once the rate of producing packets from the host approximates
the rate of receiveing packets in the guest, the guest would
loop in NAPI.

         receive packets    ---
               |             |
               v             |
           free buf      virtnet_poll
               |             |
               v             |
     add buf to avail ring  ---
               |
               |  need kick the host?
               |  NAPI continues
               v
         receive packets    ---
               |             |
               v             |
           free buf      virtnet_poll
               |             |
               v             |
     add buf to avail ring  ---
               |
               v
              ...           ...

On the other hand, the host fetches free buf from avail
ring, if the buf in the avail ring is not enough, the
host notifies the guest the event by writing the avail
idx read from avail ring to the event idx of used ring,
then the host goes to sleep, waiting for the kick signal
from the guest.

Once the guest finds the host is waiting for kick singal
(in virtqueue_kick_prepare_split()), it kicks the host.

The host may stall forever at the sequences below:

         Host                        Guest
     ------------                 -----------
 fetch buf, send packet           receive packet ---
         ...                          ...         |
 fetch buf, send packet             add buf       |
         ...                        add buf   virtnet_poll
    buf not enough      avail idx-> add buf       |
    read avail idx                  add buf       |
                                    add buf      ---
                                  receive packet ---
    write event idx                   ...         |
    wait for kick                   add buf   virtnet_poll
                                      ...         |
                                                 ---
                                 no more packet, exit NAPI

In the first loop of NAPI above, indicated in the range of
virtnet_poll above, the host is sending packets while the
guest is receiving packets and adding buffers.
 step 1: The buf is not enough, for example, a big packet
         needs 5 buf, but the available buf count is 3.
         The host read current avail idx.
 step 2: The guest adds some buf, then checks whether the
         host is waiting for kick signal, not at this time.
         The used ring is not empty, the guest continues
         the second loop of NAPI.
 step 3: The host writes the avail idx read from avail
         ring to used ring as event idx via
         virtio_queue_set_notification(q->rx_vq, 1).
 step 4: At the end of the second loop of NAPI, recheck
         whether kick is needed, as the event idx in the
         used ring written by the host is beyound the
         range of kick condition, the guest will not
         send kick signal to the host.

Fixes: 06b1297017 ("virtio-net: fix network stall under load")
Cc: qemu-stable@nongnu.org
Signed-off-by: Wencheng Yang <east.moutain.yang@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
2024-08-02 11:09:52 +08:00
..
acpi hw/acpi: Update CPUs AML with cpu-(ctrl)dev change 2024-07-22 20:15:41 -04:00
adc aspeed/adc: Add AST2700 support 2024-07-21 07:46:38 +02:00
arm aspeed: Introduce a AspeedSoCClass 'boot_from_emmc' handler 2024-07-21 07:46:38 +02:00
audio virtio-snd: rewrite invalid tx/rx message handling 2024-04-09 02:31:16 -04:00
block aspeed/smc: Only wire flash devices at reset 2024-03-19 11:58:15 +01:00
char hw/char/stm32l4x5_usart: Enable serial read and write 2024-04-25 10:21:59 +01:00
core physmem: Add helper function to destroy CPU AddressSpace 2024-07-22 20:15:41 -04:00
cpu
cris hw/net/etraxfs-eth: use qemu_configure_nic_device() 2024-02-02 16:23:47 +00:00
cxl hw/cxl: Support firmware updates 2024-07-21 14:42:58 -04:00
display hw/display : Add device DM163 2024-04-30 16:02:43 +01:00
dma hw/dma: Pass parent object to i8257_dma_init() 2024-02-15 16:58:46 +01:00
firmware hw/smbios: Remove 'uuid_encoded' argument from smbios_set_defaults() 2024-06-19 12:40:49 +02:00
fsi hw/fsi: Aspeed APB2OPB & On-chip peripheral bus 2024-02-01 08:33:18 +01:00
gpio hw/gpio/aspeed: Add reg_table_count to AspeedGPIOClass 2024-07-02 07:52:43 +02:00
hyperv vmbus: Print a warning when enabled without the recommended set of features 2024-03-08 14:18:56 +01:00
i2c hw/i2c/aspeed: rename the I2C class pool attribute to share_pool 2024-07-21 07:46:38 +02:00
i386 target/i386/cpu: Mask off SGX/SGX_LC feature words for non-PC machine 2024-07-31 13:13:31 +02:00
ide ide, vl: turn -win2k-hack into a property on IDE devices 2024-02-28 00:23:39 +01:00
input hw/input/pckbd: Open-code i8042_setup_a20_line() wrapper 2024-02-22 12:47:35 +01:00
intc hw/loongarch: Remove unimplemented extioi INT_encode mode 2024-07-19 10:40:04 +08:00
ipack
ipmi
isa hw/isa/vt82c686: Bring back via_isa_set_irq() 2023-11-28 14:26:37 +01:00
loongarch hw/loongarch: Modify flash block size to 256K 2024-07-19 10:40:04 +08:00
m68k m68k: Clean up includes 2024-01-30 21:20:20 +03:00
mem hw/mem/memory-device: Remove legacy_align from memory_device_pre_plug() 2024-06-19 12:40:49 +02:00
mips hw/mips: Inline 'bios.h' definitions 2024-01-05 16:20:15 +01:00
misc aspeed/scu: Add boot-from-eMMC HW strapping bit for AST2600 SoC 2024-07-21 07:46:38 +02:00
net hw/net:ftgmac100: introduce TX and RX ring base address high registers to support 64 bits 2024-07-09 08:05:44 +02:00
nubus hw/nubus: add nubus-virtio-mmio device 2024-02-27 09:36:39 +01:00
nvram hw/nvram: Add BCM2835 OTP device 2024-07-01 12:48:55 +01:00
openrisc hw/openrisc: Split re-usable boot time apis out to boot.c 2022-09-04 07:02:56 +01:00
pci Revert "hw/pci: Rename has_power to enabled" 2024-08-01 04:32:00 -04:00
pci-bridge hw/cxl: Add a switch mailbox CCI function 2023-11-07 03:39:11 -05:00
pci-host hw/ppc: Avoid using Monitor in pnv_phb4_pic_print_info() 2024-06-19 12:40:49 +02:00
ppc pnv/xive2: Dump more END state with 'info pic' 2024-07-26 09:51:33 +10:00
remote include/hw/pci: Split pci_device.h off pci.h 2023-01-08 01:54:22 -05:00
riscv hw/riscv/virt.c: add address-cells in create_fdt_one_aplic() 2024-06-26 22:32:29 +10:00
rtc hw/i386: move rtc-reset-reinjection command out of hw/rtc 2024-05-10 15:45:15 +02:00
rx hw/rx/rx62n: Only call qdev_get_gpio_in() when necessary 2024-02-15 16:58:46 +01:00
s390x hw/intc/s390_flic: Fix interrupt controller migration on s390x with TCG 2024-07-02 08:02:01 +02:00
scsi esp.c: keep track of the DRQ state during DMA 2024-02-13 19:37:28 +00:00
sd hw/sd/sdcard: Basis for eMMC support 2024-07-16 20:26:47 +02:00
sensor hw/sensor: Add IC_DEVICE_ID to ISL voltage regulators 2022-07-14 16:24:38 +02:00
sh4
southbridge hw/isa/piix: Allow for optional PIT creation in PIIX3 2023-10-22 05:18:17 -04:00
sparc hw/sparc/grlib: split out the headers for each peripherals 2024-02-15 16:58:46 +01:00
ssi hw/ssi: Extend SPI model 2024-07-26 09:21:06 +10:00
timer hw/timer: Move HPET_INTCAP definition to "hpet.h" 2024-02-20 20:34:21 +03:00
tricore hw/tricore/testboard: Use qdev_new() instead of QOM basic API 2024-02-22 12:47:40 +01:00
usb include: Include headers where needed 2023-01-08 01:54:22 -05:00
vfio vfio/common: Allow disabling device dirty page tracking 2024-07-23 17:14:53 +02:00
virtio virtio-net: Fix network stall at the host side waiting for kick 2024-08-02 11:09:52 +08:00
watchdog aspeed/wdt: Add AST2700 support 2024-06-16 21:08:54 +02:00
xen hw/xen: detect when running inside stubdomain 2024-07-01 14:57:18 +02:00
xtensa
boards.h smbios: make memory device size configurable per Machine 2024-07-22 20:15:41 -04:00
clock.h hw/clock: Let clock_set_mul_div() return a boolean value 2024-03-26 14:24:06 +01:00
elf_ops.h.inc hw/elf_ops: Rename elf_ops.h -> elf_ops.h.inc 2024-04-25 12:48:12 +02:00
fw-path-provider.h
hotplug.h pci: fix 'hotplugglable' property behavior 2023-03-07 12:38:59 -05:00
hw.h compiler.h: replace QEMU_NORETURN with G_NORETURN 2022-04-21 17:03:51 +04:00
irq.h hw/core/irq: remove unused 'qemu_irq_split' function 2022-04-21 11:37:04 +01:00
loader-fit.h
loader.h loader: remove load_image_gzipped function as its not used anywhere 2024-07-16 20:04:08 +02:00
nmi.h
or-irq.h hw: Replace qemu_or_irq typedef by OrIRQState 2023-02-27 13:27:05 +00:00
pcmcia.h replace TABs with spaces 2023-03-20 12:43:50 +01:00
platform-bus.h
ptimer.h ptimer: Rename PTIMER_POLICY_DEFAULT to PTIMER_POLICY_LEGACY 2022-05-19 16:19:03 +01:00
qdev-clock.h
qdev-core.h include/hw/qdev-core.h: Correct and clarify gpio doc comments 2024-07-16 20:04:08 +02:00
qdev-dma.h
qdev-properties-system.h migration/multifd: Add new migration option zero-page-detection. 2024-03-11 16:57:05 -04:00
qdev-properties.h qdev-properties: alias all object class properties 2023-12-21 22:49:28 +01:00
register.h
registerfields.h hw/registerfields: Add shared fields macros 2022-06-22 09:49:34 +02:00
resettable.h reset: Add RESET_TYPE_SNAPSHOT_LOAD 2024-04-25 10:21:59 +01:00
stream.h
sysbus.h hw/sysbus: Remove now unused sysbus_address_space() 2024-02-26 18:40:21 +01:00
usb.h hw/usb: remove usb_bus_find 2024-02-27 09:37:21 +01:00
vmstate-if.h