mirrors/qemu - qemu - SynapseOS git

Author	SHA1	Message	Date
Richard Henderson	22437b4de9	util/bufferiszero: Add simd acceleration for aarch64 Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely double-check with the compiler flags for __ARM_NEON and don't bother with a runtime check. Otherwise, model the loop after the x86 SSE2 function. Use UMAXV for the vector reduction. This is 3 cycles on cortex-a76 and 2 cycles on neoverse-n1. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2024-05-03 08:03:35 -07:00
Richard Henderson	bf67aa3dd2	util/bufferiszero: Simplify test_buffer_is_zero_next_accel Because the three alternatives are monotonic, we don't need to keep a couple of bitmasks, just identify the strongest alternative at startup. Generalize test_buffer_is_zero_next_accel and init_accel by always defining an accel_table array. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2024-05-03 08:03:05 -07:00
Richard Henderson	0100ce2b49	util/bufferiszero: Introduce biz_accel_fn typedef Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2024-05-03 08:03:05 -07:00
Richard Henderson	7ae6399a85	util/bufferiszero: Improve scalar variant Split less-than and greater-than 256 cases. Use unaligned accesses for head and tail. Avoid using out-of-bounds pointers in loop boundary conditions. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2024-05-03 08:03:05 -07:00
Alexander Monakov	f28e0bbefa	util/bufferiszero: Optimize SSE2 and AVX2 variants Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Avoid using out-of-bounds pointers in loop boundary conditions. Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of PTEST, which is not profitable there (like in the removed SSE4 variant). Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-6-amonakov@ispras.ru>	2024-05-03 08:03:05 -07:00
Alexander Monakov	93a6085618	util/bufferiszero: Remove useless prefetches Use of prefetching in bufferiszero.c is quite questionable: - prefetches are issued just a few CPU cycles before the corresponding line would be hit by demand loads; - they are done for simple access patterns, i.e. where hardware prefetchers can perform better; - they compete for load ports in loops that should be limited by load port throughput rather than ALU throughput. Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-5-amonakov@ispras.ru>	2024-05-03 08:03:05 -07:00
Alexander Monakov	cbe3d52646	util/bufferiszero: Reorganize for early test for acceleration Test for length >= 256 inline, where is is often a constant. Before calling into the accelerated routine, sample three bytes from the buffer, which handles most non-zero buffers. Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Message-Id: <20240206204809.9859-3-amonakov@ispras.ru> [rth: Use __builtin_constant_p; move the indirect call out of line.] Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2024-05-03 08:03:05 -07:00
Alexander Monakov	d018425c32	util/bufferiszero: Remove AVX512 variant Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD routines are invoked much more rarely in normal use when most buffers are non-zero. This makes use of AVX512 unprofitable, as it incurs extra frequency and voltage transition periods during which the CPU operates at reduced performance, as described in https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-4-amonakov@ispras.ru> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2024-05-03 08:03:04 -07:00
Alexander Monakov	8a917b99d5	util/bufferiszero: Remove SSE4.1 variant The SSE4.1 variant is virtually identical to the SSE2 variant, except for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing if an SSE register is all zeroes. The PTEST instruction decodes to two uops, so it can be handled only by the complex decoder, and since CMP+JNE are macro-fused, both sequences decode to three uops. The uops comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch standpoint. Hence, the use of PTEST brings no benefit from throughput standpoint. Its latency is not important, since it feeds only a conditional jump, which terminates the dependency chain. I never observed PTEST variants to be faster on real hardware. Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-2-amonakov@ispras.ru>	2024-05-03 08:03:04 -07:00
Richard Henderson	51f4d916b5	util/bufferiszero: Use i386 host/cpuinfo.h Use cpuinfo_init() during init_accel(), and the variable cpuinfo during test_buffer_is_zero_next_accel(). Adjust the logic that cycles through the set of accelerators for testing. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2023-05-23 16:51:13 -07:00
Richard Henderson	5d133dd839	include/qemu/cpuid: Introduce xgetbv_low Replace the two uses of asm to expand xgetbv with an inline function. Since one of the two has been using the mnemonic, assume that the comment about "older versions of the assember" is obsolete, as even that is 4 years old. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2023-03-05 13:44:07 -08:00
Richard Henderson	701ea5870d	util/bufferiszero: Use __attribute__((target)) for avx2/avx512 Use the attribute, which is supported by clang, instead of the #pragma, which is not supported and, for some reason, also not detected by the meson probe, so we fail by -Werror. Include only <immintrin.h> as that is the outermost "official" header for these intrinsics -- emmintrin.h and smmintrin -- are older SSE2 and SSE4 specific headers, while the immintrin.h includes all of the Intel intrinsics. Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2023-01-16 10:14:12 -10:00
Michael S. Tsirkin	2a728de1ff	cpuid: use unsigned for max cpuid __get_cpuid_max returns an unsigned value. For consistency, store the result in an unsigned variable. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>	2022-02-04 09:07:43 -05:00
Robert Hoo	8f13a39dc0	util/bufferiszero: improve avx2 accelerator By increasing avx2 length_to_accel to 128, we can simplify its logic and reduce a branch. The authorship of this patch actually belongs to Richard Henderson <richard.henderson@linaro.org>, I just fixed a boundary case on his original patch. Suggested-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <1585119021-46593-2-git-send-email-robert.hu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-01 14:24:03 -04:00
Robert Hoo	b87c99d073	util/bufferiszero: assign length_to_accel value for each accelerator case Because in unit test, init_accel() will be called several times, each with different accelerator type. Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <1585119021-46593-1-git-send-email-robert.hu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-01 14:24:03 -04:00
Robert Hoo	27f08ea1c7	util: add util function buffer_zero_avx512() And intialize buffer_is_zero() with it, when Intel AVX512F is available on host. This function utilizes Intel AVX512 fundamental instructions which is faster than its implementation with AVX2 (in my unit test, with 4K buffer, on CascadeLake SP, ~36% faster, buffer_zero_avx512() V.S. buffer_zero_avx2()). Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 23:02:21 +01:00
Markus Armbruster	a8d2532645	Include qemu-common.h exactly where needed No header includes qemu-common.h after this commit, as prescribed by qemu-common.h's file comment. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-5-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and net/tap-bsd.c fixed up]	2019-06-12 13:20:20 +02:00
Richard Henderson	5dd8990841	util: Introduce include/qemu/cpuid.h Clang 3.9 passes the CONFIG_AVX2_OPT configure test. However, the supplied <cpuid.h> does not contain the bit_AVX2 define that we use when detecting whether the routine can be enabled. Introduce a qemu-specific header that uses the compiler's definition of __cpuid et al, but supplies any missing bit_* definitions needed. This avoids introducing any extra ifdefs to util/bufferiszero.c, and allows quite a few to be removed from tcg/i386/tcg-target.inc.c. Signed-off-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Message-id: 20170719044018.18063-1-rth@twiddle.net Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2017-07-24 12:42:55 +01:00
Richard Henderson	d9911d14e0	cutils: Rewrite x86 buffer zero checking Handle alignment of buffers, so that the vector paths can be used more often. Signed-off-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <1473800239-13841-1-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-14 12:25:14 +02:00
Richard Henderson	083d012a38	cutils: Add generic prefetch There's no real knowledge of the cacheline size, just prefetching one loop ahead. Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-7-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:13:32 +02:00
Paolo Bonzini	86444f084b	cutils: Add SSE4 version Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:13:32 +02:00
Richard Henderson	efad668245	cutils: Add test for buffer_is_zero Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-6-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:13:32 +02:00
Richard Henderson	43ff5e01ec	cutils: Remove ppc buffer zero checking For ppc64le, gcc6 does extremely poorly with the Altivec code. Moreover, on POWER7 and POWER8, a hand-optimized Altivec version turns out to be no faster than the revised integer version, and therefore not worth the effort. Signed-off-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:13:32 +02:00
Richard Henderson	2250d3a293	cutils: Remove aarch64 buffer zero checking The revised integer version is 4 times faster than the neon version on an AppliedMicro Mustang. Even with hand scheduling and additional unrolling I cannot make any neon version run as fast as the integer. Signed-off-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:13:31 +02:00
Richard Henderson	5e33a87222	cutils: Rearrange buffer_is_zero acceleration Allow selection of several acceleration functions based on the size and alignment of the buffer. Do not require ifunc support for AVX2 acceleration. Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-5-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:13:30 +02:00
Richard Henderson	a1febc4950	cutils: Export only buffer_is_zero Since the two users don't make use of the returned offset, beyond ensuring that the entire buffer is zero, consider the can_use_buffer_find_nonzero_offset and buffer_find_nonzero_offset functions internal. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-4-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:09:45 +02:00
Richard Henderson	8c70c1b0c7	cutils: Remove SPLAT macro This is unused and complicates the vector interface. Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-3-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:09:45 +02:00
Richard Henderson	88ca8e80de	cutils: Move buffer_is_zero and subroutines to a new file Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-2-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-09-13 19:09:45 +02:00

28 Commits