Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check. Otherwise, model the loop after the x86 SSE2 function.
Use UMAXV for the vector reduction. This is 3 cycles on cortex-a76 and
2 cycles on neoverse-n1.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>