When host vector registers and operations were introduced, I failed
to mark the registers call clobbered as required by the ABI.
Fixes: 770c2fc7bb
Cc: qemu-stable@nongnu.org
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Also, assert that we don't overflow any of two different offsets into
the TB. Both unwind and goto_tb both record a uint16_t for later use.
This fixes an arm-softmmu test case utilizing NEON in which there is
a TB generated that runs to 7800 opcodes, and compiles to 96k on an
x86_64 host. This overflows the 16-bit offset in which we record the
goto_tb reset offset. Because of that overflow, we install a jump
destination that goes to neverland. Boom.
With this reduced op count, the same TB compiles to about 48k for
aarch64, ppc64le, and x86_64 hosts, and neither assertion fires.
Cc: qemu-stable@nongnu.org
Reported-by: "Jason A. Donenfeld" <Jason@zx2c4.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The assembler in most versions of Mac OS X is pretty old and does not
support the xgetbv instruction. To go around this problem, the raw
encoding of the instruction is used instead.
Signed-off-by: John Arbuckle <programmingkidx@gmail.com>
Message-Id: <20180604215102.11002-1-programmingkidx@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The VPUNPCKLD* instructions are all "non-destructive source",
indicated by "NDS" in the encoding string in the x86 ISA manual.
This means that they take two source operands, one of which is
encoded in the VEX.vvvv field. We were incorrectly treating them
as if they were destructive-source and passing 0 as the 'v'
argument of tcg_out_vex_modrm(). This meant we were always
using %xmm0 as one of the source operands, causing incorrect
results if the register allocator happened to want to use
something else. For instance the input AArch64 insn:
DUP v26.16b, w21
which becomes TCG IR ops:
dup_vec v128,e8,tmp2,x21
st_vec v128,e8,tmp2,env,$0xa40
was assembled to:
0x607c568c: c4 c1 7a 7e 86 e8 00 00 vmovq 0xe8(%r14), %xmm0
0x607c5694: 00
0x607c5695: c5 f9 60 c8 vpunpcklbw %xmm0, %xmm0, %xmm1
0x607c5699: c5 f9 61 c9 vpunpcklwd %xmm1, %xmm0, %xmm1
0x607c569d: c5 f9 70 c9 00 vpshufd $0, %xmm1, %xmm1
0x607c56a2: c4 c1 7a 7f 8e 40 0a 00 vmovdqu %xmm1, 0xa40(%r14)
0x607c56aa: 00
when the vpunpcklwd insn should be "%xmm1, %xmm1, %xmm1".
This resulted in our incorrectly setting the output vector to
q26=0000320000003200:0000320000003200
when given an input of x21 == 0000000002803200
rather than the expected all-zeroes.
Pass the correct source register number to tcg_out_vex_modrm()
for these insns.
Fixes: 770c2fc7bb
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20180504153431.5169-1-peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Unknown why -m32 was passing with gcc but not clang; it should have
failed for both. This would be used for tcg_gen_dup_i64_vec, and
visible with the right TB and an aarch64 guest.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The x86 vector instruction set is extremely irregular. With newer
editions, Intel has filled in some of the blanks. However, we don't
get many 64-bit operations until SSE4.2, introduced in 2009.
The subsequent edition was for AVX1, introduced in 2011, which added
three-operand addressing, and adjusts how all instructions should be
encoded.
Given the relatively narrow 2 year window between possible to support
and desirable to support, and to vastly simplify code maintainence,
I am only planning to support AVX1 and later cpus.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
It's not even clear what the interface REG and VAL32 were supposed to mean.
All uses had REG = 0 and VAL32 was the bitset assigned to the destination.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Already it saves 2 bytes per call, but also the constant pool
entry may well be shared across multiple calls.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Dispense with TCGBackendData, as it has never been used for more than
holding a single pointer. Use a define in the cpu/tcg-target.h to
signal requirement for TCGLabelQemuLdst, so that we can drop the no-op
tcg-be-null.h stubs. Rename tcg-be-ldst.h to tcg-ldst.inc.c.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Clang 3.9 passes the CONFIG_AVX2_OPT configure test. However, the
supplied <cpuid.h> does not contain the bit_AVX2 define that we use
when detecting whether the routine can be enabled.
Introduce a qemu-specific header that uses the compiler's definition
of __cpuid et al, but supplies any missing bit_* definitions needed.
This avoids introducing any extra ifdefs to util/bufferiszero.c, and
allows quite a few to be removed from tcg/i386/tcg-target.inc.c.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 20170719044018.18063-1-rth@twiddle.net
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This reverts commit 4ac7691073.
This fixes
http://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03062.html
While I think we could get away with relying on the undocumented
behaviour, the tcg constraint system isn't powerful enough to
properly describe the required (non-)overlap conditions.
Reported-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The ISA manual documents the output is undefined if the input was zero.
However, we document in target-i386 that the behavior of real silicon
is to preserve the contents of the output register. We also mention
that there are real applications that depend on this. That this is
baked into silicon is mentioned as a potential cause for some false
sharing behaviour wrt lzcnt/tzcnt.
Taking advantage of this allows us to save 2 insns in the normal case,
and 4 insns for i686 emulating a 64-bit clz.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Previously we could not have different constraints for different ISA levels,
which prevented us from eliding the matching constraint for shifts.
We do now have to make sure that the operands match for constant shifts.
We can also handle some small left shifts via lea.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Use a switch instead of searching a table. Share constraints between
32-bit and 64-bit, when at all possible.
Signed-off-by: Richard Henderson <rth@twiddle.net>
This will let us choose how to interpret a given constraint
depending on whether the opcode is 32- or 64-bit. Which will
let us share more constraint combinations between opcodes.
At the same time, change the interface to return the advanced
pointer instead of passing it in/out by reference.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This will allow the target to tailor the constraints to the
auto-detected ISA extensions.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
TARGET_PAGE_MASK, as defined, has type "int". We need to extend
that to the proper target width before oring in an "unsigned".
Signed-off-by: Richard Henderson <rth@twiddle.net>
Generate a 'lock orl $0,0(%esp)' instruction for ordering instead of
mfence which has similar ordering semantics.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Message-Id: <20160714202026.9727-3-bobby.prani@gmail.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Previously we allowed fully unaligned operations, but not operations
that are aligned but with less alignment than the operation size.
In addition, arm32, ia64, mips, and sparc had been omitted from the
previous overalignment patch, which would have led to that alignment
being enforced.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Some architectures (e.g. ARMv8) need the address which is aligned
to a size more than the size of the memory access.
To support such check it's enough the current costless alignment
check implementation in QEMU, but we need to support
an alignment size specifying.
Signed-off-by: Sergey Sorokin <afarallax@yandex.ru>
Message-Id: <1466705806-679898-1-git-send-email-afarallax@yandex.ru>
Signed-off-by: Richard Henderson <rth@twiddle.net>
[rth: Assert in tcg_canonicalize_memop. Leave get_alignment_bits
available for, though unused by, user-mode. Retain logging difference
based on ALIGNED_ONLY.]
While we can store constants via constrants on INDEX_op_st_i32 et al,
we weren't able to spill constants to backing store.
Add a new backend interface, tcg_out_sti, which may store the constant
(and is allowed to fail). Rearrange the temp_* helpers so that we only
attempt to directly store a constant when the temp is becoming dead/free.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Briefly describe in a comment how direct block chaining is done. It
should help in understanding of the following data fields.
Rename some fields in TranslationBlock and TCGContext structures to
better reflect their purpose (dropping excessive 'tb_' prefix in
TranslationBlock but keeping it in TCGContext):
tb_next_offset => jmp_reset_offset
tb_jmp_offset => jmp_insn_offset
tb_next => jmp_target_addr
jmp_next => jmp_list_next
jmp_first => jmp_list_first
Avoid using a magic constant as an invalid offset which is used to
indicate that there's no n-th jump generated.
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Ensure direct jump patching in i386 is atomic by:
* naturally aligning a location of direct jump address;
* using atomic_read()/atomic_set() for code patching.
tcg_out_nopn() implementation:
Suggested-by: Richard Henderson <rth@twiddle.net>.
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <1461341333-19646-6-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Check for CONFIG_DEBUG_TCG instead of NDEBUG, drop now useless code.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-id: 1461228530-14852-2-git-send-email-aurelien@aurel32.net
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The TCG code is quite performance sensitive, but at the same time can
also be quite tricky. That is why asserts that can be enabled with the
--enable-debug-tcg configure option.
This used to work the following way:
| #include "config.h"
|
| ...
|
| #if !defined(CONFIG_DEBUG_TCG) && !defined(NDEBUG)
| /* define it to suppress various consistency checks (faster) */
| #define NDEBUG
| #endif
|
| ...
|
| #include <assert.h>
Since commit 757e725b (tcg: Clean up includes) "config.h" as been
replaced by "qemu/osdep.h" which itself includes <assert.h>. As a
consequence the assertions are always enabled, even when using
--disable-debug-tcg, causing a performance regression, especially on
targets with many registers. For instance on qemu-system-ppc the
speed difference is about 15%.
tcg_debug_assert is controlled directly by CONFIG_DEBUG_TCG and already
uses in some places. This patch replaces all the calls to assert into
calss to tcg_debug_assert.
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-id: 1461228530-14852-1-git-send-email-aurelien@aurel32.net
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Commit 757e725b58 added a number of #include "qemu/osdep.h"
files to the tcg-target.c files (as they were named at the time).
These are unnecessary because these files are not standalone C
files, and the tcg/tcg.c file which includes them will have
already included osdep.h on their behalf. Remove the unneeded
include directives.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <1456238983-10160-4-git-send-email-peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Rename the per-architecture tcg-target.c files to tcg-target.inc.c.
This makes it clearer that they are not intended to be standalone
C files, but are instead #included into another source file.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <1456238983-10160-2-git-send-email-peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>