Fix some of the nasty TCG race conditions and crashes by implementing
cpu_exit() as setting a flag which is checked at the start of each TB.
This avoids crashes if a thread or signal handler calls cpu_exit()
while the execution thread is itself modifying the TB graph (which
may happen in system emulation mode as well as in linux-user mode
with a multithreaded guest binary).
This fixes the crashes seen in LP:668799; however there are another
class of crashes described in LP:1098729 which stem from the fact
that in linux-user with a multithreaded guest all threads will
use and modify the same global TCG date structures (including the
generated code buffer) without any kind of locking. This means that
multithreaded guest binaries are still in the "unsupported"
category.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Document tcg_qemu_tb_exec(). In particular, its return value is a
combination of a pointer to the next translation block and some
extra information in the low two bits. Provide some #defines for
the values passed in these bits to improve code clarity.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Switch the default for qemu_log logging output from "/tmp/qemu.log"
to stderr. This is an incompatible change in some sense, but logging
is mostly used for debugging purposes so it shouldn't affect production
use. The previous behaviour can be obtained by adding "-D /tmp/qemu.log"
to the command line.
This change requires us to:
* update all the documentation/help text (we take the opportunity
to smooth out minor inconsistencies between the phrasing in
linux-user/bsd-user/system help messages)
* make linux-user and bsd-user defer to qemu-log for the default
logging destination rather than overriding it themselves
* ensure that all logfile closing is done via qemu_log_close()
and that that function doesn't close stderr
as well as the obvious change to the behaviour of do_qemu_set_log()
when no logfile name has been specified.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1361901160-28729-1-git-send-email-peter.maydell@linaro.org
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We even had the encoding of smull already handy...
Cc: Andrzej Zaborowski <balrogg@gmail.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
We're going to have use for this shortly in implementing other helpers.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Commit 0b0d3320db (TCG: Final globals
clean-up) moved code_gen_prologue but forgot to update ppc code.
This broke the build on 32-bit ppc. ppc64 is unaffected.
Cc: Evgeny Voevodin <evgenyvoevodin@gmail.com>
Cc: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Rename the public-facing function cpu_set_log to qemu_set_log. This
requires us to rename the internal-only qemu_set_log() to
do_qemu_set_log().
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
It's worth to clean-up translation blocks variables and move them
into one context as was suggested by Swirl.
Also if we use this context directly inside tcg_ctx, then it
speeds up code generation a bit.
Signed-off-by: Evgeny Voevodin <evgenyvoevodin@gmail.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Silence a (legitimate) complaint about missing parentheses:
tcg/arm/tcg-target.c: In function ‘tcg_out_qemu_ld’:
tcg/arm/tcg-target.c:1148:5: error: suggest parentheses around
comparison in operand of ‘&’ [-Werror=parentheses]
tcg/arm/tcg-target.c: In function ‘tcg_out_qemu_st’:
tcg/arm/tcg-target.c:1357:5: error: suggest parentheses around
comparison in operand of ‘&’ [-Werror=parentheses]
which meant that we would mistakenly always assert if running
a QEMU built with debug enabled on ARM.
Signed-off-by: Peter Maydell <peter.maydelL@linaro.org>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This adds two optimizations using the non-zero bit mask. In some cases
involving shifts or ANDs the value can become zero, and can thus be
optimized to a move of zero. Second, useless zero-extension or an
AND with constant can be detected that would only zero bits that are
already zero.
The main advantage of this optimization is that it turns zero-extensions
into moves, thus enabling much better copy propagation (around 1% code
reduction). Here is for example a "test $0xff0000,%ecx + je" before
optimization:
mov_i64 tmp0,rcx
movi_i64 tmp1,$0xff0000
discard cc_src
and_i64 cc_dst,tmp0,tmp1
movi_i32 cc_op,$0x1c
ext32u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,eq,$0x0
and after (without patch on the left, with on the right):
movi_i64 tmp1,$0xff0000 movi_i64 tmp1,$0xff0000
discard cc_src discard cc_src
and_i64 cc_dst,rcx,tmp1 and_i64 cc_dst,rcx,tmp1
movi_i32 cc_op,$0x1c movi_i32 cc_op,$0x1c
ext32u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0 movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,eq,$0x0 brcond_i64 cc_dst,tmp12,eq,$0x0
Other similar cases: "test %eax, %eax + jne" where eax is already 32-bit
(after optimization, without patch on the left, with on the right):
discard cc_src discard cc_src
mov_i64 cc_dst,rax mov_i64 cc_dst,rax
movi_i32 cc_op,$0x1c movi_i32 cc_op,$0x1c
ext32u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0 movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,ne,$0x0 brcond_i64 rax,tmp12,ne,$0x0
"test $0x1, %dl + je":
movi_i64 tmp1,$0x1 movi_i64 tmp1,$0x1
discard cc_src discard cc_src
and_i64 cc_dst,rdx,tmp1 and_i64 cc_dst,rdx,tmp1
movi_i32 cc_op,$0x1a movi_i32 cc_op,$0x1a
ext8u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0 movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,eq,$0x0 brcond_i64 cc_dst,tmp12,eq,$0x0
In some cases TCG even outsmarts GCC. :) Here the input code has
"and $0x2,%eax + movslq %eax,%rbx + test %rbx, %rbx" and the optimizer,
thanks to copy propagation, does the following:
movi_i64 tmp12,$0x2 movi_i64 tmp12,$0x2
and_i64 rax,rax,tmp12 and_i64 rax,rax,tmp12
mov_i64 cc_dst,rax mov_i64 cc_dst,rax
ext32s_i64 tmp0,rax -> nop
mov_i64 rbx,tmp0 -> mov_i64 rbx,cc_dst
and_i64 cc_dst,rbx,rbx -> nop
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Add a "mask" field to the tcg_temp_info struct. A bit that is zero
in "mask" will always be zero in the corresponding temporary.
Zero bits in the mask can be produced from moves of immediates,
zero-extensions, ANDs with constants, shifts; they can then be
be propagated by logical operations, shifts, sign-extensions,
negations, deposit operations, and conditional moves. Other
operations will just reset the mask to all-ones, i.e. unknown.
[rth: s/target_ulong/tcg_target_ulong/]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
The next patch will add to the TCG optimizer a field that should be
non-zero in the default case. Thus, replace the memset of the
temps array with a loop. Only the state field has to be up-to-date,
because others are not used except if the state is TCG_TEMP_COPY
or TCG_TEMP_CONST.
[rth: Extracted the loop to a function.]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Commit 7f6f0ae5b9 added two assertions.
One of these assertions is not needed:
The pointer ts is never NULL because it is initialized with the
address of an array element.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Existing compile-time detection is spotty at best. Convert
it all to runtime detection instead.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In dead_temp, local temps should always be marked as back to memory,
even if they have not been allocated (i.e. they are discared before
cross a basic block).
It fixes the following assertion in target-xtensa:
qemu-system-xtensa: tcg/tcg.c:1665: temp_save: Assertion `s->temps[temp].val_type == 2 || s->temps[temp].fixed_reg' failed.
Aborted
Reported-by: Max Filippov <jcmvbkbc@gmail.com>
Tested-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
The bswap16 TCG opcode assumes that the high bytes of the temp equal
to 0 before calling it. The ARM backend implementation takes this
assumption to slightly optimize the generated code.
The same implementation is called for implementing the cross-endian
qemu_st16 opcode, where this assumption is not true anymore. One way to
fix that would be to zero the high bytes before calling it. Given the
store instruction just ignore them, it is possible to provide a slightly
more optimized version. With ARMv6+ the rev16 instruction does the work
correctly. For lower ARM versions the patch provides a version which
behaves correctly with non-zero high bytes, but fill them with junk.
Cc: Andrzej Zaborowski <balrogg@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-stable@nongnu.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
The TCG arm backend considers likely that the offset to the TLB
entries does not exceed 12 bits for mem_index = 0. In practice this is
not true for at least the MIPS target.
The current patch fixes that by loading the bits 23-12 with a separate
instruction, and using loads with address writeback, independently of
the value of mem_idx. In total this allow a 24-bit offset, which is a
lot more than needed.
Cc: Andrzej Zaborowski <balrogg@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-stable@nongnu.org
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
The operations for INDEX_op_deposit_i32 and INDEX_op_deposit_i64
are now supported and enabled by default.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
mmu access looks something like:
<check tlb>
if miss goto slow_path
<fast path>
done:
...
; end of the TB
slow_path:
<pre process>
mr r3, r27 ; move areg0 to r3
; (r3 holds the first argument for all the PPC32 ABIs)
<call mmu_helper>
b $+8
.long done
<post process>
b done
On ppc32 <call mmu_helper> is:
(SysV and Darwin)
mmu_helper is most likely not within direct branching distance from
the call site, necessitating
a. moving 32 bit offset of mmu_helper into a GPR ; 8 bytes
b. moving GPR to CTR/LR ; 4 bytes
c. (finally) branching to CTR/LR ; 4 bytes
r3 setting - 4 bytes
call - 16 bytes
dummy jump over retaddr - 4 bytes
embedded retaddr - 4 bytes
Total overhead - 28 bytes
(PowerOpen (AIX))
a. moving 32 bit offset of mmu_helper's TOC into a GPR1 ; 8 bytes
b. loading 32 bit function pointer into GPR2 ; 4 bytes
c. moving GPR2 to CTR/LR ; 4 bytes
d. loading 32 bit small area pointer into R2 ; 4 bytes
e. (finally) branching to CTR/LR ; 4 bytes
r3 setting - 4 bytes
call - 24 bytes
dummy jump over retaddr - 4 bytes
embedded retaddr - 4 bytes
Total overhead - 36 bytes
Following is done to trim the code size of slow path sections:
In tcg_target_qemu_prologue trampolines are emitted that look like this:
trampoline:
mfspr r3, LR
addi r3, 4
mtspr LR, r3 ; fixup LR to point over embedded retaddr
mr r3, r27
<jump mmu_helper> ; tail call of sorts
And slow path becomes:
slow_path:
<pre process>
<call trampoline>
.long done
<post process>
b done
call - 4 bytes (trampoline is within code gen buffer
and most likely accessible via
direct branch)
embedded retaddr - 4 bytes
Total overhead - 8 bytes
In the end the icache pressure is decreased by 20/28 bytes at the cost
of an extra jump to trampoline and adjusting LR (to skip over embedded
retaddr) once inside.
Signed-off-by: malc <av1474@comtv.ru>