We had two fields specific to INDEX_op_call. Rename these and
add some macros so that the fields may be reused for other opcodes.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
With no fixed array allocation, we can't overflow a buffer.
This will be important as optimizations related to host vectors
may expand the number of ops used.
Use QTAILQ to link the ops together.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
These are now trivial sets and tests against NULL. Unwrap.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Rather than have separate code only used for guest_base,
rely on a recent change to handle constant pool entries.
Cc: qemu-s390x@nongnu.org
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Both ARMv6 and AArch64 currently may drop complex guest_base values
into the constant pool. But generic code wasn't expecting that, and
the pool is not emitted. Correct that.
Tested-by: Emilio G. Cota <cota@braap.org>
Tested-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Reported-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This is identical for each target. So, move the initialization to
common code. Move the variable itself out of tcg_ctx and name it
cpu_env to minimize changes within targets.
This also means we can remove tcg_global_reg_new_{ptr,i32,i64},
since there are no longer global-register temps created by targets.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This enables parallel TCG code generation. However, we do not take
advantage of it yet since tb_lock is still held during tb_gen_code.
In user-mode we use a single TCG context; see the documentation
added to tcg_region_init for the rationale.
Note that targets do not need any conversion: targets initialize a
TCGContext (e.g. defining TCG globals), and after this initialization
has finished, the context is cloned by the vCPU threads, each of
them keeping a separate copy.
TCG threads claim one entry in tcg_ctxs[] by atomically increasing
n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's
of that variable and tcg_ctxs; they are there just to play nice with
analysis tools such as thread sanitizer.
Note that we do not allocate an array of contexts (we allocate
an array of pointers instead) because when tcg_context_init
is called, we do not know yet how many contexts we'll use since
the bool behind qemu_tcg_mttcg_enabled() isn't set yet.
Previous patches folded some TCG globals into TCGContext. The non-const
globals remaining are only set at init time, i.e. before the TCG
threads are spawned. Here is a list of these set-at-init-time globals
under tcg/:
Only written by tcg_context_init:
- indirect_reg_alloc_order
- tcg_op_defs
Only written by tcg_target_init (called from tcg_context_init):
- tcg_target_available_regs
- tcg_target_call_clobber_regs
- arm: arm_arch, use_idiv_instructions
- i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
have_movbe, have_popcnt
- mips: use_movnz_instructions, use_mips32_instructions,
use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
- ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
- s390: tb_ret_addr, s390_facilities
- sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
use_vis3_instructions
Only written by tcg_prologue_init:
- 'struct jit_code_entry one_entry'
- aarch64: tb_ret_addr
- arm: tb_ret_addr
- i386: tb_ret_addr, guest_base_flags
- ia64: tb_ret_addr
- mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This is groundwork for supporting multiple TCG contexts.
The naive solution here is to split code_gen_buffer statically
among the TCG threads; this however results in poor utilization
if translation needs are different across TCG threads.
What we do here is to add an extra layer of indirection, assigning
regions that act just like pages do in virtual memory allocation.
(BTW if you are wondering about the chosen naming, I did not want
to use blocks or pages because those are already heavily used in QEMU).
We use a global lock to serialize allocations as well as statistics
reporting (we now export the size of the used code_gen_buffer with
tcg_code_size()). Note that for the allocator we could just use
a counter and atomic_inc; however, that would complicate the gathering
of tcg_code_size()-like stats. So given that the region operations are
not a fast path, a lock seems the most reasonable choice.
The effectiveness of this approach is clear after seeing some numbers.
I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark.
Note that I'm evaluating this after enabling per-thread TCG (which
is done by a subsequent commit).
* -smp 1, 1 region (entire buffer):
qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357
qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363
qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364
qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373
qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373
qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360
qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370
qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367
That is, 8 flushes.
* -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]:
qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356
qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361
qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361
qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375
qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375
qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360
qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365
qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368
Again, 8 flushes. Note how buffer utilization is not 100%, but it
is close. Smaller region sizes would yield higher utilization,
but we want region allocation to be rare (it acquires a lock), so
we do not want to go too small.
* -smp 8, static partitioning of 8 regions (10 MB per region):
qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354
qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370
qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365
qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377
qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358
qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367
qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364
qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358
qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362
qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372
qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374
qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376
qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374
qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372
qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359
qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362
qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368
qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378
qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367
qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364
That is, 20 flushes. Note how a static partitioning approach uses
the code buffer poorly, leading to many unnecessary flushes.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting multiple TCG contexts.
While at it, also allocate temps_used directly as a bitmap of the
required size, instead of using a bitmap of TCG_MAX_TEMPS via
TCGTempSet.
Performance-wise we lose about 1.12% in a translation-heavy workload
such as booting+shutting down debian-arm:
Performance counter stats for 'taskset -c 0 arm-softmmu/qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=die-on-boot.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 1' (10 runs):
exec time (s) Relative slowdown wrt original (%)
---------------------------------------------------------------
original 20.213321616 0.
tcg_malloc 20.441130078 1.1270214
TCGContext 20.477846517 1.3086662
g_malloc 20.780527895 2.8061013
The other two alternatives shown in the table are:
- TCGContext: embed temps[TCG_MAX_TEMPS] and TCGTempSet used_temps
in TCGContext. This is simple enough but it isn't faster than using
tcg_malloc; moreover, it wastes memory.
- g_malloc: allocate/deallocate both temps and used_temps every time
tcg_optimize is executed.
Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This is groundwork for supporting multiple TCG contexts.
To avoid scalability issues when profiling info is enabled, this patch
makes the profiling info counters distributed via the following changes:
1) Consolidate profile info into its own struct, TCGProfile, which
TCGContext also includes. Note that tcg_table_op_count is brought
into TCGProfile after dropping the tcg_ prefix.
2) Iterate over the TCG contexts in the system to obtain the total counts.
This change also requires updating the accessors to TCGProfile fields to
use atomic_read/set whenever there may be conflicting accesses (as defined
in C11) to them.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting multiple TCG contexts.
Note that having n_tcg_ctxs is unnecessary. However, it is
convenient to have it, since it will simplify iterating over the
array: we'll have just a for loop instead of having to iterate
over a NULL-terminated array (which would require n+1 elems)
or having to check with ifdef's for usermode/softmmu.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting multiple TCG contexts.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting multiple TCG contexts.
The core of this patch is this change to tcg/tcg.h:
> -extern TCGContext tcg_ctx;
> +extern TCGContext tcg_init_ctx;
> +extern TCGContext *tcg_ctx;
Note that for now we set *tcg_ctx to whatever TCGContext is passed
to tcg_context_init -- in this case &tcg_init_ctx.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting multiple TCG contexts.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Thereby decoupling the resulting translated code from the current state
of the system.
The tb->cflags field is not passed to tcg generation functions. So
we add a field to TCGContext, storing there a copy of tb->cflags.
Most architectures have <= 32 registers, which results in a 4-byte hole
in TCGContext. Use this hole for the new field.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.
Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the "curr_cflags" inline.
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Using the offset of a temporary, relative to TCGContext, rather than
its index means that we don't use 0. That leaves offset 0 free for
a NULL representation without having to leave index 0 unused.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
When we used structures for TCGv_*, we needed a macro in order to
perform a comparison. Now that we use pointers, this is just clutter.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The GET and MAKE functions weren't really specific enough.
We now have a full complement of functions that convert exactly
between temporaries, arguments, tcgv pointers, and indices.
The target/sparc change is also a bug fix, which would have affected
a host that defines TCG_TARGET_HAS_extr[lh]_i64_i32, i.e. MIPS64.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Transform TCGv_* to an "argument" or a temporary.
For now, an argument is simply the temporary index.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
While we're touching many of the lines anyway, adjust the naming
of the functions to better distinguish when "TCGArg" vs "TCGTemp"
should be used.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Copy s->nb_globals or s->nb_temps to a local variable for the purposes
of iteration. This should allow the compiler to use low-overhead
looping constructs on some hosts.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This avoids having to allocate external memory for each temporary.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
At the same time, drop the TCGContext argument and use tcg_ctx instead.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This avoids needing to test the index of a temp against nb_globals.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Rather than have a separate buffer of 10*max_ops entries,
give each opcode 10 entries. The result is actually a bit
smaller and should have slightly more cache locality.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Will come in handy very soon.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting multiple TCG contexts.
The hash table becomes read-only after it is filled in,
so we can save space by keeping just a global pointer to it.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting multiple TCG contexts.
Compile-tested for all targets on an x86_64 host.
Suggested-by: Richard Henderson <rth@twiddle.net>
Acked-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
In preparation for adding tc.size to be able to keep track of
TB's using the binary search tree implementation from glib.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
It is unlikely that we will ever want to call this helper passing
an argument other than the current PC. So just remove the argument,
and use the pc we already get from cpu_get_tb_cpu_state.
This change paves the way to having a common "tb_lookup" function.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
It's not even clear what the interface REG and VAL32 were supposed to mean.
All uses had REG = 0 and VAL32 was the bitset assigned to the destination.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20170911213328.9701-4-f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This fixes building for ppc64 on ppc32 (changed in 5964fca8a1):
tcg/ppc/tcg-target.inc.c: In function 'tb_target_set_jmp_target':
include/qemu/compiler.h:86:30: error: static assertion failed: \
"not expecting: sizeof(*(uint64_t *)jmp_addr) > ATOMIC_REG_SIZE"
QEMU_BUILD_BUG_ON(sizeof(*ptr) > ATOMIC_REG_SIZE); \
^
tcg/ppc/tcg-target.inc.c:1377:9: note: in expansion of macro 'atomic_set'
atomic_set((uint64_t *)jmp_addr, pair);
^
Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20170911204936.5020-1-f4bug@amsat.org>
[rth: Added commentary requested by pmm.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
changed in 659ef5cbb8, this fixes building with --enable-tcg-interpreter:
/home/travis/build/qemu/qemu/tcg/tcg.c:116:14: error: ‘tcg_out_ldst_finalize’ used but never defined [-Werror]
static bool tcg_out_ldst_finalize(TCGContext *s);
^
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20170911022839.23231-1-f4bug@amsat.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
At this point the conversion is a wash. Loading of TB+ofs is
smaller, but the actual return address from exit_tb is larger.
There are a few more insns required to transition between TBs.
But the expectation is that accesses to the constant pool will
on the whole be smaller.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Move constants before all of the functions.
Move tcg_out_<format> functions before all
of the others. No functional change.
Signed-off-by: Richard Henderson <rth@twiddle.net>
We are not going to use ldrd for loading the comparator
for 32-bit guests, so don't limit cmp_off to 8 bits then.
This eliminates one insn in the tlb load for some guests.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Use UBFX to avoid limitation on CPU_TLB_BITS. Since we're dropping
the initial shift, we need to replace the page masking. We can use
MOVW+BIC to do this without shifting. The result is the same size
as the armv6 path with one less conditional instruction.
Signed-off-by: Richard Henderson <rth@twiddle.net>
We were passing in -2 instead of +2, but then ignoring
the actual contents of addend in the calculation.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Already it saves 2 bytes per call, but also the constant pool
entry may well be shared across multiple calls.
Signed-off-by: Richard Henderson <rth@twiddle.net>
A new shared header tcg-pool.inc.c adds new_pool_label,
for registering a tcg_target_ulong to be emitted after
the generated code, plus relocation data to install a
pointer to the data.
A new pointer is added to the TCGContext, so that we
dump the constant pool as data, not code.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Dispense with TCGBackendData, as it has never been used for more than
holding a single pointer. Use a define in the cpu/tcg-target.h to
signal requirement for TCGLabelQemuLdst, so that we can drop the no-op
tcg-be-null.h stubs. Rename tcg-be-ldst.h to tcg-ldst.inc.c.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Replace the USE_DIRECT_JUMP ifdef with a TCG_TARGET_HAS_direct_jump
boolean test. Replace the tb_set_jmp_target1 ifdef with an unconditional
function tb_target_set_jmp_target.
While we're touching all backends, add a parameter for tb->tc_ptr;
we're going to need it shortly for some backends.
Move tb_set_jmp_target and tb_add_jump from exec-all.h to cpu-exec.c.
This opens the possibility for TCG_TARGET_HAS_direct_jump to be
a runtime decision -- based on host cpu capabilities, the size of
code_gen_buffer, or a future debugging switch.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Missed being added as part of 71650df7b0.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This allows LOAD HALFWORD IMMEDIATE ON CONDITION,
eliminating one insn in some common cases.
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This allows using a 3-operand insn form for some arithmetic,
logicals and shifts.
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Use a switch instead of searching a table.
Acked-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Currently, we cannot use mttcg for running strong memory model guests
on weak memory model hosts due to missing ordering semantics.
We implicitly generate fence instructions for stronger guests if an
ordering mismatch is detected. We generate fences only for the orders
for which fence instructions are necessary, for example a fence is not
necessary between a store and a subsequent load on x86 since its
absence in the guest binary tells that ordering need not be
ensured. Also note that if we find multiple subsequent fence
instructions in the generated IR, we combine them in the TCG
optimization pass.
This patch allows us to boot an x86 guest on ARM64 hosts using mttcg.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Message-Id: <20170829063313.10237-4-bobby.prani@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
We threatened to remove ia64 as host in v2.9.0. Its time has now come.
There are still some usages of defined(__ia64__) throughout the source
code that would be triggered if one were to enable TCI on an ia64 host.
Leave those alone for now.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
For a 64-bit ILP32 host, aligning to sizeof(long) is not enough.
Guess the minimum for any host is 8, as that covers uint64_t.
Qemu doesn't use a host long double or host vectors, except in
extremely limited circumstances.
Fixes a bus error for a sparc v8plus host.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Patch 85aa80813d changed the IF emitting the TST instruction,
but failed to change the ?: converting CMP to CMPEQ, so the
result of the TST is ignored.
Signed-off-by: Richard Henderson <rth@twiddle.net>
With the move of some docs/ to docs/devel/ on ac06724a71,
a couple of references were not updated.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Clang 3.9 passes the CONFIG_AVX2_OPT configure test. However, the
supplied <cpuid.h> does not contain the bit_AVX2 define that we use
when detecting whether the routine can be enabled.
Introduce a qemu-specific header that uses the compiler's definition
of __cpuid et al, but supplies any missing bit_* definitions needed.
This avoids introducing any extra ifdefs to util/bufferiszero.c, and
allows quite a few to be removed from tcg/i386/tcg-target.inc.c.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 20170719044018.18063-1-rth@twiddle.net
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reserve a register for the guest_base using ppc code for reference.
By doing so, we do not have to recompute it for every memory load.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1499677934-2249-1-git-send-email-jiang.biao2@zte.com.cn>
Every vCPU now uses a separate set of TBs for each set of dynamic
tracing event state values. Each set of TBs can be used by any number of
vCPUs to maximize TB reuse when vCPUs have the same tracing state.
This feature is later used by tracetool to optimize tracing of guest
code events.
The maximum number of TB sets is defined as 2^E, where E is the number
of events that have the 'vcpu' property (their state is stored in
CPUState->trace_dstate).
For this to work, a change on the dynamic tracing state of a vCPU will
force it to flush its virtual TB cache (which is only indexed by
address), and fall back to the physical TB cache (which now contains the
vCPU's dynamic tracing state as part of the hashing function).
Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-id: 149915775266.6295.10060144081246467690.stgit@frigg.lan
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When running a helloworld program with qemu-i386 in linux-user
mode on Loongson 3A3000, it will crash. This patch fix the bug.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Message-Id: <1499669979-25904-1-git-send-email-jiang.biao2@zte.com.cn>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This patch enables the indirect jump path using an LDR (literal)
instruction. It will be interesting to test and see which performs
better among the two paths.
CC: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Message-Id: <20170630143614.31059-3-bobby.prani@gmail.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
We use ADRP+ADD to compute the target address for goto_tb. This patch
introduces the NOP instruction which is used to align the above
instruction pair so that we can use one atomic instruction to patch
the destination offsets.
CC: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Message-Id: <20170630143614.31059-2-bobby.prani@gmail.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
We can use a branch to register instruction for exit_tb for offsets
greater than 128MB.
CC: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Message-Id: <20170630143614.31059-1-bobby.prani@gmail.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The new placement of the TB means that we can use one insn
to load the goto_tb destination directly from the TB.
Signed-off-by: Richard Henderson <rth@twiddle.net>
The new placement of the TB means that we can use one insn
to load the return value for exit_tb returning the TB pointer.
Tested-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Allocating an arbitrarily-sized array of tbs results in either
(a) a lot of memory wasted or (b) unnecessary flushes of the code
cache when we run out of TB structs in the array.
An obvious solution would be to just malloc a TB struct when needed,
and keep the TB array as an array of pointers (recall that tb_find_pc()
needs the TB array to run in O(log n)).
Perhaps a better solution, which is implemented in this patch, is to
allocate TB's right before the translated code they describe. This
results in some memory waste due to padding to have code and TBs in
separate cache lines--for instance, I measured 4.7% of padding in the
used portion of code_gen_buffer when booting aarch64 Linux on a
host with 64-byte cache lines. However, it can allow for optimizations
in some host architectures, since TCG backends could safely assume that
the TB and the corresponding translated code are very close to each
other in memory. See this message by rth for a detailed explanation:
https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html
Subject: Re: GSoC 2017 Proposal: TCG performance enhancements
Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net>
Suggested-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1496790745-314-3-git-send-email-cota@braap.org>
[rth: Simplify the arithmetic in tcg_tb_alloc]
Signed-off-by: Richard Henderson <rth@twiddle.net>
Add helpers to gather cache info from the host at init-time.
For now, only export the host's I/D cache line sizes, which we
will use to improve cache locality to avoid false sharing.
Suggested-by: Richard Henderson <rth@twiddle.net>
Suggested-by: Geert Martin Ijewski <gm.ijewski@web.de>
Tested-by: Geert Martin Ijewski <gm.ijewski@web.de>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1496794624-4083-1-git-send-email-cota@braap.org>
[rth: Move all implementations from tcg/ppc/]
Signed-off-by: Richard Henderson <rth@twiddle.net>
move tcg-runtime.c, translate-all.(ch) and translate-common.c into
accel/tcg/ subdirectory and updated related trace-events file.
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <1496383606-18060-4-git-send-email-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In theory this would re-enable usage of QEMU on an armv4 host.
Whether this is worthwhile is debatable -- we've been unconditionally
issuing the armv5t BX instruction in the prologue since 2011 without
complaint. Possibly we should simply require an armv6 host.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Instead of exporting goto_ptr directly to TCG frontends, export
tcg_gen_lookup_and_goto_ptr(), which calls goto_ptr with the pointer
returned by the lookup_tb_ptr() helper. This is the only use case
we have for goto_ptr and lookup_tb_ptr, so having this function is
very convenient. Furthermore, it trivially allows us to avoid calling
the lookup helper if goto_ptr is not implemented by the backend.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1493263764-18657-2-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-3-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-4-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-5-git-send-email-cota@braap.org>
[rth: Squashed 4 related commits.]
Signed-off-by: Richard Henderson <rth@twiddle.net>
Users of tcg_gen_atomic_cmpxchg and do_atomic_op rightfully utilize
the output. Even though this code is dead, it gets translated, and
without the initialization we encounter a tcg_error.
Reported-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Tested-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Tested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
We already require gcc 4.1 or newer (for the atomic
support), so the fallback codepaths for older gcc
versions than that are now dead code and we can
just delete them.
NB: clang reports itself as gcc 4.2 (regardless of
clang version), so clang won't be using the fallbacks
either.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
The C store helper functions take the address argument as a
target_ulong type; if this is 32 bit but the host is 64 bit
then the SPARC calling convention requires that the caller
must zero extend the value. We weren't doing this, which
meant we could pass values to the caller with high bits set
and QEMU would crash if it was compiled with optimizations.
In particular, the i386 BIOS would not start.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1490871151-29029-3-git-send-email-peter.maydell@linaro.org
Reviewed-by: Richard Henderson <rth@twiddle.net>
The C store helper functions take the data argument as a uint8_t,
uint16_t, etc depending on the store size. The SPARC calling
convention requires that data types smaller than the register
size must be extended by the caller. We weren't doing this,
which meant that if QEMU was compiled with optimizations enabled
we could end up storing incorrect values to guest memory.
(In particular the i386 guest BIOS would crash on startup.)
Add code to the trampolines that call the store helpers to
do the zero extension as required.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-id: 1490871151-29029-2-git-send-email-peter.maydell@linaro.org
Reviewed-by: Richard Henderson <rth@twiddle.net>
Merge the original development branch due to breakage caused by the
MTTCG merge.
Conflicts:
cpu-exec.c
translate-common.c
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To fix the following warnings:
In file included from /users/pranith/qemu/tcg/tcg.c:255:
/users/pranith/qemu/tcg/aarch64/tcg-target.inc.c:879:24: warning: implicit conversion from enumeration type 'TCGMemOp' (aka 'enum TCGMemOp') to different enumeration type 'TCGType' (aka 'enum TCGType')
[-Wenum-conversion]
tcg_out_cmp(s, ext, a, b, b_const);
~~~~~~~~~~~ ^~~
/users/pranith/qemu/tcg/aarch64/tcg-target.inc.c:893:36: warning: implicit conversion from enumeration type 'TCGMemOp' (aka 'enum TCGMemOp') to different enumeration type 'TCGType' (aka 'enum TCGType')
[-Wenum-conversion]
tcg_out_insn(s, 3201, CBZ, ext, a, offset);
~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
/users/pranith/qemu/tcg/aarch64/tcg-target.inc.c:389:65: note: expanded from macro 'tcg_out_insn'
glue(tcg_out_insn_,FMT)(S, glue(glue(glue(I,FMT),_),OP), ## __VA_ARGS__)
^
/users/pranith/qemu/tcg/aarch64/tcg-target.inc.c:895:37: warning: implicit conversion from enumeration type 'TCGMemOp' (aka 'enum TCGMemOp') to different enumeration type 'TCGType' (aka 'enum TCGType')
[-Wenum-conversion]
tcg_out_insn(s, 3201, CBNZ, ext, a, offset);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
/users/pranith/qemu/tcg/aarch64/tcg-target.inc.c:389:65: note: expanded from macro 'tcg_out_insn'
glue(tcg_out_insn_,FMT)(S, glue(glue(glue(I,FMT),_),OP), ## __VA_ARGS__)
^
/users/pranith/qemu/tcg/aarch64/tcg-target.inc.c:1610:27: warning: implicit conversion from enumeration type 'TCGType' (aka 'enum TCGType') to different enumeration type 'TCGMemOp' (aka 'enum TCGMemOp')
[-Wenum-conversion]
tcg_out_brcond(s, ext, a2, a0, a1, const_args[1], arg_label(args[3]));
~~~~~~~~~~~~~~ ^~~
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Message-Id: <20170217154311.13920-1-bobby.prani@gmail.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This enables the multi-threaded system emulation by default for ARMv7
and ARMv8 guests using the x86_64 TCG backend. This is because on the
guest side:
- The ARM translate.c/translate-64.c have been converted to
- use MTTCG safe atomic primitives
- emit the appropriate barrier ops
- The ARM machine has been updated to
- hold the BQL when modifying shared cross-vCPU state
- defer powerctl changes to async safe work
All the host backends support the barrier and atomic primitives but
need to provide same-or-better support for normal load/store
operations.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Tested-by: Pranith Kumar <bobby.prani@gmail.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
We know there will be cases where MTTCG won't work until additional work
is done in the front/back ends to support. It will however be useful to
be able to turn it on.
As a result MTTCG will default to off unless the combination is
supported. However the user can turn it on for the sake of testing.
Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
[AJB: move to -accel tcg,thread=multi|single, defaults]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
We'll be using the memory ordering definitions to define values for
both the host and guest. To avoid fighting with circular header
dependencies just move these types into their own minimal header.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
The icount interrupt flag and tcg_exit_req serve almost the same
purpose, let's make them completely the same.
The former TB_EXIT_REQUESTED and TB_EXIT_ICOUNT_EXPIRED cases are
unified, since we can distinguish them from the value of the
interrupt flag.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
tb_jmp_insn_offset and tb_jmp_reset_offset are pointers
and cannot be used with ARRAY_SIZE.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-id: 20170202195601.11286-1-sw@weilnetz.de
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This reverts commit 4ac7691073.
This fixes
http://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03062.html
While I think we could get away with relying on the undocumented
behaviour, the tcg constraint system isn't powerful enough to
properly describe the required (non-)overlap conditions.
Reported-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
There were some patterns, like 0x0000_ffff_ffff_00ff, for which we
would select to begin a multi-insn sequence with MOVN, but would
fail to set the 0x0000 lane back from 0xffff.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Message-Id: <20161207180727.6286-3-rth@twiddle.net>
When al == xzr, we cannot use addi/subi because that encodes xsp.
Force a zero into the temp register for that (rare) case.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Message-Id: <20161207180727.6286-2-rth@twiddle.net>
The number of actual invocations of ctpop itself does not warrent
an opcode, but it is very helpful for POWER7 to use in generating
an expansion for ctz.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The number of actual invocations does not warrent an opcode,
and the backends generating it. But at least we can eliminate
redundant helpers.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The ISA manual documents the output is undefined if the input was zero.
However, we document in target-i386 that the behavior of real silicon
is to preserve the contents of the output register. We also mention
that there are real applications that depend on this. That this is
baked into silicon is mentioned as a potential cause for some false
sharing behaviour wrt lzcnt/tzcnt.
Taking advantage of this allows us to save 2 insns in the normal case,
and 4 insns for i686 emulating a 64-bit clz.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Previously we could not have different constraints for different ISA levels,
which prevented us from eliding the matching constraint for shifts.
We do now have to make sure that the operands match for constant shifts.
We can also handle some small left shifts via lea.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Use a switch instead of searching a table. Share constraints between
32-bit and 64-bit, when at all possible.
Signed-off-by: Richard Henderson <rth@twiddle.net>