The primary motivation is to remove a dozen insns along
the fast-path in tb_lookup. As a byproduct, this allows
us to completely remove parallel_cpus.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Lets make sure all the flags we compare when looking up blocks are
together in the same place.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210224165811.11567-5-alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
We don't really deal in cf_mask most of the time. The one time it's
relevant is when we want to remove an invalidated TB from the QHT
lookup. Everywhere else we should be looking up things without
CF_INVALID set.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210224165811.11567-4-alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
There is nothing special about this compile flag that doesn't mean we
can't just compute it with curr_cflags() which we should be using when
building a new set.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210224165811.11567-3-alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
When icount is enabled and we recompile an MMIO access we end up
double counting the instruction execution. To avoid this we introduce
the CF_MEMI cflag which only allows memory instrumentation for the
next TB (which won't yet have been counted). As this is part of the
hashed compile flags we will only execute the generated TB while
coming out of a cpu_io_recompile.
While we are at it delete the old TODO. We might as well keep the
translation handy as it's likely you will repeatedly hit it on each
MMIO access.
Reported-by: Aaron Lindsay <aaron@os.amperecomputing.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Aaron Lindsay <aaron@os.amperecomputing.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210213130325.14781-21-alex.bennee@linaro.org>
Now we no longer generate CF_NOCACHE blocks we can remove a bunch of
the special case handling for them. While we are at it we can remove
the unused tb->orig_tb field and save a few bytes on the TB structure.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210213130325.14781-20-alex.bennee@linaro.org>
Use g2h_untagged in contexts that have no cpu, e.g. the binary
loaders that operate before the primary cpu is created. As a
colollary, target_mmap and friends must use untagged addresses,
since they are used by the loaders.
Use g2h_untagged on values returned from target_mmap, as the
kernel never applies a tag itself.
Use g2h_untagged on all pc values. The only current user of
tags, aarch64, removes tags from code addresses upon branch,
so "pc" is always untagged.
Use g2h with the cpu context on hand wherever possible.
Use g2h_untagged in lock_user, which will be updated soon.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20210212184902.1251044-13-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Each thread must have its own pc, even under TCI.
Remove the GETPC ifdef, because GETPC is always available for
helpers, and thus is always required. Move the assignment
under INDEX_op_call, because the value is only visible when
we make a call to a helper function.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210204014509.882821-6-richard.henderson@linaro.org>
As cpu_io_recompile() is only called within TCG accelerator
in cputlb.c, declare it locally.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20210117164813.4101761-6-f4bug@amsat.org>
[rth: Adjust vs changed tb_flush_jmp_cache patch.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
tb_gen_code() is only called within TCG accelerator, declare it locally.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20210117164813.4101761-4-f4bug@amsat.org>
[rth: Adjust vs changed tb_flush_jmp_cache patch.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Move and make the function static, as the only users
are here in cputlb.c.
Suggested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
cpu_gen_init() is TCG specific, only used in tcg/translate-all.c.
No need to export it to other accelerators, declare it statically.
Reviewed-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20210117164813.4101761-2-f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Add two helper functions, using a global variable to hold
the displacement. The displacement is currently always 0,
so no change in behaviour.
Begin using the functions in tcg common code only.
Reviewed-by: Joelle van Dyne <j@getutm.app>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
There is no "version 2" of the "Lesser" General Public License.
It is either "GPL version 2.0" or "Lesser GPL version 2.1".
This patch replaces all occurrences of "Lesser GPL version 2" with
"Lesser GPL version 2.1" in comment section.
Signed-off-by: Chetan Pant <chetan4windows@gmail.com>
Message-Id: <20201023123353.19796-1-chetan4windows@gmail.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
On ARM, the Top Byte Ignore feature means that only 56 bits of
the address are significant in the virtual address. We are
required to give the entire 64-bit address to FAR_ELx on fault,
which means that we do not "clean" the top byte early in TCG.
This new interface allows us to flush all 256 possible aliases
for a given page, currently missed by tlb_flush_page*.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 20201016210754.818257-2-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
refactoring of cpus.c continues with cpu timer state extraction.
cpu-timers: responsible for the softmmu cpu timers state,
including cpu clocks and ticks.
icount: counts the TCG instructions executed. As such it is specific to
the TCG accelerator. Therefore, it is built only under CONFIG_TCG.
One complication is due to qtest, which uses an icount field to warp time
as part of qtest (qtest_clock_warp).
In order to solve this problem, provide a separate counter for qtest.
This requires fixing assumptions scattered in the code that
qtest_enabled() implies icount_enabled(), checking each specific case.
Signed-off-by: Claudio Fontana <cfontana@suse.de>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
[remove redundant initialization with qemu_spice_init]
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
[fix lingering calls to icount_get]
Signed-off-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
clang's C11 atomic_fetch_*() functions only take a C11 atomic type
pointer argument. QEMU uses direct types (int, etc) and this causes a
compiler error when a QEMU code calls these functions in a source file
that also included <stdatomic.h> via a system header file:
$ CC=clang CXX=clang++ ./configure ... && make
../util/async.c:79:17: error: address argument to atomic operation must be a pointer to _Atomic type ('unsigned int *' invalid)
Avoid using atomic_*() names in QEMU's atomic.h since that namespace is
used by <stdatomic.h>. Prefix QEMU's APIs with 'q' so that atomic.h
and <stdatomic.h> can co-exist. I checked /usr/include on my machine and
searched GitHub for existing "qatomic_" users but there seem to be none.
This patch was generated using:
$ git grep -h -o '\<atomic\(64\)\?_[a-z0-9_]\+' include/qemu/atomic.h | \
sort -u >/tmp/changed_identifiers
$ for identifier in $(</tmp/changed_identifiers); do
sed -i "s%\<$identifier\>%q$identifier%g" \
$(git grep -I -l "\<$identifier\>")
done
I manually fixed line-wrap issues and misaligned rST tables.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200923105646.47864-1-stefanha@redhat.com>
I was after adding qemu_spin_destroy calls, but while at
it I noticed that we are leaking some memory.
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Robert Foley <robert.foley@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20200609200738.445-5-robert.foley@linaro.org>
Message-Id: <20200612190237.30436-8-alex.bennee@linaro.org>
This new interface will allow targets to probe for a page
and then handle watchpoints themselves. This will be most
useful for vector predicated memory operations, where one
page lookup can be used for many operations, and one test
can avoid many watchpoint checks.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20200508154359.7494-6-richard.henderson@linaro.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20200508154359.7494-4-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This gets around a build problem with --disable-tcg.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
This will be used by plugins to get the host address
of instructions.
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
MVCL is interruptible and we should check for interrupts and process
them after writing back the variables to the registers. Let's check
for any exit requests and exit to the main loop. Introduce a new helper
function for that: cpu_loop_exit_requested().
When booting Fedora 30, I can see a handful of these exits and it seems
to work reliable. Also, Richard explained why this works correctly even
when MVCL is called via EXECUTE:
(1) TB with EXECUTE runs, at address Ae
- env->psw_addr stored with Ae.
- helper_ex() runs, memory address Am computed
from D2a(X2a,B2a) or from psw.addr+RI2.
- env->ex_value stored with memory value modified by R1a
(2) TB of executee runs,
- env->ex_value stored with 0.
- helper_mvcl() runs, using and updating R1b, R1b+1, R2b, R2b+1.
(3a) helper_mvcl() completes,
- TB of executee continues, psw.addr += ilen.
- Next instruction is the one following EXECUTE.
(3b) helper_mvcl() exits to main loop,
- cpu_loop_exit_restore() unwinds psw.addr = Ae.
- Next instruction is the EXECUTE itself...
- goto 1.
As the PoP mentiones that an interruptible instruction called via EXECUTE
should avoid modifying storage/registers that are used by EXECUTE itself,
it is fine to retrigger EXECUTE.
Cc: Alex Bennée <alex.bennee@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
There is only one caller, tlb_set_page_with_attrs. We cannot
inline the entire function because the AddressSpaceDispatch
structure is private to exec.c, and cannot easily be moved to
include/exec/memory-internal.h.
Compute is_ram and is_romd once within tlb_set_page_with_attrs.
Fold the number of tests against these predicates. Compute
cpu_physical_memory_is_clean outside of the tlb lock region.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Let's also allow to probe other access types.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20190830100959.26615-3-david@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
... similar to tlb_vaddr_to_host(); however, allow access to the host
page except when TLB_NOTDIRTY or TLB_MMIO is set.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20190830100959.26615-2-david@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Factor it out into common code. Similar to the !CONFIG_USER_ONLY variant,
let's not allow to cross page boundaries.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20190826075112.25637-4-david@redhat.com>
[rth: Move cpu & cc variables inside if block.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Back in 2016, we discussed[1] rules for headers, and these were
generally liked:
1. Have a carefully curated header that's included everywhere first. We
got that already thanks to Peter: osdep.h.
2. Headers should normally include everything they need beyond osdep.h.
If exceptions are needed for some reason, they must be documented in
the header. If all that's needed from a header is typedefs, put
those into qemu/typedefs.h instead of including the header.
3. Cyclic inclusion is forbidden.
This patch gets include/ closer to obeying 2.
It's actually extracted from my "[RFC] Baby steps towards saner
headers" series[2], which demonstrates a possible path towards
checking 2 automatically. It passes the RFC test there.
[1] Message-ID: <87h9g8j57d.fsf@blackfin.pond.sub.org>
https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg03345.html
[2] Message-Id: <20190711122827.18970-1-armbru@redhat.com>
https://lists.nongnu.org/archive/html/qemu-devel/2019-07/msg02715.html
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-Id: <20190812052359.30071-2-armbru@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
No header includes qemu-common.h after this commit, as prescribed by
qemu-common.h's file comment.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190523143508.25387-5-armbru@redhat.com>
[Rebased with conflicts resolved automatically, except for
include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c
block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c
target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h
target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h
target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h
target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and
net/tap-bsd.c fixed up]
We can now use the CPUClass hook instead of a named function.
Create a static tlb_fill function to avoid other changes within
cputlb.c. This also isolates the asserts within. Remove the
named tlb_fill function from all of the targets.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
In order to handle TB's that translate to too much code, we
need to place the control of the length of the translation
in the hands of the code gen master loop.
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Include the cluster number in the hash we use to look
up TBs. This is important because a TB that is valid
for one cluster at a given physical address and set
of CPU flags is not necessarily valid for another:
the two clusters may have different views of physical
memory, or may have different CPU features (eg FPU
present or absent).
We put the cluster number in the high 8 bits of the
TB cflags. This gives us up to 256 clusters, which should
be enough for anybody. If we ever need more, or need
more bits in cflags for other purposes, we could make
tb_hash_func() take more data (and expand qemu_xxhash7()
to qemu_xxhash8()).
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Message-id: 20190121152218.9592-4-peter.maydell@linaro.org
Paves the way for the addition of a per-TLB lock.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <20181009174557.16125-4-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
We set up TLB entries in tlb_set_page_with_attrs(), where we have
some logic for determining whether the TLB entry is considered
to be RAM-backed, and thus has a valid addend field. When we
look at the TLB entry in get_page_addr_code(), we use different
logic for determining whether to treat the page as RAM-backed
and use the addend field. This is confusing, and in fact buggy,
because the code in tlb_set_page_with_attrs() correctly decides
that rom_device memory regions not in romd mode are not RAM-backed,
but the code in get_page_addr_code() thinks they are RAM-backed.
This typically results in "Bad ram pointer" assertion if the
guest tries to execute from such a memory region.
Fix this by making get_page_addr_code() just look at the
TLB_MMIO bit in the code_address field of the TLB, which
tlb_set_page_with_attrs() sets if and only if the addend
field is not valid for code execution.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-id: 20180713150945.12348-1-peter.maydell@linaro.org
There is no need for a stub, since tb_invalidate_phys_addr can be excised
altogether when TCG is disabled. This is a bit cleaner since it avoids
using code that is clearly specific to user-mode emulation (it calls
mmap_lock/unlock) for the !CONFIG_TCG case.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix the --disable-tcg breakage introduced by 8bca9a03ec:
$ configure --disable-tcg
[...]
$ make -C i386-softmmu exec.o
make: Entering directory 'i386-softmmu'
CC exec.o
In file included from source/qemu/exec.c:62:0:
source/qemu/include/exec/ram_addr.h:96:6: error: conflicting types for ‘tb_invalidate_phys_range’
void tb_invalidate_phys_range(ram_addr_t start, ram_addr_t end);
^~~~~~~~~~~~~~~~~~~~~~~~
In file included from source/qemu/exec.c:24:0:
source/qemu/include/exec/exec-all.h:309:6: note: previous declaration of ‘tb_invalidate_phys_range’ was here
void tb_invalidate_phys_range(target_ulong start, target_ulong end);
^~~~~~~~~~~~~~~~~~~~~~~~
source/qemu/exec.c:1043:6: error: conflicting types for ‘tb_invalidate_phys_addr’
void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr, MemTxAttrs attrs)
^~~~~~~~~~~~~~~~~~~~~~~
In file included from source/qemu/exec.c:24:0:
source/qemu/include/exec/exec-all.h:308:6: note: previous declaration of ‘tb_invalidate_phys_addr’ was here
void tb_invalidate_phys_addr(target_ulong addr);
^~~~~~~~~~~~~~~~~~~~~~~
make: *** [source/qemu/rules.mak:69: exec.o] Error 1
make: Leaving directory 'i386-softmmu'
Tested to build x86_64-softmmu and i386-softmmu targets.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-id: 20180629200710.27626-1-f4bug@amsat.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Place them in exec.c, exec-all.h and ram_addr.h. This removes
knowledge of translate-all.h (which is an internal header) from
several files outside accel/tcg and removes knowledge of
AddressSpace from translate-all.c (as it only operates on ram_addr_t).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use mmap_lock in user-mode to protect TCG state and the page descriptors.
In !user-mode, each vCPU has its own TCG state, so no locks needed.
Per-page locks are used to protect the page descriptors.
Per-TB locks are used in both modes to protect TB jumps.
Some notes:
- tb_lock is removed from notdirty_mem_write by passing a
locked page_collection to tb_invalidate_phys_page_fast.
- tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
so there is no need to further serialize access to them.
- do_tb_flush is run in a safe async context, meaning no other
vCPU threads are running. Therefore acquiring mmap_lock there
is just to please tools such as thread sanitizer.
- Not visible in the diff, but tb_invalidate_phys_page already
has an assert_memory_lock.
- cpu_io_recompile is !user-only, so no mmap_lock there.
- Added mmap_unlock()'s before all siglongjmp's that could
be called in user-mode while mmap_lock is held.
+ Added an assert for !have_mmap_lock() after returning from
the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.
Performance numbers before/after:
Host: AMD Opteron(tm) Processor 6376
ubuntu 17.04 ppc64 bootup+shutdown time
700 +-+--+----+------+------------+-----------+------------*--+-+
| + + + + + *B |
| before ***B*** ** * |
|tb lock removal ###D### *** |
600 +-+ *** +-+
| ** # |
| *B* #D |
| *** * ## |
500 +-+ *** ### +-+
| * *** ### |
| *B* # ## |
| ** * #D# |
400 +-+ ** ## +-+
| ** ### |
| ** ## |
| ** # ## |
300 +-+ * B* #D# +-+
| B *** ### |
| * ** #### |
| * *** ### |
200 +-+ B *B #D# +-+
| #B* * ## # |
| #* ## |
| + D##D# + + + + |
100 +-+--+----+------+------------+-----------+------------+--+-+
1 8 16 Guest CPUs 48 64
png: https://imgur.com/HwmBHXe
debian jessie aarch64 bootup+shutdown time
90 +-+--+-----+-----+------------+------------+------------+--+-+
| + + + + + + |
| before ***B*** B |
80 +tb lock removal ###D### **D +-+
| **### |
| **## |
70 +-+ ** # +-+
| ** ## |
| ** # |
60 +-+ *B ## +-+
| ** ## |
| *** #D |
50 +-+ *** ## +-+
| * ** ### |
| **B* ### |
40 +-+ **** # ## +-+
| **** #D# |
| ***B** ### |
30 +-+ B***B** #### +-+
| B * * # ### |
| B ###D# |
20 +-+ D ##D## +-+
| D# |
| + + + + + + |
10 +-+--+-----+-----+------------+------------+------------+--+-+
1 8 16 Guest CPUs 48 64
png: https://imgur.com/iGpGFtv
The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
lock contention significantly hurts scalability.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This applies to both user-mode and !user-mode emulation.
Instead of relying on a global lock, protect the list of incoming
jumps with tb->jmp_lock. This lock also protects tb->cflags,
so update all tb->cflags readers outside tb->jmp_lock to use
atomic reads via tb_cflags().
In order to find the destination TB (and therefore its jmp_lock)
from the origin TB, we introduce tb->jmp_dest[].
I considered not using a linked list of jumps, which simplifies
code and makes the struct smaller. However, it unnecessarily increases
memory usage, which results in a performance decrease. See for
instance these numbers booting+shutting down debian-arm:
Time (s) Rel. err (%) Abs. err (s) Rel. slowdown (%)
------------------------------------------------------------------------------
before 20.88 0.74 0.154512 0.
after 20.81 0.38 0.079078 -0.33524904
GTree 21.02 0.28 0.058856 0.67049808
GHashTable + xxhash 21.63 1.08 0.233604 3.5919540
Using a hash table or a binary tree to keep track of the jumps
doesn't really pay off, not only due to the increased memory usage,
but also because most TBs have only 0 or 1 jumps to them. The maximum
number of jumps when booting debian-arm that I measured is 35, but
as we can see in the histogram below a TB with that many incoming jumps
is extremely rare; the average TB has 0.80 incoming jumps.
n_jumps: 379208; avg jumps/tb: 0.801099
dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁ ▁▁▁ ▁|[34.0,35.0]
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The appended adds assertions to make sure we do not longjmp with page
locks held. Note that user-mode has nothing to check, since page_locks
are !user-mode only.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Groundwork for supporting parallel TCG generation.
Instead of using a global lock (tb_lock) to protect changes
to pages, use fine-grained, per-page locks in !user-mode.
User-mode stays with mmap_lock.
Sometimes changes need to happen atomically on more than one
page (e.g. when a TB that spans across two pages is
added/invalidated, or when a range of pages is invalidated).
We therefore introduce struct page_collection, which helps
us keep track of a set of pages that have been locked in
the appropriate locking order (i.e. by ascending page index).
This commit first introduces the structs and the function helpers,
to then convert the calling code to use per-page locking. Note
that tb_lock is not removed yet.
While at it, rename tb_alloc_page to tb_page_add, which pairs with
tb_page_remove.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This commit does several things, but to avoid churn I merged them all
into the same commit. To wit:
- Use uintptr_t instead of TranslationBlock * for the list of TBs in a page.
Just like we did in (c37e6d7e "tcg: Use uintptr_t type for
jmp_list_{next|first} fields of TB"), the rationale is the same: these
are tagged pointers, not pointers. So use a more appropriate type.
- Only check the least significant bit of the tagged pointers. Masking
with 3/~3 is unnecessary and confusing.
- Introduce the TB_FOR_EACH_TAGGED macro, and use it to define
PAGE_FOR_EACH_TB, which improves readability. Note that
TB_FOR_EACH_TAGGED will gain another user in a subsequent patch.
- Update tb_page_remove to use PAGE_FOR_EACH_TB. In case there
is a bug and we attempt to remove a TB that is not in the list, instead
of segfaulting (since the list is NULL-terminated) we will reach
g_assert_not_reached().
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This paves the way for enabling scalable parallel generation of TCG code.
Instead of tracking TBs with a single binary search tree (BST), use a
BST for each TCG region, protecting it with a lock. This is as scalable
as it gets, since each TCG thread operates on a separate region.
The core of this change is the introduction of struct tcg_region_tree,
which contains a pointer to a GTree and an associated lock to serialize
accesses to it. We then allocate an array of tcg_region_tree's, adding
the appropriate padding to avoid false sharing based on
qemu_dcache_linesize.
Given a tc_ptr, we first find the corresponding region_tree. This
is done by special-casing the first and last regions first, since they
might be of size != region.size; otherwise we just divide the offset
by region.stride. I was worried about this division (several dozen
cycles of latency), but profiling shows that this is not a fast path.
Note that region.stride is not required to be a power of two; it
is only required to be a multiple of the host's page size.
Note that with this design we can also provide consistent snapshots
about all region trees at once; for instance, tcg_tb_foreach
acquires/releases all region_tree locks before/after iterating over them.
For this reason we now drop tb_lock in dump_exec_info().
As an alternative I considered implementing a concurrent BST, but this
can be tricky to get right, offers no consistent snapshots of the BST,
and performance and scalability-wise I don't think it could ever beat
having separate GTrees, given that our workload is insert-mostly (all
concurrent BST designs I've seen focus, understandably, on making
lookups fast, which comes at the expense of convoluted, non-wait-free
insertions/removals).
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Currently we don't support board configurations that put an IOMMU
in the path of the CPU's memory transactions, and instead just
assert() if the memory region fonud in address_space_translate_for_iotlb()
is an IOMMUMemoryRegion.
Remove this limitation by having the function handle IOMMUs.
This is mostly straightforward, but we must make sure we have
a notifier registered for every IOMMU that a transaction has
passed through, so that we can flush the TLB appropriately
when any of the IOMMUs change their mappings.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20180604152941.20374-5-peter.maydell@linaro.org
The API for cpu_transaction_failed() says that it takes the physical
address for the failed transaction. However we were actually passing
it the offset within the target MemoryRegion. We don't currently
have any target CPU implementations of this hook that require the
physical address; fix this bug so we don't get confused if we ever
do add one.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20180611125633.32755-3-peter.maydell@linaro.org
As part of plumbing MemTxAttrs down to the IOMMU translate method,
add MemTxAttrs as an argument to tb_invalidate_phys_addr().
Its callers either have an attrs value to hand, or don't care
and can use MEMTXATTRS_UNSPECIFIED.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20180521140402.23318-3-peter.maydell@linaro.org
In icount mode, instructions that access io memory spaces in the middle
of the translation block invoke TB recompilation. After recompilation,
such instructions become last in the TB and are allowed to access io
memory spaces.
When the code includes instruction like i386 'xchg eax, 0xffffd080'
which accesses APIC, QEMU goes into an infinite loop of the recompilation.
This instruction includes two memory accesses - one read and one write.
After the first access, APIC calls cpu_report_tpr_access, which restores
the CPU state to get the current eip. But cpu_restore_state_from_tb
resets the cpu->can_do_io flag which makes the second memory access invalid.
Therefore the second memory access causes a recompilation of the block.
Then these operations repeat again and again.
This patch moves resetting cpu->can_do_io flag from
cpu_restore_state_from_tb to cpu_loop_exit* functions.
It also adds a parameter for cpu_restore_state which controls restoring
icount. There is no need to restore icount when we only query CPU state
without breaking the TB. Restoring it in such cases leads to the
incorrect flow of the virtual time.
In most cases new parameter is true (icount should be recalculated).
But there are two cases in i386 and openrisc when the CPU state is only
queried without the need to break the TB. This patch fixes both of
these cases.
Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
Message-Id: <20180409091320.12504.35329.stgit@pasha-VirtualBox>
[rth: Make can_do_io setting unconditional; move from cpu_exec;
make cpu_loop_exit_{noexc,restore} call cpu_loop_exit.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The MC68040 MMU provides the size of the access that
triggers the page fault.
This size is set in the Special Status Word which
is written in the stack frame of the access fault
exception.
So we need the size in m68k_cpu_unassigned_access() and
m68k_cpu_handle_mmu_fault().
To be able to do that, this patch modifies the prototype of
handle_mmu_fault handler, tlb_fill() and probe_write().
do_unassigned_access() already includes a size parameter.
This patch also updates handle_mmu_fault handlers and
tlb_fill() of all targets (only parameter, no code change).
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20180118193846.24953-2-laurent@vivier.eu>
Normally we create an address space for that CPU and pass that address
space into the function. Let's just do it inside to unify address space
creations. It'll simplify my next patch to rename those address spaces.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20171123092333.16085-3-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We are still seeing signals during translation time when we walk over
a page protection boundary. This expands the check to ensure the host
PC is inside the code generation buffer. The original suggestion was
to check versus tcg_ctx.code_gen_ptr but as we now segment the
translation buffer we have to settle for just a general check for
being inside.
I've also fixed up the declaration to make it clear it can deal with
invalid addresses. A later patch will fix up the call sites.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20171108153245.20740-2-alex.bennee@linaro.org
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Tested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
We don't really free anything in this function anymore; we just remove
the TB from the binary search tree.
Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This is a prerequisite for supporting multiple TCG contexts, since
we will have threads generating code in separate regions of
code_gen_buffer.
For this we need a new field (.size) in struct tb_tc to keep
track of the size of the translated code. This field uses a size_t
to avoid adding a hole to the struct, although really an unsigned
int would have been enough.
The comparison function we use is optimized for the common case:
insertions. Profiling shows that upon booting debian-arm, 98%
of comparisons are between existing tb's (i.e. a->size and b->size
are both !0), which happens during insertions (and removals, but
those are rare). The remaining cases are lookups. From reading the glib
sources we see that the first key is always the lookup key. However,
the code does not assume this to always be the case because this
behaviour is not guaranteed in the glib docs. However, we embed
this knowledge in the code as a branch hint for the compiler.
Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer. The next patch in this series renames tb_free
to tb_remove to reflect this.
Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which
results in a small slowdown when booting debian-arm:
Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel img/arm/aarch32-current-linux-kernel-only.img \
-append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 1' (10 runs):
- Before:
8048.598422 task-clock (msec) # 0.931 CPUs utilized ( +- 0.28% )
16,974 context-switches # 0.002 M/sec ( +- 0.12% )
0 cpu-migrations # 0.000 K/sec
10,125 page-faults # 0.001 M/sec ( +- 1.23% )
35,144,901,879 cycles # 4.367 GHz ( +- 0.14% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
65,758,252,643 instructions # 1.87 insns per cycle ( +- 0.33% )
10,871,298,668 branches # 1350.707 M/sec ( +- 0.41% )
192,322,212 branch-misses # 1.77% of all branches ( +- 0.32% )
8.640869419 seconds time elapsed ( +- 0.57% )
- After:
8146.242027 task-clock (msec) # 0.923 CPUs utilized ( +- 1.23% )
17,016 context-switches # 0.002 M/sec ( +- 0.40% )
0 cpu-migrations # 0.000 K/sec
18,769 page-faults # 0.002 M/sec ( +- 0.45% )
35,660,956,120 cycles # 4.378 GHz ( +- 1.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
65,095,366,607 instructions # 1.83 insns per cycle ( +- 1.73% )
10,803,480,261 branches # 1326.192 M/sec ( +- 1.95% )
195,601,289 branch-misses # 1.81% of all branches ( +- 0.39% )
8.828660235 seconds time elapsed ( +- 0.38% )
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Now that we have curr_cflags, we can include CF_USE_ICOUNT
early and then remove it as necessary.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
These flags are used by target/*/translate.c,
and affect code generation.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.
Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the "curr_cflags" inline.
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
In preparation for adding tc.size to be able to keep track of
TB's using the binary search tree implementation from glib.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
And fix the following warning when DEBUG_TB_INVALIDATE is enabled
in translate-all.c:
CC mipsn32-linux-user/accel/tcg/translate-all.o
/data/src/qemu/accel/tcg/translate-all.c: In function ‘tb_alloc_page’:
/data/src/qemu/accel/tcg/translate-all.c:1201:16: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘tb_page_addr_t {aka unsigned int}’ [-Werror=format=]
printf("protecting code page: 0x" TARGET_FMT_lx "\n",
^
cc1: all warnings being treated as errors
/data/src/qemu/rules.mak:66: recipe for target 'accel/tcg/translate-all.o' failed
make[1]: *** [accel/tcg/translate-all.o] Error 1
Makefile:328: recipe for target 'subdir-mipsn32-linux-user' failed
make: *** [subdir-mipsn32-linux-user] Error 2
cota@flamenco:/data/src/qemu/build ((18f3fe1...) *$)$
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This gets rid of a hole in struct TranslationBlock.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Replace the USE_DIRECT_JUMP ifdef with a TCG_TARGET_HAS_direct_jump
boolean test. Replace the tb_set_jmp_target1 ifdef with an unconditional
function tb_target_set_jmp_target.
While we're touching all backends, add a parameter for tb->tc_ptr;
we're going to need it shortly for some backends.
Move tb_set_jmp_target and tb_add_jump from exec-all.h to cpu-exec.c.
This opens the possibility for TCG_TARGET_HAS_direct_jump to be
a runtime decision -- based on host cpu capabilities, the size of
code_gen_buffer, or a future debugging switch.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Used later. An enum makes expected values explicit and
bounds the value space of switches.
Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <150002049746.22386.2316077281615710615.stgit@frigg.lan>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This will allow some amount of cleanup to happen before
switching the backends over to enum DisasJumpType.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Lluís Vilanova <vilanova@ac.upc.edu>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Needed to implement a target-agnostic gen_intermediate_code()
in the future.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Benneé <alex.benee@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Message-Id: <150002025498.22386.18051908483085660588.stgit@frigg.lan>
Signed-off-by: Richard Henderson <rth@twiddle.net>
* new model of the ARM MPS2/MPS2+ FPGA based development board
* clean up DISAS_* exit conditions and fix various regressions
since commits e75449a3468a6b28c7b5 (in particular including
ones which broke OP-TEE guests)
* make Cortex-M3 and M4 correctly default to 8 PMSA regions
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJZbLEBAAoJEDwlJe0UNgzeTqsP/06M2a/rswUKjIGAsXv+TeTl
5N31g9E6Jr57HXK94Q0XtNkLlPwvIn97Dcv6VKg5+E8OgJx7ozldwZVFghWvMbOA
mbaikzgTRRUf6ydNTA4DtWYZPkaLNT86Vmb2T1GKS0nmw2ymd+hMLNk5vZd1jhDv
krHxwECI5e+u1INpw7erlQ2mqVP1NjvOuMNtjdAgtJ5tnjFRfQaVedePmr5qOuIK
xkYMKMNtled/KS0gP4TaSu5S012iYhzrpKISN/g4WHT/8kllr+iEowNAOJSJ6l38
oaBJJJCsLwnnV1nRClp4NNQv0Q/RXyIex5mPkeWERk4QU9adSDHnYJR7xn7JEyzV
l9o+av28bXA7l3C8BOi3ahSGh5cDu+hif0Biml/Kke7e4+1Lp3/QWSQ+p/E5PDDq
rhk65cg07PxSHeogN8hgu+RYN0gF3WBKASwUIDAkVdBsLlH8LVmoT5DtllL+6PyY
cwCp3nWeu0q2YDxGOfCZrUC4YJMl8hqHoWbdVah8vLKV/w/JVUtVEIol0za50dzG
ii6wOLqzV8GH0vkVa5x0InlH+t+/LtDRVkgHUT3/64eEEG+SsK/GmZeEtvcmp7GP
7Qx+Dd7hPgh+uis0XZPz37vqyCYhaFNw1+M9EESlQKUKfdY8B5B5bpXVDOBF+0Zl
daOoMw8xBd21DXNk9tCk
=gVxi
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20170717' into staging
target-arm queue:
* new model of the ARM MPS2/MPS2+ FPGA based development board
* clean up DISAS_* exit conditions and fix various regressions
since commits e75449a3468a6b28c7b5 (in particular including
ones which broke OP-TEE guests)
* make Cortex-M3 and M4 correctly default to 8 PMSA regions
# gpg: Signature made Mon 17 Jul 2017 13:43:45 BST
# gpg: using RSA key 0x3C2525ED14360CDE
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>"
# gpg: aka "Peter Maydell <pmaydell@gmail.com>"
# gpg: aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>"
# Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83 15CF 3C25 25ED 1436 0CDE
* remotes/pmaydell/tags/pull-target-arm-20170717:
MAINTAINERS: Add entries for MPS2 board
hw/arm/mps2: Add ethernet
hw/arm/mps2: Add SCC
hw/misc/mps2_scc: Implement MPS2 Serial Communication Controller
hw/arm/mps2: Add timers
hw/char/cmsdk-apb-timer: Implement CMSDK APB timer device
hw/arm/mps2: Add UARTs
hw/char/cmsdk-apb-uart.c: Implement CMSDK APB UART
hw/arm/mps2: Implement skeleton mps2-an385 and mps2-an511 board models
target/arm: use DISAS_EXIT for eret handling
target/arm: use gen_goto_tb for ISB handling
target/arm/translate: ensure gen_goto_tb sets exit flags
target/arm/translate.h: expand comment on DISAS_EXIT
target/arm/translate: make DISAS_UPDATE match declared semantics
include/exec/exec-all: document common exit conditions
target/arm: Make Cortex-M3 and M4 default to 8 PMSA regions
qdev: support properties which don't set a default value
qdev-properties.h: Explicitly set the default value for arraylen properties
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
As a precursor to later patches attempt to come up with a more
concrete wording for what each of the common exit cases would be.
CC: Emilio G. Cota <cota@braap.org>
CC: Richard Henderson <rth@twiddle.net>
CC: Lluís Vilanova <vilanova@ac.upc.edu>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-id: 20170713141928.25419-2-alex.bennee@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Every vCPU now uses a separate set of TBs for each set of dynamic
tracing event state values. Each set of TBs can be used by any number of
vCPUs to maximize TB reuse when vCPUs have the same tracing state.
This feature is later used by tracetool to optimize tracing of guest
code events.
The maximum number of TB sets is defined as 2^E, where E is the number
of events that have the 'vcpu' property (their state is stored in
CPUState->trace_dstate).
For this to work, a change on the dynamic tracing state of a vCPU will
force it to flush its virtual TB cache (which is only indexed by
address), and fall back to the physical TB cache (which now contains the
vCPU's dynamic tracing state as part of the hashing function).
Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-id: 149915775266.6295.10060144081246467690.stgit@frigg.lan
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add CONFIG_TCG around TLB-related functions and structure declarations.
Some of these functions are defined in ./accel/tcg/cputlb.c, which will
not be linked in if TCG is disabled, and have no stubs; therefore, their
callers will also be compiled out for --disable-tcg.
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of exporting goto_ptr directly to TCG frontends, export
tcg_gen_lookup_and_goto_ptr(), which calls goto_ptr with the pointer
returned by the lookup_tb_ptr() helper. This is the only use case
we have for goto_ptr and lookup_tb_ptr, so having this function is
very convenient. Furthermore, it trivially allows us to avoid calling
the lookup helper if goto_ptr is not implemented by the backend.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1493263764-18657-2-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-3-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-4-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-5-git-send-email-cota@braap.org>
[rth: Squashed 4 related commits.]
Signed-off-by: Richard Henderson <rth@twiddle.net>
This introduces support to the cputlb API for flushing all CPUs TLBs
with one call. This avoids the need for target helpers to iterate
through the vCPUs themselves.
An additional variant of the API (_synced) will cause the source vCPUs
work to be scheduled as "safe work". The result will be all the flush
operations will be complete by the time the originating vCPU executes
its safe work. The calling implementation can either end the TB
straight away (which will then pick up the cpu->exit_request on
entering the next block) or defer the exit until the architectural
sync point (usually a barrier instruction).
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
While the vargs approach was flexible the original MTTCG ended up
having munge the bits to a bitmap so the data could be used in
deferred work helpers. Instead of hiding that in cputlb we push the
change to the API to make it take a bitmap of MMU indexes instead.
For ARM some the resulting flushes end up being quite long so to aid
readability I've tended to move the index shifting to a new line so
all the bits being or-ed together line up nicely, for example:
tlb_flush_page_by_mmuidx(other_cs, pageaddr,
(1 << ARMMMUIdx_S1SE1) |
(1 << ARMMMUIdx_S1SE0));
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
[AT: SPARC parts only]
Reviewed-by: Artyom Tarasenko <atar4qemu@gmail.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
[PM: ARM parts only]
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Some architectures allow to flush the tlb of other VCPUs. This is not a problem
when we have only one thread for all VCPUs but it definitely needs to be an
asynchronous work when we are in true multithreaded work.
We take the tb_lock() when doing this to avoid racing with other threads
which may be invalidating TB's at the same time. The alternative would
be to use proper atomic primitives to clear the tlb entries en-mass.
This patch doesn't do anything to protect other cputlb function being
called in MTTCG mode making cross vCPU changes.
Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
[AJB: remove need for g_malloc on defer, make check fixes, tb_lock]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
There are now only two uses of the global exit_request left.
The first ensures we exit the run_loop when we first start to process
pending work and in the kick handler. This is just as easily done by
setting the first_cpu->exit_request flag.
The second use is in the round robin kick routine. The global
exit_request ensured every vCPU would set its local exit_request and
cause a full exit of the loop. Now the iothread isn't being held while
running we can just rely on the kick handler to push us out as intended.
We lightly re-factor the main vCPU thread to ensure cpu->exit_requests
cause us to exit the main loop and process any IO requests that might
come along. As an cpu->exit_request may legitimately get squashed
while processing the EXCP_INTERRUPT exception we also check
cpu->queued_work_first to ensure queued work is expedited as soon as
possible.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
..and make the definition local to cpus. In preparation for MTTCG the
concept of a global tcg_current_cpu will no longer make sense. However
we still need to keep track of it in the single-threaded case to be able
to exit quickly when required.
qemu_cpu_kick_no_halt() moves and becomes qemu_cpu_kick_rr_cpu() to
emphasise its use-case. qemu_cpu_kick now kicks the relevant cpu as
well as qemu_kick_rr_cpu() which will become a no-op in MTTCG.
For the time being the setting of the global exit_request remains.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
When icount is active, tb_add_jump is surprisingly called with an
out of bounds basic block index. I have no idea how that can work,
but it does not seem like a good idea. Clear *last_tb for all
TB_EXIT_ICOUNT_EXPIRED cases, even when all you have to do is
refill icount_extra.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We have never has the concept of global TLB entries which would avoid
the flush so we never actually use this flag. Drop it and make clear
that tlb_flush is the sledge-hammer it has always been.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
[DG: ppc portions]
Acked-by: David Gibson <david@gibson.dropbear.id.au>
softmmu requires more functions to be thread-safe, because translation
blocks can be invalidated from e.g. notdirty callbacks. Probably the
same holds for user-mode emulation, it's just that no one has ever
tried to produce a coherent locking there.
This patch will guide the introduction of more tb_lock and tb_unlock
calls for system emulation.
Note that after this patch some (most) of the mentioned functions are
still called outside tb_lock/tb_unlock. The next one will rectify this.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <20161027151030.20863-7-alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This adds asserts to check the locking on the various translation
engines structures. There are two sets of structures that are protected
by locks.
The first the l1map and PageDesc structures used to track which
translation blocks are associated with which physical addresses. In
user-mode this is covered by the mmap_lock.
The second case are TB context related structures which are protected by
tb_lock which is also user-mode only.
Currently the asserts do nothing in SoftMMU mode but this will change
for MTTCG.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <20161027151030.20863-4-alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we cannot emulate an atomic operation within a parallel
context, this exception allows us to stop the world and try
again in a serial context.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Modify all CPUs to call it from XXX_cpu_realizefn() function.
Remove all the cannot_destroy_with_object_finalize_yet as
unsafe references have been moved to cpu_exec_realizefn().
(tested with QOM command provided by commit 4c315c27)
for arm:
Setting of cpu->mp_affinity is moved from arm_cpu_initfn()
to arm_cpu_realizefn() as setting of cpu_index is now done
in cpu_exec_realizefn(). To avoid to overwrite an user defined
value, we set it to an invalid value by default, and update
it in realize function only if the value is still invalid.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Add a mutex for the CPU list to system emulation, as it will be used to
manage safe work. Abstract manipulation of the CPU list in new functions
cpu_list_add and cpu_list_remove.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The return address argument to the softmmu template helpers was
confused. In the legacy case, we wanted to indicate that there
is no return address, and so passed in NULL. However, we then
immediately subtracted GETPC_ADJ from NULL, resulting in a non-zero
value, indicating the presence of an (invalid) return address.
Push the GETPC_ADJ subtraction down to the only point it's required:
immediately before use within cpu_restore_state_from_tb, after all
NULL pointer checks have been completed.
This makes GETPC and GETRA identical. Remove GETRA as the lesser
used macro, replacing all uses with GETPC.
Signed-off-by: Richard Henderson <rth@twiddle.net>
When invalidating a translation block, set an invalid flag into the
TranslationBlock structure first. It is also necessary to check whether
the target TB is still valid after acquiring 'tb_lock' but before calling
tb_add_jump() since TB lookup is to be performed out of 'tb_lock' in
future. Note that we don't have to check 'last_tb'; an already invalidated
TB will not be executed anyway and it is thus safe to patch it.
Suggested-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are functions tlb_fill(), cpu_unaligned_access() and
do_unaligned_access() that are called with access type and mmu index
arguments. But these arguments are named 'is_write' and 'is_user' in their
declarations. The patches fix the arguments to avoid a confusion.
Signed-off-by: Sergey Sorokin <afarallax@yandex.ru>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Message-id: 1465907177-1399402-1-git-send-email-afarallax@yandex.ru
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.
Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.
The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:
- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
MRU is implemented here by adding an intermediate struct
that contains the u32 hash and a pointer to the TB; this
allows us, on an MRU promotion, to copy said struct (that is not
at the head), and put this new copy at the head. After a grace
period, the original non-head struct can be eliminated, and
after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
no MRU for lookups; MRU for inserts.
The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
http://imgur.com/a/Awgnq
Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.
Host: Intel Xeon E5-2690
28 ++------------+-------------+-------------+-------------+------------++
A***** + + + master **A*** +
27 ++ * xxhash ##B###++
| A******A****** xxhash-rcu $$C$$$ |
26 C$$ A******A****** qht-fixed-nomru*%%D%%%++
D%%$$ A******A******A*qht-dyn-mru A*E****A
25 ++ %%$$ qht-dyn-nomru &&F&&&++
B#####% |
24 ++ #C$$$$$ ++
| B### $ |
| ## C$$$$$$ |
23 ++ # C$$$$$$ ++
| B###### C$$$$$$ %%%D
22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
| D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E
21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
+ E@@@ F&&& + E@ + F&&& + +
20 ++------------+-------------+-------------+-------------+------------++
14 16 18 20 22 24
log2 number of buckets
Host: Intel i7-4790K
14.5 ++------------+------------+-------------+------------+------------++
A** + + + master **A*** +
14 ++ ** xxhash ##B###++
13.5 ++ ** xxhash-rcu $$C$$$++
| qht-fixed-nomru %%D%%% |
13 ++ A****** qht-dyn-mru @@E@@@++
| A*****A******A****** qht-dyn-nomru &&F&&& |
12.5 C$$ A******A******A*****A****** ***A
12 ++ $$ A*** ++
D%%% $$ |
11.5 ++ %% ++
B### %C$$$$$$ |
11 ++ ## D%%%%% C$$$$$ ++
| # % C$$$$$$ |
10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C
10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
+ F&& D%%%%%%B######B######B#####B###@@@D%%% +
9.5 ++------------+------------+-------------+------------+------------++
14 16 18 20 22 24
log2 number of buckets
Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.
xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.
qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.
However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.
Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.
The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:
- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time
We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.
A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.
Reviewed-by: Sergey Fedorov <serge.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The function cpu_resume_from_signal() is now always called with a
NULL puc argument, and is rather misnamed since it is never called
from a signal handler. It is essentially forcing an exit to the
top level cpu loop but without raising any exception, so rename
it to cpu_loop_exit_noexc() and drop the useless unused argument.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Acked-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Riku Voipio <riku.voipio@linaro.org>
Message-id: 1463494687-25947-4-git-send-email-peter.maydell@linaro.org
exec-all.h contains TCG-specific definitions. It is not needed outside
TCG-specific files such as translate.c, exec.c or *helper.c.
One generic function had snuck into include/exec/exec-all.h; move it to
include/qom/cpu.h.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
TCG backends do not need most of exec-all.h; extract what they actually
need to a separate file or move it directly to tcg.h. The next patch
will stop including exec-all.h from everywhere.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'tb_invalidated_flag' was meant to catch two events:
* some TB has been invalidated by tb_phys_invalidate();
* the whole translation buffer has been flushed by tb_flush().
Then it was checked:
* in cpu_exec() to ensure that the last executed TB can be safely
linked to directly call the next one;
* in cpu_exec_nocache() to decide if the original TB should be provided
for further possible invalidation along with the temporarily
generated TB.
It is always safe to patch an invalidated TB since it is not going to be
used anyway. It is also safe to call tb_phys_invalidate() for an already
invalidated TB. Thus, setting this flag in tb_phys_invalidate() is
simply unnecessary. Moreover, it can prevent from pretty proper linking
of TBs, if any arbitrary TB has been invalidated. So just don't touch it
in tb_phys_invalidate().
If this flag is only used to catch whether tb_flush() has been called
then rename it to 'tb_flushed'. Declare it as 'bool' and stick to using
only 'true' and 'false' to set its value. Also, instead of setting it in
tb_gen_code(), just after tb_flush() has been called, do it right inside
of tb_flush().
In cpu_exec(), this flag is used to track if tb_flush() has been called
and have made 'next_tb' (a reference to the last executed TB) invalid
for linking it to directly call the next TB. tb_flush() can be called
during the CPU execution loop from tb_gen_code(), during TB execution or
by another thread while 'tb_lock' is released. Catch for translation
buffer flush reliably by resetting this flag once before first TB lookup
and each time we find it set before trying to add a direct jump. Don't
touch in in tb_find_physical().
Each vCPU has its own execution loop in multithreaded mode and thus
should have its own copy of the flag to be able to reset it with its own
'next_tb' and don't affect any other vCPU execution thread. So make this
flag per-vCPU and move it to CPUState.
In cpu_exec_nocache(), we only need to check if tb_flush() has been
called from tb_gen_code() called by cpu_exec_nocache() itself. To do
this reliably, preserve the old value of the flag, reset it before
calling tb_gen_code(), check afterwards, and combine the saved value
back to the flag.
This patch is based on the patch "tcg: move tb_invalidated_flag to
CPUState" from Paolo Bonzini <pbonzini@redhat.com>.
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The check is to make sure that another thread hasn't already done the
same while we were outside of tb_lock. Mention this in a comment.
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
These fields do not contain pure pointers to a TranslationBlock
structure. So uintptr_t is the most appropriate type for them.
Also put some asserts to assure that the two least significant bits of
the pointer are always zero before assigning it to jmp_list_first.
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>