2020-07-09 17:13:15 +03:00
|
|
|
..
|
|
|
|
Copyright (c) 2015-2020 Linaro Ltd.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
2020-07-09 17:13:15 +03:00
|
|
|
This work is licensed under the terms of the GNU GPL, version 2 or
|
|
|
|
later. See the COPYING file in the top-level directory.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
2021-06-23 13:27:43 +03:00
|
|
|
==================
|
|
|
|
Multi-threaded TCG
|
|
|
|
==================
|
2017-02-23 21:29:04 +03:00
|
|
|
|
2020-07-09 17:13:15 +03:00
|
|
|
This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
|
|
|
|
system-mode emulation. user-mode emulation has always mirrored the
|
|
|
|
thread structure of the translated executable although some of the
|
|
|
|
changes done for MTTCG system emulation have improved the stability of
|
|
|
|
linux-user emulation.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
The original system-mode TCG implementation was single threaded and
|
|
|
|
dealt with multiple CPUs with simple round-robin scheduling. This
|
|
|
|
simplified a lot of things but became increasingly limited as systems
|
|
|
|
being emulated gained additional cores and per-core performance gains
|
|
|
|
for host systems started to level off.
|
|
|
|
|
|
|
|
vCPU Scheduling
|
|
|
|
===============
|
|
|
|
|
|
|
|
We introduce a new running mode where each vCPU will run on its own
|
2020-07-09 17:13:15 +03:00
|
|
|
user-space thread. This is enabled by default for all FE/BE
|
|
|
|
combinations where the host memory model is able to accommodate the
|
|
|
|
guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
|
|
|
|
guest has had the required work done to support this safely
|
|
|
|
(TARGET_SUPPORTS_MTTCG).
|
|
|
|
|
|
|
|
System emulation will fall back to the original round robin approach
|
|
|
|
if:
|
|
|
|
|
|
|
|
* forced by --accel tcg,thread=single
|
|
|
|
* enabling --icount mode
|
|
|
|
* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
In the general case of running translated code there should be no
|
|
|
|
inter-vCPU dependencies and all vCPUs should be able to run at full
|
|
|
|
speed. Synchronisation will only be required while accessing internal
|
|
|
|
shared data structures or when the emulated architecture requires a
|
|
|
|
coherent representation of the emulated machine state.
|
|
|
|
|
|
|
|
Shared Data Structures
|
|
|
|
======================
|
|
|
|
|
|
|
|
Main Run Loop
|
|
|
|
-------------
|
|
|
|
|
|
|
|
Even when there is no code being generated there are a number of
|
|
|
|
structures associated with the hot-path through the main run-loop.
|
|
|
|
These are associated with looking up the next translation block to
|
|
|
|
execute. These include:
|
|
|
|
|
|
|
|
tb_jmp_cache (per-vCPU, cache of recent jumps)
|
|
|
|
tb_ctx.htable (global hash table, phys address->tb lookup)
|
|
|
|
|
|
|
|
As TB linking only occurs when blocks are in the same page this code
|
|
|
|
is critical to performance as looking up the next TB to execute is the
|
|
|
|
most common reason to exit the generated code.
|
|
|
|
|
|
|
|
DESIGN REQUIREMENT: Make access to lookup structures safe with
|
|
|
|
multiple reader/writer threads. Minimise any lock contention to do it.
|
|
|
|
|
|
|
|
The hot-path avoids using locks where possible. The tb_jmp_cache is
|
|
|
|
updated with atomic accesses to ensure consistent results. The fall
|
|
|
|
back QHT based hash table is also designed for lockless lookups. Locks
|
|
|
|
are only taken when code generation is required or TranslationBlocks
|
|
|
|
have their block-to-block jumps patched.
|
|
|
|
|
|
|
|
Global TCG State
|
|
|
|
----------------
|
|
|
|
|
2020-07-09 17:13:15 +03:00
|
|
|
User-mode emulation
|
|
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
2017-02-23 21:29:04 +03:00
|
|
|
We need to protect the entire code generation cycle including any post
|
|
|
|
generation patching of the translated code. This also implies a shared
|
|
|
|
translation buffer which contains code running on all cores. Any
|
|
|
|
execution path that comes to the main run loop will need to hold a
|
|
|
|
mutex for code generation. This also includes times when we need flush
|
|
|
|
code or entries from any shared lookups/caches. Structures held on a
|
|
|
|
per-vCPU basis won't need locking unless other vCPUs will need to
|
|
|
|
modify them.
|
|
|
|
|
|
|
|
DESIGN REQUIREMENT: Add locking around all code generation and TB
|
|
|
|
patching.
|
|
|
|
|
|
|
|
(Current solution)
|
|
|
|
|
2017-08-05 06:46:31 +03:00
|
|
|
Code generation is serialised with mmap_lock().
|
|
|
|
|
2020-07-09 17:13:15 +03:00
|
|
|
!User-mode emulation
|
|
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
2017-08-05 06:46:31 +03:00
|
|
|
Each vCPU has its own TCG context and associated TCG region, thereby
|
2020-07-09 17:13:15 +03:00
|
|
|
requiring no locking during translation.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
Translation Blocks
|
|
|
|
------------------
|
|
|
|
|
|
|
|
Currently the whole system shares a single code generation buffer
|
|
|
|
which when full will force a flush of all translations and start from
|
|
|
|
scratch again. Some operations also force a full flush of translations
|
|
|
|
including:
|
|
|
|
|
|
|
|
- debugging operations (breakpoint insertion/removal)
|
|
|
|
- some CPU helper functions
|
2020-07-13 23:04:12 +03:00
|
|
|
- linux-user spawning its first thread
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
This is done with the async_safe_run_on_cpu() mechanism to ensure all
|
|
|
|
vCPUs are quiescent when changes are being made to shared global
|
|
|
|
structures.
|
|
|
|
|
|
|
|
More granular translation invalidation events are typically due
|
|
|
|
to a change of the state of a physical page:
|
|
|
|
|
|
|
|
- code modification (self modify code, patching code)
|
|
|
|
- page changes (new page mapping in linux-user mode)
|
|
|
|
|
|
|
|
While setting the invalid flag in a TranslationBlock will stop it
|
|
|
|
being used when looked up in the hot-path there are a number of other
|
|
|
|
book-keeping structures that need to be safely cleared.
|
|
|
|
|
|
|
|
Any TranslationBlocks which have been patched to jump directly to the
|
|
|
|
now invalid blocks need the jump patches reversing so they will return
|
|
|
|
to the C code.
|
|
|
|
|
|
|
|
There are a number of look-up caches that need to be properly updated
|
|
|
|
including the:
|
|
|
|
|
|
|
|
- jump lookup cache
|
|
|
|
- the physical-to-tb lookup hash table
|
|
|
|
- the global page table
|
|
|
|
|
|
|
|
The global page table (l1_map) which provides a multi-level look-up
|
|
|
|
for PageDesc structures which contain pointers to the start of a
|
|
|
|
linked list of all Translation Blocks in that page (see page_next).
|
|
|
|
|
|
|
|
Both the jump patching and the page cache involve linked lists that
|
|
|
|
the invalidated TranslationBlock needs to be removed from.
|
|
|
|
|
|
|
|
DESIGN REQUIREMENT: Safely handle invalidation of TBs
|
|
|
|
- safely patch/revert direct jumps
|
|
|
|
- remove central PageDesc lookup entries
|
|
|
|
- ensure lookup caches/hashes are safely updated
|
|
|
|
|
|
|
|
(Current solution)
|
|
|
|
|
|
|
|
The direct jump themselves are updated atomically by the TCG
|
|
|
|
tb_set_jmp_target() code. Modification to the linked lists that allow
|
translate-all: protect TB jumps with a per-destination-TB lock
This applies to both user-mode and !user-mode emulation.
Instead of relying on a global lock, protect the list of incoming
jumps with tb->jmp_lock. This lock also protects tb->cflags,
so update all tb->cflags readers outside tb->jmp_lock to use
atomic reads via tb_cflags().
In order to find the destination TB (and therefore its jmp_lock)
from the origin TB, we introduce tb->jmp_dest[].
I considered not using a linked list of jumps, which simplifies
code and makes the struct smaller. However, it unnecessarily increases
memory usage, which results in a performance decrease. See for
instance these numbers booting+shutting down debian-arm:
Time (s) Rel. err (%) Abs. err (s) Rel. slowdown (%)
------------------------------------------------------------------------------
before 20.88 0.74 0.154512 0.
after 20.81 0.38 0.079078 -0.33524904
GTree 21.02 0.28 0.058856 0.67049808
GHashTable + xxhash 21.63 1.08 0.233604 3.5919540
Using a hash table or a binary tree to keep track of the jumps
doesn't really pay off, not only due to the increased memory usage,
but also because most TBs have only 0 or 1 jumps to them. The maximum
number of jumps when booting debian-arm that I measured is 35, but
as we can see in the histogram below a TB with that many incoming jumps
is extremely rare; the average TB has 0.80 incoming jumps.
n_jumps: 379208; avg jumps/tb: 0.801099
dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁ ▁▁▁ ▁|[34.0,35.0]
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-08-03 03:34:06 +03:00
|
|
|
searching for linked pages are done under the protection of tb->jmp_lock,
|
|
|
|
where tb is the destination block of a jump. Each origin block keeps a
|
|
|
|
pointer to its destinations so that the appropriate lock can be acquired before
|
|
|
|
iterating over a jump list.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
2017-07-27 03:15:41 +03:00
|
|
|
The global page table is a lockless radix tree; cmpxchg is used
|
|
|
|
to atomically insert new elements.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
The lookup caches are updated atomically and the lookup hash uses QHT
|
|
|
|
which is designed for concurrent safe lookup.
|
|
|
|
|
2017-08-01 22:40:16 +03:00
|
|
|
Parallel code generation is supported. QHT is used at insertion time
|
|
|
|
as the synchronization point across threads, thereby ensuring that we only
|
|
|
|
keep track of a single TranslationBlock for each guest code block.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
Memory maps and TLBs
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
The memory handling code is fairly critical to the speed of memory
|
|
|
|
access in the emulated system. The SoftMMU code is designed so the
|
|
|
|
hot-path can be handled entirely within translated code. This is
|
|
|
|
handled with a per-vCPU TLB structure which once populated will allow
|
|
|
|
a series of accesses to the page to occur without exiting the
|
|
|
|
translated code. It is possible to set flags in the TLB address which
|
|
|
|
will ensure the slow-path is taken for each access. This can be done
|
|
|
|
to support:
|
|
|
|
|
|
|
|
- Memory regions (dividing up access to PIO, MMIO and RAM)
|
|
|
|
- Dirty page tracking (for code gen, SMC detection, migration and display)
|
|
|
|
- Virtual TLB (for translating guest address->real address)
|
|
|
|
|
|
|
|
When the TLB tables are updated by a vCPU thread other than their own
|
|
|
|
we need to ensure it is done in a safe way so no inconsistent state is
|
|
|
|
seen by the vCPU thread.
|
|
|
|
|
|
|
|
Some operations require updating a number of vCPUs TLBs at the same
|
|
|
|
time in a synchronised manner.
|
|
|
|
|
|
|
|
DESIGN REQUIREMENTS:
|
|
|
|
|
|
|
|
- TLB Flush All/Page
|
|
|
|
- can be across-vCPUs
|
|
|
|
- cross vCPU TLB flush may need other vCPU brought to halt
|
|
|
|
- change may need to be visible to the calling vCPU immediately
|
|
|
|
- TLB Flag Update
|
|
|
|
- usually cross-vCPU
|
|
|
|
- want change to be visible as soon as possible
|
|
|
|
- TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
|
|
|
|
- This is a per-vCPU table - by definition can't race
|
|
|
|
- updated by its own thread when the slow-path is forced
|
|
|
|
|
|
|
|
(Current solution)
|
|
|
|
|
|
|
|
We have updated cputlb.c to defer operations when a cross-vCPU
|
|
|
|
operation with async_run_on_cpu() which ensures each vCPU sees a
|
|
|
|
coherent state when it next runs its work (in a few instructions
|
|
|
|
time).
|
|
|
|
|
|
|
|
A new set up operations (tlb_flush_*_all_cpus) take an additional flag
|
|
|
|
which when set will force synchronisation by setting the source vCPUs
|
|
|
|
work as "safe work" and exiting the cpu run loop. This ensure by the
|
|
|
|
time execution restarts all flush operations have completed.
|
|
|
|
|
|
|
|
TLB flag updates are all done atomically and are also protected by the
|
2017-08-05 06:46:31 +03:00
|
|
|
corresponding page lock.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
(Known limitation)
|
|
|
|
|
|
|
|
Not really a limitation but the wait mechanism is overly strict for
|
|
|
|
some architectures which only need flushes completed by a barrier
|
|
|
|
instruction. This could be a future optimisation.
|
|
|
|
|
|
|
|
Emulated hardware state
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
Currently thanks to KVM work any access to IO memory is automatically
|
|
|
|
protected by the global iothread mutex, also known as the BQL (Big
|
|
|
|
Qemu Lock). Any IO region that doesn't use global mutex is expected to
|
|
|
|
do its own locking.
|
|
|
|
|
|
|
|
However IO memory isn't the only way emulated hardware state can be
|
|
|
|
modified. Some architectures have model specific registers that
|
|
|
|
trigger hardware emulation features. Generally any translation helper
|
|
|
|
that needs to update more than a single vCPUs of state should take the
|
|
|
|
BQL.
|
|
|
|
|
|
|
|
As the BQL, or global iothread mutex is shared across the system we
|
|
|
|
push the use of the lock as far down into the TCG code as possible to
|
|
|
|
minimise contention.
|
|
|
|
|
|
|
|
(Current solution)
|
|
|
|
|
|
|
|
MMIO access automatically serialises hardware emulation by way of the
|
2020-03-10 00:58:18 +03:00
|
|
|
BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
|
2017-02-23 21:29:04 +03:00
|
|
|
and also defer the reset/startup of vCPUs to the vCPU context by way
|
|
|
|
of async_run_on_cpu().
|
|
|
|
|
|
|
|
Updates to interrupt state are also protected by the BQL as they can
|
|
|
|
often be cross vCPU.
|
|
|
|
|
|
|
|
Memory Consistency
|
|
|
|
==================
|
|
|
|
|
|
|
|
Between emulated guests and host systems there are a range of memory
|
|
|
|
consistency models. Even emulating weakly ordered systems on strongly
|
|
|
|
ordered hosts needs to ensure things like store-after-load re-ordering
|
|
|
|
can be prevented when the guest wants to.
|
|
|
|
|
|
|
|
Memory Barriers
|
|
|
|
---------------
|
|
|
|
|
|
|
|
Barriers (sometimes known as fences) provide a mechanism for software
|
|
|
|
to enforce a particular ordering of memory operations from the point
|
|
|
|
of view of external observers (e.g. another processor core). They can
|
|
|
|
apply to any memory operations as well as just loads or stores.
|
|
|
|
|
2020-07-09 17:13:15 +03:00
|
|
|
The Linux kernel has an excellent `write-up
|
2020-10-09 19:15:23 +03:00
|
|
|
<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
|
2020-07-09 17:13:15 +03:00
|
|
|
on the various forms of memory barrier and the guarantees they can
|
|
|
|
provide.
|
2017-02-23 21:29:04 +03:00
|
|
|
|
|
|
|
Barriers are often wrapped around synchronisation primitives to
|
|
|
|
provide explicit memory ordering semantics. However they can be used
|
|
|
|
by themselves to provide safe lockless access by ensuring for example
|
|
|
|
a change to a signal flag will only be visible once the changes to
|
|
|
|
payload are.
|
|
|
|
|
|
|
|
DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
|
|
|
|
|
|
|
|
This would enforce a strong load/store ordering so all loads/stores
|
|
|
|
complete at the memory barrier. On single-core non-SMP strongly
|
|
|
|
ordered backends this could become a NOP.
|
|
|
|
|
|
|
|
Aside from explicit standalone memory barrier instructions there are
|
|
|
|
also implicit memory ordering semantics which comes with each guest
|
|
|
|
memory access instruction. For example all x86 load/stores come with
|
2020-03-10 00:58:18 +03:00
|
|
|
fairly strong guarantees of sequential consistency whereas Arm has
|
2017-02-23 21:29:04 +03:00
|
|
|
special variants of load/store instructions that imply acquire/release
|
|
|
|
semantics.
|
|
|
|
|
|
|
|
In the case of a strongly ordered guest architecture being emulated on
|
|
|
|
a weakly ordered host the scope for a heavy performance impact is
|
|
|
|
quite high.
|
|
|
|
|
|
|
|
DESIGN REQUIREMENTS: Be efficient with use of memory barriers
|
|
|
|
- host systems with stronger implied guarantees can skip some barriers
|
|
|
|
- merge consecutive barriers to the strongest one
|
|
|
|
|
|
|
|
(Current solution)
|
|
|
|
|
|
|
|
The system currently has a tcg_gen_mb() which will add memory barrier
|
|
|
|
operations if code generation is being done in a parallel context. The
|
|
|
|
tcg_optimize() function attempts to merge barriers up to their
|
|
|
|
strongest form before any load/store operations. The solution was
|
|
|
|
originally developed and tested for linux-user based systems. All
|
|
|
|
backends have been converted to emit fences when required. So far the
|
|
|
|
following front-ends have been updated to emit fences when required:
|
|
|
|
|
|
|
|
- target-i386
|
|
|
|
- target-arm
|
|
|
|
- target-aarch64
|
|
|
|
- target-alpha
|
|
|
|
- target-mips
|
|
|
|
|
|
|
|
Memory Control and Maintenance
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
This includes a class of instructions for controlling system cache
|
|
|
|
behaviour. While QEMU doesn't model cache behaviour these instructions
|
|
|
|
are often seen when code modification has taken place to ensure the
|
|
|
|
changes take effect.
|
|
|
|
|
|
|
|
Synchronisation Primitives
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
There are two broad types of synchronisation primitives found in
|
|
|
|
modern ISAs: atomic instructions and exclusive regions.
|
|
|
|
|
|
|
|
The first type offer a simple atomic instruction which will guarantee
|
|
|
|
some sort of test and conditional store will be truly atomic w.r.t.
|
|
|
|
other cores sharing access to the memory. The classic example is the
|
|
|
|
x86 cmpxchg instruction.
|
|
|
|
|
|
|
|
The second type offer a pair of load/store instructions which offer a
|
2018-06-12 09:51:50 +03:00
|
|
|
guarantee that a region of memory has not been touched between the
|
2020-03-10 00:58:18 +03:00
|
|
|
load and store instructions. An example of this is Arm's ldrex/strex
|
2017-02-23 21:29:04 +03:00
|
|
|
pair where the strex instruction will return a flag indicating a
|
|
|
|
successful store only if no other CPU has accessed the memory region
|
|
|
|
since the ldrex.
|
|
|
|
|
|
|
|
Traditionally TCG has generated a series of operations that work
|
|
|
|
because they are within the context of a single translation block so
|
|
|
|
will have completed before another CPU is scheduled. However with
|
|
|
|
the ability to have multiple threads running to emulate multiple CPUs
|
|
|
|
we will need to explicitly expose these semantics.
|
|
|
|
|
|
|
|
DESIGN REQUIREMENTS:
|
|
|
|
- Support classic atomic instructions
|
|
|
|
- Support load/store exclusive (or load link/store conditional) pairs
|
|
|
|
- Generic enough infrastructure to support all guest architectures
|
|
|
|
CURRENT OPEN QUESTIONS:
|
|
|
|
- How problematic is the ABA problem in general?
|
|
|
|
|
|
|
|
(Current solution)
|
|
|
|
|
|
|
|
The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
|
|
|
|
can be used directly or combined to emulate other instructions like
|
2020-03-10 00:58:18 +03:00
|
|
|
Arm's ldrex/strex instructions. While they are susceptible to the ABA
|
2017-02-23 21:29:04 +03:00
|
|
|
problem so far common guests have not implemented patterns where
|
|
|
|
this may be a problem - typically presenting a locking ABI which
|
|
|
|
assumes cmpxchg like semantics.
|
|
|
|
|
|
|
|
The code also includes a fall-back for cases where multi-threaded TCG
|
|
|
|
ops can't work (e.g. guest atomic width > host atomic width). In this
|
|
|
|
case an EXCP_ATOMIC exit occurs and the instruction is emulated with
|
|
|
|
an exclusive lock which ensures all emulation is serialised.
|
|
|
|
|
|
|
|
While the atomic helpers look good enough for now there may be a need
|
|
|
|
to look at solutions that can more closely model the guest
|
|
|
|
architectures semantics.
|