0b2675c473
The term "QEMU global mutex" is identical to the more widely used Big QEMU Lock ("BQL"). Update the code comments and documentation to use "BQL" instead of "QEMU global mutex". Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com> Message-id: 20240102153529.486531-6-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
373 lines
14 KiB
ReStructuredText
373 lines
14 KiB
ReStructuredText
..
|
|
Copyright (c) 2015-2020 Linaro Ltd.
|
|
|
|
This work is licensed under the terms of the GNU GPL, version 2 or
|
|
later. See the COPYING file in the top-level directory.
|
|
|
|
==================
|
|
Multi-threaded TCG
|
|
==================
|
|
|
|
This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
|
|
system-mode emulation. user-mode emulation has always mirrored the
|
|
thread structure of the translated executable although some of the
|
|
changes done for MTTCG system emulation have improved the stability of
|
|
linux-user emulation.
|
|
|
|
The original system-mode TCG implementation was single threaded and
|
|
dealt with multiple CPUs with simple round-robin scheduling. This
|
|
simplified a lot of things but became increasingly limited as systems
|
|
being emulated gained additional cores and per-core performance gains
|
|
for host systems started to level off.
|
|
|
|
vCPU Scheduling
|
|
===============
|
|
|
|
We introduce a new running mode where each vCPU will run on its own
|
|
user-space thread. This is enabled by default for all FE/BE
|
|
combinations where the host memory model is able to accommodate the
|
|
guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
|
|
guest has had the required work done to support this safely
|
|
(TARGET_SUPPORTS_MTTCG).
|
|
|
|
System emulation will fall back to the original round robin approach
|
|
if:
|
|
|
|
* forced by --accel tcg,thread=single
|
|
* enabling --icount mode
|
|
* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
|
|
|
|
In the general case of running translated code there should be no
|
|
inter-vCPU dependencies and all vCPUs should be able to run at full
|
|
speed. Synchronisation will only be required while accessing internal
|
|
shared data structures or when the emulated architecture requires a
|
|
coherent representation of the emulated machine state.
|
|
|
|
Shared Data Structures
|
|
======================
|
|
|
|
Main Run Loop
|
|
-------------
|
|
|
|
Even when there is no code being generated there are a number of
|
|
structures associated with the hot-path through the main run-loop.
|
|
These are associated with looking up the next translation block to
|
|
execute. These include:
|
|
|
|
tb_jmp_cache (per-vCPU, cache of recent jumps)
|
|
tb_ctx.htable (global hash table, phys address->tb lookup)
|
|
|
|
As TB linking only occurs when blocks are in the same page this code
|
|
is critical to performance as looking up the next TB to execute is the
|
|
most common reason to exit the generated code.
|
|
|
|
DESIGN REQUIREMENT: Make access to lookup structures safe with
|
|
multiple reader/writer threads. Minimise any lock contention to do it.
|
|
|
|
The hot-path avoids using locks where possible. The tb_jmp_cache is
|
|
updated with atomic accesses to ensure consistent results. The fall
|
|
back QHT based hash table is also designed for lockless lookups. Locks
|
|
are only taken when code generation is required or TranslationBlocks
|
|
have their block-to-block jumps patched.
|
|
|
|
Global TCG State
|
|
----------------
|
|
|
|
User-mode emulation
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
We need to protect the entire code generation cycle including any post
|
|
generation patching of the translated code. This also implies a shared
|
|
translation buffer which contains code running on all cores. Any
|
|
execution path that comes to the main run loop will need to hold a
|
|
mutex for code generation. This also includes times when we need flush
|
|
code or entries from any shared lookups/caches. Structures held on a
|
|
per-vCPU basis won't need locking unless other vCPUs will need to
|
|
modify them.
|
|
|
|
DESIGN REQUIREMENT: Add locking around all code generation and TB
|
|
patching.
|
|
|
|
(Current solution)
|
|
|
|
Code generation is serialised with mmap_lock().
|
|
|
|
!User-mode emulation
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Each vCPU has its own TCG context and associated TCG region, thereby
|
|
requiring no locking during translation.
|
|
|
|
Translation Blocks
|
|
------------------
|
|
|
|
Currently the whole system shares a single code generation buffer
|
|
which when full will force a flush of all translations and start from
|
|
scratch again. Some operations also force a full flush of translations
|
|
including:
|
|
|
|
- debugging operations (breakpoint insertion/removal)
|
|
- some CPU helper functions
|
|
- linux-user spawning its first thread
|
|
|
|
This is done with the async_safe_run_on_cpu() mechanism to ensure all
|
|
vCPUs are quiescent when changes are being made to shared global
|
|
structures.
|
|
|
|
More granular translation invalidation events are typically due
|
|
to a change of the state of a physical page:
|
|
|
|
- code modification (self modify code, patching code)
|
|
- page changes (new page mapping in linux-user mode)
|
|
|
|
While setting the invalid flag in a TranslationBlock will stop it
|
|
being used when looked up in the hot-path there are a number of other
|
|
book-keeping structures that need to be safely cleared.
|
|
|
|
Any TranslationBlocks which have been patched to jump directly to the
|
|
now invalid blocks need the jump patches reversing so they will return
|
|
to the C code.
|
|
|
|
There are a number of look-up caches that need to be properly updated
|
|
including the:
|
|
|
|
- jump lookup cache
|
|
- the physical-to-tb lookup hash table
|
|
- the global page table
|
|
|
|
The global page table (l1_map) which provides a multi-level look-up
|
|
for PageDesc structures which contain pointers to the start of a
|
|
linked list of all Translation Blocks in that page (see page_next).
|
|
|
|
Both the jump patching and the page cache involve linked lists that
|
|
the invalidated TranslationBlock needs to be removed from.
|
|
|
|
DESIGN REQUIREMENT: Safely handle invalidation of TBs
|
|
- safely patch/revert direct jumps
|
|
- remove central PageDesc lookup entries
|
|
- ensure lookup caches/hashes are safely updated
|
|
|
|
(Current solution)
|
|
|
|
The direct jump themselves are updated atomically by the TCG
|
|
tb_set_jmp_target() code. Modification to the linked lists that allow
|
|
searching for linked pages are done under the protection of tb->jmp_lock,
|
|
where tb is the destination block of a jump. Each origin block keeps a
|
|
pointer to its destinations so that the appropriate lock can be acquired before
|
|
iterating over a jump list.
|
|
|
|
The global page table is a lockless radix tree; cmpxchg is used
|
|
to atomically insert new elements.
|
|
|
|
The lookup caches are updated atomically and the lookup hash uses QHT
|
|
which is designed for concurrent safe lookup.
|
|
|
|
Parallel code generation is supported. QHT is used at insertion time
|
|
as the synchronization point across threads, thereby ensuring that we only
|
|
keep track of a single TranslationBlock for each guest code block.
|
|
|
|
Memory maps and TLBs
|
|
--------------------
|
|
|
|
The memory handling code is fairly critical to the speed of memory
|
|
access in the emulated system. The SoftMMU code is designed so the
|
|
hot-path can be handled entirely within translated code. This is
|
|
handled with a per-vCPU TLB structure which once populated will allow
|
|
a series of accesses to the page to occur without exiting the
|
|
translated code. It is possible to set flags in the TLB address which
|
|
will ensure the slow-path is taken for each access. This can be done
|
|
to support:
|
|
|
|
- Memory regions (dividing up access to PIO, MMIO and RAM)
|
|
- Dirty page tracking (for code gen, SMC detection, migration and display)
|
|
- Virtual TLB (for translating guest address->real address)
|
|
|
|
When the TLB tables are updated by a vCPU thread other than their own
|
|
we need to ensure it is done in a safe way so no inconsistent state is
|
|
seen by the vCPU thread.
|
|
|
|
Some operations require updating a number of vCPUs TLBs at the same
|
|
time in a synchronised manner.
|
|
|
|
DESIGN REQUIREMENTS:
|
|
|
|
- TLB Flush All/Page
|
|
- can be across-vCPUs
|
|
- cross vCPU TLB flush may need other vCPU brought to halt
|
|
- change may need to be visible to the calling vCPU immediately
|
|
- TLB Flag Update
|
|
- usually cross-vCPU
|
|
- want change to be visible as soon as possible
|
|
- TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
|
|
- This is a per-vCPU table - by definition can't race
|
|
- updated by its own thread when the slow-path is forced
|
|
|
|
(Current solution)
|
|
|
|
We have updated cputlb.c to defer operations when a cross-vCPU
|
|
operation with async_run_on_cpu() which ensures each vCPU sees a
|
|
coherent state when it next runs its work (in a few instructions
|
|
time).
|
|
|
|
A new set up operations (tlb_flush_*_all_cpus) take an additional flag
|
|
which when set will force synchronisation by setting the source vCPUs
|
|
work as "safe work" and exiting the cpu run loop. This ensure by the
|
|
time execution restarts all flush operations have completed.
|
|
|
|
TLB flag updates are all done atomically and are also protected by the
|
|
corresponding page lock.
|
|
|
|
(Known limitation)
|
|
|
|
Not really a limitation but the wait mechanism is overly strict for
|
|
some architectures which only need flushes completed by a barrier
|
|
instruction. This could be a future optimisation.
|
|
|
|
Emulated hardware state
|
|
-----------------------
|
|
|
|
Currently thanks to KVM work any access to IO memory is automatically protected
|
|
by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
|
|
to do its own locking.
|
|
|
|
However IO memory isn't the only way emulated hardware state can be
|
|
modified. Some architectures have model specific registers that
|
|
trigger hardware emulation features. Generally any translation helper
|
|
that needs to update more than a single vCPUs of state should take the
|
|
BQL.
|
|
|
|
As the BQL, or global iothread mutex is shared across the system we
|
|
push the use of the lock as far down into the TCG code as possible to
|
|
minimise contention.
|
|
|
|
(Current solution)
|
|
|
|
MMIO access automatically serialises hardware emulation by way of the
|
|
BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
|
|
and also defer the reset/startup of vCPUs to the vCPU context by way
|
|
of async_run_on_cpu().
|
|
|
|
Updates to interrupt state are also protected by the BQL as they can
|
|
often be cross vCPU.
|
|
|
|
Memory Consistency
|
|
==================
|
|
|
|
Between emulated guests and host systems there are a range of memory
|
|
consistency models. Even emulating weakly ordered systems on strongly
|
|
ordered hosts needs to ensure things like store-after-load re-ordering
|
|
can be prevented when the guest wants to.
|
|
|
|
Memory Barriers
|
|
---------------
|
|
|
|
Barriers (sometimes known as fences) provide a mechanism for software
|
|
to enforce a particular ordering of memory operations from the point
|
|
of view of external observers (e.g. another processor core). They can
|
|
apply to any memory operations as well as just loads or stores.
|
|
|
|
The Linux kernel has an excellent `write-up
|
|
<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
|
|
on the various forms of memory barrier and the guarantees they can
|
|
provide.
|
|
|
|
Barriers are often wrapped around synchronisation primitives to
|
|
provide explicit memory ordering semantics. However they can be used
|
|
by themselves to provide safe lockless access by ensuring for example
|
|
a change to a signal flag will only be visible once the changes to
|
|
payload are.
|
|
|
|
DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
|
|
|
|
This would enforce a strong load/store ordering so all loads/stores
|
|
complete at the memory barrier. On single-core non-SMP strongly
|
|
ordered backends this could become a NOP.
|
|
|
|
Aside from explicit standalone memory barrier instructions there are
|
|
also implicit memory ordering semantics which comes with each guest
|
|
memory access instruction. For example all x86 load/stores come with
|
|
fairly strong guarantees of sequential consistency whereas Arm has
|
|
special variants of load/store instructions that imply acquire/release
|
|
semantics.
|
|
|
|
In the case of a strongly ordered guest architecture being emulated on
|
|
a weakly ordered host the scope for a heavy performance impact is
|
|
quite high.
|
|
|
|
DESIGN REQUIREMENTS: Be efficient with use of memory barriers
|
|
- host systems with stronger implied guarantees can skip some barriers
|
|
- merge consecutive barriers to the strongest one
|
|
|
|
(Current solution)
|
|
|
|
The system currently has a tcg_gen_mb() which will add memory barrier
|
|
operations if code generation is being done in a parallel context. The
|
|
tcg_optimize() function attempts to merge barriers up to their
|
|
strongest form before any load/store operations. The solution was
|
|
originally developed and tested for linux-user based systems. All
|
|
backends have been converted to emit fences when required. So far the
|
|
following front-ends have been updated to emit fences when required:
|
|
|
|
- target-i386
|
|
- target-arm
|
|
- target-aarch64
|
|
- target-alpha
|
|
- target-mips
|
|
|
|
Memory Control and Maintenance
|
|
------------------------------
|
|
|
|
This includes a class of instructions for controlling system cache
|
|
behaviour. While QEMU doesn't model cache behaviour these instructions
|
|
are often seen when code modification has taken place to ensure the
|
|
changes take effect.
|
|
|
|
Synchronisation Primitives
|
|
--------------------------
|
|
|
|
There are two broad types of synchronisation primitives found in
|
|
modern ISAs: atomic instructions and exclusive regions.
|
|
|
|
The first type offer a simple atomic instruction which will guarantee
|
|
some sort of test and conditional store will be truly atomic w.r.t.
|
|
other cores sharing access to the memory. The classic example is the
|
|
x86 cmpxchg instruction.
|
|
|
|
The second type offer a pair of load/store instructions which offer a
|
|
guarantee that a region of memory has not been touched between the
|
|
load and store instructions. An example of this is Arm's ldrex/strex
|
|
pair where the strex instruction will return a flag indicating a
|
|
successful store only if no other CPU has accessed the memory region
|
|
since the ldrex.
|
|
|
|
Traditionally TCG has generated a series of operations that work
|
|
because they are within the context of a single translation block so
|
|
will have completed before another CPU is scheduled. However with
|
|
the ability to have multiple threads running to emulate multiple CPUs
|
|
we will need to explicitly expose these semantics.
|
|
|
|
DESIGN REQUIREMENTS:
|
|
- Support classic atomic instructions
|
|
- Support load/store exclusive (or load link/store conditional) pairs
|
|
- Generic enough infrastructure to support all guest architectures
|
|
CURRENT OPEN QUESTIONS:
|
|
- How problematic is the ABA problem in general?
|
|
|
|
(Current solution)
|
|
|
|
The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
|
|
can be used directly or combined to emulate other instructions like
|
|
Arm's ldrex/strex instructions. While they are susceptible to the ABA
|
|
problem so far common guests have not implemented patterns where
|
|
this may be a problem - typically presenting a locking ABI which
|
|
assumes cmpxchg like semantics.
|
|
|
|
The code also includes a fall-back for cases where multi-threaded TCG
|
|
ops can't work (e.g. guest atomic width > host atomic width). In this
|
|
case an EXCP_ATOMIC exit occurs and the instruction is emulated with
|
|
an exclusive lock which ensures all emulation is serialised.
|
|
|
|
While the atomic helpers look good enough for now there may be a need
|
|
to look at solutions that can more closely model the guest
|
|
architectures semantics.
|