4f71086665
8 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Emilio G. Cota
|
4e2ca83e71 |
tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASK
This will enable us to decouple code translation from the value of parallel_cpus at any given time. It will also help us minimize TB flushes when generating code via EXCP_ATOMIC. Note that the declaration of parallel_cpus is brought to exec-all.h to be able to define there the "curr_cflags" inline. Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> |
||
Lluís Vilanova
|
61a67f71dd |
exec: [tcg] Use different TBs according to the vCPU's dynamic tracing state
Every vCPU now uses a separate set of TBs for each set of dynamic tracing event state values. Each set of TBs can be used by any number of vCPUs to maximize TB reuse when vCPUs have the same tracing state. This feature is later used by tracetool to optimize tracing of guest code events. The maximum number of TB sets is defined as 2^E, where E is the number of events that have the 'vcpu' property (their state is stored in CPUState->trace_dstate). For this to work, a change on the dynamic tracing state of a vCPU will force it to flush its virtual TB cache (which is only indexed by address), and fall back to the physical TB cache (which now contains the vCPU's dynamic tracing state as part of the hashing function). Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu> Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-id: 149915775266.6295.10060144081246467690.stgit@frigg.lan Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> |
||
Emilio G. Cota
|
6f1653180f |
tb-hash: improve tb_jmp_cache hash function in user mode
Optimizations to cross-page chaining and indirect branches make performance more sensitive to the hit rate of tb_jmp_cache. The constraint of reserving some bits for the page number lowers the achievable quality of the hashing function. However, user-mode does not have this requirement. Thus, with this change we use for user-mode a hashing function that is both faster and of better quality than the previous one. Measurements: Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0. - SPECint06 (test set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz 2.2x +-+--------------------------------------------------------------------------------------------------------------+-+ | | | jr | 2x +jr+multhash +....................................................+++++...................................+-+ | jr+hash |$$$ | | |$+$ | | ### $ | 1.8x +-+......................................................................#|#.$...................................+-+ | ++#+# $ | | |# # $ | 1.6x +-+....................................................................***.#.$....................++$$$..........+-+ | $$$ *+* # $ |$+$ | | ++$$$ ### $ * * # $ +++|$ $ | | ++###+$ # # $ * * # $ ### ****## $ | 1.4x +-+...................***+#.$.........***.#.$..........................*.*.#.$...........#+#$$.*++*|#.$..........+-+ | *+* # $ * * # $ * * # $ # # $ * *+# $ | | * * # $ +++++ * * # $ * * # $ *** # $ * * # $ ###$$ | 1.2x +-+...................*.*.#.$.***##$$.*.*.#.$..........................*.*.#.$.........*.*.#.$.*..*.#.$.***+#+$..+-+ | * * # $ *+* # $ * * # $ +++ * * # $ ++###$$ * * # $ * * # $ * * # $ | | ***##$$ * * # $ * * # $ * * # $ ***##$$ ++### * * # $ *** #+$ * * # $ * * # $ * * # $ | | *+*+#+$ ***##$$$ * * # $ * * # $ * * # $ *+* # $ ++####$$ ***+# * * # $ * * # $ * * # $ * * # $ * * # $ | 1x +-++-*+*+#+$+*+*+#-+$+*+*-#+$+*+*+#+$+*+*+#+$+*-*+#+$+***++#+$+*+*+#$$+*+*+#+$+*+*+#+$+*+*-#+$+*+-*+#+$+*+*+#+$-++-+ | * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ | | * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ | 0.8x +-+--***##$$-***##$$$-***##$$-***##$$-***##$$-***##$$-***###$$-***##$$-***##$$-***##$$-***##$$-****##$$-***##$$--+-+ astar bzip2 gcc gobmk h264ref hmmlibquantum mcf omnetpperlbench sjengxalancbmk hmean png: http://imgur.com/4UXTrEc Here I also tried the hash function suggested by Paolo ("multhash"): return ((uint64_t) (pc * 2654435761) >> 32) & (TB_JMP_CACHE_SIZE - 1); As you can see it is just as good as the other new function ("hash"), which is what I ended up going with. - SPECint06 (train set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz 2.6x +-+--------------------------------------------------------------------------------------------------------------+-+ | | | jr ### | 2.4x +jr+hash...........................................................................................#.#...........+-+ | # # | | # # | 2.2x +-+................................................................................................#.#...........+-+ | # # | | # # | 2x +-+................................................................................................#.#...........+-+ | **** # | | * * # | 1.8x +-+.............................................................................................*..*.#...........+-+ | +++ * * # | | #### #### * * # | 1.6x +-+......................................####.............................#..#.****..#..........*..*.#...........+-+ | +++ #++# **** # * * # #### * * # | | ### # # * * # * * # # # * * # | 1.4x +-+...................****+#..........****..#..........................*..*..#.*..*..#....#..#..*..*.#...........+-+ | *++* # * * # * * # * * # *** # * * # #### | | * * # #### * * # * * # * * # * * # * * # **** # | 1.2x +-+...................*..*.#..****++#.*..*..#..........................*..*..#.*..*..#..*.*..#..*..*.#..*..*..#..+-+ | ****### * * # * * # * * # * * # * * # * * # * * # * * # | | * * # ***### * * # * * # * * # ****## * * # * * # * * # * * # * * # | 1x +-+--****###--***###--****##--****###-****###--***###--***###--****##--****###-****###--***###--****##--****###--+-+ astar bzip2 gcc gobmk h264ref hmmlibquantum mcf omnetpperlbench sjengxalancbmk hmean png: http://imgur.com/ArCbHqo - NBench, x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz 1.12x +-+-------------------------------------------------------------------------------------------------------------+-+ | | | jr +++ | 1.1x +jr+hash...........................................................####.........................................+-+ | +++#| # | | | #++# | 1.08x +-+................................+++................+++.+++..*****..#.........................................+-+ | | +++ | | * | * # | | | | | | *+++* # | 1.06x +-+................................****###.............|...|...*...*..#.........................+++.............+-+ | *| * |# ****### * * # | | | *| *++# *| * |# * * # #### | 1.04x +-+................................*++*..#............*|.*.|#..*...*..#........................#.|#.............+-+ | * * # *++*++# * * # +++#++# | | * * # * * # * * # | # # +++#### | 1.02x +-+................................*..*..#......+++...*..*..#..*...*..#.....................****..#..*****++#...+-+ | +++ * * # +++ | * * # * * # +++ *| * # *+++* # | | +++ | +++ +++ ++++++ * * # *****### * * # * * # | +++ ++++++ *++* # * * # | 1x +-++-+++++####++****###++++-+####+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-+++####-+*****###++*++*++#++*+-+*++#+-++-+ | *****| # *++* |# *****| # * * # * *++# * * # * * # **** |# * * # * * # * * # | | * | *| # * *++# * | *++# * * # * * # * * # * * # *| *++# * * # * * # * * # | 0.98x +-+...*.|.*++#..*..*..#..*+++*..#..*..*..#..*...*..#..*..*..#..*...*..#..*++*..#..*...*..#..*..*..#..*...*..#...+-+ | *+++* # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # | | * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # | 0.96x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+ ASSIGNMENT BITFIELD FOURFP EMULATION HUFFMAN LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT hmean png: http://imgur.com/ZXFX0hJ - NBench, arm-linux-user. Host: Intel i7-4790K @ 4.00GHz 1.3x +-+-------------------------------------------------------------------------------------------------------------+-+ | #### | | jr # # +++ | 1.25x +jr+hash.....................#..#...........................................####................................+-+ | # # # # | | # # # # | 1.2x +-+..........................#..#...........................................#..#................................+-+ | # # # # | | # # # # | 1.15x +-+..........................#..#...........................................#..#................................+-+ | # # #### # # | | # # # # # # | 1.1x +-+..........................#..#..................................#..#.....#..#................................+-+ | # # # # # # +++ | | # # #### # # # # #### | 1.05x +-+..........................#..#...............#..#.....####......#..#.....#..#.........................#..#...+-+ | # # # # # # # # # # +++ # # | | +++ ***** # #### ***** # # # +++# # **** # ****### # # | 1x +-++-+*****###++****+++++*+-+*++#+-****++#-+*+++*-+#+++++#++#++*****++#+-*++*++#-+*****-++++*++*++#++*****++#+-++-+ | * * # * * | * * # * * # * * # **** # * * # * * # * *### * *++# * * # | | * * # * *### * * # * * # * * # * * # * * # * * # * * # * * # * * # | 0.95x +-+...*...*..#..*..*.|#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#...+-+ | * * # * * |# * * # * * # * * # * * # * * # * * # * * # * * # * * # | | * * # * * |# * * # * * # * * # * * # * * # * * # * * # * * # * * # | 0.9x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+ ASSIGNMENT BITFIELD FOURFP EMULATION HUFFMAN LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT hmean png: http://imgur.com/FfD27ey Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1493263764-18657-12-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson <rth@twiddle.net> |
||
Markus Armbruster
|
2a6a4076e1 |
Clean up ill-advised or unusual header guards
Cleaned up with scripts/clean-header-guards.pl. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> |
||
Emilio G. Cota
|
909eaac9bb |
tb hash: track translated blocks with qht
Having a fixed-size hash table for keeping track of all translation blocks is suboptimal: some workloads are just too big or too small to get maximum performance from the hash table. The MRU promotion policy helps improve performance when the hash table is a little undersized, but it cannot make up for severely undersized hash tables. Furthermore, frequent MRU promotions result in writes that are a scalability bottleneck. For scalability, lookups should only perform reads, not writes. This is not a big deal for now, but it will become one once MTTCG matures. The appended fixes these issues by using qht as the implementation of the TB hash table. This solution is superior to other alternatives considered, namely: - master: implementation in QEMU before this patchset - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU. - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU. MRU is implemented here by adding an intermediate struct that contains the u32 hash and a pointer to the TB; this allows us, on an MRU promotion, to copy said struct (that is not at the head), and put this new copy at the head. After a grace period, the original non-head struct can be eliminated, and after another grace period, freed. - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize + no MRU for lookups; MRU for inserts. The appended solution is the following: - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize + no MRU for lookups; MRU for inserts. The plots below compare the considered solutions. The Y axis shows the boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis sweeps the number of buckets (or initial number of buckets for qht-autoresize). The plots in PNG format (and with errorbars) can be seen here: http://imgur.com/a/Awgnq Each test runs 5 times, and the entire QEMU process is pinned to a single core for repeatability of results. Host: Intel Xeon E5-2690 28 ++------------+-------------+-------------+-------------+------------++ A***** + + + master **A*** + 27 ++ * xxhash ##B###++ | A******A****** xxhash-rcu $$C$$$ | 26 C$$ A******A****** qht-fixed-nomru*%%D%%%++ D%%$$ A******A******A*qht-dyn-mru A*E****A 25 ++ %%$$ qht-dyn-nomru &&F&&&++ B#####% | 24 ++ #C$$$$$ ++ | B### $ | | ## C$$$$$$ | 23 ++ # C$$$$$$ ++ | B###### C$$$$$$ %%%D 22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C | D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E 21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B + E@@@ F&&& + E@ + F&&& + + 20 ++------------+-------------+-------------+-------------+------------++ 14 16 18 20 22 24 log2 number of buckets Host: Intel i7-4790K 14.5 ++------------+------------+-------------+------------+------------++ A** + + + master **A*** + 14 ++ ** xxhash ##B###++ 13.5 ++ ** xxhash-rcu $$C$$$++ | qht-fixed-nomru %%D%%% | 13 ++ A****** qht-dyn-mru @@E@@@++ | A*****A******A****** qht-dyn-nomru &&F&&& | 12.5 C$$ A******A******A*****A****** ***A 12 ++ $$ A*** ++ D%%% $$ | 11.5 ++ %% ++ B### %C$$$$$$ | 11 ++ ## D%%%%% C$$$$$ ++ | # % C$$$$$$ | 10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C 10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B + F&& D%%%%%%B######B######B#####B###@@@D%%% + 9.5 ++------------+------------+-------------+------------+------------++ 14 16 18 20 22 24 log2 number of buckets Note that the original point before this patch series is X=15 for "master"; the little sensitivity to the increased number of buckets is due to the poor hashing function in master. xxhash-rcu has significant overhead due to the constant churn of allocating and deallocating intermediate structs for implementing MRU. An alternative would be do consider failed lookups as "maybe not there", and then acquire the external lock (tb_lock in this case) to really confirm that there was indeed a failed lookup. This, however, would not be enough to implement dynamic resizing--this is more complex: see "Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming" by Triplett, McKenney and Walpole. This solution was discarded due to the very coarse RCU read critical sections that we have in MTTCG; resizing requires waiting for readers after every pointer update, and resizes require many pointer updates, so this would quickly become prohibitive. qht-fixed-nomru shows that MRU promotion is advisable for undersized hash tables. However, qht-dyn-mru shows that MRU promotion is not important if the hash table is properly sized: there is virtually no difference in performance between qht-dyn-nomru and qht-dyn-mru. Before this patch, we're at X=15 on "xxhash"; after this patch, we're at X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we can achieve with optimum sizing of the hash table, while keeping the hash table scalable for readers. The improvement we get before and after this patch for booting debian jessie with arm-softmmu is: - Intel Xeon E5-2690: 10.5% less time - Intel i7-4790K: 5.2% less time We could get this same improvement _for this particular workload_ by statically increasing the size of the hash table. But this would hurt workloads that do not need a large hash table. The dynamic (upward) resizing allows us to start small and enlarge the hash table as needed. A quick note on downsizing: the table is resized back to 2**15 buckets on every tb_flush; this makes sense because it is not guaranteed that the table will reach the same number of TBs later on (e.g. most bootup code is thrown away after boot); it makes sense to grow the hash table as more code blocks are translated. This also avoids the complication of having to build downsizing hysteresis logic into qht. Reviewed-by: Sergey Fedorov <serge.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson <rth@twiddle.net> |
||
Emilio G. Cota
|
42bd32287f |
tb hash: hash phys_pc, pc, and flags with xxhash
For some workloads such as arm bootup, tb_phys_hash is performance-critical. The is due to the high frequency of accesses to the hash table, originated by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's. More info: https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html To dig further into this I modified an arm image booting debian jessie to immediately shut down after boot. Analysis revealed that quite a bit of time is unnecessarily spent in tb_phys_hash: the cause is poor hashing that results in very uneven loading of chains in the hash table's buckets; the longest observed chain had ~550 elements. The appended addresses this with two changes: 1) Use xxhash as the hash table's hash function. xxhash is a fast, high-quality hashing function. 2) Feed the hashing function with not just tb_phys, but also pc and flags. This improves performance over using just tb_phys for hashing, since that resulted in some hash buckets having many TB's, while others getting very few; with these changes, the longest observed chain on a single hash bucket is brought down from ~550 to ~40. Tests show that the other element checked for in tb_find_physical, cs_base, is always a match when tb_phys+pc+flags are a match, so hashing cs_base is wasteful. It could be that this is an ARM-only thing, though. UPDATE: On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote: > The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB > consisting of only a delay slot). > It may well still turn out to be reasonable to ignore cs_base for hashing. BTW, after this change the hash table should not be called "tb_hash_phys" anymore; this is addressed later in this series. This change gives consistent bootup time improvements. I tested two host machines: - Intel Xeon E5-2690: 11.6% less time - Intel i7-4790K: 19.2% less time Increasing the number of hash buckets yields further improvements. However, using a larger, fixed number of buckets can degrade performance for other workloads that do not translate as many blocks (600K+ for debian-jessie arm bootup). This is dealt with later in this series. Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson <rth@twiddle.net> |
||
Peter Crosthwaite
|
41da4bd642 |
cpu-defs: Move out TB_JMP defines
These are not Architecture specific in any way so move them out of cpu-defs.h. tb-hash.h is an appropriate place as a leading user and their strong relationship to TB hashing and caching. Reviewed-by: Richard Henderson <rth@redhat.com> Signed-off-by: Peter Crosthwaite <crosthwaite.peter@gmail.com> Message-Id: <43ceca65a3fa240efac49aa0bf604ad0442e1710.1433052532.git.crosthwaite.peter@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
Peter Crosthwaite
|
e1b89321ba |
include/exec: Move tb hash functions out
This is one of very few things in exec-all with a genuine CPU architecture dependency. Move these hashing helpers to a new header to trim exec-all.h down to a near architecture-agnostic header. The defs are only used by cpu-exec and translate-all which are both arch-obj's so the new tb-hash.h has no core code usage. Reviewed-by: Richard Henderson <rth@redhat.com> Signed-off-by: Peter Crosthwaite <crosthwaite.peter@gmail.com> Message-Id: <9d048b96f7cfa64a4d9c0b88e0dd2877fac51d41.1433052532.git.crosthwaite.peter@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |