msgsnd has a broadcast mode that sends hypervisor doorbells to all
threads belonging to the same core as the target. A "subcore" mode
sends to all or one thread depending on 1LPAR mode.
Reviewed-by: Glenn Miles <milesg@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
attn is an implementation-specific instruction that on POWER (and G5/
970) can be enabled with a HID bit (disabled = illegal), and executing
it causes the host processor to stop and the service processor to be
notified. Generally used for debugging.
Implement attn and make it checkstop the system, which should be good
enough for QEMU debugging.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Add support for the clrbhrb and mfbhrbe instructions.
Since neither instruction is believed to be critical to
performance, both instructions were implemented using helper
functions.
Access to both instructions is controlled by bits in the
HFSCR (for privileged state) and MMCR0 (for problem state).
A new function, helper_mmcr0_facility_check, was added for
checking MMCR0[BHRBA] and raising a facility_unavailable exception
if required.
NOTE: For P8 and P9, due to a performance issue, branch history will
not be kept, but the instructions will be allowed to execute
as normal with the exception that the mfbhrbe instruction will
always return a zero value.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Glenn Miles <milesg@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
This commit continues adding support for the Branch History
Rolling Buffer (BHRB) as is provided starting with the P8
processor and continuing with its successors. This commit
is limited to the recording and filtering of taken branches.
The following changes were made:
- Enabled functionality on P10 processors only due to
performance impact seen with P8 and P9 where it is not
disabled for non problem state branches.
- Added a BHRB buffer for storing branch instruction and
target addresses for taken branches
- Renamed gen_update_cfar to gen_update_branch_history and
added a 'target' parameter to hold the branch target
address and 'inst_type' parameter to use for filtering
- Added TCG code to gen_update_branch_history that stores
data to the BHRB and updates the BHRB offset.
- Added BHRB resource initialization and reset functions
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Glenn Miles <milesg@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the following instructions to decodetree specification :
v{max, min}{u, s}{b, h, w, d} : VX-form
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op' flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the following instructions to decodetree specification:
v{and, andc, nand, or, orc, nor, xor, eqv} : VX-form
The changes were verified by validating that the tcp ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op' flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the following instructions to decodetree specification :
{l,st}ve{b,h,w}x,
{l,st}v{x,xl},
lvs{l,r} : X-form
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured using the '-d in_asm,op' flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the below instructions to decodetree specification :
andi[s]., {ori, xori}[s] : D-form
{and, andc, nand, or, orc, nor, xor, eqv}[.],
exts{b, h, w}[.], cnt{l, t}z{w, d}[.],
popcnt{b, w, d}, prty{w, d}, cmp, bpermd : X-form
With this patch, all the fixed-point logical instructions have been
moved to decodetree.
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op' flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
[np: 32-bit compile fix]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the following instructions to decodetree specification :
cmp{rb, eqb}, t{w, d} : X-form
t{w, d}i : D-form
isel : A-form
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured using the '-d in_asm,op' flag.
Also for CMPRB, following review comments :
Replaced repetition of arithmetic right shifting (tcg_gen_shri_i32) followed
by extraction of last 8 bits (tcg_gen_ext8u_i32) with extraction of the required
bits using offsets (tcg_gen_extract_i32).
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
[np: 32-bit compile fix]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the below instructions to decodetree specification :
divd[u, e, eu][o][.] : XO-form
mod{sd, ud} : X-form
With this patch, all the fixed-point arithmetic instructions have been
moved to decodetree.
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured using the '-d in_asm,op' flag.
Also, remaned do_divwe method in fixedpoint-impl.c.inc to do_dive because it is
now used to divide doubleword operands as well, and not just words.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
[np: 32-bit compile fix]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the following instructions to decodetree :
mul{ld, ldo, hd, hdu}[.] : XO-form
madd{hd, hdu, ld} : VA-form
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op'
flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
[np: 32-bit compile fix]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the below instructions to decodetree specification :
neg[o][.] : XO-form
mod{sw, uw}, darn : X-form
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op' flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
[np: 32-bit compile fix]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the following instructions to decodetree specification :
divw[u, e, eu][o][.] : XO-form
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op' flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Moving the following instructions to decodetree specification :
mulli : D-form
mul{lw, lwo, hw, hwu}[.] : XO-form
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op' flag.
Also cleaned up code for mullw[o][.] as per review comments while
keeping the logic of the tcg ops generated semantically same.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
This patch moves the below instructions to decodetree specification :
f{add, sub, mul, div, re, rsqrte, madd, msub, nmadd, nmsub}[s][.] : A-form
ft{div, sqrt} : X-form
With this patch, all the floating-point arithmetic instructions have been
moved to decodetree.
The changes were verified by validating that the tcg ops generated by those
instructions remain the same, which were captured with the '-d in_asm,op' flag.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
POWER10 adds a new field to sync for store-store syncs, and some
new variants of the existing syncs that include persistent memory.
Implement the store-store syncs and plwsync/phwsync.
Reviewed-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Memory barriers are supposed to do something on BookE systems, these
were probably just missed during MTTCG enablement, maybe no targets
support SMP. Either way, add proper BookE implementations.
Reviewed-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
This tries to faithfully reproduce the odd BookE logic. Note the
e206 check in gen_msync_4xx() is always false, so not carried over.
It does change the handling of non-zero reserved bits outside the
defined fields from being illegal to being ignored, which the
architecture specifies ot help with backward compatibility of new
fields. The existing behaviour causes illegal instruction exceptions
when using new POWER10 sync variants that add new fields, after this
the instructions are accepted and are implemented as supersets of
the new behaviour, as intended.
Reviewed-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
With mttcg, broadcast tlbie instructions do not wait until other vCPUs
have been kicked out of TCG execution before they complete (including
necessary subsequent tlbsync, etc., instructions). This is contrary to
the ISA, and it permits other vCPUs to use translations after the TLB
flush. For example:
CPU0
// *memP is initially 0, memV maps to memP with *pte
*pte = 0;
ptesync ; tlbie ; eieio ; tlbsync ; ptesync
*memP = 1;
CPU1
assert(*memV == 0);
It is possible for the assertion to fail because CPU1 translates memV
using the TLB after CPU0 has stored 1 to the underlying memory. This
race was observed with a careful test case where CPU1 checks run in a
very large expensive TB so it can run for the entire CPU0 period between
clearing the pte and storing the memory, but host vCPU thread preemption
could cause the race to hit anywhere.
As explained in commit 4ddc104689 ("target/ppc: Fix tlbie"), it is not
enough to just use tlb_flush_all_cpus_synced(), because that does not
execute until the calling CPU has finished its TB. It is also required
that the TB is ended at the point where the TLB flush must subsequently
take effect.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
This patch moves the below instructions to decodetree specification:
{add, subf}[c,e,me,ze][o][.] : XO-form
addic[.], subfic : D-form
addex : Z23-form
This patch introduces XO form instructions into decode tree
specification, for which all the four variations([o][.]) have been
handled with a single pattern. The changes were verified by validating
that the tcg ops generated by those instructions remain the same, which
were captured with the '-d in_asm,op' flag.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Signed-off-by: Chinmay Rath <rathc@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The move to decodetree flipped the inequality test for the VEC / VSX
MSR facility check.
This caused application crashes under Linux, where these facility
unavailable interrupts are used for lazy-switching of VEC/VSX register
sets. Getting the incorrect interrupt would result in wrong registers
being loaded, potentially overwriting live values and/or exposing
stale ones.
Cc: qemu-stable@nongnu.org
Reported-by: Joel Stanley <joel@jms.id.au>
Fixes: 70426b5bb7 ("target/ppc: moved stxvx and lxvx from legacy to decodtree")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1769
Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Tested-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Correct typos automatically found with the `typos` tool
<https://crates.io/crates/typos>
Signed-off-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
(mjt: remove 2 "arbitrer" hunks, suggested by BALATON)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Allow the name 'cpu_env' to be used for something else.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
LQ, STQ have the same register-pair ordering as LQARX/STQARX., which is
the even (lower) register contains the most significant bits. This is
not implemented correctly for big-endian.
do_ldst_quad() has variables low_addr_gpr and high_addr_gpr which is
confusing because they are low and high addresses, whereas LQARX/STQARX.
and most such things use the low and high values for lo/hi variables.
The conversion to native 128-bit memory access functions missed this
strangeness.
Fix this by changing the if condition, and change the variable names to
hi/lo to match convention.
Cc: qemu-stable@nongnu.org
Reported-by: Ivan Warren <ivan@vmfacility.fr>
Fixes: 57b38ffd0c ("target/ppc: Use tcg_gen_qemu_{ld,st}_i128 for LQARX, LQ, STQ")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1836
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Now that gen_icount_io_start() is a simple wrapper to
translator_io_start(), inline it.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20230602095439.48102-1-philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The following commits changed the code such that the fallback to MFSS for MFFSCRN,
MFFSCRNI, MFFSCE and MFFSL on pre 3.0 ISAs was removed and became an illegal instruction:
bf8adfd88b - target/ppc: Move mffscrn[i] to decodetree
394c2e2fda - target/ppc: Move mffsce to decodetree
3e5bce70ef - target/ppc: Move mffsl to decodetree
The hardware will handle them as a MFFS instruction as the code did previously.
This means applications that were segfaulting under qemu when encountering these
instructions which is used in glibc libm functions for example.
The fallback for MFFSCDRN and MFFSCDRNI added in a later patch was also missing.
This patch restores the fallback to MFSS for these instructions on pre 3.0s ISAs
as the hardware decoder would, fixing the segfaulting libm code. It doesn't have
the fallback for 3.0 onwards to match hardware behaviour.
Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Reviewed-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230510111913.1718734-1-richard.purdie@linuxfoundation.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
No need to roll our own, as this is now provided by tcg.
This was the last use of retxl, so remove that too.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
In function do_extractm() the mask is calculated as
dup_const(1 << (element_width - 1)). '1' being signed int
works fine for MO_8,16,32. For MO_64, on PPC64 host
this ends up becoming 0 on compilation. The vextractdm
uses MO_64, and it ends up having mask as 0.
Explicitly use 1ULL instead of signed int 1 like its
used everywhere else.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1536
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Lucas Mateus Castro <lucas.araujo@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <168319292809.1159309.5817546227121323288.stgit@ltc-boston1.aus.stglabs.ibm.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Compute all carry bits in parallel instead of a loop.
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
All uses are strictly read-only.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
All remaining uses are strictly read-only.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Initialize a new temp instead of tcg_const_*.
Fix a pasto in a comment.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
All remaining uses are strictly read-only.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Compute both partial results separately and accumulate
at the end, instead of accumulating in the middle.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Move the body out of this large macro.
Use tcg_constant_i64.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Translators are no longer required to free tcg temporaries.
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Since tcg_temp_new is now identical, use that.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Used gvec to translate XVTSTDCSP and XVTSTDCDP.
xvtstdcsp:
rept loop imm master version prev version current version
25 4000 0 0,206200 0,040730 (-80.2%) 0,040740 (-80.2%)
25 4000 1 0,205120 0,053650 (-73.8%) 0,053510 (-73.9%)
25 4000 3 0,206160 0,058630 (-71.6%) 0,058570 (-71.6%)
25 4000 51 0,217110 0,191490 (-11.8%) 0,192320 (-11.4%)
25 4000 127 0,206160 0,191490 (-7.1%) 0,192640 (-6.6%)
8000 12 0 1,234719 0,418833 (-66.1%) 0,386365 (-68.7%)
8000 12 1 1,232417 1,435979 (+16.5%) 1,462792 (+18.7%)
8000 12 3 1,232760 1,766073 (+43.3%) 1,743990 (+41.5%)
8000 12 51 1,239281 1,319562 (+6.5%) 1,423479 (+14.9%)
8000 12 127 1,231708 1,315760 (+6.8%) 1,426667 (+15.8%)
xvtstdcdp:
rept loop imm master version prev version current version
25 4000 0 0,159930 0,040830 (-74.5%) 0,040610 (-74.6%)
25 4000 1 0,160640 0,053670 (-66.6%) 0,053480 (-66.7%)
25 4000 3 0,160020 0,063030 (-60.6%) 0,062960 (-60.7%)
25 4000 51 0,160410 0,128620 (-19.8%) 0,127470 (-20.5%)
25 4000 127 0,160330 0,127670 (-20.4%) 0,128690 (-19.7%)
8000 12 0 1,190365 0,422146 (-64.5%) 0,388417 (-67.4%)
8000 12 1 1,191292 1,445312 (+21.3%) 1,428698 (+19.9%)
8000 12 3 1,188687 1,980656 (+66.6%) 1,975354 (+66.2%)
8000 12 51 1,191250 1,264500 (+6.1%) 1,355083 (+13.8%)
8000 12 127 1,197313 1,266729 (+5.8%) 1,349156 (+12.7%)
Overall, these instructions are the hardest ones to measure performance
as the gvec implementation is affected by the immediate. Above there are
5 different scenarios when it comes to immediate and 2 when it comes to
rept/loop combination. The immediates scenarios are: all bits are 0
therefore the target register should just be changed to 0, with 1 bit
set, with 2 bits set in a combination the new implementation can deal
with using gvec, 4 bits set and the new implementation can't deal with
it using gvec and all bits set. The rept/loop scenarios are high loop
and low rept (so it should spend more time executing it than translating
it) and high rept low loop (so it should spend more time translating it
than executing this code).
These comparisons are between the upstream version, a previous similar
implementation and a one with a cleaner code(this one).
For a comparison with o previous different implementation:
<20221010191356.83659-13-lucas.araujo@eldorado.org.br>
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221019125040.48028-13-lucas.araujo@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Moved XSTSTDCSP, XSTSTDCDP and XSTSTDCQP to decodetree and moved some of
its decoding away from the helper as previously the DCMX, XB and BF were
calculated in the helper with the help of cpu_env, now that part was
moved to the decodetree with the rest.
xvtstdcsp:
rept loop master patch
8 12500 1,85393600 1,94683600 (+5.0%)
25 4000 1,78779800 1,92479000 (+7.7%)
100 1000 2,12775000 2,28895500 (+7.6%)
500 200 2,99655300 3,23102900 (+7.8%)
2500 40 6,89082200 7,44827500 (+8.1%)
8000 12 17,50585500 18,95152100 (+8.3%)
xvtstdcdp:
rept loop master patch
8 12500 1,39043100 1,33539800 (-4.0%)
25 4000 1,35731800 1,37347800 (+1.2%)
100 1000 1,51514800 1,56053000 (+3.0%)
500 200 2,21014400 2,47906000 (+12.2%)
2500 40 5,39488200 6,68766700 (+24.0%)
8000 12 13,98623900 18,17661900 (+30.0%)
xvtstdcdp:
rept loop master patch
8 12500 1,35123800 1,34455800 (-0.5%)
25 4000 1,36441200 1,36759600 (+0.2%)
100 1000 1,49763500 1,54138400 (+2.9%)
500 200 2,19020200 2,46196400 (+12.4%)
2500 40 5,39265700 6,68147900 (+23.9%)
8000 12 14,04163600 18,19669600 (+29.6%)
As some values are now decoded outside the helper and passed to it as an
argument the number of arguments of the helper increased, the number
of TCGop needed to load the arguments increased. I suspect that's why
the slow-down in the tests with a high REPT but low LOOP.
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221019125040.48028-12-lucas.araujo@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Moved XVTSTDCSP and XVTSTDCDP to decodetree an restructured the helper
to be simpler and do all decoding in the decodetree (so XB, XT and DCMX
are all calculated outside the helper).
Obs: The tests in this one are slightly different, these are the sum of
these instructions with all possible immediate and those instructions
are repeated 10 times.
xvtstdcsp:
rept loop master patch
8 12500 2,76402100 2,70699100 (-2.1%)
25 4000 2,64867100 2,67884100 (+1.1%)
100 1000 2,73806300 2,78701000 (+1.8%)
500 200 3,44666500 3,61027600 (+4.7%)
2500 40 5,85790200 6,47475500 (+10.5%)
8000 12 15,22102100 17,46062900 (+14.7%)
xvtstdcdp:
rept loop master patch
8 12500 2,11818000 1,61065300 (-24.0%)
25 4000 2,04573400 1,60132200 (-21.7%)
100 1000 2,13834100 1,69988100 (-20.5%)
500 200 2,73977000 2,48631700 (-9.3%)
2500 40 5,05067000 5,25914100 (+4.1%)
8000 12 14,60507800 15,93704900 (+9.1%)
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221019125040.48028-11-lucas.araujo@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Moved VPRTYBW and VPRTYBD to use gvec and both of them and VPRTYBQ to
decodetree. VPRTYBW and VPRTYBD now also use .fni4 and .fni8,
respectively.
vprtybw:
rept loop master patch
8 12500 0,01198900 0,00703100 (-41.4%)
25 4000 0,01070100 0,00571400 (-46.6%)
100 1000 0,01123300 0,00678200 (-39.6%)
500 200 0,01601500 0,01535600 (-4.1%)
2500 40 0,03872900 0,05562100 (43.6%)
8000 12 0,10047000 0,16643000 (65.7%)
vprtybd:
rept loop master patch
8 12500 0,00757700 0,00788100 (4.0%)
25 4000 0,00652500 0,00669600 (2.6%)
100 1000 0,00714400 0,00825400 (15.5%)
500 200 0,01211000 0,01903700 (57.2%)
2500 40 0,03483800 0,07021200 (101.5%)
8000 12 0,09591800 0,21036200 (119.3%)
vprtybq:
rept loop master patch
8 12500 0,00675600 0,00667200 (-1.2%)
25 4000 0,00619400 0,00643200 (3.8%)
100 1000 0,00707100 0,00751100 (6.2%)
500 200 0,01199300 0,01342000 (11.9%)
2500 40 0,03490900 0,04092900 (17.2%)
8000 12 0,09588200 0,11465100 (19.6%)
I wasn't expecting such a performance lost in both VPRTYBD and VPRTYBQ,
I'm not sure if it's worth to move those instructions. Comparing the
assembly of the helper with the TCGop they are pretty similar, so
I'm not sure why vprtybd took so much more time.
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221019125040.48028-6-lucas.araujo@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
This patch moves VADDCUW and VSUBCUW to decodtree with gvec using an
implementation based on the helper, with the main difference being
changing the -1 (aka all bits set to 1) result returned by cmp when
true to +1. It also implemented a .fni4 version of those instructions
and dropped the helper.
vaddcuw:
rept loop master patch
8 12500 0,01008200 0,00612400 (-39.3%)
25 4000 0,01091500 0,00471600 (-56.8%)
100 1000 0,01332500 0,00593700 (-55.4%)
500 200 0,01998500 0,01275700 (-36.2%)
2500 40 0,04704300 0,04364300 (-7.2%)
8000 12 0,10748200 0,11241000 (+4.6%)
vsubcuw:
rept loop master patch
8 12500 0,01226200 0,00571600 (-53.4%)
25 4000 0,01493500 0,00462100 (-69.1%)
100 1000 0,01522700 0,00455100 (-70.1%)
500 200 0,02384600 0,01133500 (-52.5%)
2500 40 0,04935200 0,03178100 (-35.6%)
8000 12 0,09039900 0,09440600 (+4.4%)
Overall there was a gain in performance, but the TCGop code was still
slightly bigger in the new version (it went from 4 to 5).
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221019125040.48028-4-lucas.araujo@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>