Commit Graph

97 Commits

Author SHA1 Message Date
Philippe Mathieu-Daudé
ff676046fb misc: remove duplicated includes
exec: housekeeping (funny since 02d0e09503)

applied using ./scripts/clean-includes

Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2017-12-18 17:07:02 +03:00
Peter Maydell
4d60b25b37 accel/tcg/cpu-exec-common.c: Remove unnecessary include of memory-internal.h
The cpu-exec-common.c file includes memory-internal.h, but it doesn't
actually use anything from that header. Remove the unnecessary include.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2017-12-18 17:07:02 +03:00
Emilio G. Cota
55bbc8610c translate-all: fix 'consisits' typo in comment
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2017-12-18 17:07:02 +03:00
Juan Quintela
8f2c4cbc76 tcg: Fix compilation without TCG
Commit 2726627197 started to use tb_unlock() and tlb_set_dirty() on
non TCG code.  Add the functions as stubs, so that builds with TCG
disabled continue to compile.

Signed-off-by: Juan Quintela <quintela@redhat.com>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
[PMM: tweaked commit message]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-11-23 10:02:44 +00:00
Peter Maydell
34d49937e4 accel/tcg: Handle atomic accesses to notdirty memory correctly
To do a write to memory that is marked as notdirty, we need
to invalidate any TBs we have cached for that memory, and
update the cpu physical memory dirty flags for VGA and migration.
The slowpath code in notdirty_mem_write() does all this correctly,
but the new atomic handling code in atomic_mmu_lookup() doesn't
do anything at all, it just clears the dirty bit in the TLB.

The effect of this bug is that if the first write to a notdirty
page for which we have cached TBs is by a guest atomic access,
we fail to invalidate the TBs and subsequently will execute
incorrect code. This can be seen by trying to run 'javac' on AArch64.

Use the new notdirty_call_before() and notdirty_call_after()
functions to correctly handle the update to notdirty memory
in the atomic codepath.

Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 1511201308-23580-3-git-send-email-peter.maydell@linaro.org
2017-11-21 12:09:25 +00:00
Peter Maydell
b11ce33fe0 Revert "cpu-exec: don't overwrite exception_index"
This reverts commit e01cecabf3,
which breaks booting of aarch64 Linux images.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-11-20 10:58:27 +00:00
Peter Maydell
62955e101e Miscellaneous bugfixes
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAloMXN0UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNNAQf/e7/uT2tW7WNfamSOMYXswf0R6ak+
 KjVSG+qiNsKaZzXmMFkhm4n0u1vCW0VGEQGRHr0MoSCyyfhupzLRHxfHi8SytqTf
 S6wqNtIbOK0L8bW+U5vzADks33UCuuUNlVZeOAkEPaXiLlgxmBoHfyoXkIGemJc2
 epx5x22rloNQLaBoL7FGmAkQhQCSJg19hAtRLo0tkryCwBZ9P6a1K0aNAHU2RFaB
 LgRFcxwduwTydsHRYeQ8J7YR0fERle01QUB8y9tlOc8/d2x9yRPBWhPHwscKMv6I
 JwM0c2Mnw6Yqbwyj7snWty7epgUcHWrOVnZnaIpNW9Z8m/wgz28oZ3a09w==
 =6wL6
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging

Miscellaneous bugfixes

# gpg: Signature made Wed 15 Nov 2017 15:27:25 GMT
# gpg:                using RSA key 0xBFFBD25F78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>"
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>"
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini/tags/for-upstream:
  fix scripts/update-linux-headers.sh here document
  exec: Do not resolve subpage in mru_section
  util/stats64: Fix min/max comparisons
  cpu-exec: avoid cpu_exec_nocache infinite loop with record/replay
  cpu-exec: don't overwrite exception_index
  vhost-user-scsi: add missing virtqueue_size param
  target-i386: adds PV_TLB_FLUSH CPUID feature bit
  thread-posix: fix qemu_rec_mutex_trylock macro
  Makefile: simpler/faster "make help"
  ioapic/tracing: Remove last DPRINTFs
  Enable 8-byte wide MMIO for 16550 serial devices

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-11-16 14:42:54 +00:00
Richard Henderson
ec603b5584 tcg: Record code_gen_buffer address for user-only memory helpers
When we handle a signal from a fault within a user-only memory helper,
we cannot cpu_restore_state with the PC found within the signal frame.
Use a TLS variable, helper_retaddr, to record the unwind start point
to find the faulting guest insn.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-11-15 10:33:27 +01:00
Pavel Dovgalyuk
17b50b0c29 cpu-exec: avoid cpu_exec_nocache infinite loop with record/replay
This patch ensures that icount_decr.u32.high is clear before calling
cpu_exec_nocache when exception is pending.  Because the exception is
caused by the first instruction in the block and it cannot be executed
without resetting the flag.

There are two parts in the fix.  First, clear icount_decr.u32.high in
cpu_handle_interrupt (just before processing the "dependent" request,
stored in cpu->interrupt_request or cpu->exit_request) rather than
cpu_loop_exec_tb; this ensures that cpu_handle_exception is always
reached with zero icount_decr.u32.high unless another interrupt has
happened in the meanwhile.

Second, try to cause the exception at the beginning of
cpu_handle_exception, and exit immediately if the TB cannot
execute.  With this change, interrupts are processed and
cpu_exec_nocache can make process.

Signed-off-by: Maria Klimushenkova <maria.klimushenkova@ispras.ru>
Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Message-Id: <20171114081818.27640.33165.stgit@pasha-VirtualBox>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-14 14:46:46 +01:00
Pavel Dovgalyuk
e01cecabf3 cpu-exec: don't overwrite exception_index
This patch adds a condition before overwriting exception_index fiels.
It is needed when exception_index is already set to some meaningful value.

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>

Message-Id: <20171114081812.27640.26372.stgit@pasha-VirtualBox>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-14 14:46:46 +01:00
Alex Bennée
d25f2a7227 accel/tcg/translate-all: expand cpu_restore_state addr check
We are still seeing signals during translation time when we walk over
a page protection boundary. This expands the check to ensure the host
PC is inside the code generation buffer. The original suggestion was
to check versus tcg_ctx.code_gen_ptr but as we now segment the
translation buffer we have to settle for just a general check for
being inside.

I've also fixed up the declaration to make it clear it can deal with
invalid addresses. A later patch will fix up the call sites.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20171108153245.20740-2-alex.bennee@linaro.org
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Tested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-11-13 13:55:27 +00:00
Peter Maydell
426eeecdf5 cpu-exec: Exit exclusive region on longjmp from step_atomic
Commit ac03ee5331 narrowed the scope of the exclusive
region so it only covers when we're executing the TB, not when
we're generating it. However it missed that there is more than
one execution path out of cpu_tb_exec -- if the atomic insn
causes an exception then the code will longjmp out, skipping
the code to end the exclusive region. This causes QEMU to hang
the next time the CPU calls start_exclusive(), waiting for
itself to exit the region.

Move the "end the region" code out to the end of the
function so that it is run for both normal exit and also
for exit-via-longjmp. We have to use a volatile bool flag
to decide whether we need to end the region, because we
can longjump out of the codegen as well as the execution.

(For some reason this only reproduces for me with a clang
optimized build, not a gcc debug build.)

Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Fixes: ac03ee5331
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <1509640536-32160-1-git-send-email-peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-11-03 09:34:21 +01:00
Emilio G. Cota
cc689485ee translate-all: exit from tb_phys_invalidate if qht_remove fails
Two or more threads might race while invalidating the same TB. We currently
do not check for this at all despite taking tb_lock, which means we would
wrongly invalidate the same TB more than once. This bug has actually been
hit by users: I recently saw a report on IRC, although I have yet to see
the corresponding test case.

Fix this by using qht_remove as the synchronization point; if it fails,
that means the TB has already been invalidated, and therefore there
is nothing left to do in tb_phys_invalidate.

Note that this solution works now that we still have tb_lock, and will
continue working once we remove tb_lock.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1508445114-4717-1-git-send-email-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
3468b59e18 tcg: enable multiple TCG contexts in softmmu
This enables parallel TCG code generation. However, we do not take
advantage of it yet since tb_lock is still held during tb_gen_code.

In user-mode we use a single TCG context; see the documentation
added to tcg_region_init for the rationale.

Note that targets do not need any conversion: targets initialize a
TCGContext (e.g. defining TCG globals), and after this initialization
has finished, the context is cloned by the vCPU threads, each of
them keeping a separate copy.

TCG threads claim one entry in tcg_ctxs[] by atomically increasing
n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's
of that variable and tcg_ctxs; they are there just to play nice with
analysis tools such as thread sanitizer.

Note that we do not allocate an array of contexts (we allocate
an array of pointers instead) because when tcg_context_init
is called, we do not know yet how many contexts we'll use since
the bool behind qemu_tcg_mttcg_enabled() isn't set yet.

Previous patches folded some TCG globals into TCGContext. The non-const
globals remaining are only set at init time, i.e. before the TCG
threads are spawned. Here is a list of these set-at-init-time globals
under tcg/:

Only written by tcg_context_init:
- indirect_reg_alloc_order
- tcg_op_defs
Only written by tcg_target_init (called from tcg_context_init):
- tcg_target_available_regs
- tcg_target_call_clobber_regs
- arm: arm_arch, use_idiv_instructions
- i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
        have_movbe, have_popcnt
- mips: use_movnz_instructions, use_mips32_instructions,
        use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
- ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
- s390: tb_ret_addr, s390_facilities
- sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
         use_vis3_instructions

Only written by tcg_prologue_init:
- 'struct jit_code_entry one_entry'
- aarch64: tb_ret_addr
- arm: tb_ret_addr
- i386: tb_ret_addr, guest_base_flags
- ia64: tb_ret_addr
- mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
e8feb96fcc tcg: introduce regions to split code_gen_buffer
This is groundwork for supporting multiple TCG contexts.

The naive solution here is to split code_gen_buffer statically
among the TCG threads; this however results in poor utilization
if translation needs are different across TCG threads.

What we do here is to add an extra layer of indirection, assigning
regions that act just like pages do in virtual memory allocation.
(BTW if you are wondering about the chosen naming, I did not want
to use blocks or pages because those are already heavily used in QEMU).

We use a global lock to serialize allocations as well as statistics
reporting (we now export the size of the used code_gen_buffer with
tcg_code_size()). Note that for the allocator we could just use
a counter and atomic_inc; however, that would complicate the gathering
of tcg_code_size()-like stats. So given that the region operations are
not a fast path, a lock seems the most reasonable choice.

The effectiveness of this approach is clear after seeing some numbers.
I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark.
Note that I'm evaluating this after enabling per-thread TCG (which
is done by a subsequent commit).

* -smp 1, 1 region (entire buffer):
    qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357
    qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363
    qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364
    qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373
    qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373
    qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360
    qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370
    qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367

That is, 8 flushes.

* -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]:

    qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356
    qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361
    qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361
    qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375
    qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375
    qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360
    qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365
    qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368

Again, 8 flushes. Note how buffer utilization is not 100%, but it
is close. Smaller region sizes would yield higher utilization,
but we want region allocation to be rare (it acquires a lock), so
we do not want to go too small.

* -smp 8, static partitioning of 8 regions (10 MB per region):
    qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354
    qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370
    qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365
    qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377
    qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358
    qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367
    qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364
    qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358
    qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362
    qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372
    qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374
    qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376
    qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374
    qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372
    qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359
    qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362
    qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368
    qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378
    qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367
    qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364

That is, 20 flushes. Note how a static partitioning approach uses
the code buffer poorly, leading to many unnecessary flushes.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
f51f315a67 translate-all: use qemu_protect_rwx/none helpers
The helpers require the address and size to be page-aligned, so
do that before calling them.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
c3fac1138e tcg: distribute profiling counters across TCGContext's
This is groundwork for supporting multiple TCG contexts.

To avoid scalability issues when profiling info is enabled, this patch
makes the profiling info counters distributed via the following changes:

1) Consolidate profile info into its own struct, TCGProfile, which
   TCGContext also includes. Note that tcg_table_op_count is brought
   into TCGProfile after dropping the tcg_ prefix.
2) Iterate over the TCG contexts in the system to obtain the total counts.

This change also requires updating the accessors to TCGProfile fields to
use atomic_read/set whenever there may be conflicting accesses (as defined
in C11) to them.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
b1311c4acf tcg: define tcg_init_ctx and make tcg_ctx a pointer
Groundwork for supporting multiple TCG contexts.

The core of this patch is this change to tcg/tcg.h:

> -extern TCGContext tcg_ctx;
> +extern TCGContext tcg_init_ctx;
> +extern TCGContext *tcg_ctx;

Note that for now we set *tcg_ctx to whatever TCGContext is passed
to tcg_context_init -- in this case &tcg_init_ctx.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
44ded3d048 tcg: take tb_ctx out of TCGContext
Groundwork for supporting multiple TCG contexts.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
f19c6cc6fc translate-all: report correct avg host TB size
Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
corresponding translated code") we are not fully utilizing
code_gen_buffer for translated code, and therefore are
incorrectly reporting the amount of translated code as well as
the average host TB size. Address this by:

- Making the conscious choice of misreporting the total translated code;
  doing otherwise would mislead users into thinking "-tb-size" is not
  honoured.

- Expanding tb_tree_stats to accurately count the bytes of translated code on
  the host, and using this for reporting the average tb host size,
  as well as the expansion ratio.

In the future we might want to consider reporting the accurate numbers for
the total translated code, together with a "bookkeeping/overhead" field to
account for the TB structs.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
be1e01171b exec-all: rename tb_free to tb_remove
We don't really free anything in this function anymore; we just remove
the TB from the binary search tree.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
2ac01d6daf translate-all: use a binary search tree to track TBs in TBContext
This is a prerequisite for supporting multiple TCG contexts, since
we will have threads generating code in separate regions of
code_gen_buffer.

For this we need a new field (.size) in struct tb_tc to keep
track of the size of the translated code. This field uses a size_t
to avoid adding a hole to the struct, although really an unsigned
int would have been enough.

The comparison function we use is optimized for the common case:
insertions. Profiling shows that upon booting debian-arm, 98%
of comparisons are between existing tb's (i.e. a->size and b->size
are both !0), which happens during insertions (and removals, but
those are rare). The remaining cases are lookups. From reading the glib
sources we see that the first key is always the lookup key. However,
the code does not assume this to always be the case because this
behaviour is not guaranteed in the glib docs. However, we embed
this knowledge in the code as a branch hint for the compiler.

Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer. The next patch in this series renames tb_free
to tb_remove to reflect this.

Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which
results in a small slowdown when booting debian-arm:

Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
	-machine type=virt -nographic -smp 1 -m 4096 \
	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
	-device virtio-net-device,netdev=unet \
	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
	-device virtio-blk-device,drive=myblock \
	-kernel img/arm/aarch32-current-linux-kernel-only.img \
	-append console=ttyAMA0 root=/dev/vda1 \
	-name arm,debug-threads=on -smp 1' (10 runs):

- Before:

       8048.598422      task-clock (msec)         #    0.931 CPUs utilized            ( +-  0.28% )
            16,974      context-switches          #    0.002 M/sec                    ( +-  0.12% )
                 0      cpu-migrations            #    0.000 K/sec
            10,125      page-faults               #    0.001 M/sec                    ( +-  1.23% )
    35,144,901,879      cycles                    #    4.367 GHz                      ( +-  0.14% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    65,758,252,643      instructions              #    1.87  insns per cycle          ( +-  0.33% )
    10,871,298,668      branches                  # 1350.707 M/sec                    ( +-  0.41% )
       192,322,212      branch-misses             #    1.77% of all branches          ( +-  0.32% )

       8.640869419 seconds time elapsed                                          ( +-  0.57% )

- After:
       8146.242027      task-clock (msec)         #    0.923 CPUs utilized            ( +-  1.23% )
            17,016      context-switches          #    0.002 M/sec                    ( +-  0.40% )
                 0      cpu-migrations            #    0.000 K/sec
            18,769      page-faults               #    0.002 M/sec                    ( +-  0.45% )
    35,660,956,120      cycles                    #    4.378 GHz                      ( +-  1.22% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    65,095,366,607      instructions              #    1.83  insns per cycle          ( +-  1.73% )
    10,803,480,261      branches                  # 1326.192 M/sec                    ( +-  1.95% )
       195,601,289      branch-misses             #    1.81% of all branches          ( +-  0.39% )

       8.828660235 seconds time elapsed                                          ( +-  0.38% )

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Richard Henderson
416986d3f9 tcg: Remove CF_IGNORE_ICOUNT
Now that we have curr_cflags, we can include CF_USE_ICOUNT
early and then remove it as necessary.

Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
ac03ee5331 cpu-exec: lookup/generate TB outside exclusive region during step_atomic
Now that all code generation has been converted to check CF_PARALLEL, we can
generate !CF_PARALLEL code without having yet set !parallel_cpus --
and therefore without having to be in the exclusive region during
cpu_exec_step_atomic.

While at it, merge cpu_exec_step into cpu_exec_step_atomic.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
e82d5a2460 tcg: check CF_PARALLEL instead of parallel_cpus
Thereby decoupling the resulting translated code from the current state
of the system.

The tb->cflags field is not passed to tcg generation functions. So
we add a field to TCGContext, storing there a copy of tb->cflags.

Most architectures have <= 32 registers, which results in a 4-byte hole
in TCGContext. Use this hole for the new field.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:42 -07:00
Emilio G. Cota
c5a49c63fa tcg: convert tb->cflags reads to tb_cflags(tb)
Convert all existing readers of tb->cflags to tb_cflags, so that we
use atomic_read and therefore avoid undefined behaviour in C11.

Note that the remaining setters/getters of the field are protected
by tb_lock, and therefore do not need conversion.

Luckily all readers access the field via 'tb->cflags' (so no foo.cflags,
bar->cflags in the code base), which makes the conversion easily
scriptable:

FILES=$(git grep 'tb->cflags' target include/exec/gen-icount.h \
	 accel/tcg/translator.c | cut -f1 -d':' | sort | uniq)

perl -pi -e 's/([^.>])tb->cflags/$1tb_cflags(tb)/g' $FILES
perl -pi -e 's/([a-z->.]*)(->|\.)tb->cflags/tb_cflags($1$2tb)/g' $FILES

Then manually fixed the few errors that checkpatch reported.

Compile-tested for all targets.

Suggested-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:41 -07:00
Richard Henderson
9b990ee5a3 tcg: Add CPUState cflags_next_tb
We were generating code during tb_invalidate_phys_page_range,
check_watchpoint, cpu_io_recompile, and (seemingly) discarding
the TB, assuming that it would magically be picked up during
the next iteration through the cpu_exec loop.

Instead, record the desired cflags in CPUState so that we request
the proper TB so that there is no more magic.

Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:41 -07:00
Emilio G. Cota
4e2ca83e71 tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASK
This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.

Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the "curr_cflags" inline.

Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24 13:53:41 -07:00
David Hildenbrand
f52bfb1214 accel/tcg: allow to invalidate a write TLB entry immediately
Background: s390x implements Low-Address Protection (LAP). If LAP is
enabled, writing to effective addresses (before any translation)
0-511 and 4096-4607 triggers a protection exception.

So we have subpage protection on the first two pages of every address
space (where the lowcore - the CPU private data resides).

By immediately invalidating the write entry but allowing the caller to
continue, we force every write access onto these first two pages into
the slow path. we will get a tlb fault with the specific accessed
addresses and can then evaluate if protection applies or not.

We have to make sure to ignore the invalid bit if tlb_fill() succeeds.

Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20171016202358.3633-2-david@redhat.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2017-10-20 13:32:10 +02:00
Peter Maydell
a8b392ac9a * TCG 8-byte atomic accesses bugfix (Andrew)
* Report disk rotation rate (Daniel)
 * Report invalid scsi-disk block size configuration (Mark)
 * KVM and memory API MemoryListener fixes (David, Maxime, Peter Xu)
 * x86 CPU hotplug crash fix (Igor)
 * Load/store API documentation (Peter Maydell)
 * Small fixes by myself and Thomas
 * qdev DEVICE_DELETED deferral (Michael)
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAlnnJUgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMifwf/dTZwtGqvAV4+jezCiZ3MTknz39dM
 HOGnD3m2xy04QT5LHiwDmaLFXy1y/AUVQm79JMPN4dKoFvtruREoWUq8EU0FCsLZ
 PkdCbJuXKGiBYMRXkQQxeT8lAyaBQwZdc+O9mYuOrSGZOQscA7SxgClYmzVdVzcy
 ZNTqkuaw1NDIAapdfGv94WLza4Nb8XX8bFwohgkf4mLDXifhjYHQTbBTfB0NqPxH
 Rk3HU+wgYUCJRYXpvktESgzRo5sm1aozCRq3f0Y6RV12ylgF6GG4CyN7YcKRn8eh
 NZbyehHiF5YU2kuvO9SmAB+FqM2+aMtq8uuNuI1Nxgd222MOVaChyWc3jg==
 =gmUj
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging

* TCG 8-byte atomic accesses bugfix (Andrew)
* Report disk rotation rate (Daniel)
* Report invalid scsi-disk block size configuration (Mark)
* KVM and memory API MemoryListener fixes (David, Maxime, Peter Xu)
* x86 CPU hotplug crash fix (Igor)
* Load/store API documentation (Peter Maydell)
* Small fixes by myself and Thomas
* qdev DEVICE_DELETED deferral (Michael)

# gpg: Signature made Wed 18 Oct 2017 10:56:24 BST
# gpg:                using RSA key 0xBFFBD25F78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>"
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>"
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini/tags/for-upstream: (29 commits)
  scsi: reject configurations with logical block size > physical block size
  qdev: defer DEVICE_DEL event until instance_finalize()
  Revert "qdev: Free QemuOpts when the QOM path goes away"
  qdev: store DeviceState's canonical path to use when unparenting
  qemu-pr-helper: use new libmultipath API
  watch_mem_write: implement 8-byte accesses
  notdirty_mem_write: implement 8-byte accesses
  memory: reuse section_from_flat_range()
  kvm: simplify kvm_align_section()
  kvm: region_add and region_del is not called on updates
  kvm: fix error message when failing to unregister slot
  kvm: tolerate non-existing slot for log_start/log_stop/log_sync
  kvm: fix alignment of ram address
  memory: call log_start after region_add
  target/i386: trap on instructions longer than >15 bytes
  target/i386: introduce x86_ld*_code
  tco: add trace events
  docs/devel/loads-stores.rst: Document our various load and store APIs
  nios2: define tcg_env
  build: remove CONFIG_LIBDECNUMBER
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-10-19 15:38:07 +01:00
David Hildenbrand
a6ffc4232a kvm: simplify kvm_align_section()
Use ROUND_UP and simplify the code a bit.

Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20171016144302.24284-7-david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-18 10:15:00 +02:00
David Hildenbrand
90ed4bcc3a kvm: region_add and region_del is not called on updates
Attributes are not updated via region_add()/region_del(). Attribute changes
lead to a delete first, followed by a new add.

If this would ever not be the case, we would get an error when trying to
register the new slot.

Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20171016144302.24284-6-david@redhat.com>
Tested-by: Joe Clifford <joeclifford@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-18 10:14:52 +02:00
David Hildenbrand
1c4fdabaf7 kvm: fix error message when failing to unregister slot
"overlapping" is a leftover, let's drop it.

Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20171016144302.24284-5-david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-18 10:14:48 +02:00
David Hildenbrand
e377e87ca6 kvm: tolerate non-existing slot for log_start/log_stop/log_sync
If we want to trap every access to a section, we might not have a
slot. So let's just tolerate if we don't have one.

Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20171016144302.24284-4-david@redhat.com>
Tested-by: Joe Clifford <joeclifford@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-18 10:14:42 +02:00
David Hildenbrand
bbfd3017eb kvm: fix alignment of ram address
Fix the wrong calculation of the delta, used to align the ram address.

This only strikes if alignment has to be done.

Reported-by: Joe Clifford <joeclifford@gmail.com>
Fixes: 5ea69c2e36 ("kvm: factor out alignment of memory section")
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20171016144302.24284-3-david@redhat.com>
Tested-by: Joe Clifford <joeclifford@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-18 10:14:35 +02:00
Richard Henderson
de258eb07d tcg: Fix off-by-one in assert in page_set_flags
Most of the users of page_set_flags offset (page, page + len) as
the end points.  One might consider this an error, since the other
users do supply an endpoint as the last byte of the region.

However, the first thing that page_set_flags does is round end UP
to the start of the next page.  Which means computing page + len - 1
is in the end pointless.  Therefore, accept this usage and do not
assert when given the exact size of the vm as the endpoint.

Signed-off-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20170708025030.15845-2-rth@twiddle.net>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
2017-10-16 16:00:56 +03:00
Emilio G. Cota
e7e168f413 exec-all: extract tb->tc_* into a separate struct tc_tb
In preparation for adding tc.size to be able to keep track of
TB's using the binary search tree implementation from glib.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
6eb062abd6 translate-all: define and use DEBUG_TB_CHECK_GATE
This prevents bit rot by ensuring the debug code is compiled when
building a user-mode target.

Unfortunately the helpers are user-mode-only so we cannot fully
get rid of the ifdef checks. Add a comment to explain this.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
dae9e03aed translate-all: define and use DEBUG_TB_INVALIDATE_GATE
This gets rid of an ifdef check while ensuring that the debug code
is compiled, which prevents bit rot.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
67a5b5d2f6 exec-all: introduce TB_PAGE_ADDR_FMT
And fix the following warning when DEBUG_TB_INVALIDATE is enabled
in translate-all.c:

  CC      mipsn32-linux-user/accel/tcg/translate-all.o
/data/src/qemu/accel/tcg/translate-all.c: In function ‘tb_alloc_page’:
/data/src/qemu/accel/tcg/translate-all.c:1201:16: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘tb_page_addr_t {aka unsigned int}’ [-Werror=format=]
         printf("protecting code page: 0x" TARGET_FMT_lx "\n",
                ^
cc1: all warnings being treated as errors
/data/src/qemu/rules.mak:66: recipe for target 'accel/tcg/translate-all.o' failed
make[1]: *** [accel/tcg/translate-all.o] Error 1
Makefile:328: recipe for target 'subdir-mipsn32-linux-user' failed
make: *** [subdir-mipsn32-linux-user] Error 2
cota@flamenco:/data/src/qemu/build ((18f3fe1...) *$)$

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
424079c13b translate-all: define and use DEBUG_TB_FLUSH_GATE
This gets rid of some ifdef checks while ensuring that the debug code
is compiled, which prevents bit rot.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
84f1c148da exec-all: bring tb->invalid into tb->cflags
This gets rid of a hole in struct TranslationBlock.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
f6bb84d531 tcg: consolidate TB lookups in tb_lookup__cpu_state
This avoids duplicating code. cpu_exec_step will also use the
new common function once we integrate parallel_cpus into tb->cflags.

Note that in this commit we also fix a race, described by Richard Henderson
during review. Think of this scenario with threads A and B:

   (A) Lookup succeeds for TB in hash without tb_lock
        (B) Sets the TB's tb->invalid flag
        (B) Removes the TB from tb_htable
        (B) Clears all CPU's tb_jmp_cache
   (A) Store TB into local tb_jmp_cache

Given that order of events, (A) will keep executing that invalid TB until
another flush of its tb_jmp_cache happens, which in theory might never happen.
We can fix this by checking the tb->invalid flag every time we look up a TB
from tb_jmp_cache, so that in the above scenario, next time we try to find
that TB in tb_jmp_cache, we won't, and will therefore be forced to look it
up in tb_htable.

Performance-wise, I measured a small improvement when booting debian-arm.
Note that inlining pays off:

 Performance counter stats for 'taskset -c 0 qemu-system-arm \
	-machine type=virt -nographic -smp 1 -m 4096 \
	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
	-device virtio-net-device,netdev=unet \
	-drive file=jessie.qcow2,id=myblock,index=0,if=none \
	-device virtio-blk-device,drive=myblock \
	-kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \
	-name arm,debug-threads=on -smp 1' (10 runs):

Before:
      18714.917392 task-clock                #    0.952 CPUs utilized            ( +-  0.95% )
            23,142 context-switches          #    0.001 M/sec                    ( +-  0.50% )
                 1 CPU-migrations            #    0.000 M/sec
            10,558 page-faults               #    0.001 M/sec                    ( +-  0.95% )
    53,957,727,252 cycles                    #    2.883 GHz                      ( +-  0.91% ) [83.33%]
    24,440,599,852 stalled-cycles-frontend   #   45.30% frontend cycles idle     ( +-  1.20% ) [83.33%]
    16,495,714,424 stalled-cycles-backend    #   30.57% backend  cycles idle     ( +-  0.95% ) [66.66%]
    76,267,572,582 instructions              #    1.41  insns per cycle
                                             #    0.32  stalled cycles per insn  ( +-  0.87% ) [83.34%]
    12,692,186,323 branches                  #  678.186 M/sec                    ( +-  0.92% ) [83.35%]
       263,486,879 branch-misses             #    2.08% of all branches          ( +-  0.73% ) [83.34%]

      19.648474449 seconds time elapsed                                          ( +-  0.82% )

After, w/ inline (this patch):
      18471.376627 task-clock                #    0.955 CPUs utilized            ( +-  0.96% )
            23,048 context-switches          #    0.001 M/sec                    ( +-  0.48% )
                 1 CPU-migrations            #    0.000 M/sec
            10,708 page-faults               #    0.001 M/sec                    ( +-  0.81% )
    53,208,990,796 cycles                    #    2.881 GHz                      ( +-  0.98% ) [83.34%]
    23,941,071,673 stalled-cycles-frontend   #   44.99% frontend cycles idle     ( +-  0.95% ) [83.34%]
    16,161,773,848 stalled-cycles-backend    #   30.37% backend  cycles idle     ( +-  0.76% ) [66.67%]
    75,786,269,766 instructions              #    1.42  insns per cycle
                                             #    0.32  stalled cycles per insn  ( +-  1.24% ) [83.34%]
    12,573,617,143 branches                  #  680.708 M/sec                    ( +-  1.34% ) [83.33%]
       260,235,550 branch-misses             #    2.07% of all branches          ( +-  0.66% ) [83.33%]

      19.340502161 seconds time elapsed                                          ( +-  0.56% )

After, w/o inline:
      18791.253967 task-clock                #    0.954 CPUs utilized            ( +-  0.78% )
            23,230 context-switches          #    0.001 M/sec                    ( +-  0.42% )
                 1 CPU-migrations            #    0.000 M/sec
            10,563 page-faults               #    0.001 M/sec                    ( +-  1.27% )
    54,168,674,622 cycles                    #    2.883 GHz                      ( +-  0.80% ) [83.34%]
    24,244,712,629 stalled-cycles-frontend   #   44.76% frontend cycles idle     ( +-  1.37% ) [83.33%]
    16,288,648,572 stalled-cycles-backend    #   30.07% backend  cycles idle     ( +-  0.95% ) [66.66%]
    77,659,755,503 instructions              #    1.43  insns per cycle
                                             #    0.31  stalled cycles per insn  ( +-  0.97% ) [83.34%]
    12,922,780,045 branches                  #  687.702 M/sec                    ( +-  1.06% ) [83.34%]
       261,962,386 branch-misses             #    2.03% of all branches          ( +-  0.71% ) [83.35%]

      19.700174670 seconds time elapsed                                          ( +-  0.56% )

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
7f11636dbe tcg: remove addr argument from lookup_tb_ptr
It is unlikely that we will ever want to call this helper passing
an argument other than the current PC. So just remove the argument,
and use the pc we already get from cpu_get_tb_cpu_state.

This change paves the way to having a common "tb_lookup" function.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
841710c78e cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find
Reusing the have_tb_lock name, which is also defined in translate-all.c,
makes code reviewing unnecessarily harder.

Avoid potential confusion by renaming the local have_tb_lock variable
to something else.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
13e1094735 translate-all: make have_tb_lock static
It is only used by this object, and it's not exported to any other.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
0aecede612 tcg: fix corruption of code_time profiling counter upon tb_flush
Whenever there is an overflow in code_gen_buffer (e.g. we run out
of space in it and have to flush it), the code_time profiling counter
ends up with an invalid value (that is, code_time -= profile_getclock(),
without later on getting += profile_getclock() due to the goto).

Fix it by using the ti variable, so that we only update code_time
when there is no overflow. Note that in case there is an overflow
we fail to account for the elapsed coding time, but this is quite rare
so we can probably live with it.

"info jit" before/after, roughly at the same time during debian-arm bootup:

- before:
Statistics:
TB flush count      1
TB invalidate count 4665
TLB flush count     998
JIT cycles          -615191529184601 (-256329.804 s at 2.4 GHz)
translated TBs      302310 (aborted=0 0.0%)
avg ops/TB          48.4 max=438
deleted ops/TB      8.54
avg temps/TB        32.31 max=38
avg host code/TB    361.5
avg search data/TB  24.5
cycles/op           -42014693.0
cycles/in byte      -121444900.2
cycles/out byte     -5629031.1
cycles/search byte     -83114481.0
  gen_interm time   -0.0%
  gen_code time     100.0%
optim./code time    -0.0%
liveness/code time  -0.0%
cpu_restore count   6236
  avg cycles        110.4

- after:
Statistics:
TB flush count      1
TB invalidate count 4665
TLB flush count     1010
JIT cycles          1996899624 (0.832 s at 2.4 GHz)
translated TBs      297961 (aborted=0 0.0%)
avg ops/TB          48.5 max=438
deleted ops/TB      8.56
avg temps/TB        32.31 max=38
avg host code/TB    361.8
avg search data/TB  24.5
cycles/op           138.2
cycles/in byte      398.4
cycles/out byte     18.5
cycles/search byte     273.1
  gen_interm time   14.0%
  gen_code time     86.0%
optim./code time    19.4%
liveness/code time  10.3%
cpu_restore count   6372
  avg cycles        111.0

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Emilio G. Cota
83974cf4f8 cputlb: bring back tlb_flush_count under !TLB_DEBUG
Commit f0aff0f124 ("cputlb: add assert_cpu_is_self checks") buried
the increment of tlb_flush_count under TLB_DEBUG. This results in
"info jit" always (mis)reporting 0 TLB flushes when !TLB_DEBUG.

Besides, under MTTCG tlb_flush_count is updated by several threads,
so in order not to lose counts we'd either have to use atomic ops
or distribute the counter, which is more scalable.

This patch does the latter by embedding tlb_flush_count in CPUArchState.
The global count is then easily obtained by iterating over the CPU list.

Note that this change also requires updating the accessors to
tlb_flush_count to use atomic_read/set whenever there may be conflicting
accesses (as defined in C11) to it.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10 07:37:10 -07:00
Greg Kurz
11748ba72e kvm: check KVM_CAP_NR_VCPUS with kvm_vm_check_extension()
On a modern server-class ppc host with the following CPU topology:

Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0,8,16,24
Off-line CPU(s) list:  1-7,9-15,17-23,25-31
Thread(s) per core:    1

If both KVM PR and KVM HV loaded and we pass:

        -machine pseries,accel=kvm,kvm-type=PR -smp 8

We expect QEMU to warn that this exceeds the number of online CPUs:

Warning: Number of SMP cpus requested (8) exceeds the recommended
 cpus supported by KVM (4)
Warning: Number of hotpluggable cpus requested (8) exceeds the
 recommended cpus supported by KVM (4)

but nothing is printed...

This happens because on ppc the KVM_CAP_NR_VCPUS capability is VM
specific  ndreally depends on the KVM type, but we currently use it
as a global capability. And KVM returns a fallback value based on
KVM HV being present. Maybe KVM on POWER shouldn't presume anything
as long as it doesn't have a VM, but in all cases, we should call
KVM_CREATE_VM first and use KVM_CAP_NR_VCPUS as a VM capability.

This patch hence changes kvm_recommended_vcpus() accordingly and
moves the sanity checking of smp_cpus after the VM creation.

It is okay for the other archs that also implement KVM_CAP_NR_VCPUS,
ie, mips, s390, x86 and arm, because they don't depend on the VM
being created or not.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <150600966286.30533.10909862523552370889.stgit@bahia.lan>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-02 14:38:06 +02:00
Greg Kurz
62dd4edaaf kvm: check KVM_CAP_SYNC_MMU with kvm_vm_check_extension()
On a server-class ppc host, this capability depends on the KVM type,
ie, HV or PR. If both KVM are present in the kernel, we will always
get the HV specific value, even if we explicitely requested PR on
the command line.

This can have an impact if we're using hugepages or a balloon device.

Since we've already created the VM at the time any user calls
kvm_has_sync_mmu(), switching to kvm_vm_check_extension() is
enough to fix any potential issue.

It is okay for the other archs that also implement KVM_CAP_SYNC_MMU,
ie, mips, s390, x86 and arm, because they don't depend on the VM being
created or not.

While here, let's cache the state of this extension in a bool variable,
since it has several users in the code, as suggested by Thomas Huth.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <150600965332.30533.14702405809647835716.stgit@bahia.lan>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-02 14:38:06 +02:00