1. Review and commit patch
[ 896733 ] Lazy flags, for more instructions, only 1 src op
May be partially, but I hope to get all ideas from patch in
2. Get Bochs speedup after lazy flags optimization
3. Most important for me: improve correctness of emulation by handling several
undocumented EFLAGS modifications. And finally pass
UFLAGS - Undefined Flags Test v 3.0
Copyright (C) Potemkin's Hackers Group (PHG) 1989,1995
The test still fails on > 50% of its checks.
According to the Intel manuals:
The LOCK prefix can be prepended only to the following instructions
and only to those forms of the instructions where the destination
operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If
the LOCK prefix is used with one of these instructions and the source
operand is a memory operand, an undefined opcode exception (#UD) will
be generated. An undefined opcode exception will also be generated if
the LOCK prefix is used with any instruction not in the above list.
Checking of the LOCK prefix done in fetchDecode state and not overloads
Bochs's execution.
there to offer a way to substitute more efficient code
to do the RMW cases. At the moment, they just map to
the normal functions.
Sorry, restored the previous version ...
"bx_bool" which is always defined as Bit32u on all platforms. In Carbon
specific code, Boolean is still used because the Carbon header files
define it to unsigned char.
- this fixes bug [ 623152 ] MacOSX: Triple Exception Booting win95.
The bug was that some code in Bochs depends on Boolean to be a
32 bit value. (This should be fixed, but I don't know all the places
where it needs to be fixed yet.) Because Carbon defined Boolean as
an unsigned char, Bochs just followed along and used the unsigned char
definition to avoid compile problems. This exposed the dependency
on 32 bit Boolean on MacOS X only and led to major simulation problems,
that could only be reproduced and debugged on that platform.
- On the mailing list we debated whether to make all Booleans into "bool" or
our own type. I chose bx_bool for several reasons.
1. Unlike C++'s bool, we can guarantee that bx_bool is the same size on all
platforms, which makes it much less likely to have more platform-specific
simulation differences in the future. (I spent hours on a borrowed
MacOSX machine chasing bug 618388 before discovering that different sized
Booleans were the problem, and I don't want to repeat that.)
2. We still have at least one dependency on 32 bit Booleans which must be
fixed some time, but I don't want to risk introducing new bugs into the
simulation just before the 2.0 release.
Modified Files:
bochs.h config.h.in gdbstub.cc logio.cc main.cc pc_system.cc
pc_system.h plugin.cc plugin.h bios/rombios.c cpu/apic.cc
cpu/arith16.cc cpu/arith32.cc cpu/arith64.cc cpu/arith8.cc
cpu/cpu.cc cpu/cpu.h cpu/ctrl_xfer16.cc cpu/ctrl_xfer32.cc
cpu/ctrl_xfer64.cc cpu/data_xfer16.cc cpu/data_xfer32.cc
cpu/data_xfer64.cc cpu/debugstuff.cc cpu/exception.cc
cpu/fetchdecode.cc cpu/flag_ctrl_pro.cc cpu/init.cc
cpu/io_pro.cc cpu/lazy_flags.cc cpu/lazy_flags.h cpu/mult16.cc
cpu/mult32.cc cpu/mult64.cc cpu/mult8.cc cpu/paging.cc
cpu/proc_ctrl.cc cpu/segment_ctrl_pro.cc cpu/stack_pro.cc
cpu/tasking.cc debug/dbg_main.cc debug/debug.h debug/sim2.cc
disasm/dis_decode.cc disasm/disasm.h doc/docbook/Makefile
docs-html/cosimulation.html fpu/wmFPUemu_glue.cc
gui/amigaos.cc gui/beos.cc gui/carbon.cc gui/gui.cc gui/gui.h
gui/keymap.cc gui/keymap.h gui/macintosh.cc gui/nogui.cc
gui/rfb.cc gui/sdl.cc gui/siminterface.cc gui/siminterface.h
gui/term.cc gui/win32.cc gui/wx.cc gui/wxmain.cc gui/wxmain.h
gui/x.cc instrument/example0/instrument.cc
instrument/example0/instrument.h
instrument/example1/instrument.cc
instrument/example1/instrument.h
instrument/stubs/instrument.cc instrument/stubs/instrument.h
iodev/cdrom.cc iodev/cdrom.h iodev/cdrom_osx.cc iodev/cmos.cc
iodev/devices.cc iodev/dma.cc iodev/dma.h iodev/eth_arpback.cc
iodev/eth_packetmaker.cc iodev/eth_packetmaker.h
iodev/floppy.cc iodev/floppy.h iodev/guest2host.h
iodev/harddrv.cc iodev/harddrv.h iodev/ioapic.cc
iodev/ioapic.h iodev/iodebug.cc iodev/iodev.h
iodev/keyboard.cc iodev/keyboard.h iodev/ne2k.h
iodev/parallel.h iodev/pci.cc iodev/pci.h iodev/pic.h
iodev/pit.cc iodev/pit.h iodev/pit_wrap.cc iodev/pit_wrap.h
iodev/sb16.cc iodev/sb16.h iodev/serial.cc iodev/serial.h
iodev/vga.cc iodev/vga.h memory/memory.h memory/misc_mem.cc
into inline functions with asm() statements in cpu.h. This cleans
up the *.cc code (which now doesn't have any asm()s in it), and
centralizes the asm() code so constraints can be modified in one
place. This also makes it easier to cover more instructions
with asm()s for more efficient eflags handling.
all available optimizations in one shot.
Finished one last case of an instruction which could but didn't use
the Read-Modify-Write variants of access.cc functions.
Started going through the integer instructions, merging obvious cases
where there are two "if (modrm==11b) {" clauses and very little
action in between, and cleaning up the aweful indentation leftover
from many years ago when those instructions were implemented using
cut-and-paste. We may get a little extra performance out of these
mods, but they'll also be easier after I'm finished to enhance
with asm() statements to knock out the lazy flags processing on x86.
were simply replacements of the eflags mask constants with
the macro names already in cpu.h for asm() statements. I forgot
to use the macros for some instructions.
0x000008d5 -> EFlagsOSZAPCMask
0x000008d4 -> EFlagsOSZAPMask
use getB_CF() etc. getB_CF() and friends are only for a relatively
small number of cases where a true boolean/binary number (0 or 1) is required
rather than 0 or non-0 as is returned by get_CF().
user can turn on/off use of native host specific inline asm
statements. By default, this option is enabled, so you only
need it to disable inline asms in your compile for now.
Currently only on x86+GCC environments, will inline asm()
statements be used. Eventually, other platforms could specify
some asm()s; probably for endian issues such as byte-swapping
and unaligned memory accesses. On x86, there are some inline
asm()s which do the arithmetic EFLAGS processing so that the
lazy flags handling is somewhat bypassed. Eventually, I'll
add more, at least for the more common instructions. This
adds a little extra performance.
of (1 & (val32>>N)), and added a getB_?F() accessor for special
cases which need a strict binary value (exactly 0 or 1). Most
code only needed a value for logical comparison. I modified the
special cases which do need a binary number for shifting and
comparison between flags, to use the special getB_?F() accessor.
Cleaned up memory.cc functions a little, now that all accesses
are within a single page.
Fixed a (not very likely encountered) bug in fetchdecode.cc (and
fetchdecode64.cc) where a 2-byte opcode starting with a prefix
starts at the last offset on a page. There were no checks
on the segment overrides for a boundary condition. I added them.
The eflags enhancements added just a tiny bit of performance.
so frequently.
Coded asm() statements for INC/DEC_ERX() instructions.
Cleaned up the iCache a litle including a bug fix. The
generation ID was decrementing the whole field including
some high meta bits. That could roll over after 1 Billion
cycles. I know only decrement if the field is valid, to
save the write.
I implemented inline functions which can serve the value of
the arithmetic flags if they are cached, and redirect to
the lazy_flags.cc routines if not.
Most of this was just prep work for adding more asm() statements
for native eflags processing when on x86.
also extended by the REX.B field on Hammer) is passed to instructions.
I rearranged the bxInstruction_c to free up a field to be used
to pass this info when mod-rm bytes are not used. This got rid
of the ugly ((i->b1 & 7) + i->rex_b) code.
Probably shaved just a very little run time off Hammer emulation,
and even less on x86-32. The resultant is a little cleaner anyways.
in cpu.cc out of the main loop, and into the asynchronous
events handling. I went through all the code paths, and
there doesn't seem to be any reason for that code to be
in the hot loop.
Added another accessor for getting instruction data, called
modC0(). A lot of instructions test whether the mod field
of mod-nnn-rm is 0xc0 or not, ie., it's a register operation
and not memory. So I flag this in fetchdecode{,64}.cc.
This added on the order of 1% performance improvement for
a Win95 boot.
Macroized a few leftover calls to Write_RMV_virtual_xyz()
that didn't get modified in the x86-64 merge. Really, they
just call the real function for now, but I want to have them
available to do direct writes with the guest2host TLB pointers.
to bitfields. bxInstruction_c is now 24 bytes, including 4 for
the memory addr resolution function pointer, and 4 for the
execution function pointer (16 + 4 + 4).
Coded more accessors, to abstract access from most code.
with accessors. Had to touch a number of files to update the
access using the new accessors.
Moved rm_addr to the CPU structure, to slim down bxInstruction_c
and to prevent future instruction caching from getting sprayed
with writes to individual rm_addr fields. There only needs to
be one. Though need to deal with instructions which have
static non-modrm addresses, but which are using rm_addr since
that will change.
bxInstruction_c is down to about 40 bytes now. Trying to
get down to 24 bytes.
use accessors. This lets me work on compressing the
size of fetch-decode structure (now called bxInstruction_c).
I've reduced it down to about 76 bytes. We should be able
to do much better soon. I needed the abstraction of the
accessors, so I have a lot of freedom to re-arrange things
without making massive future changes.
Lost a few percent of performance in these mods, but my
main focus was to get the abstraction.
All the EFLAGS bits used to be cached in separate fields. I left
a few of them in separate fields for now - might remove them
at some point also. When the arithmetic fields are known
(ie they're not in lazy mode), they are all cached in a
32-bit EFLAGS image, just like the x86 EFLAGS register expects.
All other eflags are store in the 32-bit register also, with
a few also mirrored in separate fields for now.
The reason I did this, was so that on x86 hosts, asm() statements
can be #ifdef'd in to do the calculation and get the native
eflags results very cheaply. Just to test that it works, I
coded ADD_EdId() and ADD_EwIw() with some conditionally compiled
asm()s for accelerated eflags processing and it works.
-Kevin
Read-Modify-Write instructions. The first read phase stores
the host pointer in the "pages" field if a direct use pointer
is available. The Write phase first checks if a pointer was
issued and uses it for a direct write if available.
I chose the "pages" field since it needs to be checked by the
write_RMW_virtual variants anyways and thus needs to be
cached anyways.
Mostly the mods where to access.cc, but I did also macro-ize
the calls to write_RMW_virtual...() in files which use it
and cpu.h. Right now, the macro is just a straight pass-through.
I tried expanding it to a quick initial check for the pointer
availability to do the write in-place, with a function call
as a fall-back. That didn't seemed to matter at all.
Booting is not helped by this really. The upper bound of
the gain is 5 or 6%, and that's only if you have a loop that
looks like:
label:
add [eax], ebx ;; mega read-modify-write instruction
jmp label ;; intensive loop.
tries to fix it. The shortcuts to register names such as AX and DL are
#defines in cpu/cpu.h, and they are defined in terms of BX_CPU_THIS_PTR.
When BX_USE_CPU_SMF=1, this works fine. (This is what bochs used for
a long time, and nobody used the SMF=0 mode at all.) To make SMP bochs
work, I had to get SMF=0 mode working for the CPU so that there could
be an array of cpus.
When SMF=0 for the CPU, BX_CPU_THIS_PTR is defined to be "this->" which
only works within methods of BX_CPU_C. Code outside of BX_CPU_C must
reference BX_CPU(num) instead.
- to try to enforce the correct use of AL/AX/DL/etc. shortcuts, they are
now only #defined when "NEED_CPU_REG_SHORTCUTS" is #defined. This is
only done in the cpu/*.cc code.
To see the commit logs for this use either cvsweb or
cvs update -r BRANCH-io-cleanup and then 'cvs log' the various files.
In general this provides a generic interface for logging.
logfunctions:: is a class that is inherited by some classes, and also
. allocated as a standalone global called 'genlog'. All logging uses
. one of the ::info(), ::error(), ::ldebug(), ::panic() methods of this
. class through 'BX_INFO(), BX_ERROR(), BX_DEBUG(), BX_PANIC()' macros
. respectively.
.
. An example usage:
. BX_INFO(("Hello, World!\n"));
iofunctions:: is a class that is allocated once by default, and assigned
as the iofunction of each logfunctions instance. It is this class that
maintains the file descriptor and other output related code, at this
point using vfprintf(). At some future point, someone may choose to
write a gui 'console' for bochs to which messages would be redirected
simply by assigning a different iofunction class to the various logfunctions
objects.
More cleanup is coming, but this works for now. If you want to see alot
of debugging output, in main.cc, change onoff[LOGLEV_DEBUG]=0 to =1.
Comments, bugs, flames, to me: todd@fries.net