Virtual Mode Extensions (VME)
and
Protected Mode Virtual Interrupts (PVI)
instructions STI and CLI have full support of these features, according to Intel docs. Need to check POPF and PUSHF instructions and afterwise VME and PVI extensions could be enabled in CR4
split REPEAT instructions according to opsize to speedup execution
now each REPEATABLE instruction splitted to 3 different instructions, one for 16-bit operand size, one for 32-bit and one for 64-bit. Choosing of correct instruction occure in fetchdecode step.
Manual says that GP(0) shouldd be generated in this case ALWAYS
Fixed instructions PANIC messages to ERROR for this case
And ... do not leave PANIC messages w/o taking care that user could push CONTINUE button and program should know to continue after the PANIC code line. Mainly in rerurn instructions were several problems ...
the patch release notes by Zwane:
o Define symbols for constants like
o APIC arbitration
o Processor priority
o Various interrupt delivery fixes
o Focus processor checking
o ExtINT delivery
I need to release this now so that i don't fall too far behind CVS, when
it was part of the bochs-smp patch it could boot 2.4.18 4way. Apologies
for the whitespace changes.
Also remove patch.apic-ppr-zwane patch because it already included in
patch.apic-zwane.
I hope it will help to boot x86-64 or cmp systems required missed APIC
features !
1. Review and commit patch
[ 896733 ] Lazy flags, for more instructions, only 1 src op
May be partially, but I hope to get all ideas from patch in
2. Get Bochs speedup after lazy flags optimization
3. Most important for me: improve correctness of emulation by handling several
undocumented EFLAGS modifications. And finally pass
UFLAGS - Undefined Flags Test v 3.0
Copyright (C) Potemkin's Hackers Group (PHG) 1989,1995
The test still fails on > 50% of its checks.
2. Fixed bug
[ 989478 ] I-Cache and undefined Instruktions
The L4 microkernel uses an undefined instruction to
trap for a special requests into the kernel (LOCK NOP).
The handler fixes this up and gives the user a special
code page with syscall stubs. If you're not using the
I-Cache optimization everthing works find on bochs. But
if you enable the I-Cache (--enable-icache), then the
undefined opcode exception is thrown only once for ever
virtual address it occurs. See the demodisk of the
L4KA::pistachio
(http://www.l4ka.org/projects/pistachio/download.php).
In this case the pingpong benchmark of this demo is of
interest. Everything runs fine until the program tries
to spawn a new task for its measurements. This new task
shares the code of the creating program. But the new
task stops executing at the undefined instruction
explained above and no exception is thrown.
- v8086 priveleged instruction processing bug (was also reported by
LightCone Aug 7 2003)
- exception process bug (was reported by Diego Henriquez Sat Nov 15
01:16:51 CET 2003)
- segment validation with IRET instruction
- CS segment not present exception processing with IRET
configure script option --enable-magic-breakpoints (enabled by default).
Documented the instruction required to trigger the magic breakpoint
(xchgw %bx,%bx).
With this coding style each instruction could be implemented separatelly even not together with current Bochs FPU emulator.
Step-by-step I am going to transfer all FPU instructions from current Bochs FPU emulator to new style and remove an old bugged emulator.
Anyway, now I could implement all currently missed FPU instructions without hacking wm-fpu-emu.
Instructions MOV_CxRx and MOV_RxCx are not supported in v8086 mode according to Intel manuals.
Also these instructions are treated as register-to-register regardless to MODRM byte fields (according to AMD manuals)
Also commit fix for MOV_EwSw by Kevin
PNI could be enabled by setting BX_SUPPORT_PNI in config.h
After the feature will be fully validation I'll also add configure option.
The implemntation is ~complete. I've missed only three FPU new opcodes of FUSTTP instruction and MONITOR/WAIT instructions.
Enjoy ! ;)
check.
Commented out a number of instances of invalidate_prefetch_q(),
for branches which do not change CS since the EIP window mechanism
takes care of validating that EIP lands in the current page or not
in the main cpu loop anyways.
Fixed a couple cases (v8086 mode and real mode) of loading CS where
the EIP page window was not invalidated in segment_ctrl_pro.cc.
That may fix some aliasing problems reported before (OS2).
According to the Intel manuals:
The LOCK prefix can be prepended only to the following instructions
and only to those forms of the instructions where the destination
operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If
the LOCK prefix is used with one of these instructions and the source
operand is a memory operand, an undefined opcode exception (#UD) will
be generated. An undefined opcode exception will also be generated if
the LOCK prefix is used with any instruction not in the above list.
Checking of the LOCK prefix done in fetchDecode state and not overloads
Bochs's execution.
- it works only on x86 with gcc2.95+
- uses the GCC function atribute "regparm(n)" to declare that certain
functions use the register calling convention
- performance improvement is about 6%
1) fixed the type of "hostPageAddr" and associated typecasts.
2) fixed the type of "pages" and associated typecasts (overloaded variable)
3) patch to cpu.cc to calculate "eipPageBias" correctly in 64 bit mode
* renamed CPU_ID to BX_CPU_ID.
with this new name there is no possibility for name contentions and BX_CPU_ID
definition could be moved out to NEED_CPU_REG_SHORTCUTS block
* returned back `unsigned BX_CPU::which_cpu(void)` function
* added BX_CPU_ID parameter for
BX_INSTR_PHY_READ(a20addr, len);
BX_INSTR_PHY_WRITE(a20addr, len);
now it will be
BX_INSTR_PHY_READ(cpu_id, a20addr, len);
BX_INSTR_PHY_WRITE(cpu_id, a20addr, len);
> CPU_ID is defined as
> #define CPU_ID (BX_CPU_THIS_PTR local_apic.get_id())
> This is not true when the APIC name is changed (true in Linux). Please
> change this to:
> #define CPU_ID (BX_CPU_THIS - BX_CPU(0))
Because source files were added/removed it would require an update
of the windows and macos project files, so I want to wait until after 2.0.
M Makefile.in 1.51 back to 1.50
M cpu.h 1.121 back to 1.120
M fetchdecode.cc 1.37 back to 1.36
M fetchdecode64.cc 1.33 back to 1.32
M sse.cc 1.17 back to 1.16
A sse2.cc 1.27 back to 1.26 (added back)
R sse_move.cc removed
R sse_pfp.cc removed
- to bring these changes back again, all we have to do is
"cvs update -j tmp-before1 -j tmp-after1"
sse.cc -> general SSE stuff and SSE integer (MMX extensions)
sse_move.cc -> memory transfer and shuffle opcodes
sse_pfp.cc -> packed floating point operations
debugger, SMP, and x86-64. A few macros were missing the CPU_ID argument,
and a few passed nonexistent variables to the instrumentation macros.
- I changed CPU_ID into a plain old macro instead of an inline call to a
trivial which_cpu() function, and removed which_cpu().
Modified Files:
cpu/cpu.h cpu/ctrl_xfer64.cc debug/dbg_main.cc
- Now compiles for plain ia-32
- Fixed some printf formatting for ia32 only.
- Update to latest Win32 DLL
- Added an ICEBP (Undoc 0xF8, INT 01) facility.
- updated to use latest VGA refresh routine
there to offer a way to substitute more efficient code
to do the RMW cases. At the moment, they just map to
the normal functions.
Sorry, restored the previous version ...
"bx_bool" which is always defined as Bit32u on all platforms. In Carbon
specific code, Boolean is still used because the Carbon header files
define it to unsigned char.
- this fixes bug [ 623152 ] MacOSX: Triple Exception Booting win95.
The bug was that some code in Bochs depends on Boolean to be a
32 bit value. (This should be fixed, but I don't know all the places
where it needs to be fixed yet.) Because Carbon defined Boolean as
an unsigned char, Bochs just followed along and used the unsigned char
definition to avoid compile problems. This exposed the dependency
on 32 bit Boolean on MacOS X only and led to major simulation problems,
that could only be reproduced and debugged on that platform.
- On the mailing list we debated whether to make all Booleans into "bool" or
our own type. I chose bx_bool for several reasons.
1. Unlike C++'s bool, we can guarantee that bx_bool is the same size on all
platforms, which makes it much less likely to have more platform-specific
simulation differences in the future. (I spent hours on a borrowed
MacOSX machine chasing bug 618388 before discovering that different sized
Booleans were the problem, and I don't want to repeat that.)
2. We still have at least one dependency on 32 bit Booleans which must be
fixed some time, but I don't want to risk introducing new bugs into the
simulation just before the 2.0 release.
Modified Files:
bochs.h config.h.in gdbstub.cc logio.cc main.cc pc_system.cc
pc_system.h plugin.cc plugin.h bios/rombios.c cpu/apic.cc
cpu/arith16.cc cpu/arith32.cc cpu/arith64.cc cpu/arith8.cc
cpu/cpu.cc cpu/cpu.h cpu/ctrl_xfer16.cc cpu/ctrl_xfer32.cc
cpu/ctrl_xfer64.cc cpu/data_xfer16.cc cpu/data_xfer32.cc
cpu/data_xfer64.cc cpu/debugstuff.cc cpu/exception.cc
cpu/fetchdecode.cc cpu/flag_ctrl_pro.cc cpu/init.cc
cpu/io_pro.cc cpu/lazy_flags.cc cpu/lazy_flags.h cpu/mult16.cc
cpu/mult32.cc cpu/mult64.cc cpu/mult8.cc cpu/paging.cc
cpu/proc_ctrl.cc cpu/segment_ctrl_pro.cc cpu/stack_pro.cc
cpu/tasking.cc debug/dbg_main.cc debug/debug.h debug/sim2.cc
disasm/dis_decode.cc disasm/disasm.h doc/docbook/Makefile
docs-html/cosimulation.html fpu/wmFPUemu_glue.cc
gui/amigaos.cc gui/beos.cc gui/carbon.cc gui/gui.cc gui/gui.h
gui/keymap.cc gui/keymap.h gui/macintosh.cc gui/nogui.cc
gui/rfb.cc gui/sdl.cc gui/siminterface.cc gui/siminterface.h
gui/term.cc gui/win32.cc gui/wx.cc gui/wxmain.cc gui/wxmain.h
gui/x.cc instrument/example0/instrument.cc
instrument/example0/instrument.h
instrument/example1/instrument.cc
instrument/example1/instrument.h
instrument/stubs/instrument.cc instrument/stubs/instrument.h
iodev/cdrom.cc iodev/cdrom.h iodev/cdrom_osx.cc iodev/cmos.cc
iodev/devices.cc iodev/dma.cc iodev/dma.h iodev/eth_arpback.cc
iodev/eth_packetmaker.cc iodev/eth_packetmaker.h
iodev/floppy.cc iodev/floppy.h iodev/guest2host.h
iodev/harddrv.cc iodev/harddrv.h iodev/ioapic.cc
iodev/ioapic.h iodev/iodebug.cc iodev/iodev.h
iodev/keyboard.cc iodev/keyboard.h iodev/ne2k.h
iodev/parallel.h iodev/pci.cc iodev/pci.h iodev/pic.h
iodev/pit.cc iodev/pit.h iodev/pit_wrap.cc iodev/pit_wrap.h
iodev/sb16.cc iodev/sb16.h iodev/serial.cc iodev/serial.h
iodev/vga.cc iodev/vga.h memory/memory.h memory/misc_mem.cc
SSE/SSE2 for Stanislav. Also, some method prototypes and
skeletal functions in access.cc for read/write double quadword
features.
Also cleaned up one warning in protect_ctrl.cc for non-64 bit compiles.
There was an unused variable, only used for 64-bit.
This is an interim update to allow others to test.
We have userland code running!!! (up to a point)
Able to start executing "sash" as /sbin/init in userland from linux 64 bit
kernel until it crashes trying to access a null pointer. No kernel panics
though, just a segfault loop.
into inline functions with asm() statements in cpu.h. This cleans
up the *.cc code (which now doesn't have any asm()s in it), and
centralizes the asm() code so constraints can be modified in one
place. This also makes it easier to cover more instructions
with asm()s for more efficient eflags handling.
hack with longjmp() back to cpu.cc main decode loop, and added a
check in there to return control when bx_guard.special_unwind_stack
is set (compiling with debugger enabled only).
If in the debugger you try to execute further instructions
(which you shouldn't), other fields need to be reset I would
think, such as EXT and errorno, and have to make sure ESP/EIP
are corrected properly. Basically, this hack is only good
for examining the current situation of a nasty fault.
exit out of cpu_loop() and back to the caller can be honored.
Previously, code in this function was a part of cpu_loop so
a "return;" would already do that. Now, a value is passed
back to cpu_loop() to denote such a request, and then a return
is executed from cpu_loop().
I haven't tested this yet, but previously I must have broke
certain debugging requests by moving the code to a separate
function and not fixing the "return;" statements.
Symptom: Linux kernel 2.4.19 would hang in random places. CPU still
running, but in dle loop.
Cause: if APIC interrupt occurred while a PIC interrupt was pending, the
PIC interrupt would be lost. This is because either an APIC or PIC
interrupt would trash any pending interrupt event because INTR is only a state,
not an event queue.
Temporary fix: reworked apic.cc to have it's own copy of INTR state. cpu.cc now
checks for both cpu.INTR and local_apic.INTR.
Need to do further research to see if local_apic and pic can be integrated in such
a way as properly manage the combined effects of both devices accessing INTR state.
value and a change-mask, rather than passing all the boolean
change flags as arguments.
Recoded the POPF instruction in flag_ctrl.cc to use the
new writeEFlags() function, and to make it more sane.
Also, the old write_flags() and write_eflags() functions
redirect to writeEFlags() for now. Later, when we get
back in a development mode, it would be better to make
all calls use the new function and get rid of the old ones.
been using the Boolean type for a number of multi-bit fields on the
assumption that it is actually many bits wide. However, this assumption is
unsafe and has caused some bugs that are hard to track down.
- in the Carbon library on MacOS X, Boolean is defined to be an unsigned char.
This has been causing some of the EFLAGS accessors to fail (bits 8-31)
because they depended on Boolean being 32 bits wide. I changed these
accessors to return Bit32u instead. I believe that this will finally fix
[ 618388 ] Unable to boot under MacOS X.
- It would be possible to create a bochs specific type for booleans (bx_bool),
but it's cleaner to simply use "Boolean" when we actually mean a 1-bit true
or false field, and Bit8u/Bit32u when it is a multibit field.
32-bits rather than 64. This is possible, because there is
always an active null (heartbeat) timer, with periodicity
of less than or equal to the maximum 32-bit int value.
This generates a little less code in the hot part of cpu_loop,
and saved about 3% execution time on a Win95 boot.
Moved the asynchronous handling code from cpu_loop() to its
own function since it's a long path. This neatened up the
code a little (less gotos and all), and made it more clear
to use a "while (1)" around the iterative code in cpu_loop().
These seem to be working better, are a more simple design,
easier to understand, and AFAIK don't have race conditions
in them like the old ones do.
Re-coded the apic timer, to return cycle accurate values
which vary with each iteration of a read from a guest OS.
The previous implementation had very poor resolution. It
also didn't check the mask bit to see if an apic timer
interrupt should occur on countdown to 0. The apic timer
now calls its own bochs timer, rather than tag on the
one in iodev/devices.cc.
I needed to use one new function which is an inline in
pc_sytem.h. That would have to be added to the old pc_system.h if
we have to back-out to it.
Linux/x86-64 now boots until it hits two undefined opcodes:
FXRSTOR (0f ae). This restores FPU, MMX, XMM and MXCSR registers
from a 512-byte region of memory. We don't implement this yet.
MOVNTDQ (66 0f e7). This is a move involving an XMM register.
The 0x66 prefix is used so it's a double quadword, rather than
MOVNTQ (0f e7) which operates on a single quadword.
The Linux kernel panic is on the MOVNTQD opcodes. Perhaps that's
because that opcode is used in exception handling of the 1st?
Looks like we need to implement some new instructions.
restart another one in wxWindows. Fixed that. Also, on restart, the
apic id's left over from the first run were causing panics. Fixed that.
- modified: main.cc cpu/apic.cc cpu/cpu.h cpu/init.cc
now simply return a cached value which is set upon mode changes.
The biggest problem was protected_mode() which did something like:
return CR0.PM && ! EFLAGS.VM
This adds up when it was being executed many times in branch functions
etc. Now, cached values are set and sampled instead.
and Jas Sandys-Lumsdaine to split out common instructions into
variants which deal with the mod=11b case (Reg-Reg) and the
other cases (which do memory ops). Actually, I only split
MOV_GwEw and MOV_GdEd for now. According to some instrumentation
of a Win95 boot, they were the most frequently used opcode by far.
Some things changed in the ctrl_xfer*.cc, fetchdecode*.cc,
and cpu.cc since the original patches, so I did some patch
integration by hand. Check the placement of the
macros BX_INSTR_FETCH_DECODE_COMPLETED() and BX_INSTR_OPCODE()
in cpu.cc to make sure I go them right. Also, I changed the
parameters to BX_INSTR_OPCODE() to update them to the new code.
I put some comments before each of these to help determine if
the placement is right.
These macros are only compiled in if you are gathering instrumentation
data from bochs, so they shouldn't effect others.
Created 64-bit versions of some branch instructions and
changed fetchdecode64.cc to use them instead. This keeps the
#ifdef pollution down for 32-bit code and made fixing them
easier. They needed to clear the upper bits of RIP for
16-bit operand sizes. They also should not have had a protection
limit check in them, especially since that field is still
32-bit in cpu.h, so there's no way to set nominal 64-bit values.
The 32-bit versions were also not honoring the upper 32-bits
of RIP.
LOOPNE64_Jb
LOOPE64_Jb
LOOP64_Jb
JCXZ64_Jb
Changed all occurances of JCC_Jw/JCC_Jd in fetchdecode64.cc to
use JCC_Jq, which was coded already. Both JMP_Jq and JCC_Jq are
now fixed w.r.t. 16-bit opsizes and upper RIP bit clearing.
fetching 64-bit address opcode info, which was incorrect.
Fixed. Got rid of BxImmediate_Oq. fetchdecode64.cc now
uses BxImmediateO, like the fetch routine does. Addresses which
are embedded in the opcode, have a size which depends on
the current addressing size. For long-mode, this is
either 64 (default) or 32 (AddrSize over-ride). BxImmediate_O
now conditionally fetches based on AddrSize.
64-bit bug#2: In JMP_Jq(), when the current operand size is
16-bits, the upper dword of RIP was not being cleared. The
semantics with this case are weird - one would think the
top 48 bits would be cleared, but apparently only the top
32 bits are. Anyways, I fixed this.
Replaced some of the messy immediate fetching (byte-by-byte) in
fetchdecode64.cc with ReadHost{Q,D}WordFromLittleEndian() calls
for cleanliness. Should do this for all the cases, plus
the 32-bit stuff.
Since the SYSCALL replaces the LOADALL instruction, it is incompatible with
earlier CPU types.
At moment, the SYSCALL is only enabled by x86-64 emulation, but the code
can be incorporated in IA32 only emulations.
Instructions added:
0F 05 SYSCALL (replaces LOADALL)
0F 07 SYSRET (new)
TODO: restructure #if ... so that it can be used by non x86-64 emulations.
use getB_CF() etc. getB_CF() and friends are only for a relatively
small number of cases where a true boolean/binary number (0 or 1) is required
rather than 0 or non-0 as is returned by get_CF().
loadSRegLMNominal() which should be used to load a segment register
in long-mode with nominal values which are compatible with existing
checks and expectations for descriptor cache values.
Fixed 64-bit iret to not do a descriptor fetch if SS selector is null.
Also load SS with loadSRegLMNorminal() in the same case.
these from interfering from a normal compile here's what I did.
In config.h.in (which will generate config.h after a configure),
I added a #define called KPL64Hacks:
#define KPL64Hacks
*After* running configure, you must set this by hand. It will
default to off, so you won't get my hacks in a normal compile.
This will go away soon. There is also a macro just after that
called BailBigRSP(). You don't need to enabled that, but you
can. In many of the instructions which seemed like they could
be hit by the fetchdecode64() process, but which also touched
EIP/ESP, I inserted a macro. Usually this macro expands to nothing.
If you like, you can enabled it, and it will panic if it finds
the upper bits of RIP/RSP set. This helped me find bugs.
Also, I cleaned up the emulation in ctrl_xfer{8,16,32}.cc.
There were some really old legacy code snippets which directly
accessed operands on the stack with access_linear. Lots of
ugly code instead of just pop_32() etc. Cleaning those up,
minimized the number of instructions which directly manipulate
the stack pointer, which should help in refining 64-bit support.
which says to paste getB_ with flag and then paste with (. It should
be "getB_##flag(void)". Some preprocessors are complaining about pasting
the symbol with the paren.
of (1 & (val32>>N)), and added a getB_?F() accessor for special
cases which need a strict binary value (exactly 0 or 1). Most
code only needed a value for logical comparison. I modified the
special cases which do need a binary number for shifting and
comparison between flags, to use the special getB_?F() accessor.
Cleaned up memory.cc functions a little, now that all accesses
are within a single page.
Fixed a (not very likely encountered) bug in fetchdecode.cc (and
fetchdecode64.cc) where a 2-byte opcode starting with a prefix
starts at the last offset on a page. There were no checks
on the segment overrides for a boundary condition. I added them.
The eflags enhancements added just a tiny bit of performance.
so frequently.
Coded asm() statements for INC/DEC_ERX() instructions.
Cleaned up the iCache a litle including a bug fix. The
generation ID was decrementing the whole field including
some high meta bits. That could roll over after 1 Billion
cycles. I know only decrement if the field is valid, to
save the write.
I implemented inline functions which can serve the value of
the arithmetic flags if they are cached, and redirect to
the lazy_flags.cc routines if not.
Most of this was just prep work for adding more asm() statements
for native eflags processing when on x86.
also extended by the REX.B field on Hammer) is passed to instructions.
I rearranged the bxInstruction_c to free up a field to be used
to pass this info when mod-rm bytes are not used. This got rid
of the ugly ((i->b1 & 7) + i->rex_b) code.
Probably shaved just a very little run time off Hammer emulation,
and even less on x86-32. The resultant is a little cleaner anyways.
in cpu.cc out of the main loop, and into the asynchronous
events handling. I went through all the code paths, and
there doesn't seem to be any reason for that code to be
in the hot loop.
Added another accessor for getting instruction data, called
modC0(). A lot of instructions test whether the mod field
of mod-nnn-rm is 0xc0 or not, ie., it's a register operation
and not memory. So I flag this in fetchdecode{,64}.cc.
This added on the order of 1% performance improvement for
a Win95 boot.
Macroized a few leftover calls to Write_RMV_virtual_xyz()
that didn't get modified in the x86-64 merge. Really, they
just call the real function for now, but I want to have them
available to do direct writes with the guest2host TLB pointers.
but if you hand edit cpu/cpu.h, and change BxICacheEntries,
you can try different sizes. I'll make this more flexible
with configure. For now, use "--enable-icache" with no parameters.
- Modified fetchdecode.cc/fetchdecode64.cc just enough so that
instructions which encode a direct address now use a memory
resolution function which just sticks the immediate address
into rm_addr. With cached instructions we need this.
to bitfields. bxInstruction_c is now 24 bytes, including 4 for
the memory addr resolution function pointer, and 4 for the
execution function pointer (16 + 4 + 4).
Coded more accessors, to abstract access from most code.
with accessors. Had to touch a number of files to update the
access using the new accessors.
Moved rm_addr to the CPU structure, to slim down bxInstruction_c
and to prevent future instruction caching from getting sprayed
with writes to individual rm_addr fields. There only needs to
be one. Though need to deal with instructions which have
static non-modrm addresses, but which are using rm_addr since
that will change.
bxInstruction_c is down to about 40 bytes now. Trying to
get down to 24 bytes.
use accessors. This lets me work on compressing the
size of fetch-decode structure (now called bxInstruction_c).
I've reduced it down to about 76 bytes. We should be able
to do much better soon. I needed the abstraction of the
accessors, so I have a lot of freedom to re-arrange things
without making massive future changes.
Lost a few percent of performance in these mods, but my
main focus was to get the abstraction.
no longer used. Also rearranged that struct a little
to be more compressed. Over time, I'm going to reduce
it further, for use with future accelerations.
enhancement to bochs. You can now configure with
--enable-guest2host-tlb.
Force the support of big pages (PSE) when x86-64 is configured.
Reverted back to only one kind of TLB entry style, since everything
is ported.
Fixed one bug in io.cc with as_64 and the index registers.
There are others, as noticed by Peter.
class declaration, for example:
static const unsigned os_64=0, as_64=0;
After reading some suggestions on usenet, I changed these into
enums instead, like this:
enum { os_64=0, as_64=0 };
printing a message when a reserved bit was set, but not causing
a #GP(0). As well, I force a new PAE support option to 1 when
Hammer support is enabled.
called cpu_mode. Now there is one for cpu32, but it is declared:
static const unsigned cpu_mode=BX_MODE_IA32;
This way the compiler can compile-out if-then-else clauses based
on it, allowing for easier code sharing.
member functions are turned on, BX_CPU_C_PREFIX expands to nothing, and any
method that uses BX_CPU_C_PREFIX instead of explictly writing "BX_CPU_C::"
will not be a member function at all. This makes it impossible for code
outside the BX_CPU_C object to call the accessor because sometimes the method
is at ptr_to_cpu->get_EIP() and other times you'd have to do just get_EIP().
The only way I've found to solve this is to remove the BX_CPU_C_PREFIX
and write BX_CPU_C:: instead.
- in debug/dbg_main.cc I removed the EBP, EIP, ESP, SP shortcuts. Now
the accessors are used everywhere. Also I replaced a reference to
the short-lived get_erx() accessor with ones that work: get_EAX(), etc.
- with these changes the current cvs compiles with any combination of
debugger enabled/disabled, SMP enabled/disabled, and x86-64 enabled/disabled.
BX_READ_8BIT_REG() --> BX_READ_8BIT_REGx()
BX_WRITE_8BIT_REG() --> BX_WRITE_8BIT_REGx()
They use an extra parameter "extended". I coded this
as the macro without the "x" for cpu32 compiles. This
allows for ease of merging and code sharing.
- add get_erx() method to bx_gen_reg_t which returns the erx field of the
structure (which is has a different name in cpu and cpu64). Providing
an accessor is one strategy for avoiding igly "#ifdef BX_SUPPORT_X86_64"
statements in the rest of the code.
- cpu64/init.cc: the "eflags" before get_flag and set_flag is no longer
correct. removed.
- modified files: load32bitOShack.cc logio.cc cpu/cpu.h cpu64/apic.cc
cpu64/cpu.h cpu64/init.cc cpu64/proc_ctrl.cc debug/dbg_main.cc
cpu64 directories. Instead of using the macros introduced in cpu.h rev 1.37
such as GetEFlagsDFLogical and SetEFlagsDF and ClearEFlagsDF, I made inline
methods on the BX_CPU_C object that access the eflags fields. The problem
with the macros is that they cannot be used outside the BX_CPU_C object. The
macros have now been removed, and all references to eflags now use these new
accessors.
- I debated whether to put the accessors as members of the BX_CPU_C object
or members of the bx_flags_reg_t struct. I chose to make them members
of BX_CPU_C for two reasons: 1. the lazy flags are implemented as
members of BX_CPU_C, and 2. the eflags are referenced in many many places
and it is more compact without having to put eflags in front of each. (The
real problem with compactness is having to write BX_CPU_THIS_PTR in front of
everything, but that's another story.)
- Kevin pointed out a major bug in my set accessor code. What a difference a
little tilde can make! That is fixed now.
- modified: load32bitOShack.cc debug/dbg_main.cc
and in both cpu and cpu64 directories:
cpu.cc cpu.h ctrl_xfer_pro.cc debugstuff.cc exception.cc flag_ctrl.cc
flag_ctrl_pro.cc init.cc io.cc io_pro.cc proc_ctrl.cc soft_int.cc
string.cc vm8086.cc
a consistent way of accessing these flags that works both inside and
outside the BX_CPU class, I added inline accessor methods for each
flag: assert_FLAG(), clear_FLAG(), set_FLAG(value), and get_FLAG ()
that returns its value. I use assert to mean "set the value to one"
to avoid confusion, since there's also a set method that takes a value.
- the eflags access macros (e.g. GetEFlagsDFLogical, ClearEFlagsTF) are
now defined in terms of the inline accessors. In most cases it will
result in the same code anyway. The major advantage of the accesors
is that they can be used from inside or outside the BX_CPU object, while
the macros can only be used from inside.
- since almost all eflags were stored in val32 now, I went ahead and
removed the if_, rf, and vm fields. Now the val32 bit is the
"official" value for these flags, and they have accessors just like
everything else.
- init.cc: move the registration of registers until after they have been
initialized so that the initial value of each parameter is correct.
Modified files:
debug/dbg_main.cc cpu/cpu.h cpu/debugstuff.cc cpu/flag_ctrl.cc
cpu/flag_ctrl_pro.cc cpu/init.cc
with GCC) align them with the GCC special alignment attribute.
Since there was then one available field, I split the protection
attributes and native host pointers into their own fields.
Before, with 3 dwords per TLB entry, some entries (about 3/8)
were spanning two processor cache lines (assuming a 32-byte
cache line). Now, they all fit within one cache line.
Knocked about 1.4% off Win95 boot time, probably more off normal
software runs.
All the EFLAGS bits used to be cached in separate fields. I left
a few of them in separate fields for now - might remove them
at some point also. When the arithmetic fields are known
(ie they're not in lazy mode), they are all cached in a
32-bit EFLAGS image, just like the x86 EFLAGS register expects.
All other eflags are store in the 32-bit register also, with
a few also mirrored in separate fields for now.
The reason I did this, was so that on x86 hosts, asm() statements
can be #ifdef'd in to do the calculation and get the native
eflags results very cheaply. Just to test that it works, I
coded ADD_EdId() and ADD_EwIw() with some conditionally compiled
asm()s for accelerated eflags processing and it works.
-Kevin
it can decide how to proceed. Some of those bits are necessary
to make TLB invalidation decisions. INVLPG doesn't cause
a whole TLB flush anymore, just one page. Some of the
current CPU behaviours model the P6, especially on CR0
reloads. Earlier processors kept some pre-change pre-fetched
instructions until a branch. We could probably model that
by setting a flag, and letting the revalidate_prefetch_q
function cause serialization.
The TLB flush code only invalidates entries which are not
already invalidated for the case where the TLB invalidation
ID trick is not in use.
Read-Modify-Write instructions. The first read phase stores
the host pointer in the "pages" field if a direct use pointer
is available. The Write phase first checks if a pointer was
issued and uses it for a direct write if available.
I chose the "pages" field since it needs to be checked by the
write_RMW_virtual variants anyways and thus needs to be
cached anyways.
Mostly the mods where to access.cc, but I did also macro-ize
the calls to write_RMW_virtual...() in files which use it
and cpu.h. Right now, the macro is just a straight pass-through.
I tried expanding it to a quick initial check for the pointer
availability to do the write in-place, with a function call
as a fall-back. That didn't seemed to matter at all.
Booting is not helped by this really. The upper bound of
the gain is 5 or 6%, and that's only if you have a loop that
looks like:
label:
add [eax], ebx ;; mega read-modify-write instruction
jmp label ;; intensive loop.
Kevin Lawton says he doesn't get a performance benefit.
I'm not sure if I do. Either way, the difference isn't
very large.
This code may get removed if it turns out to be useless.
- bx_gen_reg cannot be declared with BX_SMF or it can't read gen_reg
when static member functions are turned on.
- use "BX_CPU_C_PREFIX" instead of "BX_CPU_C::" for get_segment_base.
- the SMF (static member function) tricks are just plain wierd. The only way to
really be sure that you're not breaking something is to try compiling it with
SMF on and with SMF off. e.g. "configure && make" and
"configure --enable-processors=2 && make".
mode uses the notion of the guest-to-host TLB. This has the
benefit of allowing more uniform and streamlined acceleration
code in access.cc which does not have to check if CR0.PG
is set, eliminating a few instructions per guest access.
Shaved just a little off execution time, as expected.
Also, access_linear now breaks accesses which span two pages,
into two calls the the physical memory routines, when paging
is off, just like it always has for paging on. Besides
being more uniform, this allows the physical memory access
routines to known the complete data item is contained
within a single physical page, and stop reapplying the
A20ADDR() macro to pointers as it increments them.
Perhaps things can be optimized a little more now there too...
I renamed the routines to {read,write}PhysicalPage() as
a reminder that these routines now operate on data
solely within one page.
I also added a little code so that the paging module is
notified when the A20 line is tweaked, so it can dump
whatever mappings it wants to.
I have not tested these functions, but they model the format and
acceleration principals of the byte/word/dword functions. Give them
a try on both little/big endian machines.
so that a compare of the current access could be done more
efficiently against the cached values, both in the normal
paging routines, and in the accelerated code in access.cc.
This cut down the amount of code path needed to get to
direct use of a host address nicely, and speed definitely
got a boost as a result, especially if you use the
--enable-guest2host-tlb option.
The CR0.WP flag was a real pain, because it imparts
a complication on the way protections work. Fortunately
it's not a high-change flag, so I just base the new
cached info on the current CR0.WP value, and dump
the TLB cache when it changes.
access routines in access.cc, completing the upgrade of
those routines. You do need '--enable-guest2host-tlb', before
you get the speedups for now. The guest2host mods seem pretty
solid, though I do need to see what effects the A20 line has
on this cache and the paging TLB in general.
added --enable-repeat-speedups with default to disabled.
Reconfigure/recompile and the speedup code will be #ifdef'd
out for now. It manifested as junk written to the VGA screen
while booting/running Windows.
Also made some more mods to the main cpu loop. Moved the
handling of EXT/errorno outside the main loop, much like
the extra EIP/ESP commits were moved, for a little better
performance.
I changed the fetch_ptr/bytesleft method of fetching to
a slightly different model, which calculates a window
for which EIP will be valid (land on the current page),
and a bias which when applied to EIP will be from
0..upper_page_limit. Speed is about the same for either
method, but a pseudo-op/threaded-interpreter will plug
in better with this and be faster.
- Paging code rehash. You must now use --enable-4meg-pages to
use 4Meg pages, with the default of disabled, since we don't well
support 4Meg pages yet. Paging table walks model a real CPU
more closely now, and I fixed some bugs in the old logic.
- Segment check redundancy elimination. After a segment is loaded,
reads and writes are marked when a segment type check succeeds, and
they are skipped thereafter, when possible.
- Repeated IO and memory string copy acceleration. Only some variants
of instructions are available on all platforms, word and dword
variants only on x86 for the moment due to alignment and endian issues.
This is compiled in currently with no option - I should add a configure
option.
- Added a guest linear address to host TLB. Actually, I just stick
the host address (mem.vector[addr] address) in the upper 29 bits
of the field 'combined_access' since they are unused. Convenient
for now. I'm only storing page frame addresses. This was the
simplest for of such a TLB. We can likely enhance this. Also,
I only accelerated the normal read/write routines in access.cc.
Could also modify the read-modify-write versions too. You must
use --enable-guest2host-tlb, to try this out. Currently speeds
up Win95 boot time by about 3.5% for me. More ground to cover...
- Minor mods to CPUI/MOV_CdRd for CMOV.
- Integrated enhancements from Volker to getHostMemAddr() for PCI
being enabled.
Specific changes from the patch:
1.) renamed fdcache_eip to fdcache_ip, as it is using
the RIP instead of the EIP.
2.) added a Boolean array fdcache_is32 which uses is32
to determine icache hits. Otherwise we could run 32-bit
code as 16-bit or vice versa.
Modified Files:
config.h.in cpu/cpu.cc cpu/cpu.h memory/memory.cc
[ #433759 ] virtual address checks can overflow
> Bochs has been crashing in some cases when you try to access data which
> overlaps the segment limit, when the segment limit is near the 32-bit
> boundary. The example that came up a few times is reading/writing 4 bytes
> starting at 0xffffffff when the segment limit was 0xffffffff. The
> condition used to compare offset+length-1 with the limit, but
> offset+length-1 was overflowing so the comparison went wrong. This patch
> changes the condition so that it supports all segment limits except for
> sizes 0,1,2,3 bytes. Dave and I figured that these sizes would not be
> needed, while size 0xffffffff is used quite a lot.
in an output format similar to gdb (when you do info all-registers).
Also, if you do "info all" you get the CPU registers and the FPU
registers.
- added bx_cpu_c method called fpu_print_regs, which is implemented
in wmFPUemu_glue.cc
BX_SUPPORT_APIC were used. To follow the pattern used by other
names like this, I changed them all to BX_SUPPORT_APIC.
Thanks to Tom Lindström for chasing this down!
BX_CPU_C bx_cpu;
BX_MEM_C bx_mem;
and when more than one processor, use
BX_CPU_C *bx_cpu_array[BX_SMP_PROCESSORS];
BX_MEM_C *bx_mem_array[BX_ADDRESS_SPACES];
The changeover is controlled by BX_SMP_PROCESSORS, but there are only
a few code changes since nearly all code uses the BX_CPU(n) and BX_MEM(n)
macros.
- This turns out to make a 10% speed difference! With this revision,
the CVS version now gets 95% of the performance of the 3/25/2000
snapshot, which I've been using as my baseline.
tries to fix it. The shortcuts to register names such as AX and DL are
#defines in cpu/cpu.h, and they are defined in terms of BX_CPU_THIS_PTR.
When BX_USE_CPU_SMF=1, this works fine. (This is what bochs used for
a long time, and nobody used the SMF=0 mode at all.) To make SMP bochs
work, I had to get SMF=0 mode working for the CPU so that there could
be an array of cpus.
When SMF=0 for the CPU, BX_CPU_THIS_PTR is defined to be "this->" which
only works within methods of BX_CPU_C. Code outside of BX_CPU_C must
reference BX_CPU(num) instead.
- to try to enforce the correct use of AL/AX/DL/etc. shortcuts, they are
now only #defined when "NEED_CPU_REG_SHORTCUTS" is #defined. This is
only done in the cpu/*.cc code.
in BRANCH-smp-bochs revisions.
- The general task was to make multiple CPU's which communicate
through their APICs. So instead of BX_CPU and BX_MEM, we now have
BX_CPU(x) and BX_MEM(y). For an SMP simulation you have several
processors in a shared memory space, so there might be processors
BX_CPU(0..3) but only one memory space BX_MEM(0). For cosimulation,
you could have BX_CPU(0) with BX_MEM(0), then BX_CPU(1) with
BX_MEM(1). WARNING: Cosimulation is almost certainly broken by the
SMP changes.
- to simulate multiple CPUs, you have to give each CPU time to execute
in turn. This is currently implemented using debugger guards. The
cpu loop steps one CPU for a few instructions, then steps the
next CPU for a few instructions, etc.
- there is some limited support in the debugger for two CPUs, for
example printing information from each CPU when single stepping.
To see the commit logs for this use either cvsweb or
cvs update -r BRANCH-io-cleanup and then 'cvs log' the various files.
In general this provides a generic interface for logging.
logfunctions:: is a class that is inherited by some classes, and also
. allocated as a standalone global called 'genlog'. All logging uses
. one of the ::info(), ::error(), ::ldebug(), ::panic() methods of this
. class through 'BX_INFO(), BX_ERROR(), BX_DEBUG(), BX_PANIC()' macros
. respectively.
.
. An example usage:
. BX_INFO(("Hello, World!\n"));
iofunctions:: is a class that is allocated once by default, and assigned
as the iofunction of each logfunctions instance. It is this class that
maintains the file descriptor and other output related code, at this
point using vfprintf(). At some future point, someone may choose to
write a gui 'console' for bochs to which messages would be redirected
simply by assigning a different iofunction class to the various logfunctions
objects.
More cleanup is coming, but this works for now. If you want to see alot
of debugging output, in main.cc, change onoff[LOGLEV_DEBUG]=0 to =1.
Comments, bugs, flames, to me: todd@fries.net