32-bits rather than 64. This is possible, because there is
always an active null (heartbeat) timer, with periodicity
of less than or equal to the maximum 32-bit int value.
This generates a little less code in the hot part of cpu_loop,
and saved about 3% execution time on a Win95 boot.
Moved the asynchronous handling code from cpu_loop() to its
own function since it's a long path. This neatened up the
code a little (less gotos and all), and made it more clear
to use a "while (1)" around the iterative code in cpu_loop().
These seem to be working better, are a more simple design,
easier to understand, and AFAIK don't have race conditions
in them like the old ones do.
Re-coded the apic timer, to return cycle accurate values
which vary with each iteration of a read from a guest OS.
The previous implementation had very poor resolution. It
also didn't check the mask bit to see if an apic timer
interrupt should occur on countdown to 0. The apic timer
now calls its own bochs timer, rather than tag on the
one in iodev/devices.cc.
I needed to use one new function which is an inline in
pc_sytem.h. That would have to be added to the old pc_system.h if
we have to back-out to it.
Linux/x86-64 now boots until it hits two undefined opcodes:
FXRSTOR (0f ae). This restores FPU, MMX, XMM and MXCSR registers
from a 512-byte region of memory. We don't implement this yet.
MOVNTDQ (66 0f e7). This is a move involving an XMM register.
The 0x66 prefix is used so it's a double quadword, rather than
MOVNTQ (0f e7) which operates on a single quadword.
The Linux kernel panic is on the MOVNTQD opcodes. Perhaps that's
because that opcode is used in exception handling of the 1st?
Looks like we need to implement some new instructions.
restart another one in wxWindows. Fixed that. Also, on restart, the
apic id's left over from the first run were causing panics. Fixed that.
- modified: main.cc cpu/apic.cc cpu/cpu.h cpu/init.cc
now simply return a cached value which is set upon mode changes.
The biggest problem was protected_mode() which did something like:
return CR0.PM && ! EFLAGS.VM
This adds up when it was being executed many times in branch functions
etc. Now, cached values are set and sampled instead.
and Jas Sandys-Lumsdaine to split out common instructions into
variants which deal with the mod=11b case (Reg-Reg) and the
other cases (which do memory ops). Actually, I only split
MOV_GwEw and MOV_GdEd for now. According to some instrumentation
of a Win95 boot, they were the most frequently used opcode by far.
Some things changed in the ctrl_xfer*.cc, fetchdecode*.cc,
and cpu.cc since the original patches, so I did some patch
integration by hand. Check the placement of the
macros BX_INSTR_FETCH_DECODE_COMPLETED() and BX_INSTR_OPCODE()
in cpu.cc to make sure I go them right. Also, I changed the
parameters to BX_INSTR_OPCODE() to update them to the new code.
I put some comments before each of these to help determine if
the placement is right.
These macros are only compiled in if you are gathering instrumentation
data from bochs, so they shouldn't effect others.
Created 64-bit versions of some branch instructions and
changed fetchdecode64.cc to use them instead. This keeps the
#ifdef pollution down for 32-bit code and made fixing them
easier. They needed to clear the upper bits of RIP for
16-bit operand sizes. They also should not have had a protection
limit check in them, especially since that field is still
32-bit in cpu.h, so there's no way to set nominal 64-bit values.
The 32-bit versions were also not honoring the upper 32-bits
of RIP.
LOOPNE64_Jb
LOOPE64_Jb
LOOP64_Jb
JCXZ64_Jb
Changed all occurances of JCC_Jw/JCC_Jd in fetchdecode64.cc to
use JCC_Jq, which was coded already. Both JMP_Jq and JCC_Jq are
now fixed w.r.t. 16-bit opsizes and upper RIP bit clearing.
fetching 64-bit address opcode info, which was incorrect.
Fixed. Got rid of BxImmediate_Oq. fetchdecode64.cc now
uses BxImmediateO, like the fetch routine does. Addresses which
are embedded in the opcode, have a size which depends on
the current addressing size. For long-mode, this is
either 64 (default) or 32 (AddrSize over-ride). BxImmediate_O
now conditionally fetches based on AddrSize.
64-bit bug#2: In JMP_Jq(), when the current operand size is
16-bits, the upper dword of RIP was not being cleared. The
semantics with this case are weird - one would think the
top 48 bits would be cleared, but apparently only the top
32 bits are. Anyways, I fixed this.
Replaced some of the messy immediate fetching (byte-by-byte) in
fetchdecode64.cc with ReadHost{Q,D}WordFromLittleEndian() calls
for cleanliness. Should do this for all the cases, plus
the 32-bit stuff.
Since the SYSCALL replaces the LOADALL instruction, it is incompatible with
earlier CPU types.
At moment, the SYSCALL is only enabled by x86-64 emulation, but the code
can be incorporated in IA32 only emulations.
Instructions added:
0F 05 SYSCALL (replaces LOADALL)
0F 07 SYSRET (new)
TODO: restructure #if ... so that it can be used by non x86-64 emulations.
use getB_CF() etc. getB_CF() and friends are only for a relatively
small number of cases where a true boolean/binary number (0 or 1) is required
rather than 0 or non-0 as is returned by get_CF().
loadSRegLMNominal() which should be used to load a segment register
in long-mode with nominal values which are compatible with existing
checks and expectations for descriptor cache values.
Fixed 64-bit iret to not do a descriptor fetch if SS selector is null.
Also load SS with loadSRegLMNorminal() in the same case.
these from interfering from a normal compile here's what I did.
In config.h.in (which will generate config.h after a configure),
I added a #define called KPL64Hacks:
#define KPL64Hacks
*After* running configure, you must set this by hand. It will
default to off, so you won't get my hacks in a normal compile.
This will go away soon. There is also a macro just after that
called BailBigRSP(). You don't need to enabled that, but you
can. In many of the instructions which seemed like they could
be hit by the fetchdecode64() process, but which also touched
EIP/ESP, I inserted a macro. Usually this macro expands to nothing.
If you like, you can enabled it, and it will panic if it finds
the upper bits of RIP/RSP set. This helped me find bugs.
Also, I cleaned up the emulation in ctrl_xfer{8,16,32}.cc.
There were some really old legacy code snippets which directly
accessed operands on the stack with access_linear. Lots of
ugly code instead of just pop_32() etc. Cleaning those up,
minimized the number of instructions which directly manipulate
the stack pointer, which should help in refining 64-bit support.
which says to paste getB_ with flag and then paste with (. It should
be "getB_##flag(void)". Some preprocessors are complaining about pasting
the symbol with the paren.
of (1 & (val32>>N)), and added a getB_?F() accessor for special
cases which need a strict binary value (exactly 0 or 1). Most
code only needed a value for logical comparison. I modified the
special cases which do need a binary number for shifting and
comparison between flags, to use the special getB_?F() accessor.
Cleaned up memory.cc functions a little, now that all accesses
are within a single page.
Fixed a (not very likely encountered) bug in fetchdecode.cc (and
fetchdecode64.cc) where a 2-byte opcode starting with a prefix
starts at the last offset on a page. There were no checks
on the segment overrides for a boundary condition. I added them.
The eflags enhancements added just a tiny bit of performance.
so frequently.
Coded asm() statements for INC/DEC_ERX() instructions.
Cleaned up the iCache a litle including a bug fix. The
generation ID was decrementing the whole field including
some high meta bits. That could roll over after 1 Billion
cycles. I know only decrement if the field is valid, to
save the write.
I implemented inline functions which can serve the value of
the arithmetic flags if they are cached, and redirect to
the lazy_flags.cc routines if not.
Most of this was just prep work for adding more asm() statements
for native eflags processing when on x86.
also extended by the REX.B field on Hammer) is passed to instructions.
I rearranged the bxInstruction_c to free up a field to be used
to pass this info when mod-rm bytes are not used. This got rid
of the ugly ((i->b1 & 7) + i->rex_b) code.
Probably shaved just a very little run time off Hammer emulation,
and even less on x86-32. The resultant is a little cleaner anyways.
in cpu.cc out of the main loop, and into the asynchronous
events handling. I went through all the code paths, and
there doesn't seem to be any reason for that code to be
in the hot loop.
Added another accessor for getting instruction data, called
modC0(). A lot of instructions test whether the mod field
of mod-nnn-rm is 0xc0 or not, ie., it's a register operation
and not memory. So I flag this in fetchdecode{,64}.cc.
This added on the order of 1% performance improvement for
a Win95 boot.
Macroized a few leftover calls to Write_RMV_virtual_xyz()
that didn't get modified in the x86-64 merge. Really, they
just call the real function for now, but I want to have them
available to do direct writes with the guest2host TLB pointers.
but if you hand edit cpu/cpu.h, and change BxICacheEntries,
you can try different sizes. I'll make this more flexible
with configure. For now, use "--enable-icache" with no parameters.
- Modified fetchdecode.cc/fetchdecode64.cc just enough so that
instructions which encode a direct address now use a memory
resolution function which just sticks the immediate address
into rm_addr. With cached instructions we need this.
to bitfields. bxInstruction_c is now 24 bytes, including 4 for
the memory addr resolution function pointer, and 4 for the
execution function pointer (16 + 4 + 4).
Coded more accessors, to abstract access from most code.
with accessors. Had to touch a number of files to update the
access using the new accessors.
Moved rm_addr to the CPU structure, to slim down bxInstruction_c
and to prevent future instruction caching from getting sprayed
with writes to individual rm_addr fields. There only needs to
be one. Though need to deal with instructions which have
static non-modrm addresses, but which are using rm_addr since
that will change.
bxInstruction_c is down to about 40 bytes now. Trying to
get down to 24 bytes.
use accessors. This lets me work on compressing the
size of fetch-decode structure (now called bxInstruction_c).
I've reduced it down to about 76 bytes. We should be able
to do much better soon. I needed the abstraction of the
accessors, so I have a lot of freedom to re-arrange things
without making massive future changes.
Lost a few percent of performance in these mods, but my
main focus was to get the abstraction.
no longer used. Also rearranged that struct a little
to be more compressed. Over time, I'm going to reduce
it further, for use with future accelerations.
enhancement to bochs. You can now configure with
--enable-guest2host-tlb.
Force the support of big pages (PSE) when x86-64 is configured.
Reverted back to only one kind of TLB entry style, since everything
is ported.
Fixed one bug in io.cc with as_64 and the index registers.
There are others, as noticed by Peter.
class declaration, for example:
static const unsigned os_64=0, as_64=0;
After reading some suggestions on usenet, I changed these into
enums instead, like this:
enum { os_64=0, as_64=0 };
printing a message when a reserved bit was set, but not causing
a #GP(0). As well, I force a new PAE support option to 1 when
Hammer support is enabled.
called cpu_mode. Now there is one for cpu32, but it is declared:
static const unsigned cpu_mode=BX_MODE_IA32;
This way the compiler can compile-out if-then-else clauses based
on it, allowing for easier code sharing.
member functions are turned on, BX_CPU_C_PREFIX expands to nothing, and any
method that uses BX_CPU_C_PREFIX instead of explictly writing "BX_CPU_C::"
will not be a member function at all. This makes it impossible for code
outside the BX_CPU_C object to call the accessor because sometimes the method
is at ptr_to_cpu->get_EIP() and other times you'd have to do just get_EIP().
The only way I've found to solve this is to remove the BX_CPU_C_PREFIX
and write BX_CPU_C:: instead.
- in debug/dbg_main.cc I removed the EBP, EIP, ESP, SP shortcuts. Now
the accessors are used everywhere. Also I replaced a reference to
the short-lived get_erx() accessor with ones that work: get_EAX(), etc.
- with these changes the current cvs compiles with any combination of
debugger enabled/disabled, SMP enabled/disabled, and x86-64 enabled/disabled.
BX_READ_8BIT_REG() --> BX_READ_8BIT_REGx()
BX_WRITE_8BIT_REG() --> BX_WRITE_8BIT_REGx()
They use an extra parameter "extended". I coded this
as the macro without the "x" for cpu32 compiles. This
allows for ease of merging and code sharing.