You need to use '--enable-global-pages' to configure in support.
If you have something to boot that uses them, give them a
spin. Really the were introduced for PPro and above, but
I haven't put in any limits. CPUID and CR4 report the proper
bits when configured, regardless of --enable-cpu-level at the
moment.
if off, we were still reading CR3 from the TSS and reloading
it! This was causing problems with a DOS extender. When
paging is turned back on, CR3 would be incorrect.
with GCC) align them with the GCC special alignment attribute.
Since there was then one available field, I split the protection
attributes and native host pointers into their own fields.
Before, with 3 dwords per TLB entry, some entries (about 3/8)
were spanning two processor cache lines (assuming a 32-byte
cache line). Now, they all fit within one cache line.
Knocked about 1.4% off Win95 boot time, probably more off normal
software runs.
BX_READ not 0. BX_READ was 10. While I was at it, I did
change BX_{READ,WRITE,RW} to {0,1,2} rather than {10,11,12}
in case that helps optimize code.
There may be more paging checks we should do before changing
any state, to avoid receiving a page fault in the middle.
I put some extra comments in there.
to request bulk IO operations to IO devices which are bulk IO aware.
Currently, I modified only harddrv.cc to be aware. I added some
fields to the bx_devices_c class for the IO instructions to
place requests and receive responses from the IO device emulation.
Devices except the hard drive, don't monitor these fields so they
respond as normal. The hard drive now monitors these fields for
bulk requests, and if enabled, it memcpy()'s data straight from
the disk buffer to memory. This eliminates numerous inp/outp calling
sequences per disk sector.
I used the fields in bx_devices_c so that I would not have to
disrupt most IO device modules. Enhancements can be made to
other devices if they use high-bandwidth IO via in/out instructions.
All the EFLAGS bits used to be cached in separate fields. I left
a few of them in separate fields for now - might remove them
at some point also. When the arithmetic fields are known
(ie they're not in lazy mode), they are all cached in a
32-bit EFLAGS image, just like the x86 EFLAGS register expects.
All other eflags are store in the 32-bit register also, with
a few also mirrored in separate fields for now.
The reason I did this, was so that on x86 hosts, asm() statements
can be #ifdef'd in to do the calculation and get the native
eflags results very cheaply. Just to test that it works, I
coded ADD_EdId() and ADD_EwIw() with some conditionally compiled
asm()s for accelerated eflags processing and it works.
-Kevin
it can decide how to proceed. Some of those bits are necessary
to make TLB invalidation decisions. INVLPG doesn't cause
a whole TLB flush anymore, just one page. Some of the
current CPU behaviours model the P6, especially on CR0
reloads. Earlier processors kept some pre-change pre-fetched
instructions until a branch. We could probably model that
by setting a flag, and letting the revalidate_prefetch_q
function cause serialization.
The TLB flush code only invalidates entries which are not
already invalidated for the case where the TLB invalidation
ID trick is not in use.
Read-Modify-Write instructions. The first read phase stores
the host pointer in the "pages" field if a direct use pointer
is available. The Write phase first checks if a pointer was
issued and uses it for a direct write if available.
I chose the "pages" field since it needs to be checked by the
write_RMW_virtual variants anyways and thus needs to be
cached anyways.
Mostly the mods where to access.cc, but I did also macro-ize
the calls to write_RMW_virtual...() in files which use it
and cpu.h. Right now, the macro is just a straight pass-through.
I tried expanding it to a quick initial check for the pointer
availability to do the write in-place, with a function call
as a fall-back. That didn't seemed to matter at all.
Booting is not helped by this really. The upper bound of
the gain is 5 or 6%, and that's only if you have a loop that
looks like:
label:
add [eax], ebx ;; mega read-modify-write instruction
jmp label ;; intensive loop.
Kevin Lawton says he doesn't get a performance benefit.
I'm not sure if I do. Either way, the difference isn't
very large.
This code may get removed if it turns out to be useless.
- modified files: config.h.in cpu/init.cc debug/dbg_main.cc gui/control.cc
gui/siminterface.cc gui/siminterface.h gui/wxdialog.cc gui/wxdialog.h
gui/wxmain.cc gui/wxmain.h iodev/keyboard.cc
----------------------------------------------------------------------
Patch name: patch.wx-show-cpu2
Author: Bryce Denney
Date: Fri Sep 6 12:13:28 EDT 2002
Description:
Second try at implementing the "Debug:Show Cpu" and "Debug:Show
Keyboard" dialog with values that change as the simulation proceeds.
(Nobody gets to see the first try.) This is the first step toward
making something resembling a wxWindows debugger.
First, variables which are going to be visible in the CI must be
registered as parameters. For some variables, it might be acceptable
to change them from Bit32u into bx_param_num_c and access them only
with set/get methods, but for most variables it would be a horrible
pain and wreck performance.
To deal with this, I introduced the concept of a shadow parameter. A
normal parameter has its value stored inside the struct, but a shadow
parameter has only a pointer to the value. Shadow params allow you to
treat any variable as if it was a parameter, without having to change
its type and access it using get/set methods. Of course, a shadow
param's value is controlled by someone else, so it can change at any
time.
To demonstrate and test the registration of shadow parameters, I
added code in cpu/init.cc to register a few CPU registers and
code in iodev/keyboard.cc to register a few keyboard state values.
Now these parameters are visible in the Debug:Show CPU and
Debug:Show Keyboard dialog boxes.
The Debug:Show* dialog boxes are created by the ParamDialog class,
which already understands how to display each type of parameter,
including the new shadow parameters (because they are just a subclass
of a normal parameter class). I have added a ParamDialog::Refresh()
method, which rereads the value from every parameter that it is
displaying and changes the displayed value. At the moment, in the
Debug:Show CPU dialog, changing the values has no effect. However
this is trivial to add when it's time (just call CommitChanges!). It
wouldn't really make sense to change the values unless you have paused
the simulation, for example when single stepping with the debugger.
The Refresh() method must be called periodically or else the dialog
will show the initial values forever. At the moment, Refresh() is
called when the simulator sends an async event called
BX_ASYNC_EVT_REFRESH, created by a call to SIM->refresh_ci ().
Details:
- implement shadow parameter class for Bit32s, called bx_shadow_num_c.
implement shadow parameter class for Boolean, called bx_shadow_bool_c.
more to follow (I need one for every type!)
- now the simulator thread can request that the config interface refresh
its display. For now, the refresh event causes the CI to check every
parameter it is watching and change the display value. Later, it may
be worth the trouble to keep track of which parameters have actually
changed. Code in the simulator thread calls SIM->refresh_ci(), which
creates an async event called BX_ASYNC_EVT_REFRESH and sends it to
the config interface. When it arrives in the wxWindows gui thread,
it calls RefreshDialogs(), which calls the Refresh() method on any
dialogs that might need it.
- in the debugger, SIM->refresh_ci() is called before every prompt
is printed. Otherwise, the refresh would wait until the next
SIM->periodic(), which might be thousands of cycles. This way,
when you're single stepping, the dialogs update with every step.
- To improve performance, the CI has a flag (MyFrame::WantRefresh())
which tells whether it has any need for refresh events. If no
dialogs are showing that need refresh events, then no event is sent
between threads.
- add a few defaults to the param classes that affect the settings of
newly created parameters. When declaring a lot of params with
similar settings it's more compact to set the default for new params
rather than to change each one separately. default_text_format is
the printf format string for displaying numbers. default_base is
the default base for displaying numbers (0, 16, 2, etc.)
- I added to ParamDialog to make it able to display modeless dialog
boxes such as "Debug:Show CPU". The new Refresh() method queries
all the parameters for their current value and changes the value in
the wxWindows control. The ParamDialog class still needs a little
work; for example, if it's modal it should have Cancel/Ok buttons,
but if it's going to be modeless it should maybe have Apply (commit
any changes) and Close.
- bx_gen_reg cannot be declared with BX_SMF or it can't read gen_reg
when static member functions are turned on.
- use "BX_CPU_C_PREFIX" instead of "BX_CPU_C::" for get_segment_base.
- the SMF (static member function) tricks are just plain wierd. The only way to
really be sure that you're not breaking something is to try compiling it with
SMF on and with SMF off. e.g. "configure && make" and
"configure --enable-processors=2 && make".
direct reads/writes from native variables to the x86 (guest)
memory image. Look at the end of bochs.h. Don't know if that's
the right place to put them, but here you can extend these
macros to platform-specific asm() code if you like, or just
use the generic C code I supplied. Some platforms have special
instructions for byte-order swapping etc. Also, you can't
make any assumptions about the alignment of the pointers
passed.
mode uses the notion of the guest-to-host TLB. This has the
benefit of allowing more uniform and streamlined acceleration
code in access.cc which does not have to check if CR0.PG
is set, eliminating a few instructions per guest access.
Shaved just a little off execution time, as expected.
Also, access_linear now breaks accesses which span two pages,
into two calls the the physical memory routines, when paging
is off, just like it always has for paging on. Besides
being more uniform, this allows the physical memory access
routines to known the complete data item is contained
within a single physical page, and stop reapplying the
A20ADDR() macro to pointers as it increments them.
Perhaps things can be optimized a little more now there too...
I renamed the routines to {read,write}PhysicalPage() as
a reminder that these routines now operate on data
solely within one page.
I also added a little code so that the paging module is
notified when the A20 line is tweaked, so it can dump
whatever mappings it wants to.
I have not tested these functions, but they model the format and
acceleration principals of the byte/word/dword functions. Give them
a try on both little/big endian machines.
so that a compare of the current access could be done more
efficiently against the cached values, both in the normal
paging routines, and in the accelerated code in access.cc.
This cut down the amount of code path needed to get to
direct use of a host address nicely, and speed definitely
got a boost as a result, especially if you use the
--enable-guest2host-tlb option.
The CR0.WP flag was a real pain, because it imparts
a complication on the way protections work. Fortunately
it's not a high-change flag, so I just base the new
cached info on the current CR0.WP value, and dump
the TLB cache when it changes.
checks were honoring the EFLAGS.DF bit, but assuming it was always
equal to 0 (increment upward). Plus some general cleanup of the
acceleration code.
I left the default of '--enable-repeat-speedups' to disabled, but
it seems pretty solid. Definitely adds performance for disk
heavy workloads.
access routines in access.cc, completing the upgrade of
those routines. You do need '--enable-guest2host-tlb', before
you get the speedups for now. The guest2host mods seem pretty
solid, though I do need to see what effects the A20 line has
on this cache and the paging TLB in general.
added --enable-repeat-speedups with default to disabled.
Reconfigure/recompile and the speedup code will be #ifdef'd
out for now. It manifested as junk written to the VGA screen
while booting/running Windows.
Also made some more mods to the main cpu loop. Moved the
handling of EXT/errorno outside the main loop, much like
the extra EIP/ESP commits were moved, for a little better
performance.
I changed the fetch_ptr/bytesleft method of fetching to
a slightly different model, which calculates a window
for which EIP will be valid (land on the current page),
and a bias which when applied to EIP will be from
0..upper_page_limit. Speed is about the same for either
method, but a pseudo-op/threaded-interpreter will plug
in better with this and be faster.
- Paging code rehash. You must now use --enable-4meg-pages to
use 4Meg pages, with the default of disabled, since we don't well
support 4Meg pages yet. Paging table walks model a real CPU
more closely now, and I fixed some bugs in the old logic.
- Segment check redundancy elimination. After a segment is loaded,
reads and writes are marked when a segment type check succeeds, and
they are skipped thereafter, when possible.
- Repeated IO and memory string copy acceleration. Only some variants
of instructions are available on all platforms, word and dword
variants only on x86 for the moment due to alignment and endian issues.
This is compiled in currently with no option - I should add a configure
option.
- Added a guest linear address to host TLB. Actually, I just stick
the host address (mem.vector[addr] address) in the upper 29 bits
of the field 'combined_access' since they are unused. Convenient
for now. I'm only storing page frame addresses. This was the
simplest for of such a TLB. We can likely enhance this. Also,
I only accelerated the normal read/write routines in access.cc.
Could also modify the read-modify-write versions too. You must
use --enable-guest2host-tlb, to try this out. Currently speeds
up Win95 boot time by about 3.5% for me. More ground to cover...
- Minor mods to CPUI/MOV_CdRd for CMOV.
- Integrated enhancements from Volker to getHostMemAddr() for PCI
being enabled.
for BX_CPU_LEVEL >= 6, and to have the CMOV instructions generate
an undefined opcode exception after printing info that they were
called, if BX_CPU_LEVEL <= 5. I suppose we could have a separate
configure option, but mirroring Intel, CMOV is available as of
Pentium Pro.
For now, you have to compile with --enable-cpu-level=6 for CMOV
support to be compiled in.
Specific changes from the patch:
1.) renamed fdcache_eip to fdcache_ip, as it is using
the RIP instead of the EIP.
2.) added a Boolean array fdcache_is32 which uses is32
to determine icache hits. Otherwise we could run 32-bit
code as 16-bit or vice versa.
Modified Files:
config.h.in cpu/cpu.cc cpu/cpu.h memory/memory.cc