Bochs/bochs-performance/testcases/README

In this directory, I'm developing some test cases to study
some different options for Bochs performance improvements.

ideas:

write a loop or nested loop that takes some serious computing, that takes
  maybe 15 seconds to run
benchmark on several machines
compile with optimization and understand the assembly
reimplement with switch stmt,function pointers, etc
benchmark again to measure the overhead of switch & function calls
test on several machines
reimplement with goto *label (gcc extension) style
benchmark again to measure the overhead of switch & function calls
test on several machines

or, by working more directly with the bochs source, we can test the
performance benefit of the "cut-and-paste native code" method on
a small scale.

---------------------------------------------------
Performance measurements

blur.c 1.2
-----------
estimated 1.43 million instructions per iteration (-O2)
  (see blur-1.2-performance.gnumeric spreadsheet for details)
on Athlon 750 with no optimization, 4.0622 ms per iteration
on Athlon 750 with -O2, 1.795138 ms per iteration
on P2 350 with no optimization, 6.162224 ms per iteration
on P2 350 with -O2, 2.5258 ms per iteration
on Bochs with -O2 on Athlon 750, 478 ms per iteration

Conclusion
Bochs is about 270x slower than native code.

Try replacing the innermost loop of blur() with a function call
that does the same thing.  Continue to compile with -O2.

if innermost loop is "sum += array[x2][y2]", 1.796 ms per iter.
if innermost loop is "blur_func (&sum, &array[x2][y2])", 3.526978 ms per iter.
function with three arguments: 3.784
function with four arguments: 4.390
function with five arguments: 4.879

what is the overhead in terms of instructions?
func_overhead(N, ARGS) for N instructions in the function, ARGS arguments
F(N, ARGS) = ?
  1.5*ARGS to push them onto the stack
  1 for the call
  1 to set %esp back to normal
  1.5*ARGS to load it into a register and maybe save the old register value
  2 for leave & ret

  = 4 + 3*ARGS   instructions

This is ignoring the fact that some CISC instructions are much more expensive
than others.

However when you measure the time cost of the function call, it is
much larger.

Compare blur with no function call to blur with a call to a function with
2 arguments.  1.796 ms per iter versus 3.526978 ms per iter.  The function
was called 142884 times, so the overhead of each function call is
12.1 ns.  (In 12.1 ns this machine can execute between 9-10 instructions.)

I briefly tested the overhead of function calls of 2,3,4,5 arguments.
This should measure the difference between NO function call (inline function)
and a function call with N arguments.

2 args: 12.1 ns more than no function call
3 args: 13.91 ns more than no function call
4 args: 18.15 ns more than no function call
5 args: 21.58 ns more than no function call

Based on measurements, maybe the func_overhead should be
func_overhead(ARGS) = 4 + 2.5*ARGS instructions


Measure the cost of a switch statement.
With no switch statement, around 1.84 ms per iteration.
With 2 case switch+default, 2.3867 ms.
With 3 case switch+default, 3.739635 ms.
With 4 case switch, 3.341807 ms.   (successive compares)
With 8 case switch, 2.241196 ms.   (jump table)
With 16 case switch, 3.122807 ms.  (jump table)
With 32 case switch, 3.080882 ms.  (jump table)

Once you have enough cases that gcc creates a jump table, the
cost of a switch statement is about 7 instructions.

Try with a bunch of function calls in a switch stmt.


blur-O2                 1.798693 ms
blur-O2-func            3.309882 ms
blur-O2-switch          3.101132 ms
blur-O2-switch-call     2.667050 ms
blur-O2-fnptr-switch    8.669802 ms
blur-O2-fnptr-table     6.010503 ms


;;;;;;;;;;;blur.c revision 1.2, compiled with -O2 -S;;;;;;;;;;;;
;; with some annotations by Bryce to figure out what is what.

	.file	"blur.c"
	.version	"01.01"
gcc2_compiled.:
.text
	.align 4
.globl blur
	.type	 blur,@function
blur:
	pushl %ebp
	movl %esp,%ebp
	subl $36,%esp
	pushl %edi
	pushl %esi
	pushl %ebx
	movl $1,%eax
	.p2align 4,,7
.L20:
	movl $1,%edi
	leal -1(%eax),%ebx
	movl %ebx,-28(%ebp)
	leal 1(%eax),%ebx
	movl %ebx,-16(%ebp)
	sall $9,%eax
	movl %eax,-24(%ebp)
	movl %ebx,-8(%ebp)
	movl -28(%ebp),%ebx
	movl %ebx,-32(%ebp)
	sall $9,-32(%ebp)
	.p2align 4,,7
.L24:
	xorl %esi,%esi                 ; let %esi = sum
	movl -28(%ebp),%ecx
	leal 1(%edi),%ebx
	movl %ebx,-20(%ebp)
	cmpl -8(%ebp),%ecx
	jg .L26


	movl %ebx,-12(%ebp)
	movl -16(%ebp),%ebx
	movl %ebx,-4(%ebp)
	movl -32(%ebp),%ebx
	movl %ebx,-36(%ebp)
	.p2align 4,,7
.L28:
	leal -1(%edi),%edx            ; let %edx = y2
	cmpl -12(%ebp),%edx           ; test if y2
	jg .L27


	;; build pointer in %eax to array[x2][y2]
	movl -36(%ebp),%eax
	addl $array,%eax
	leal (%eax,%edx,4),%eax
	.p2align 4,,7

.L32:
	;; innermost loop. it has precomputed the endpoint in -20(%ebp)
	addl (%eax),%esi               ;; sum += (%eax)
	addl $4,%eax                   ;; point %eax to next value
	incl %edx                      ;; y2++
	cmpl -20(%ebp),%edx            ;; if y2<=max
	jle .L32

.L27:
	addl $512,-36(%ebp)
	incl %ecx
	cmpl -4(%ebp),%ecx
	jle .L28

.L26:
	movl -24(%ebp),%ebx
	leal (%ebx,%edi,4),%eax
	movl %esi,array2(%eax)
	movl -20(%ebp),%edi
	cmpl $126,%edi
	jle .L24

	movl -16(%ebp),%eax
	cmpl $126,%eax
	jle .L20

	leal -48(%ebp),%esp
	popl %ebx
	popl %esi
	popl %edi
	leave
	ret
.Lfe1:
	.size	 blur,.Lfe1-blur
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
----------------------------------


blur-opcode.c

In a past email, I described a way to use gcc and gas to produce
native code, so that Bochs would not need its own code generator for
every supported platform.  I will attempt a proof of concept here.

I will try to create a system in which snippets of C++ code can be compiled
by gcc, then extracted to form a chunk of native binary code.  The chunks
will be designed so that they can be pasted together efficiently.

I haven't tried anything like this before....