2a3524e2d2
I know, the makefile is a total mess but it works for now. |
||
---|---|---|
.. | ||
blur-how-many-instructions.gnumeric | ||
blur-opcode.c | ||
blur-performance.gnumeric | ||
blur.c | ||
cost-of-function-call.gnumeric | ||
cost-of-switch.gnumeric | ||
Makefile | ||
README |
In this directory, I'm developing some test cases to study some different options for Bochs performance improvements. ideas: write a loop or nested loop that takes some serious computing, that takes maybe 15 seconds to run benchmark on several machines compile with optimization and understand the assembly reimplement with switch stmt,function pointers, etc benchmark again to measure the overhead of switch & function calls test on several machines reimplement with goto *label (gcc extension) style benchmark again to measure the overhead of switch & function calls test on several machines or, by working more directly with the bochs source, we can test the performance benefit of the "cut-and-paste native code" method on a small scale. --------------------------------------------------- Performance measurements blur.c 1.2 ----------- estimated 1.43 million instructions per iteration (-O2) (see blur-1.2-performance.gnumeric spreadsheet for details) on Athlon 750 with no optimization, 4.0622 ms per iteration on Athlon 750 with -O2, 1.795138 ms per iteration on P2 350 with no optimization, 6.162224 ms per iteration on P2 350 with -O2, 2.5258 ms per iteration on Bochs with -O2 on Athlon 750, 478 ms per iteration Conclusion Bochs is about 270x slower than native code. Try replacing the innermost loop of blur() with a function call that does the same thing. Continue to compile with -O2. if innermost loop is "sum += array[x2][y2]", 1.796 ms per iter. if innermost loop is "blur_func (&sum, &array[x2][y2])", 3.526978 ms per iter. function with three arguments: 3.784 function with four arguments: 4.390 function with five arguments: 4.879 what is the overhead in terms of instructions? func_overhead(N, ARGS) for N instructions in the function, ARGS arguments F(N, ARGS) = ? 1.5*ARGS to push them onto the stack 1 for the call 1 to set %esp back to normal 1.5*ARGS to load it into a register and maybe save the old register value 2 for leave & ret = 4 + 3*ARGS instructions This is ignoring the fact that some CISC instructions are much more expensive than others. However when you measure the time cost of the function call, it is much larger. Compare blur with no function call to blur with a call to a function with 2 arguments. 1.796 ms per iter versus 3.526978 ms per iter. The function was called 142884 times, so the overhead of each function call is 12.1 ns. (In 12.1 ns this machine can execute between 9-10 instructions.) I briefly tested the overhead of function calls of 2,3,4,5 arguments. This should measure the difference between NO function call (inline function) and a function call with N arguments. 2 args: 12.1 ns more than no function call 3 args: 13.91 ns more than no function call 4 args: 18.15 ns more than no function call 5 args: 21.58 ns more than no function call Based on measurements, maybe the func_overhead should be func_overhead(ARGS) = 4 + 2.5*ARGS instructions Measure the cost of a switch statement. With no switch statement, around 1.84 ms per iteration. With 2 case switch+default, 2.3867 ms. With 3 case switch+default, 3.739635 ms. With 4 case switch, 3.341807 ms. (successive compares) With 8 case switch, 2.241196 ms. (jump table) With 16 case switch, 3.122807 ms. (jump table) With 32 case switch, 3.080882 ms. (jump table) Once you have enough cases that gcc creates a jump table, the cost of a switch statement is about 7 instructions. Try with a bunch of function calls in a switch stmt. blur-O2 1.798693 ms blur-O2-func 3.309882 ms blur-O2-switch 3.101132 ms blur-O2-switch-call 2.667050 ms blur-O2-fnptr-switch 8.669802 ms blur-O2-fnptr-table 6.010503 ms ;;;;;;;;;;;blur.c revision 1.2, compiled with -O2 -S;;;;;;;;;;;; ;; with some annotations by Bryce to figure out what is what. .file "blur.c" .version "01.01" gcc2_compiled.: .text .align 4 .globl blur .type blur,@function blur: pushl %ebp movl %esp,%ebp subl $36,%esp pushl %edi pushl %esi pushl %ebx movl $1,%eax .p2align 4,,7 .L20: movl $1,%edi leal -1(%eax),%ebx movl %ebx,-28(%ebp) leal 1(%eax),%ebx movl %ebx,-16(%ebp) sall $9,%eax movl %eax,-24(%ebp) movl %ebx,-8(%ebp) movl -28(%ebp),%ebx movl %ebx,-32(%ebp) sall $9,-32(%ebp) .p2align 4,,7 .L24: xorl %esi,%esi ; let %esi = sum movl -28(%ebp),%ecx leal 1(%edi),%ebx movl %ebx,-20(%ebp) cmpl -8(%ebp),%ecx jg .L26 movl %ebx,-12(%ebp) movl -16(%ebp),%ebx movl %ebx,-4(%ebp) movl -32(%ebp),%ebx movl %ebx,-36(%ebp) .p2align 4,,7 .L28: leal -1(%edi),%edx ; let %edx = y2 cmpl -12(%ebp),%edx ; test if y2 jg .L27 ;; build pointer in %eax to array[x2][y2] movl -36(%ebp),%eax addl $array,%eax leal (%eax,%edx,4),%eax .p2align 4,,7 .L32: ;; innermost loop. it has precomputed the endpoint in -20(%ebp) addl (%eax),%esi ;; sum += (%eax) addl $4,%eax ;; point %eax to next value incl %edx ;; y2++ cmpl -20(%ebp),%edx ;; if y2<=max jle .L32 .L27: addl $512,-36(%ebp) incl %ecx cmpl -4(%ebp),%ecx jle .L28 .L26: movl -24(%ebp),%ebx leal (%ebx,%edi,4),%eax movl %esi,array2(%eax) movl -20(%ebp),%edi cmpl $126,%edi jle .L24 movl -16(%ebp),%eax cmpl $126,%eax jle .L20 leal -48(%ebp),%esp popl %ebx popl %esi popl %edi leave ret .Lfe1: .size blur,.Lfe1-blur ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ---------------------------------- blur-opcode.c In a past email, I described a way to use gcc and gas to produce native code, so that Bochs would not need its own code generator for every supported platform. I will attempt a proof of concept here. I will try to create a system in which snippets of C++ code can be compiled by gcc, then extracted to form a chunk of native binary code. The chunks will be designed so that they can be pasted together efficiently. I haven't tried anything like this before....