[z88dk-dev] benchmarks
[z88dk-dev] benchmarks
I committed a few benchmark programs in:
http://z88dk.cvs.sourceforge.net/viewvc ... enchmarks/
dhrystone 2.1, whetstone 1.2 and the common yet uninformative sieve of erastothenes. dhrystone has a 2d array in it so it can only be compiled with sdcc for now.
I also have paranoia.c which I'll commit tomorrow. It's supposed to test and identify weaknesses in the floating point implementation.
------------------------------------------------------------------------------
http://z88dk.cvs.sourceforge.net/viewvc ... enchmarks/
dhrystone 2.1, whetstone 1.2 and the common yet uninformative sieve of erastothenes. dhrystone has a 2d array in it so it can only be compiled with sdcc for now.
I also have paranoia.c which I'll commit tomorrow. It's supposed to test and identify weaknesses in the floating point implementation.
------------------------------------------------------------------------------
I've committed this one too. Currently it can only be compiled with sdcc due to a couple of minor issues I'll get around to fixing. It also needs %s,%d,%f,%e,%g printf converters so the new clib has to be recompiled with the float ones enabled. The total size is about 52k so I was only able to run it under cpm. There's a compile line at the top of paranoia.hI also have paranoia.c which I'll commit tomorrow. It's supposed to test and identify weaknesses in the floating point implementation.
It identified several issues in math48 so I'll have to take a look at them later. I also ran into a problem with one of the partially qualified rules in SO3 so I had to fully qualify it. The reason why a chunk of the rules are not properly qualified with dead register indication is because the peepholer currently doesn't treat function calls properly so that it safely indicates all registers are used which prevents many rules from being applied when they should be. So I've modified our copy of sdcc to implement a hack -- in the new clib all fastcall functions get a "_fastcall" appended to their names so what I've done is look at the function name to see if it contains "_fastcall" and then indicate that de,hl are live and all registers are unused for any other sort of function. This will work but it's a hack. It also means I can probably redo the rules file properly without impacting on the code substitutions.
------------------------------------------------------------------------------
Well the main issue is that sccz80 doesn't seem to accept static initialization of doubles:I've committed this one too. Currently it can only be compiled with sdcc due to a couple of minor issues I'll get around to fixing.I also have paranoia.c which I'll commit tomorrow. It's supposed to test and identify weaknesses in the floating point implementation.
double c = 1.0;
comes up error. Does someone want to have a look at that? I'd be interested in seeing the results under sccz80 as I expect it to be much smaller than sdcc and the reports will be more accurate for math48 as sccz80 uses the full 48-bit float.
------------------------------------------------------------------------------
Again with the benchmarks:-
whetstone (time in z80 cycles, target is an embedded one without stdio and heap)
(sccz80 + new clib)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -m
ticks whetstone_CODE.bin -start 8c1 -end 11aa -counter 9999999999999
size: 5407
speed:974,224,224
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -clib=sdcc_iy --max-allocs-per-node200000 -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -m
ticks whetstone_CODE.bin -start 829 -end 130b -counter 9999999999999
size: 6099
speed: 919,719,774
(sccz80 + classic clib genmath)
zcc +test -vn -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -lndos -m
ticks whetstone -start f7 -end a1f -counter 9999999999999
size: 5513
speed: 1,295,331,166
(sdcc alone)
typedef float double_t; __asm __endasm; sinf etc
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOCOMMAND --reserve-regs-iy --max-allocs-per-node200000 whetstone.c -o whetstone
hex2bin whetstone.ihx
ticks whetstone.bin -start 22b -end fd2 -counter 9999999999999 (.noi file)
size: 14865
speed: 2,821,469,046
I'm trying to get hitech-c in there too but when I run under cpm emulation, the emulator complains that hitech is calling bdos function 0x66 which is not supported. Bdos 0x66 is a date.time function that is only supported under mpm. I'm not sure what is happening there. Is anyone else able to successfully run hitech? I may have to try udo's z80pack tools but it is missing instructions in the z80 core so it can't run z88dk-generated code. Maybe hitech will be ok.
I'd also like to try with bds c but I'm not sure what the state of its float support is. It seems to be a four-digit bcd implementation that is lacking most math functions.
------------------------------------------------------------------------------
whetstone (time in z80 cycles, target is an embedded one without stdio and heap)
(sccz80 + new clib)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -m
ticks whetstone_CODE.bin -start 8c1 -end 11aa -counter 9999999999999
size: 5407
speed:974,224,224
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -clib=sdcc_iy --max-allocs-per-node200000 -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -m
ticks whetstone_CODE.bin -start 829 -end 130b -counter 9999999999999
size: 6099
speed: 919,719,774
(sccz80 + classic clib genmath)
zcc +test -vn -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -lndos -m
ticks whetstone -start f7 -end a1f -counter 9999999999999
size: 5513
speed: 1,295,331,166
(sdcc alone)
typedef float double_t; __asm __endasm; sinf etc
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOCOMMAND --reserve-regs-iy --max-allocs-per-node200000 whetstone.c -o whetstone
hex2bin whetstone.ihx
ticks whetstone.bin -start 22b -end fd2 -counter 9999999999999 (.noi file)
size: 14865
speed: 2,821,469,046
I'm trying to get hitech-c in there too but when I run under cpm emulation, the emulator complains that hitech is calling bdos function 0x66 which is not supported. Bdos 0x66 is a date.time function that is only supported under mpm. I'm not sure what is happening there. Is anyone else able to successfully run hitech? I may have to try udo's z80pack tools but it is missing instructions in the z80 core so it can't run z88dk-generated code. Maybe hitech will be ok.
I'd also like to try with bds c but I'm not sure what the state of its float support is. It seems to be a four-digit bcd implementation that is lacking most math functions.
------------------------------------------------------------------------------
I don't know what's wrong, Hitech C is widely used on emulators.
My favourite is z80mu, just because it is the first one I had.
This is the latest one I found..
http://homepage3.nifty.com/takeda-toshiya/cpm/
z80pack (a simh "fork") is probably the more accurate (too much perhaps !!) if we exclude the whole hardware emulators.
I think the things you are doing with the bechmarks are very intersting (and useful).. the new DK seems to be quickly getting a good shape and personality, which I'm still trying to familiarize with, but ..come on.. it is normal with the new stuff !
------------------------------------------------------------------------------
My favourite is z80mu, just because it is the first one I had.
This is the latest one I found..
http://homepage3.nifty.com/takeda-toshiya/cpm/
z80pack (a simh "fork") is probably the more accurate (too much perhaps !!) if we exclude the whole hardware emulators.
I think the things you are doing with the bechmarks are very intersting (and useful).. the new DK seems to be quickly getting a good shape and personality, which I'm still trying to familiarize with, but ..come on.. it is normal with the new stuff !
------------------------------------------------------------------------------
BDS C and float.
The BDS C has an FP library but it is a whole external package. It includes a rewritten printf and all (atof, ftoa, low level mantissa handling, and headers for declarations).
As for many other aspects it is not a standard thing.. just small and fast as usual but not good for a C compatibility test.
------------------------------------------------------------------------------
The BDS C has an FP library but it is a whole external package. It includes a rewritten printf and all (atof, ftoa, low level mantissa handling, and headers for declarations).
As for many other aspects it is not a standard thing.. just small and fast as usual but not good for a C compatibility test.
------------------------------------------------------------------------------
It's fun to see. I'll post a few more tonight. Here's the sieve of erastothenes (prime numbers):I think the things you are doing with the bechmarks are very intersting (and useful).. the new DK seems to be quickly getting a good shape and personality, which I'm still trying to familiarize with, but ..come on.. it is normal with the new stuff !
sieve of Eratosthenes
#pragma output CLIB_EXIT_STACK_SIZE = 0
#pragma output CLIB_MALLOC_HEAP_SIZE = 0
#pragma output CLIB_STDIO_HEAP_SIZE = 0
(sccz80 + new clib)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF sieve.c -o sieve -m
ticks sieve_CODE.bin -start xxx -end xxx -counter 9999999999999
size: 8343
speed: 5,325,739
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -clib=sdcc_iy --max-allocs-per-node200000 -DNOPRINTF sieve.c -o sieve -m
ticks sieve_CODE.bin -start xxx -end xxx -counter 9999999999999
size: 8282
speed: 3,691,568
(sccz80 + classic clib)
zcc +test -vn -DNOPRINTF sieve.c -o sieve -lndos -m
ticks sieve -start a4 -end 13e -counter 9999999999999
size: 8358
speed: 5,325,739 (yes the code produced is identical to new clib)
(sdcc alone)
__asm __endasm; ticks::
sdcc -mz80 -DNOPRINTF --reserve-regs-iy --max-allocs-per-node200000 sieve.c
hex2bin sieve.ihx
ticks sieve.bin -start xxx -end xxx -counter 9999999999999
size: 720 + 8008 (bss/data) = 8728
speed: 4,150,710
------------------------------------------------------------------------------
Dhrystone 2.1
#pragma output CLIB_EXIT_STACK_SIZE = 0
#pragma output CLIB_MALLOC_HEAP_SIZE = 0
#pragma output CLIB_STDIO_HEAP_SIZE = 0
sccz80 won't compile without replacing 2d array
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER -clib=sdcc_iy --max-allocs-per-node200000 dhry_1.c dhry_2.c -o dhry -m
ticks dhry_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7399
speed: 273,062,480
(sdcc alone)
__asm __endasm; ticks:: typedef float double_t;
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER --reserve-regs-iy --max-allocs-per-node200000 -c dhry_1.c
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER --reserve-regs-iy --max-allocs-per-node200000 -c dhry_2.c
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER --reserve-regs-iy --max-allocs-per-node200000 -c dhry_1.rel dhry_2.rel -o dhry.ihx
hex2bin dhry.ihx
ticks dhry.bin -start xxx -end xxx -counter 99999999999
size: 2719 + 5219 (bss/data) = 7938
speed: 339,822,490
------------------------------------------------------------------------------
#pragma output CLIB_EXIT_STACK_SIZE = 0
#pragma output CLIB_MALLOC_HEAP_SIZE = 0
#pragma output CLIB_STDIO_HEAP_SIZE = 0
sccz80 won't compile without replacing 2d array
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER -clib=sdcc_iy --max-allocs-per-node200000 dhry_1.c dhry_2.c -o dhry -m
ticks dhry_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7399
speed: 273,062,480
(sdcc alone)
__asm __endasm; ticks:: typedef float double_t;
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER --reserve-regs-iy --max-allocs-per-node200000 -c dhry_1.c
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER --reserve-regs-iy --max-allocs-per-node200000 -c dhry_2.c
sdcc -mz80 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER --reserve-regs-iy --max-allocs-per-node200000 -c dhry_1.rel dhry_2.rel -o dhry.ihx
hex2bin dhry.ihx
ticks dhry.bin -start xxx -end xxx -counter 99999999999
size: 2719 + 5219 (bss/data) = 7938
speed: 339,822,490
------------------------------------------------------------------------------
This one test integer math performance and you can see a tremendous speedup when using the fast integer option in the new clib. The cost of using the fast integer lib (no loop unrolling here) is about 1k. sdcc + new clib + fast integer option is almost six times faster than sdcc alone with about the same code size! sdcc is hurt is by not having a div() equivalent and having its 32-bit lib implemented in C.
pi
#pragma output CLIB_EXIT_STACK_SIZE = 0
#pragma output CLIB_MALLOC_HEAP_SIZE = 0
#pragma output CLIB_STDIO_HEAP_SIZE = 0
(sccz80 + new clib + fast integer option)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7306
speed: 1,455,531,292
(sdcc + new clib + fast integer option)
zcc +embedded -vn -startup=0 -clib=sdcc_iy -DNOPRINTF --max-allocs-per-node200000 pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7380
speed: 1,550,506,214
(sccz80 + new clib)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 6289
speed: 3,773,744,792
(sdcc + new clib)
zcc +embedded -vn -startup=0 -clib=sdcc_iy -DNOPRINTF --max-allocs-per-node200000 pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 6363
speed: 3,870,344,956
(sccz80 + classic clib)
No div() so replaced with / and %
zcc +test -vn -DNOPRINTF pi.c -o pi -lndos -m
ticks pi -start xxx -end xxx -counter 9999999999999
size: 6263
speed: 5,391,465,260
(sdcc alone)
No div() so replaced with / and %, __asm __endasm; ticks::
sdcc -mz80 -DNOPRINTF --reserve-regs-iy --max-allocs-per-node200000 pi.c
hex2bin pi.ihx
ticks pi.bin -start xxx -end xxx -counter 9999999999999
size: 1714 + 5622 (bss/data) = 7336
speed: 8,877,132,996
------------------------------------------------------------------------------
pi
#pragma output CLIB_EXIT_STACK_SIZE = 0
#pragma output CLIB_MALLOC_HEAP_SIZE = 0
#pragma output CLIB_STDIO_HEAP_SIZE = 0
(sccz80 + new clib + fast integer option)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7306
speed: 1,455,531,292
(sdcc + new clib + fast integer option)
zcc +embedded -vn -startup=0 -clib=sdcc_iy -DNOPRINTF --max-allocs-per-node200000 pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7380
speed: 1,550,506,214
(sccz80 + new clib)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 6289
speed: 3,773,744,792
(sdcc + new clib)
zcc +embedded -vn -startup=0 -clib=sdcc_iy -DNOPRINTF --max-allocs-per-node200000 pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 6363
speed: 3,870,344,956
(sccz80 + classic clib)
No div() so replaced with / and %
zcc +test -vn -DNOPRINTF pi.c -o pi -lndos -m
ticks pi -start xxx -end xxx -counter 9999999999999
size: 6263
speed: 5,391,465,260
(sdcc alone)
No div() so replaced with / and %, __asm __endasm; ticks::
sdcc -mz80 -DNOPRINTF --reserve-regs-iy --max-allocs-per-node200000 pi.c
hex2bin pi.ihx
ticks pi.bin -start xxx -end xxx -counter 9999999999999
size: 1714 + 5622 (bss/data) = 7336
speed: 8,877,132,996
------------------------------------------------------------------------------
Some hitech c results:
*********
WHETSTONE:
(hitech-c under cp/m)
comments to old style, unable to define STATIC so replaced STATIC with text search, typedef double double_t;
GIVES SOME INCORRECT RESULTS!
C -O -V -DNOPRINTF -DNOTIMER -DNOCOMMAND -MWHETDC.MAP WHETDC.C -LF
appmake +rom -s 32768 -f 0 -o whetdc0.rom
appmake +inject -b whetdc0.rom -i whetdc.com -s 256 -o whetdc.rom
ticks whetdc.rom -start 14f -end 1d2 -counter 99999999999
size: 6869 + 414 + 147 = 7430 (includes some cp/m overhead)
speed: 637,332,032
compared to the best z88dk result:
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -clib=sdcc_iy --max-allocs-per-node200000 -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -m
ticks whetstone_CODE.bin -start 829 -end 130b -counter 9999999999999
size: 6099
speed: 919,719,774
We actually lost on speed here - I didn't expect that at all but it must be kept in mind that math48 implements a 48-bit double whereas hitech implements a 32-bit one. However the good result for hitech is spoiled by the fact that some of the computed values were incorrect so its float implementation is bugged.
******
sieve of Eratosthenes
(hitech-c under cp/m)
comments to old style
C -O -V -DNOPRINTF -MSIEVE.MAP SIEVE.C
appmake +rom -s 32768 -f 0 -o sieve0.rom
appmake +inject -b sieve0.rom -i sieve.com -s 256 -o sieve.rom
ticks sieve.rom -start 14f -end 1d2 -counter 99999999999
size: 678 + 2 + 8010 = 8690 (includes some cp/m overhead)
speed: 4,547,538
compared to best z88dk result:
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -clib=sdcc_iy --max-allocs-per-node200000 -DNOPRINTF sieve.c -o sieve -m
ticks sieve_CODE.bin -start xxx -end xxx -counter 9999999999999
size: 8282
speed: 3,691,568
*****
pi.c
(hitech-c under cp/m)
No div() so replaceed with / and %, comments to old style
C -O -V -DNOPRINTF -MPI.MAP PI.C
appmake +rom -s 32768 -f 0 -o pi0.rom
appmake +inject -b pi0.rom -i pi.com -s 256 -o pi.rom
ticks pi.rom -start 13d -end 236 -counter 99999999999
size: 1159 + 2 + 5616 = 6777 (includes some cp/m overhead)
speed: 5,465,612,292
best z88dk result:
(speed)
(sccz80 + new clib + fast integer option)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7306
speed: 1,455,531,292
(size/speed compromise)
(sccz80 + new clib)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 6289
speed: 3,773,744,792
I think we'll always be smaller and faster than hitech except for that surprise float result, although if we had a 32-bit float lib I expect the results would be in z88dk's favour again. Hitech's float package is unreliable as well.
I'm still missing a dhrystone for hitech but it's a bit late here to continue so I'll look at this and bds c tomorrow. I think I'll start tabulating the results too. Can you think of any other useful benchmark programs to try?
------------------------------------------------------------------------------
*********
WHETSTONE:
(hitech-c under cp/m)
comments to old style, unable to define STATIC so replaced STATIC with text search, typedef double double_t;
GIVES SOME INCORRECT RESULTS!
C -O -V -DNOPRINTF -DNOTIMER -DNOCOMMAND -MWHETDC.MAP WHETDC.C -LF
appmake +rom -s 32768 -f 0 -o whetdc0.rom
appmake +inject -b whetdc0.rom -i whetdc.com -s 256 -o whetdc.rom
ticks whetdc.rom -start 14f -end 1d2 -counter 99999999999
size: 6869 + 414 + 147 = 7430 (includes some cp/m overhead)
speed: 637,332,032
compared to the best z88dk result:
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -clib=sdcc_iy --max-allocs-per-node200000 -DNOPRINTF -DNOTIMER -DNOCOMMAND whetstone.c -o whetstone -lm -m
ticks whetstone_CODE.bin -start 829 -end 130b -counter 9999999999999
size: 6099
speed: 919,719,774
We actually lost on speed here - I didn't expect that at all but it must be kept in mind that math48 implements a 48-bit double whereas hitech implements a 32-bit one. However the good result for hitech is spoiled by the fact that some of the computed values were incorrect so its float implementation is bugged.
******
sieve of Eratosthenes
(hitech-c under cp/m)
comments to old style
C -O -V -DNOPRINTF -MSIEVE.MAP SIEVE.C
appmake +rom -s 32768 -f 0 -o sieve0.rom
appmake +inject -b sieve0.rom -i sieve.com -s 256 -o sieve.rom
ticks sieve.rom -start 14f -end 1d2 -counter 99999999999
size: 678 + 2 + 8010 = 8690 (includes some cp/m overhead)
speed: 4,547,538
compared to best z88dk result:
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -clib=sdcc_iy --max-allocs-per-node200000 -DNOPRINTF sieve.c -o sieve -m
ticks sieve_CODE.bin -start xxx -end xxx -counter 9999999999999
size: 8282
speed: 3,691,568
*****
pi.c
(hitech-c under cp/m)
No div() so replaceed with / and %, comments to old style
C -O -V -DNOPRINTF -MPI.MAP PI.C
appmake +rom -s 32768 -f 0 -o pi0.rom
appmake +inject -b pi0.rom -i pi.com -s 256 -o pi.rom
ticks pi.rom -start 13d -end 236 -counter 99999999999
size: 1159 + 2 + 5616 = 6777 (includes some cp/m overhead)
speed: 5,465,612,292
best z88dk result:
(speed)
(sccz80 + new clib + fast integer option)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7306
speed: 1,455,531,292
(size/speed compromise)
(sccz80 + new clib)
zcc +embedded -vn -startup=0 -clib=new -DNOPRINTF pi.c -o pi -m
ticks pi_CODE.bin -start xxx -end xxx -counter 99999999999
size: 6289
speed: 3,773,744,792
I think we'll always be smaller and faster than hitech except for that surprise float result, although if we had a 32-bit float lib I expect the results would be in z88dk's favour again. Hitech's float package is unreliable as well.
I'm still missing a dhrystone for hitech but it's a bit late here to continue so I'll look at this and bds c tomorrow. I think I'll start tabulating the results too. Can you think of any other useful benchmark programs to try?
------------------------------------------------------------------------------
I think it is a fairly complete test already.. adding further detail could compromise the table readability.
BTW the actual results are already intriguing..let's give time to others to verify such results.
Also let's publish the results before being tempted to tune the optimizers for such specific benchmarks !!
------------------------------------------------------------------------------
BTW the actual results are already intriguing..let's give time to others to verify such results.
Also let's publish the results before being tempted to tune the optimizers for such specific benchmarks !!
------------------------------------------------------------------------------
Last one for hitech-c:
Dhrystone 2.1
(hitech c)
#ifndef NOSTATIC illegal, can't #define REGISTER, typedef double double_t;
(optimize switch -O had to be omitted as compiler complained "no end record found")
C -V -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER -MDHRY.MAP DHRY_1.C DHRY_2.C
appmake +rom -s 32768 -f 0 -o dhry0.rom
appmake +inject -b dhry0.rom -i dhry_1.com -s 256 -o dhry.rom
ticks dhry.rom -start 1a9 -end 359 -counter 99999999999
size: 2454 + 128 + 5221 = 7803 (includes some cp/m overhead)
speed: 544,140,142
compare to z88dk result:
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER -clib=sdcc_iy --max-allocs-per-node200000 dhry_1.c dhry_2.c -o dhry -m
ticks dhry_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7399
speed: 273,062,480
I had to turn off optimization on the hitech compile (-O) since the compiler crapped out with it turned on. Once the source file gets to a certain size the optimization step can't be enabled. I'm not sure how much it hurt hitech's performance but it's only half the speed of the z88dk compile.
The -SO3 rules with sdcc are almost producing optimal code on the benchmarks. I can only see a few problems which I don't think we can fix because they have to do with sdcc switching endianness in 16-bit quantities. I'm also going to redo the SO3 rules -- what I've put in there now was mainly a learning exercise so there's a lot of "wrong way to do things" in there. The redo is a bit of a pia because there are probably close to 350 additional rules now :-/
Other tests I was thinking of would be maybe sorting, string manipulation and file io performance. Sorting is really about the quality of your qsort function (I'm confident we're well ahead of everyone else) so it's not really a good indicator of anything unless your application involves a lot of sorting. But it seems to be a common performance test like the equally questionable sieve of erastothenes.
------------------------------------------------------------------------------
Dhrystone 2.1
(hitech c)
#ifndef NOSTATIC illegal, can't #define REGISTER, typedef double double_t;
(optimize switch -O had to be omitted as compiler complained "no end record found")
C -V -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER -MDHRY.MAP DHRY_1.C DHRY_2.C
appmake +rom -s 32768 -f 0 -o dhry0.rom
appmake +inject -b dhry0.rom -i dhry_1.com -s 256 -o dhry.rom
ticks dhry.rom -start 1a9 -end 359 -counter 99999999999
size: 2454 + 128 + 5221 = 7803 (includes some cp/m overhead)
speed: 544,140,142
compare to z88dk result:
(sdcc + new clib)
zcc +embedded -vn -SO3 -startup=0 -DNOPRINTF -DNOTIMER -DNOSTRUCTASSIGN -DNOREGISTER -clib=sdcc_iy --max-allocs-per-node200000 dhry_1.c dhry_2.c -o dhry -m
ticks dhry_CODE.bin -start xxx -end xxx -counter 99999999999
size: 7399
speed: 273,062,480
I had to turn off optimization on the hitech compile (-O) since the compiler crapped out with it turned on. Once the source file gets to a certain size the optimization step can't be enabled. I'm not sure how much it hurt hitech's performance but it's only half the speed of the z88dk compile.
I'll document the source code and how the numbers were obtained so others can verify or try with other compilers.I think it is a fairly complete test already.. adding further detail could compromise the table readability.
BTW the actual results are already intriguing..let's give time to others to verify such results.
Also let's publish the results before being tempted to tune the optimizers for such specific benchmarks !!
The -SO3 rules with sdcc are almost producing optimal code on the benchmarks. I can only see a few problems which I don't think we can fix because they have to do with sdcc switching endianness in 16-bit quantities. I'm also going to redo the SO3 rules -- what I've put in there now was mainly a learning exercise so there's a lot of "wrong way to do things" in there. The redo is a bit of a pia because there are probably close to 350 additional rules now :-/
Other tests I was thinking of would be maybe sorting, string manipulation and file io performance. Sorting is really about the quality of your qsort function (I'm confident we're well ahead of everyone else) so it's not really a good indicator of anything unless your application involves a lot of sorting. But it seems to be a common performance test like the equally questionable sieve of erastothenes.
------------------------------------------------------------------------------
Benchmark results now in the under-construction wiki:
http://www.z88dk.org/wiki/doku.php?id=t ... benchmarks
I'll go back and explain how to compile and collect the numbers seen as well as double-check results.
------------------------------------------------------------------------------
http://www.z88dk.org/wiki/doku.php?id=t ... benchmarks
I'll go back and explain how to compile and collect the numbers seen as well as double-check results.
------------------------------------------------------------------------------
I applied a temporary hack to sdcc's peepholer that will look at the function name in calls to determine what registers are live at the call. We can do this because the entire library appends "_fastcall" to all fastcall function names so if the call has "_fastcall" in the name it can be known that dehl is live (sadly I can't narrow that down to just hl for int arguments) and no registers are live otherwise (params are on the stack).
That has meant that the speculative -SO3 rules that we have can be made safe and I've just finished that. Now that I have some experience, ideally I'd start making rules from scratch again but that will be for another day. I'm going to go through several compiles to make sure programs still work with -SO3 and then I will return to the benchmarking, thoroughly document that and maybe find a couple more compilers to compare against.
Here's the pi.c asm output from sdcc without -SO3 (pi-SDCC.opt) and with -SO3 (pi-SO3.opt).
https://drive.google.com/file/d/0B6XhJJ ... sp=sharing
See if you can tell the difference The former is 284 lines, the latter 177. My rule of thumb is #bytes = #lines/2 (unless there is a lot of ix addressing then it goes higher). This would suggest a savings of approx 53 bytes on a 140 byte program (35%). In the benchmarks the code size of these small programs is swamped by the data/bss size which is close to the same for all compilers.
------------------------------------------------------------------------------
That has meant that the speculative -SO3 rules that we have can be made safe and I've just finished that. Now that I have some experience, ideally I'd start making rules from scratch again but that will be for another day. I'm going to go through several compiles to make sure programs still work with -SO3 and then I will return to the benchmarking, thoroughly document that and maybe find a couple more compilers to compare against.
Here's the pi.c asm output from sdcc without -SO3 (pi-SDCC.opt) and with -SO3 (pi-SO3.opt).
https://drive.google.com/file/d/0B6XhJJ ... sp=sharing
See if you can tell the difference The former is 284 lines, the latter 177. My rule of thumb is #bytes = #lines/2 (unless there is a lot of ix addressing then it goes higher). This would suggest a savings of approx 53 bytes on a 140 byte program (35%). In the benchmarks the code size of these small programs is swamped by the data/bss size which is close to the same for all compilers.
------------------------------------------------------------------------------
Ok.. well, it sounds incredible !I applied a temporary hack to sdcc's peepholer (...) This would suggest a savings of approx 53 bytes on a 140 byte program (35%).
I even had to take a bit of time to answer because you left me totally surprised.
Well done, if we find a way to make the hack acceptable I'd definitely insist on that.. I wonder if Phil can support us somehow.
------------------------------------------------------------------------------
On 05.11.2015 15:06, Stefano Bodrato (stefano_bodrato@...) wrote:
__z88dk_fastcall, i.e. it should give the peephole rules exact
information on which function call uses which registers for arguments
(for calls using the call asm instruction, support for other ways of
calling functions might come later).
There seem to be a few corner cases, in which it doesn't work yet - in
that case it will print a warning "Fallback to old way of handling
register arguments in peephole".
I didn't find time to really test it yet.
Philipp
Here's a first version of a patch that should just work forOk.. well, it sounds incredible !I applied a temporary hack to sdcc's peepholer (...) This would suggest a savings of approx 53 bytes on a 140 byte program (35%).
I even had to take a bit of time to answer because you left me totally surprised.
Well done, if we find a way to make the hack acceptable I'd definitely insist on that.. I wonder if Phil can support us somehow.
__z88dk_fastcall, i.e. it should give the peephole rules exact
information on which function call uses which registers for arguments
(for calls using the call asm instruction, support for other ways of
calling functions might come later).
There seem to be a few corner cases, in which it doesn't work yet - in
that case it will print a warning "Fallback to old way of handling
register arguments in peephole".
I didn't find time to really test it yet.
Philipp
On 05.11.2015 21:52, Philipp Klaus Krause wrote:
revision #9386. Please test if it works for the needs of z88dk.
Philipp
An ightly modified version of the patch I posted earlier is now in sdccOn 05.11.2015 15:06, Stefano Bodrato (stefano_bodrato@...) wrote:Here's a first version of a patch that should just work forOk.. well, it sounds incredible !I applied a temporary hack to sdcc's peepholer (...) This would suggest a savings of approx 53 bytes on a 140 byte program (35%).
I even had to take a bit of time to answer because you left me totally surprised.
Well done, if we find a way to make the hack acceptable I'd definitely insist on that.. I wonder if Phil can support us somehow.
__z88dk_fastcall, i.e. it should give the peephole rules exact
information on which function call uses which registers for arguments
(for calls using the call asm instruction, support for other ways of
calling functions might come later).
There seem to be a few corner cases, in which it doesn't work yet - in
that case it will print a warning "Fallback to old way of handling
register arguments in peephole".
I didn't find time to really test it yet.
Philipp
revision #9386. Please test if it works for the needs of z88dk.
Philipp
I've now tried it and it seems to be working just as well as my hack so I've made the necessary changes for the next nightly build (Nov 6).An ightly modified version of the patch I posted earlier is now in sdcc
revision #9386. Please test if it works for the needs of z88dk.
Philipp
Thanks Philip.
It seems the peepholer is unable to consider "ex de,hl" properly. Eg, it should be able to determine that "ld hl,#_wloc + 1" is dead in the following:
ld hl,(#_wloc)
add hl,hl
ex de,hl
ld hl,#_wloc + 1
ex de,hl
add hl,hl
add hl,hl
ex de,hl
ld hl,#_room + 0x0006
add hl,de
but this hasn't impacted the code I've been compiling yet as quite often there is a another rule that applies. In the case above "ex de,hl; ld hl,#wloc + 1;ex de,hl" can be replaced with "ld de,#wloc + 1" and the code collapses to:
ld hl,(#_wloc)
add hl,hl
add hl,hl
add hl,hl
ld de,#_room + 0x0006
add hl,de
;; if notUsed('de')
One place where code is being affected is the peepholer doesn't seem to follow branches. To do that you'd probably have to introduce a max instruction count that the peepholer would follow and then recursively peephole at branch points, applying a logical && at the results of the branch to come to a decision on whether registers are live. I know this would probably be a huge problem to implement in the current code base but here's a short real example:
ld hl,#_main_k_1_268
ld a,(hl)
add a,#0xF2
ld (hl),a
inc hl
ld a,(hl)
adc a,#0xFF
ld (hl),a
dec hl
or a,(hl)
jp NZ,l_main_00110$
substitution of a 16-bit add there can't occur because the peepholer can't determine if either bc or de is dead. Assuming de is dead this can become:
ld hl,(#_main_k_1_268)
ld de,#0xFFF2
add hl,de
ld a,h
ld (#_main_k_1_268),hl
ld hl,#_main_k_1_268 + 1
dec hl
or a,(hl)
jp NZ,l_main_00110$
;; if notUsed('de')
=
ld hl,(#_main_k_1_268)
ld de,#0xFFF2
add hl,de
ld a,h
ld (#_main_k_1_268),hl
or a,l
jp NZ,l_main_00110$
;; if notUsed('hl')
The one big problem is the tendency of sdcc to reverse endianness in 16-bit quantities but I know this is probably a major problem to fix so maybe it's something to keep in mind next time you look at code generation. Here's an example from the dhrystone2.1 benchmark:
ld hl,(#_Proc_8_Int_Loc_1_427)
ld de,#0x0014
add hl,de
ld e,l
ld d,h
add hl, hl
add hl, de
add hl, hl
add hl, hl
add hl, hl
add hl, de
add hl, hl
add hl, hl
ld e,6 (ix)
ld d,7 (ix)
add hl,de
up to here everything is great. But then sdcc decides to swap endianness by moving hl into ed.
ld e,h
ld d,l
ld hl,(#_Proc_8_Int_Loc_1_427)
add hl,hl
ld b,l
ld c,h
ld a,d
add a, b
ld d,a
ld a,e
adc a, c
ld e,a
ld a,4 (ix)
add a, b
ld b,a
ld a,5 (ix)
adc a, c
ld c,a
ld l, b
ld h, c
Using the peephole set we have now this code could have been replaced by:
ex de,hl
ld hl,(#_Proc_8_Int_Loc_1_427)
add hl,hl
ld c,l
ld b,h
ex de,hl
add hl,bc
ex de,hl
ld l,4 (ix)
ld h,5 (ix)
add hl,bc
ld c,l
ld b,h
if sdcc used big endian math. That saves 7 bytes and 14 cycles. This sudden change shows up quite a lot and it's not something easy to fix with peephole rules since undoing the endiannes reversal takes up time and space too.
------------------------------------------------------------------------------
On 06.11.2015 21:17, alvin (alvin_albrecht@...) wrote:
needed if the function argument is 16 bits, and that only l is needed if
the function argument is 8 bits.
Philipp
It should work a bit better: It should handle correctly that only hl isI've now tried it and it seems to be working just as well as my hack so I've made the necessary changes for the next nightly build (Nov 6).An ightly modified version of the patch I posted earlier is now in sdcc
revision #9386. Please test if it works for the needs of z88dk.
Philipp
Thanks Philip.
needed if the function argument is 16 bits, and that only l is needed if
the function argument is 8 bits.
Philipp
I got hold of hitech z80 v7.50 for msdos and will be benchmarking code with that too. This is the last z80 compiler put out by hitech.
A couple of results for pi.c:
** WITHOUT LDIV
hitech750
size: 6332 bytes
time: 5,520,762,227
z88dk/sdcc
size: 6154 bytes
time: 5,285,278,076
z88dk/sdcc with fast integer math option
size: 7171 bytes
time: 1,990,813,171
** WITH LDIV
hitech750
size: 6473 bytes
time: 5,884,343,627
z88dk/sdcc
size: 6154 bytes
time: 3,786,981,324
z88dk/sdcc with fast integer math option
size: 7182 bytes
time: 1,467,142,582
I'm taking out the startup code and restart stuff so code sizes will be a bit smaller than on the wiki page currently. The times will be a bit faster for z88dk/sdcc too because there are a few more peephole rules in that are affecting a few spots.
The C code generated is quite similar between hitech 750 and z88dk/sdcc with -SO3 on. Hitech's is probably a bit better because there are one or two places where the 8-bitness in the sdcc code can't be replaced with 16-bit instructions by peephole rules.
The surprising thing is hitech C is actually slower when ldiv() is used. It seems that hitech does not get a quotient and remainder from a single division and is dividing twice so that there is no advantage to using it. Hitech does do struct copy so I guess that overhead is what slows it down in comparison to the plain /% version. It looks very much like hitech is using static memory to store the ldiv result which means it is not generating re-entrant code.
------------------------------------------------------------------------------
Presto, an open source distributed SQL query engine for big data, initially
developed by Facebook, enables you to easily query your data on Hadoop in a
more interactive manner. Teradata is also now providing full enterprise
support for Presto. Download a free open source copy now.
http://pubads.g.doubleclick.net/gampad/ ... 1&iu=/4140
A couple of results for pi.c:
** WITHOUT LDIV
hitech750
size: 6332 bytes
time: 5,520,762,227
z88dk/sdcc
size: 6154 bytes
time: 5,285,278,076
z88dk/sdcc with fast integer math option
size: 7171 bytes
time: 1,990,813,171
** WITH LDIV
hitech750
size: 6473 bytes
time: 5,884,343,627
z88dk/sdcc
size: 6154 bytes
time: 3,786,981,324
z88dk/sdcc with fast integer math option
size: 7182 bytes
time: 1,467,142,582
I'm taking out the startup code and restart stuff so code sizes will be a bit smaller than on the wiki page currently. The times will be a bit faster for z88dk/sdcc too because there are a few more peephole rules in that are affecting a few spots.
The C code generated is quite similar between hitech 750 and z88dk/sdcc with -SO3 on. Hitech's is probably a bit better because there are one or two places where the 8-bitness in the sdcc code can't be replaced with 16-bit instructions by peephole rules.
The surprising thing is hitech C is actually slower when ldiv() is used. It seems that hitech does not get a quotient and remainder from a single division and is dividing twice so that there is no advantage to using it. Hitech does do struct copy so I guess that overhead is what slows it down in comparison to the plain /% version. It looks very much like hitech is using static memory to store the ldiv result which means it is not generating re-entrant code.
------------------------------------------------------------------------------
Presto, an open source distributed SQL query engine for big data, initially
developed by Facebook, enables you to easily query your data on Hadoop in a
more interactive manner. Teradata is also now providing full enterprise
support for Presto. Download a free open source copy now.
http://pubads.g.doubleclick.net/gampad/ ... 1&iu=/4140
On 09.11.2015 09:08, alvin (alvin_albrecht@...) wrote:
http://sdcc.sourceforge.net/mediawiki/i ... _code_size
It did ok in code size, but around SDCC 3.0, SDCC started generatng
smaller code.
Philipp
A long time ago, I did a small test with HITECH-C 7.80PL2:I got hold of hitech z80 v7.50 for msdos and will be benchmarking
code with that too. This is the last z80 compiler put out by
hitech.
http://sdcc.sourceforge.net/mediawiki/i ... _code_size
It did ok in code size, but around SDCC 3.0, SDCC started generatng
smaller code.
Philipp
I finished up the benchmarks at:
http://www.z88dk.org/wiki/doku.php?id=t ... ystone_2.1
A package for download including source code and some instructions for compiles with all the compilers is linked. I added the Hitechv750 results too.
IAR Z80 seems to have been dropped as a commercial product and it's a bit hard to locate it through all the virus-laden junk being distributed by the crackers.
Softools Cross-C is still available but the trial period is only 30 days so I may postpone getting that until there is a backlog of stuff we may want to compare against.
------------------------------------------------------------------------------
http://www.z88dk.org/wiki/doku.php?id=t ... ystone_2.1
A package for download including source code and some instructions for compiles with all the compilers is linked. I added the Hitechv750 results too.
IAR Z80 seems to have been dropped as a commercial product and it's a bit hard to locate it through all the virus-laden junk being distributed by the crackers.
Softools Cross-C is still available but the trial period is only 30 days so I may postpone getting that until there is a backlog of stuff we may want to compare against.
------------------------------------------------------------------------------
It's working very well.It should work a bit better: It should handle correctly that only hl is
needed if the function argument is 16 bits, and that only l is needed if
the function argument is 8 bits.
Philipp
I have one example I'm not sure what is going on:
ld a,h
ex de,hl
ld hl,#_input + 1
xor a, a
ld (de),a
dec hl
"ld a,h" is not being killed by the peepholer even though there is an "xor a,a" following.
z80MightRead() in "peep.c" looks like it takes care of "xor a,a" properly:
if(!strcmp(pl->line, "xor\ta, a") || !strcmp(pl->line, "xor\ta,a"))
return FALSE;
A second question I have is: Is there any way to specify that a rule be qualified by whether flags are tested?
Here's an example:
replace restart {
ld hl,#0xFFFF
add hl,de
} by {
ld l,e
ld h,d
dec hl
; peephole z88dk-311a
}
// **** carry flag
A 16-bit addition of -1 is replaced with a 16-bit decrement. This is correct except the flags are not equivalent following. What I'm concerned about is if sdcc might generate that code to do a comparison and then branch. I haven't seen that pattern show up yet so so far this rule has been fine but it would be better if it could be qualified by saying "if the flags are not tested."
------------------------------------------------------------------------------
On 12.11.2015 20:02, alvin (alvin_albrecht@...) wrote:
peepholes a bit longer to include the next instruction that overwrites
flags.
Philipp
Not yet. A workaround is to look at the common sequences and make theA second question I have is: Is there any way to specify that a rule be qualified by whether flags are tested?
peepholes a bit longer to include the next instruction that overwrites
flags.
Philipp