[z88dk-dev] floats and sccz80

Post by **alvin** » Wed May 13, 2015 7:11 am

It turns out fastcall and callee linkage works with floats and sdcc but we haven't been taking advantage of it. The float code that sccz80 is generating is very compact with these options enabled. sdcc is perhaps 4-5x larger but there is much room for improvement in sdcc yet.

Anyway I have been trying test programs to see if sccz80 actually needs a floating point accumulator and second register in memory and I don't think it does. The last param before a float call (fastcall or library add/sub/mul/div) always seems to be collected into the floating point accumulator and the first thing done after a call is to copy the result out of the FA or to send it directly to the next float call. It seems I can skip the step of writing the float to the floating point accumulator in memory and can instead keep the FA in registers. Has anyone still got this fresh enough in mind to confirm I can do this?

I am close to trying the math48 float library with sccz80 and sdcc. I'd like to keep the effective floating point accumulator in register BCDEHL whenever possible and avoid writing to and then immediately reading that from the in-memory FA. I think I can do away with the in-memory FA altogether.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Thu May 14, 2015 5:53 am

After many more test programs, it looks like I can get away with a single floating point accumulator (FA) in the exchange set BCDEHL'. Since sccz80 does not use the exx set, FA retains its value while sccz80 makes adjustments to the stack following subroutine calls. A single float parameter is thus passed into a function in BCDEHL' and a float value is returned in BCDEHL'. I have to make changes to the math48 library to treat the exx set as the primary input and return value.

Since sccz80 evaluates float constants using the genmath format, there has to be a format conversion between the compiler and the math48 library. This happens in the sccz80 float primitives (dload, dstore, dldpsh, dpush, dpush2). The conversion is actually not that difficult since the formats are almost identical. Both formats use exponent = 0 to indicate a zero value and both have 40 bits of mantissa and 8 bits of exponent with bias 128. The storage order in memory is all that is different.

sdcc is using a 32-bit format which I haven't looked into yet but it should be easy to do the format conversion at the compiler / mathlib interface.

At the end of it, floats are stored in memory and handled by the C compilers in their native format. The printf / scanf implementation of %f, etc, only has to deal with the compiler native format.

And this hints at complications with the plan to allow mixing of object code generated by sdcc and sccz80. For one, the implementations of printf/scanf %f is different. There can't be one printf function shared by the two compilers. Worse, floats cannot be exchanged between an sccz80 function and an sdcc function. Maybe a constraint could be introduced to only allow floats to be handled by one compiler. Then there is the problem with vararg functions in general. Parameters are passed in opposite orders on the stack so the asm implementation of those functions walk the stack using a compile-time determined subroutine depending on the currently active compiler. This means sccz80 gets a version of vararg functions that walk the stack downward and sdcc gets a version that walk the stack upward. Clearly for vararg functions there isn't one implementation that can be shared by both compilers.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Fri May 22, 2015 7:11 pm

We seem to have bugs in the ftoa() function.

Here's a small test program I was trying on the zx target while investigating asin:

Code: Select all

#include <stdio.h>
#include <math.h>

float x;

main()
{
x = 0.25;
printf("asin(%f) = %f\n", x, asin(x));
}

Compiled with:
zcc +zx -vn asinc.c -o asin -lm -lndos -create-app

(I tried -lmz with identical results except for the output string format but I think -lm is genmath)

The result:

asin(2140078080.000000) = 2140106849.000000

There are a bunch of issues:

1. I checked the source and the constant 0.25 is correctly stored but is obviously not being printed properly. This probably points at ftoa()

2. The assembler is emitting a lot of warnings about use of obsolete XDEF, etc so it looks like the crts need to have all the old keywords updated.

3. The reason I was checking asin is because it doesn't look right to me but I'll have to check again once I can see the results. It's computing asin(x) = 2*atan(x/(1.+sqrt(1.-(x*x))));. Straightforward trig gives asin(x) = atan(x/(sqrt(1.-(x*x)))); I don't think it's equivalent but I am checking to see if it's done that way to get better behaved results.

4. For the first time I see the exx set used in the asm output of sccz80. I wasn't aware it was doing that

call printf
exx
ld hl,14 ;const
add hl,sp
ld sp,hl
exx

I'm wondering if a call to a vararg function with float result may also generate this code -- I am using the EXX set to store a float result from the math48 library so this could be a problem. (The floating point accumulator is in the exx set instead of memory).

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Fri May 22, 2015 7:52 pm

4. For the first time I see the exx set used in the asm output of sccz80. I wasn't aware it was doing that

call printf
exx
ld hl,14 ;const
add hl,sp
ld sp,hl
exx

I'm wondering if a call to a vararg function with float result may also generate this code -- I am using the EXX set to store a float result from the math48 library so this could be a problem. (The floating point accumulator is in the exx set instead of memory).

It looks like this is safe. This will only happen when there is a result in DEHL that needs to be kept while the stack is adjusted by a large amount. The result is not in DEHL when a float is returned.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Fri May 22, 2015 8:05 pm

And this hints at complications with the plan to allow mixing of object code generated by sdcc and sccz80. For one, the implementations of printf/scanf %f is different. There can't be one printf function shared by the two compilers. Worse, floats cannot be exchanged between an sccz80 function and an sdcc function.

I'm going to go with this plan to allow object code generated by sdcc to intermingle with object code generated by sccz80. I'm looking at making all the sdcc functions __smallc (they are not now) so that in the iy version of the library the sdcc c interface can share the same code as the sccz80 c interface. There will then be no penalty for mixing sdcc and sccz80 generated code.

There will be one incompatibility with vararg functions. sdcc has to to do that R->L parameter order and sccz80 has to do that L->R order. All the vararg library functions call a specific function to walk the stack in the correct order depending on which compiler is generating code. So printf, ioctl, etc are compiled to work with one or the other compiler but not both at the same time. I'll try to dream up something to make things work but if not the user will have to use one compiler or the other for vararg functions. The float incompatibility also means floats cannot be passed between sdcc and sccz80 functions but I think this is not so much a big deal as long as the user can keep this straight.

The headers I will turn into macros so that they are not so lengthy.

I can hopefully start testing the math48 float package tonight with the new clib. I am very curious to see if there is any appreciable speedup from genmath given the floating point accumulators are held in registers rather than memory. The one thing we are not taking advantage of now in the classic clib is CALLEE and FASTCALL linkage for float functions -- sccz80 is generating very compact code when these are used so I will be doing that for math48. I've added a lot of functions from the C11 standard to math48, though not all and definitely some things are missing and will be missed so should be done eventually.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Sun May 24, 2015 6:22 pm

I'm starting to do some testing now. I've noticed a few things sccz80 is doing that is not the best way to do things (the int/float conversions for example always assume long and sometimes don't distinguish between signed and unsigned). It also seems sccz80 cannot statically initialize doubles and it stores -ve const doubles as positives and generates runtime code to negate when the constants are loaded.

But the code is very compact especially when the math functions are made CALLEE / FASTCALL:

#include <math.h>

double a, b;

main()
{
a = 123.456;
b = -32.76;

b = atan(a);
a = fmax(a,b);

return atan((a + b) * a - (a/b) + b) + 2.5;
}

==>

._main
ld hl,i_2+0
call dload
ld hl,_a
call dstore
ld hl,i_2+6
call dload
call minusfa
ld hl,_b
call dstore
ld hl,_a
call dload
call atan
ld hl,_b
call dstore
ld hl,_a
call dldpsh
ld hl,_b
call dldpsh
call fmax
ld hl,_a
call dstore
ld hl,_a
call dldpsh
ld hl,_b
call dload
call dadd
call dpush
ld hl,_a
call dload
call dmul
call dpush
ld hl,_a
call dldpsh
ld hl,_b
call dload
call ddiv
call dsub
call dpush
ld hl,_b
call dload
call dadd
call atan
call dpush
ld hl,i_2+12
call dload
call dadd
call ifix
ret

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Mon May 25, 2015 6:13 am

There are still bugs but I am compiling float programs that don't crash. Just a lot of wrong answers

The basic operations are working though and I did try one comparison test with genmath:

Code: Select all

#include <math.h>

int i;
double a, b;

main()
{
a = b = 1.010362272981;

for (i=1; i != 1000; ++i)
a *= b;

return a;
}

Current new clib compile:

zcc +zx -vn -clib=new math.c -lm_sccz80 -o math

Classic compile:

zcc +zx -vn math.c -lm -lndos -o math

Both came up with the same answer (30007). The constant is the 1000th root of 30000.0

The for-loops are the same in both versions. Time taken to execute the for-loop:

math48: 7,945,990 cycles = 1.98 seconds @ 4MHz
genmath: 9,393,937 cycles = 2.34 seconds @ 4MHz

Program size is hard to compare now as the two compiles are using different crts and libraries. For this program, attached code from math48 was 350 bytes although there isn't a lot of different float operations being done here.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Tue May 26, 2015 1:34 am

I think all the bugs are gone and math48 is integrated for sccz80 and the new clib.

The following functions are available now:

acos, add, asin, atan, atan2, cbrt, compare, copysign, cos, div, exp, exp2, expm1, fabs, fdim, fma, fmax, fmin, fmod, frexp, hypot, ilogb, isgreater, isgreaterequal, isless, islessequal, islessgreater, isunordered, ldexp, log, log10, log1p, log2, logb, modf, mul, pow, scalbn, sin, sqrt, sub, tan, trunc

Most of the remaining ones are fairly easy to add (floor, ceil, hyberbolic functions) and I'll try to do that over the next few days. I'm looking at the aefg scan converters for printf and scanf first so that you can actually see output on screen. Right now I'm returning the result from main as an integer to see if correct results are had.

All the math48 functions have a few letter prefix in their names to create a namespace for them: "mm48_", "lm48_", "cm48_" and if things are done this way for all float packages, all the float code can be built into each target's main library and can be available through fully qualified names. A separate library named "m" ("mz", etc) connects the sccz80 float primitives to the math library chosen. So for now math48 is assigned the "m.lib" so linking on the compile line with "-lm_sccz80" will connect sccz80 to math48. The only thing this "m" library contains is a bunch of defc definitions as in "defc dadd = mm48_fpadd" where the sccz80 primitive "dadd" is aliased to a float library function.

sccz80 is not generating correct code to call functions that take one float parameter so for now none of the float functions can be called through a function pointer. I'll fix that at some point later on. Once i get printf/scanf done I'll take a pass at sdcc next.

There are a bunch of issues with mixing object code from sccz80 and sdcc so I'm just going to take the first step in accommodating that and that will changing all the callee functions to jump into the asm implementation rather than including it as spoken of earlier. This should reduce binary size and allow the asm implementation to be shared by both compilers later on.

Along with that will be changing all headers files into macros so that they are much shorter. The idea is a header file includes a short linkage header that defines callee, fastcall, etc macros depending on which compiler is active. Then each function prototype is defined once using the appropriate macro and that will be expanded into that giant headache that is in the header files now.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Tue May 26, 2015 1:47 am

Ah yes, since yesterday I have been unable to commit new library objects to the directory libsrc/_DEVELOPMENT/lib. It says it's waiting for a lock to be released that belongs to me. The lock should reside on sf's servers so I don't know if anyone knows how that lock can be forced to be released? It's been locked for at least 24 hours now so I'm beginning to lose hope that this is the sort of lock that will time out.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

Post by **alvin** » Sun May 31, 2015 7:03 am

I got an ftoa() function working tonight so it won't be long before the printf implementation is incorporated.

I'm doing things differently than what is in genmath and sdcc now -- both are doing repeated float multiplication by 10 to generate digits and are not fully implementing the float flags in scan conversion. I'm extending what was done in math48 where the decimal string is found by first converting the float's binary power of 2 to an equivalent power of 10 = e. Then that number is divided by 10^e to get a binary number between 1 and 10. This is made into mantissa bits and is repeatedly muliplied by 10 in integer arithmetic (a bunch of shifts) with the decimal digit coming from the top 4 bits of the mantissa in each iteration.

The number of significant decimal digits is controlled by the math library; in math48's and genmath's case that's 13 digits (approx 2^-40) so ftoa only generates 13 digits and then zeroes. To avoid large buffer use, the ftoa conversion counts zeroes inserted before & after the decimal point and zeroes appended after the decimal string. So the workspace buffer only contains the significant decimal digits plus a few doodads. All the expected flags are supported: +,' ',# (do not remove trailing zeroes and decimal point) and rounding is applied properly.

ftoa() is also able to generate nan and inf strings if the math library supports that (neither math48 nor genmath do).

I expect the ftoa conversion to be much faster than in either sccz80 or sdcc now but that's untested as of yet.

ftoa() is a non-standard function so I settled on this prototype:

; size_t ftoa(float x, char *buf, uint16_t prec, uint16_t flag)

prec = precision, the number of decimal digits that should appear after the decimal point
flag =logical or of FTOA_FLAG_PLUS, FTOA_FLAG_SPACE, FTOA_FLAG_HASH which correspond to the printf flags

The size of the resulting string not including terminating '\0' is returned. If buf==0 the string is not written but you still get the string length so this can be used to find out the amount of memory needed to hold the result.

------------------------------------------------------------------------------

Post by **alvin** » Sun May 31, 2015 7:15 am

I'm doing things differently than what is in genmath and sdcc now -- both are doing repeated float multiplication by 10 to generate digits

Not quite true... sdcc is able to cast the integer portion of the float to int since its float type is only 32 bit so the mantissa easily fits into a long. But it's doing integer division to get each decimal digit which is still much slower than the *10 implemented as a couple of shifts for each digit. The fraction part is done by float multiply by 10 and cast to int.

------------------------------------------------------------------------------

Philipp Klaus Krause · Sun May 31, 2015 7:29 am

On 31.05.2015 09:15, alvin (alvin_albrecht@...) wrote:

I'm doing things differently than what is in genmath and sdcc now
-- both are doing repeated float multiplication by 10 to generate
digits
Not quite true... sdcc is able to cast the integer portion of the
float to int since its float type is only 32 bit so the mantissa
easily fits into a long. But it's doing integer division to get each
decimal digit which is still much slower than the *10 implemented as
a couple of shifts for each digit. The fraction part is done by
float multiply by 10 and cast to int.

On the Z80 there's another fast way to print integers (don't know id it
helps here though). I was considering using it in the sdcc printf()
family, but did not get around to implementing it there yet (as I'm
planning a bigger printf rewrite there anyway).
It goes from binary to bcd to ascii. See attachd example (I used that on
the ColecoVision).

Philipp

Post by **alvin** » Mon Jun 01, 2015 3:21 am

ftoe() is added and the code is rationalized with ftoa().

Test program:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

unsigned int len1, len2;
double a, b;

char buffer[128];

main()
{
len1 = ftoa(1.23456e35, NULL, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
len2 = ftoa(1.23456e35, buffer, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
printf("%s\n   len1 = %u, len2 = %u\n\n", buffer, len1, len2);

len1 = ftoa(20.5, NULL, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
len2 = ftoa(20.5, buffer, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
printf("%s\n   len1 = %u, len2 = %u\n\n", buffer, len1, len2);

len1 = ftoa(1000.0*asin(0.98), NULL, 9, 0);
len2 = ftoa(1000.0*asin(0.98), buffer, 9, 0);
printf("%s\n   len1 = %u, len2 = %u\n\n", buffer, len1, len2);

len1 = ftoe(1.23456e35, NULL, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
len2 = ftoe(1.23456e35, buffer, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
printf("%s\n   len1 = %u, len2 = %u\n\n", buffer, len1, len2);

len1 = ftoe(20.5, NULL, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
len2 = ftoe(20.5, buffer, 6, FTOA_FLAG_PLUS | FTOA_FLAG_HASH);
printf("%s\n   len1 = %u, len2 = %u\n\n", buffer, len1, len2);

len1 = ftoe(1000.0*asin(0.98), NULL, 9, FTOA_FLAG_HASH);
len2 = ftoe(1000.0*asin(0.98), buffer, 9, FTOA_FLAG_HASH);
printf("%s\n   len1 = %u, len2 = %u\n\n", buffer, len1, len2);
}

The first three floats are printed with ftoa() and the same floats are then printed using ftoe(). Each trial calls ftoa/ftoe twice to get result string size, first with a NULL buffer and second with a non-NULL. I'm not sure why the last result is off by one for the NULL case.

A screenshot from program running on zx target:

https://drive.google.com/open?id=0B6XhJ ... authuser=0

The lock issue is not resolved so I can't commit updated library files; I'll have to find help at sf to remove the orphaned lock. If you want to try it, you'll have to rebuild the new clib (go to z88dk/libsrc/_DEVELOPMENT and run "Winmake all" from a windows shell. The makefile is not up to date for other machines).

On the Z80 there's another fast way to print integers (don't know if it
helps here though). I was considering using it in the sdcc printf()
family, but did not get around to implementing it there yet (as I'm
planning a bigger printf rewrite there anyway).
It goes from binary to bcd to ascii. See attachd example (I used that on
the ColecoVision).

That's an interesting method -- it doesn't help with the float case, but it could be used with int->ascii conversion (as you're doing in that example). We're doing it with the usual repeated subtraction by powers of 10: http://z88dk.cvs.sourceforge.net/viewvc ... iew=markup

That routine is written to allow carry flag on input to determine if leading zeroes should also be output. That allows the smaller int->ascii conversion to be reused by the larger long->ascii conversion. The subtraction loop is a little bit faster but on average 5 iterations per decimal digit (except 3.5 on first for an int) compared to that bcd method with is one loop iteration per bit. That's ~23 iterations for an int using subtraction vs 16 iterations with bcd. BCD is also shorter. That is probably worth looking into.

------------------------------------------------------------------------------

Post by **alvin** » Thu Jun 04, 2015 6:01 pm

I've added fully function %aefgAEFG printf converters to the new clib and they implement all flags. I ended up printing the float string to a temporary buffer on the stack; initially I was going to send the float string piecemeal so no temporary buffer was needed but this added about 70 bytes to the code size and would have been slower. Instead I allocated space for a float string of 64 chars on the stack and if this size is exceeded "<double>" is sent to the stream (analogous to sending "<null>" when printing %s with pointer == 0). A consequence is printf can use ~120 bytes of stack space when printing floats. Width is limited to 32767 (as for the rest of the converters) and precision is limited to 255 (unlike the rest which can go as high as 32767). Higher width/prec than 32767 generates an error but float precision above 255 will be capped at 255 without error.

Associated with those converters are non-standard stdlib functions ftoa(), ftoe(), ftog() and ftoh(). All have the same prototypes:

size_t ftoa(double x, void *buf, uint16_t prec, uint16_t flags); // like %f
size_t ftoe(double x, void *buf, uint16_t prec, uint16_t flags); // like %e
size_t ftog(double x, void *buf, uint16_t prec, uint16_t flags); // like %g (selects better format of %f or %e)
size_t ftoh(double x, void *buf, uint16_t prec, uint16_t flags); // like %a (float hex string that shows binary representation exactly)

The return value is always the length of the float string not including a terminating '\0'.

If buf==0, the float string is not written but the length is returned so a program can use this to allocate a buffer sufficient to hold the string.

prec = precision which represents number of digits appearing after '.' or total number of significant digits printed depending on conversion. A 16-bit parameter is accepted but this is capped at 255 (254 for %a) because 8-bit registers are used to hold this value.

ftoh() uses a precision value of 255 to mean "use the number of significant digits supported by the float type", ie the entire binary representation is printed, no more, no less.

I've not done sdcc yet -- that will be after I get scanf working. At this moment, new clib compiles using printf must link to the math library "-lm_sccz80" as by default the float converters are enabled so a linking error will occur if float functions are not linked. I'm doing that next, using the stdio settings generated by zpragma to select reduced printfs inside the crt to eliminate this issue just like it's done now in the classic lib.

------------------------------------------------------------------------------

Post by **alvin** » Fri Jun 05, 2015 4:59 am

I'm doing that next, using the stdio settings generated by zpragma to select reduced printfs inside the crt to eliminate this issue just like it's done now in the classic lib.

I've done this now but may have to undo it.

Enabling of printf/scanf converters works as follows:

* A bitmask in a config file for each target selects which converters are included in the library when it is built:

http://z88dk.cvs.sourceforge.net/viewvc ... iew=markup
(line 247 here)

As you can see, this is currently set to enable everything include the float converters %aefgAEFG

* sccz80 further scans printf strings at compile time to determine what level of printf is required: mini (integer without width/precision), complex (integer) and float. This is imperfect so a command line option can force selection of a particular level but in practice it works so well people don't even notice it. sccz80 writes the printf level into the opts file which is read by the crt. The crt makes the connection between calls to vfprintf to one of the three library functions vfprintf_mini, vfprintf_complex, vfprintf_float.

I have put this in on a trial basis for exactly one crt, the default 32-column compile for the zx target. ( zcc +zx -vn -clib=new main.c -o main). I've made three versions of vfprintf in the source tree to support this: vfprintf_unlocked (full featured), vfprintf_unlocked_fp (full featured, same as <-), vfprintf_unlocked_int (does not include %aefgAEFG). The clib configuration is honoured in these implementations as well so you can still remove individual converters by editing the config file and rebuilding the library.

The compiles work correctly, automatically selecting vfprintf_unlocked_int when no floats are printed in printf and vfprintf_unlocked_fp otherwise.

If you don't print floats you can compile with "zcc +zx -vn -clib=new main.c -o main" despite %aefg being enabled in the config file. If you do print floats, a linking error will be generated unless you also link to the math library with "zcc +zx -vn -clib=new main.c -lm_sccz80 -o main"

And that brings us to sdcc. sdcc does not do the equivalent so what is in printf is determined by the library config file which by default enables floats %aefg. This means printf is compiled with float code whether or not the program prints floats so you have to link to the math library all the time in order to eliminate linking errors ("zcc +zx -vn -clib=sdcc_ix --reserve-regs-iy main.c -lm_sccz80 -o main"). (NOTE floats are not working with sdcc currently so printing them will likely cause a crash). If you don't want the float code in printf you have to edit that config file and rebuild the library for the target which might take about 3-5 minutes. It's a bit inconvenient so I'm wondering if anyone can think of another solution that will be automatic as in the sccz80 case?

------------------------------------------------------------------------------

Post by **alvin** » Fri Jun 05, 2015 6:06 pm

I was trying a few test programs to compare speed with genmath but there are bugs in the genmath int->float conversion which I will post in the bugs forum.

Here is one test program:

// #include <stropts.h>
#include <stdio.h>
#include <math.h>

double pi_loc;
unsigned int angle;

main()
{
// ioctl(1, IOCTL_OTERM_PAUSE, 0);

pi_loc = 4.0*atan(1.0);
printf("pi = %f\n\n", pi_loc);

for (angle=0; angle < 360; ++angle)
{
printf("sin(%f) = %f; ", (double)(angle), sin((pi_loc/180.0)*angle));
}
}

Compile with classic library for zx target using genmath (40 bit mantissa, 8 bit exponent):

zcc +zx -vn test.c -o test -lm -lndos

Compile with classic library for zx target using the native zx rom for floats (32 bit mantissa, 8 bit exponent):

zcc +zx -vn test.c -o test -lmz -lndos

Compile using new clib using math48 (40 bit mantissa, 8 bit exponent):

zcc +zx -vn -clib=new -startup=4 test.c -o test -lm_sccz80

Uncomment the ioctl for the new clib to set the terminal to continuously scroll. All three outputs use the 64 column mode.

Approximate runtime by stopwatch which includes printing and scrolling the window:

native zx + classic (uses %e rather than %f) : longer
genmath + classic : ~2mins20s
math48 + new : ~27s

Both the classic compiles do not work because their int->float conversion is bugged so the angles are large negative numbers rather than 0-360. This means they probably execute a codepath that computes the angle mod 2pi. Although the new lib properly eliminates trailing zeroes and decimal point, it still computes the default 6 sig figures as the others are printing but the extra printing causes more time for scrolling and plotting to the display. There are a lot of factors at play here but even considering that there seems to be a big improvement in speed with math48.

If you don't want the float code in printf you have to edit that config file and rebuild the library for the target which might take about 3-5 minutes. It's a bit inconvenient so I'm wondering if anyone can think of another solution that will be automatic as in the sccz80 case?

And I thought of a way using the current z80asm underline kludge

If z80asm can't find a function ("vfprintf_unlocked") then it tries again using the same name with underscore prepended ("_vfprintf_unlocked"). I could assign the integer vfprintf to that name so that z80asm settles on the integer vfprintf unless otherwise instructed, eg by pragma in the source or define on the compile line. But still this would be different behaviour than what is standard in the C world.

I think we'll just have to settle for the library config and have the user rebuild the library if he wants floats enabled in printf. If we want to mix object files generated by sdcc and sccz80 they will have to use the same vfprintf function.

------------------------------------------------------------------------------

Post by **alvin** » Tue Jun 09, 2015 5:56 pm

The repo is reverted to clib configuration without printf %aefg enabled so to enable those requires a recompile of the target library. You can still use the stdlib functions ftoa, ftoe, ftog to create float strings with printf float disabled.

I've added the first attempt at atof, strtod and strtof. These will also be the foundation of scanf %aefg. strtod is not quite complete yet -- it understands all decimal float strings as well as "inf", "infinity" and "nan" and trailing float indicators ("fFlL"). It doesn't yet support float hex strings or "nan(...)".

The implementation is again very fast and I expect it to be a couple times faster than genmath or sdcc. It reads no more than the float library's indicated number of significant digits and counts power of 10 modifier rather than multiplying by 10 or dividing by 10 at each step. While reading the significant digits, a special mul10 subroutine is used that performs float multiplication by 10 with shifts and addition to the exponent. The last step is adjustment by a power of 10 using a single call to another specialized function for that purpose.

I did figure out while writing this that I need to specify exactly the assembly language interface as I was getting sccz80 primitives mixed up with others. sccz80 is using "ufloat" to convert a 32bit unsigned long to float whereas I was expecting a 16bit unsigned int to float. Clear names have to be introduced for the asm interface.

Anyway a brief test program:

#include <stdio.h>
#include <stdlib.h>

double a;

char in[128];
char out[128];

main()
{
while (1)
{
printf("\nEnter float: ");

fflush(stdin);
scanf("%[^\n]", in);

a = strtod(in, NULL);
ftog(a, out, 6, 0);

printf("Float was: %s\n", out);
}
}

Compile with:

zcc +zx -vn clib=new test.c -o test -lm_sccz80

You will note that strtod accepts both 'e' and 'p' as exponential character since some code will be reused for the hex float string conversion.

------------------------------------------------------------------------------

Post by **alvin** » Sat Jun 13, 2015 9:49 am

strtod() is completed except for some cleaning up. It should now parse any C11 decimal float string, hex float string, "inf", "infinity", "nan", "nan(...)". Because it does all these things, it is also lengthy maybe ~600 bytes including some library functions that could be reused by the program. It's still much smaller than the free 'embedded' implementations available on the web in C but I think I will introduce some configuration flags to opt out of the special forms and hex float strings. Maybe down the road we might want to offer a slower and smaller version too.

It's late so I haven't thoroughly tested corner cases so there may yet be some problems. Here is the test program I've been using if you'd like to try:

// zcc +zx -vn -clib=new test.c -o test -lm_sccz80

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

double a;

char in[128];
char *in_end;

char out[128];

main()
{
while (1)
{
printf("\nEnter float: ");

fflush(stdin);
scanf("%[^\n]", in);

errno = 0;
a = strtod(in, &in_end);

printf("%s\n", strerror(errno));
printf("Subject : \"%.*s\"\n", in_end - in, in);
printf("Trailing: \"%s\"\n", in_end);

ftog(a, out, 6, 0);
printf("Float was: %s\n", out);

ftoh(a, out, 255, 0);
printf("Float was: %s\n", out);
}
}

Still to do:

* scanf integration
* formally specify the asm and c interfaces to the math library and better organize math48 to follow that in order for it to act as a template
* get rid of the float/double dichotomy in the code
* bring the float lib to sdcc

------------------------------------------------------------------------------

Post by **alvin** » Sun Jun 14, 2015 12:00 am

I'm a little surprised by the level of inaccuracy that shows up in floating point calculations.

If I enter "1e-12", the float read is printed back as "1.0000000000e-12" which is correct to ten decimal digits.

If I enter "10e-12", the float read is printed back as "1.0022700657e-11" where you can see inaccuracy begins in the third decimal place or about the ninth or tenth binary digit out of 40.

This result is being computed as 10.0*1e-12 so I thought there was no way there should be an error that large.

I worked it out by hand:

1e-12
=
0x8.d0e95b27d * 2^(-43)

(the hex format is showing the mantissa bits exactly)

If I multiply this by hand by 10 (add this mantissa to a version shifted right by two bits then add 3 to the base 2 exponent) I get:

0xb.0523b1f1c * 2^(-40)

This is almost exactly the number returned when entering "10e-12" which the program is yielding in hex format as:

0xb.0523b1f1d * 2^(-40)

and decimal equivalent is also:

1.0022700657e-11

You can find this in your calculator by doing:

b0523b1f1c
hex to decimal = 757293850396
* 2^(-36) * 2^(-40) = 1.002270065685...e-11

(the 2^(-36) is first move the binary point just after the first hex digit)

So it all seems correct. I am just very surprised that the error sneaks up into the 9th or 10th binary digit out of 40. I still can't see how that is happening from the manual shifted add I did to multiply by 10 ??? I spent an hour looking for the math error and there doesn't seem to be one.

------------------------------------------------------------------------------

Post by **alvin** » Sun Jun 14, 2015 12:13 am

I should add I have changed the program slightly so that all the significant digits are shown instead of defaulting to precision of 6:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

double a;

char in[128];
char *in_end;

char out[128];

main()
{
while (1)
{
printf("\nEnter float: ");

fflush(stdin);
scanf("%[^\n]", in);

errno = 0;

a = strtod(in, &in_end);

printf("%s\n", strerror(errno));

printf("Subject : \"%.*s\"\n", in_end - in, in);
printf("Trailing: \"%s\"\n", in_end);

ftog(a, out, 11, FTOA_FLAG_HASH);
printf("Float was: %s\n", out);

ftoh(a, out, -1, FTOA_FLAG_HASH);
printf("Float was: %s\n", out);
}
}

------------------------------------------------------------------------------

Post by **alvin** » Sun Jun 14, 2015 3:14 pm

I'm a little surprised by the level of inaccuracy that shows up in floating point calculations.

If I enter "1e-12", the float read is printed back as "1.0000000000e-12" which is correct to ten decimal digits.

If I enter "10e-12", the float read is printed back as "1.0022700657e-11" where you can see inaccuracy begins in the third decimal place or about the ninth or tenth binary digit out of 40.
........
So it all seems correct. I am just very surprised that the error sneaks up into the 9th or 10th binary digit out of 40. I still can't see how that is happening from the manual shifted add I did to multiply by 10 ??? I spent an hour looking for the math error and there doesn't seem to be one.

I found the error -- a transposition in the hex digits defining the constant 1e12:

defw $684D, $A510, $00A8 ;1E+12
defw $68D4, $A510, $00A8 ;1E+12 -- anders, transposition of digits here!!

When printing positive powers of 10 as in "1e12" or "10e12" the error was covered up by decimal rounding. However printing negative powers as in "10e-12" involves a division where the error became magnified.

With this patch applied, "10e-12" now properly displays as "1e-11". If anything else is parsed incorrectly let me know!

zx spectrum snapshot executable:

https://drive.google.com/file/d/0B6XhJJ ... sp=sharing

------------------------------------------------------------------------------

Post by **stefano** » Mon Jun 15, 2015 4:25 pm

Well done

------------------------------------------------------------------------------

Post by **alvin** » Thu Jun 18, 2015 5:45 am

I've come up with a standard math interface that should be able to be applied to any math library. It's taking a some time to redo all the functions but hopefully I will find some time to finish it tomorrow evening.

Here's an example showing what the assembly language interface looks like for the math48 library:

; double acosh(double x)

SECTION code_fp_math48

PUBLIC am48_acosh

EXTERN am48_dpush, am48_dconst_1, am48_dsub, am48_sqrt
EXTERN am48_dadd, am48_dmul_s, am48_dadd_s, am48_log

am48_acosh:

; inverse hyperbolic cos
; AC' = acosh(AC')
;
; enter : AC' = double x
;
; exit : success
;
; AC' = acosh(x)
; carry reset
;
; fail if domain error (x < 1)
;
; AC' = 0
; carry set, errno set
;
; uses : af, af', bc', de', hl'

; acosh(x) = ln(x+sqrt(x-1)*sqrt(x+1))

push bc ; save AC
push de
push hl

call am48_dpush ; push AC'=x

call am48_dconst_1
call am48_dsub
call am48_sqrt ; AC'= sqrt(x-1)

pop hl
pop de
pop bc ; AC = x

jp c, am48_dpopret ; if x < 1

push bc ; push x
push de
push hl

exx ; AC'= x

push bc ; save sqrt(x-1)
push de
push hl

call am48_dconst_1
call am48_dadd
call am48_sqrt ; AC'= sqrt(x+1)

call am48_dmul_s ; AC'= sqrt(x+1)*sqrt(x-1)
call am48_dadd_s ; AC'= x+sqrt(x+1)*sqrt(x-1)

pop hl
pop de
pop bc ; restore original AC

jp am48_log ; AC'= ln(AC')

The names without leading "am48_" are the standard names exposed to the clib and are defined when you link a math libary (eg "-lm"). If math48 is assigned to math library "m" then "m.lib" defines the standard names like so:

defc dadd = am48_dadd
defc dmul = am48_dmul
etc

So "m.lib" is really just a collection of aliases to the math48 asm interface. Similarly another math library linked with "-lmz", eg, would define aliases for the standard names for a different math library:

defc dadd = agen_dadd
defc dmul = agen_dmul
etc

So the linking "-lm", etc is the mechanism for creating the standard names called from the clib to perform the float functions.

Because the math libraries themselves have different names for the same functions (the leading "am48_" bit, eg) all the math library code can be placed in the target's standard library. This means you can call the math library functions directly by fully qualified name without linking if that's what you want.

Each math library implements a standard set of functions, examples of which are the am48_ functions in the listing above. I think I have about 175 of them defined. They include the standard clib math functions (sqrt, cos, etc) as well as lower level functions like add, sub, etc. The computation model universally fits any math library, I believe. There is a primary floating point accumulator in the EXX set (BC', DE', HL'). Anything could be there, including a pointer to a block of memory where the float is stored. Math48 places the entire float in BC', DE', HL'. A secondary float register can be in the main set BC, DE, HL. The math primitives can operate on one parameter (the primary floating point acc in EXX set) or two (both sets of registers). Further, floats can be pushed and popped from the stack or loaded and stored to memory. Some of the math primitives can operate with one parameter in the primary acc (EXX set) and the other on the stack.

Anyway I think it is a good structure for floating point calculations on the z80. The code seems to be compact and easy to program.

------------------------------------------------------------------------------

Post by **alvin** » Sun Jun 21, 2015 5:49 am

I found a very curious result. I tried using stefano's 3d plot example from the wiki http://www.z88dk.org/wiki/doku.php?id=e ... ts:coswave

It graphs the function "cos(x^2 + y^2)" on screen.

First some running times (@3.5MHz and interrupts enabled on the zx target):

clib=new, math48 : 2m20s
clib=classic, genmath : 2m31s
clib=classic, zxmath : 3m17s
zx basic : 4m19s

zxmath is the c compiler using the native float implementation in the zx rom. The zx basic version is an equivalent basic program that, naturally, also uses the zx rom float implementation.

When compiling for the classic library I used the available "plot" function but the new clib doesn't have graphics so I used the display address manipulators for its compile as shown below. The times are very close so the different plot methods can make some difference.

The program using math48 and the new clib:

#include <stdio.h>
#include <math.h>
#include <arch/spectrum.h>

double x, y;
unsigned int z, buf, px, py;

main()
{
for (x = -3.0; x < 3.0; x += 0.06)
{
buf = 100;

for (y = -3.0; y < 3.0; y += 0.2)
{
z = 70.0 - (10.0 * (y + 3.0) + (10.0 * cos(x*x + y*y)));

if (buf > z)
{
buf = z;

px = 32.0 * (x + 3.0);
py = z * 2;

*(unsigned char *)(zx_pxy2saddr(px, py)) |= zx_px2bitmask(px);
// plot(px, py);
}
}
}
}

Now here is the surprise: the drawn graphs are not the same! You can have a look at some emulator snapshots I took after running each program.

https://drive.google.com/file/d/0B6XhJJ ... sp=sharing

Drag and drop each snapshot to a spectrum emulator and compare the output graphs. You will see both math48 and genmath generate similar graphs which are flat around the (0,0) coordinate. The basic implementation produces a much smoother graph that is rounded around (0,0). The C compiled program using the zx rom math implementation is in between -- it is not identical to the basic program's output! I'm not sure why that is yet.

Keep in mind both math48 and genmath have more mantissa bits than the zx rom's fp (40 vs 32). I am wondering if the transcendental functions are not implemented with enough accuracy for the 40 bit mantissa. Math48 is using a truncated taylor series while I know the zx rom is using a chebyshev series. Taylor series error tends to be monotonic in an interval while chebyshev oscillates a fixed amount so the latter delivers better results with fewer terms. I have not looked at the zx fp implementation yet to know if it uses a fixed number of terms or continues to calculate until error is small. The math48 (and I assume genmath) uses a fixed number of terms which could mean worse error in some intervals.

Any other ideas? Having 40 bits of mantissa is meaningless if your series can't generate that level of precision.

------------------------------------------------------------------------------

Post by **alvin** » Sun Jun 21, 2015 6:12 am

I was just looking at the asm generated by the compiler and liking how small the float code is:

ld hl,_y
call dldpsh
ld hl,i_2+0
call dload
call dlt
ld a,h
or l
jp z,i_7
ld hl,i_2+18
call dldpsh
ld hl,i_2+24
call dldpsh
ld hl,_y
call dldpsh
ld hl,i_2+0
call dload
call dadd
call dmul
call dpush
ld hl,i_2+24
call dldpsh
ld hl,_x
call dldpsh
ld hl,_x
call dload
call dmul
call dpush
ld hl,_y
call dldpsh
ld hl,_y
call dload
call dmul
call dadd
call cos
call dmul
call dadd
call dsub
call ifix

It's very nice to see the compiler is able to just roll off one fp operation after another.

But then I spotted this:

ld hl,(_px)
push hl
ld hl,(_py)
push hl
call zx_pxy2saddr_callee
push hl
ld e,(hl)
inc hl
ld d,(hl)
push de
ld hl,(_px)
call zx_px2bitmask
pop de
call l_or
pop de
call l_pint

for the C code:

*(unsigned char *)(zx_pxy2saddr(px, py)) |= zx_px2bitmask(px);

The C code is trying to manipulate one byte on screen but the generated asm is writing an entire word. That won't affect the graph as the second byte being ORed into the screen is 0 but this is definitely a compiler bug.

------------------------------------------------------------------------------