I have a large function which does collision detection. Its flow is to start with x,y coords, then add 8 to one of them, call a function, then add 1, call function again, then add 8 to the other, call the function, and so on. There are lots of branches which depend on the way a sprite is facing, etc. I tried to opimise it by replacing all the local variables (most are temporary, the results of adding the offsets) and holding them as globals. The code size went from 861 bytes to 1159, and although Ts were hard to empirically observe, I'm pretty sure it got slower.
So I wrote a test case and had a play with it. Consider this simple code:
Code: Select all
unsigned int test1( unsigned char x, unsigned int y )
{
unsigned int result;
unsigned char x_local = x;
unsigned int y_local = y;
result = x_local + y_local;
return result;
}
Code: Select all
288 0031 DD E5 push ix
289 0033 DD 21 00 00 ld ix,0
290 0037 DD 39 add ix,sp
291 0039 ;test.c:23: unsigned char x_local = x;
292 0039 DD 5E 04 ld e,(ix+4)
293 003C ;test.c:24: unsigned int y_local = y;
294 003C DD 4E 05 ld c,(ix+5)
295 003F DD 46 06 ld b,(ix+6)
296 0042 ;test.c:26: result = x_local + y_local;
297 0042 26 00 ld h,0x00
298 0044 6B ld l, e
299 0045 09 add hl, bc
300 0046 ;test.c:28: return result;
301 0046 ;test.c:29: }
302 0046 DD E1 pop ix
303 0048 C9 ret
The alternative:
Code: Select all
unsigned int result;
unsigned char x_local;
unsigned int y_local;
unsigned int test2( unsigned char x, unsigned int y )
{
x_local = x;
y_local = y;
result = x_local + y_local;
return result;
}
Code: Select all
257 0000 DD E5 push ix
258 0002 DD 21 00 00 ld ix,0
259 0006 DD 39 add ix,sp
260 0008 ;test.c:11: x_local = x;
261 0008 DD 7E 04 ld a,(ix+4)
262 000B 32 02 00 ld (_x_local),a
263 000E ;test.c:12: y_local = y;
264 000E DD 6E 05 ld l,(ix+5)
265 0011 DD 66 06 ld h,(ix+6)
266 0014 22 03 00 ld (_y_local),hl
267 0017 ;test.c:14: result = x_local + y_local;
268 0017 3A 02 00 ld a,(_x_local)
269 001A 06 00 ld b,0x00
270 001C 21 03 00 ld hl,_y_local
271 001F 86 add a, (hl)
272 0020 32 00 00 ld (_result),a
273 0023 78 ld a, b
274 0024 21 04 00 ld hl,_y_local + 1
275 0027 8E adc a, (hl)
276 0028 32 01 00 ld (_result + 1),a
277 002B ;test.c:16: return result;
278 002B 2A 00 00 ld hl, (_result)
279 002E ;test.c:17: }
280 002E DD E1 pop ix
281 0030 C9 ret
I can see the copying of the input values from the stack into the globals takes time and instructions, and that only happens once which makes it a bit of an unfair comparison in such a small, simple testcase. Even so, the pattern of loading a memory location, reading or writing from/to it, is clearly heavier than using the index register, and in my game code the difference is quite marked.
So, what is the current conventional wisdom? Am I doing this wrong and drawing incorrect conclusions? Has the ZSDCC compiler been improved in this area or was it always advice only for sccz80 users? Most importantly, what's the current advice in this area? Globals or locals?