sometime one wants to speedup coding things but stucks on place on how to.
So here I invite you to participate and contribuite to this special topic, as a sort note book for all of us, with the best known tricks to speedup execution time from machine language code programs.
My point of view: faster but not that huge code leads to more efficient code in the sametime.
The Rules:
- target platform vic 20
- machine language only (sry basic fans, Minimon cartrige from Mike, Vicmon or similar tool could help starting)
- no speedcode (unrolled branch loops)
- only 6502 CPU normal & extra opcodes
- speed gain should be reasonable
- clock cycles in postings should be indicate for every opcode, code segment, complete execution time
- postings are not limited to my first example, new and different code examples are welcome
- feedback mandatory
Note 1: I will short "clock cycles" for all later example to cy and "raster lines" to rl (1 rl = 71 cy).
Note 2: examples doesn't consider:
- running under SEI or CLI environment
- G and BREAK command inducted delays or other outside caused delays from used assembler or ml monitor
---------------------------------------------------------------------------------------------------------------------------------------
Please do consider following code example to transfer some uncompressed datas, here one page buffer data to first half of the unexpanded vic screen.
I would write it like this, Example 1:
Code: Select all
opcode cy
--------------------------------------
LDY #$00 2 +
* LDA $A000,Y 4
STA $1E00,Y 5
INY 2
BNE * 3 / (2 on loop exit)
Pretty much for so less.
I thought earlier if I add something to the loop, execution time will rise, but in fact it's not so, due not so evident, see yourself.
Example 2:
Code: Select all
opcode cy
--------------------------------------
LDY #7F 2 +
* LDA $A000,Y 4
STA $1E00,Y 5
LDA $A080,Y 4
STA $1E80,Y 5
DEY 2
BPL * 3 / (2 on loop exit)
So we won 640 cy with no greater effort (+6 Bytes) relating to my first example.
Too bad, as soon were reached the limits of this first trick, every new doubling the LDA/STA pair in the loop will not raise the cy gain as expected.
But the good news are that the percentual speed gain of 17.8 % (640 cy) will be hold constant if the second block of videoram is filled too, expanding the loop with LDA $A100/80,Y / STA $1F00/80 ... Check it yourself.
Who wants and can undercut this?
Done and by for now.