Speeding up execution time without using speedcode

Noizer · Post by **Noizer** » Mon Jun 08, 2020 12:29 pm

Dear all,

sometime one wants to speedup coding things but stucks on place on how to.
So here I invite you to participate and contribuite to this special topic, as a sort note book for all of us, with the best known tricks to speedup execution time from machine language code programs.
My point of view: faster but not that huge code leads to more efficient code in the sametime.

The Rules:
- target platform vic 20
- machine language only (sry basic fans, Minimon cartrige from Mike, Vicmon or similar tool could help starting)
- no speedcode (unrolled branch loops)
- only 6502 CPU normal & extra opcodes
- speed gain should be reasonable
- clock cycles in postings should be indicate for every opcode, code segment, complete execution time
- postings are not limited to my first example, new and different code examples are welcome
- feedback mandatory

Note 1: I will short "clock cycles" for all later example to cy and "raster lines" to rl (1 rl = 71 cy).
Note 2: examples doesn't consider:
- running under SEI or CLI environment
- G and BREAK command inducted delays or other outside caused delays from used assembler or ml monitor
---------------------------------------------------------------------------------------------------------------------------------------

Please do consider following code example to transfer some uncompressed datas, here one page buffer data to first half of the unexpanded vic screen.

I would write it like this, Example 1:

Code: Select all

opcode			cy
--------------------------------------
  LDY #$00		2 +
* LDA $A000,Y		4
  STA $1E00,Y		5
  INY			2
  BNE *			3 / (2 on loop exit)

So, execution time is: 2 + (256 * 14) - 1 = 1 +256*14= 3585 cy, ~50.5 rl

Pretty much for so less.

I thought earlier if I add something to the loop, execution time will rise, but in fact it's not so, due not so evident, see yourself.
Example 2:

Code: Select all

opcode			cy
--------------------------------------
  LDY #7F		2 +
* LDA $A000,Y		4
  STA $1E00,Y		5
  LDA $A080,Y		4
  STA $1E80,Y		5
  DEY			2
  BPL *			3 / (2 on loop exit)

Execution time now: 1 + 128*23 = 2945 cy, ~41,5 rl (-9 rl as above)

So we won 640 cy with no greater effort (+6 Bytes) relating to my first example.

Too bad, as soon were reached the limits of this first trick, every new doubling the LDA/STA pair in the loop will not raise the cy gain as expected.

But the good news are that the percentual speed gain of 17.8 % (640 cy) will be hold constant if the second block of videoram is filled too, expanding the loop with LDA $A100/80,Y / STA $1F00/80 ... Check it yourself.

Who wants and can undercut this?

Done and by for now.

chysn · Post by **chysn** » Mon Jun 08, 2020 12:47 pm

Noizer wrote: ↑Mon Jun 08, 2020 12:29 pm Now the question: Who wants and can undercut this?

I mean... if you're not going to make me write it out, or write a script to generate it, there's always

Code: Select all

lda $a000
sta $1e00
lda $a001
sta $1e01
lda $a002
sta $1e02
;
; yadda yadda yadda
;
lda $a0ff
sta $1eff

for 2048 cycles?

But my medium is unexpanded VIC, so I'm almost always more interested in memory optimization over cycle optimization.

tlr · Post by **tlr** » Mon Jun 08, 2020 1:03 pm

For arbitrary data I guess it's not possible to undercut that strategy. Assuming there are sequences in the data, like runs of equal symbols or simple patterns, then you can potentially generate the output faster. Finding the patterns could be automated, as in coding a specialized cruncher.

Noizer · Post by **Noizer** » Mon Jun 08, 2020 1:09 pm

chysn wrote: ↑Mon Jun 08, 2020 12:47 pm
Noizer wrote: ↑Mon Jun 08, 2020 12:29 pm Now the question: Who wants and can undercut this?
I mean... if you're not going to make me write it out, or write a script to generate it, there's always
Code: Select all
lda $a000
sta $1e00
lda $a001
sta $1e01
lda $a002
sta $1e02
;
; yadda yadda yadda
;
lda $a0ff
sta $1eff
for 2048 cycles?

But my medium is unexpanded VIC, so I'm almost always more interested in memory optimization over cycle optimization.

Hi, what you wrote is called "speedcode", I explained some rules to avoid some approachs, as too trivial, right?

chysn · Post by **chysn** » Mon Jun 08, 2020 1:18 pm

Noizer wrote: ↑Mon Jun 08, 2020 1:09 pm Hi, what you wrote is called "speedcode", I explained some rules to avoid some approachs, as too trivial, right?

I had never heard the term "speedcode" and a Google search didn't shed any light on what it meant. It seems like if your primary goal is speed, then why would you prohibit speedcode?

It might be better to measure performance using some ratio of code size and cycle count. My gut feeling is that the speedcode would be the worst approach, and Example 1 would be the best. Every time I do a machine language project for VIC-20, I reach the point where I'd kill my brother-in-law for six bytes.*

* Hyperbole

DarwinNE · Post by **DarwinNE** » Mon Jun 08, 2020 3:51 pm

Noizer wrote: ↑Mon Jun 08, 2020 12:29 pm I thought earlier if I add something to the loop, execution time will rise, but in fact it's not so, due not so evident, see yourself.

Isn't that an example of loop unrolling?

https://en.wikipedia.org/wiki/Loop_unrolling

Noizer · Post by **Noizer** » Fri Jun 12, 2020 3:58 pm

I cannot explain why you couldn't find any example on web. Maybe lockdown inet bandwidth issues?
See at least here:
https://codebase64.pokefinder.org/doku. ... :speedcode
http://www.cactus.jawnet.pl/attitude/?a ... 7&which=21
Both for c64, but you know for sure what it good to vic 20.
But as already underlined, this purpose is not topic affine, the challenge is to not use speedcode.
Unrolling a loop completely - is speedcode. If one reduce the loop counter clever, without pumping away disponible RAM, this is not speedcode, it's code optimization.
BTW, my first example can be expanded to save much more cycles with not that huge effort as spamming the memory with infinites LDA Blabla, STA Blabla.
At the end the performance gain is ~33,5%, with pure speedcode ~43,3%.
I calculated 2385 cy to 2048 cy (from chysn's posting), delta ~11%.
Sry - no time to post the code segment, but everyone can reproduce this easily.
@chysn: Right, the overview gain function when adding pairs of LDA/STA, is an hyperbole. At a specific sequence next steps are no longer worth of expanding the loop.
P.S.: Did you forget to post an hyperbole grafic or someting else in your last posting?

Mike · Post by **Mike** » Fri Jun 12, 2020 5:01 pm

My usual approach would be to use another, intrinsically faster algorithm (with whatever amounts of loop unrolling I deem sensible, *if* that technique can even be usefully applied), but the example in the OP doesn't naturally lead to that kind of challenge.

chysn · Post by **chysn** » Fri Jun 12, 2020 8:09 pm

Right. I'd be happy to try my hand at this unironically. Optimization for cycles is a bit outside of my comfort zone, so I'd welcome the chance to try it and talk about it. But a small loop like this doesn't offer many opportunities. It's already pretty much optimal.

groepaz · Post by **groepaz** » Sat Jun 13, 2020 8:49 am

Unrolling a loop completely - is speedcode. If one reduce the loop counter clever, without pumping away disponible RAM, this is not speedcode, it's code optimization.

ugh well. a lot of so called "speedcode" is still not completely unrolled. it's always a compromise between execution time and available memory.

Noizer · Post by **Noizer** » Sun Jun 14, 2020 12:22 pm

Thank you guys out there for that feedback.
I will come back to you at right place to right time.
In a meanwhile this topic could enjoy much more replies, I guess

Denial

Speeding up execution time without using speedcode

Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode

Re: Speeding up execution time without using speedcode