Speeding up execution time without using speedcode

Basic and Machine Language

Moderator: Moderators

Post Reply
User avatar
Noizer
Vic 20 Devotee
Posts: 297
Joined: Tue May 15, 2018 12:00 pm
Location: Europa

Speeding up execution time without using speedcode

Post by Noizer »

Dear all,

sometime one wants to speedup coding things but stucks on place on how to.
So here I invite you to participate and contribuite to this special topic, as a sort note book for all of us, with the best known tricks to speedup execution time from machine language code programs.
My point of view: faster but not that huge code leads to more efficient code in the sametime.

The Rules:
- target platform vic 20
- machine language only (sry basic fans, Minimon cartrige from Mike, Vicmon or similar tool could help starting)
- no speedcode (unrolled branch loops)
- only 6502 CPU normal & extra opcodes
- speed gain should be reasonable
- clock cycles in postings should be indicate for every opcode, code segment, complete execution time
- postings are not limited to my first example, new and different code examples are welcome
- feedback mandatory :)

Note 1: I will short "clock cycles" for all later example to cy and "raster lines" to rl (1 rl = 71 cy).
Note 2: examples doesn't consider:
- running under SEI or CLI environment
- G and BREAK command inducted delays or other outside caused delays from used assembler or ml monitor
---------------------------------------------------------------------------------------------------------------------------------------

Please do consider following code example to transfer some uncompressed datas, here one page buffer data to first half of the unexpanded vic screen.

I would write it like this, Example 1:

Code: Select all

opcode			cy
--------------------------------------
  LDY #$00		2 +
* LDA $A000,Y		4
  STA $1E00,Y		5
  INY			2
  BNE *			3 / (2 on loop exit)
So, execution time is: 2 + (256 * 14) - 1 = 1 +256*14= 3585 cy, ~50.5 rl :)

Pretty much for so less.

I thought earlier if I add something to the loop, execution time will rise, but in fact it's not so, due not so evident, see yourself.
Example 2:

Code: Select all

opcode			cy
--------------------------------------
  LDY #7F		2 +
* LDA $A000,Y		4
  STA $1E00,Y		5
  LDA $A080,Y		4
  STA $1E80,Y		5
  DEY			2
  BPL *			3 / (2 on loop exit)
  
Execution time now: 1 + 128*23 = 2945 cy, ~41,5 rl (-9 rl as above)

So we won 640 cy with no greater effort (+6 Bytes) relating to my first example. :wink:

Too bad, as soon were reached the limits of this first trick, every new doubling the LDA/STA pair in the loop will not raise the cy gain as expected.

But the good news are that the percentual speed gain of 17.8 % (640 cy) will be hold constant if the second block of videoram is filled too, expanding the loop with LDA $A100/80,Y / STA $1F00/80 ... Check it yourself.

Who wants and can undercut this?

Done and by for now.
Last edited by Noizer on Mon Jun 08, 2020 1:02 pm, edited 1 time in total.
Valid rule today as earlier: 1 Byte = 8 Bits
-._/classes instead of masses\_.-
User avatar
chysn
Vic 20 Scientist
Posts: 1205
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

Re: Speeding up execution time without using speedcode

Post by chysn »

Noizer wrote: Mon Jun 08, 2020 12:29 pm Now the question: Who wants and can undercut this?
I mean... if you're not going to make me write it out, or write a script to generate it, there's always

Code: Select all

lda $a000
sta $1e00
lda $a001
sta $1e01
lda $a002
sta $1e02
;
; yadda yadda yadda
;
lda $a0ff
sta $1eff
for 2048 cycles?

But my medium is unexpanded VIC, so I'm almost always more interested in memory optimization over cycle optimization.
VIC-20 Projects: wAx Assembler, TRBo: Turtle RescueBot, Helix Colony, Sub Med, Trolley Problem, Dungeon of Dance, ZEPTOPOLIS, MIDI KERNAL, The Archivist, Ed for Prophet-5

WIP: MIDIcast BASIC extension

he/him/his
tlr
Vic 20 Nerd
Posts: 567
Joined: Mon Oct 04, 2004 10:53 am

Re: Speeding up execution time without using speedcode

Post by tlr »

For arbitrary data I guess it's not possible to undercut that strategy. Assuming there are sequences in the data, like runs of equal symbols or simple patterns, then you can potentially generate the output faster. Finding the patterns could be automated, as in coding a specialized cruncher.
User avatar
Noizer
Vic 20 Devotee
Posts: 297
Joined: Tue May 15, 2018 12:00 pm
Location: Europa

Re: Speeding up execution time without using speedcode

Post by Noizer »

chysn wrote: Mon Jun 08, 2020 12:47 pm
Noizer wrote: Mon Jun 08, 2020 12:29 pm Now the question: Who wants and can undercut this?
I mean... if you're not going to make me write it out, or write a script to generate it, there's always

Code: Select all

lda $a000
sta $1e00
lda $a001
sta $1e01
lda $a002
sta $1e02
;
; yadda yadda yadda
;
lda $a0ff
sta $1eff
for 2048 cycles?

But my medium is unexpanded VIC, so I'm almost always more interested in memory optimization over cycle optimization.
Hi, what you wrote is called "speedcode", I explained some rules to avoid some approachs, as too trivial, right? :D
Valid rule today as earlier: 1 Byte = 8 Bits
-._/classes instead of masses\_.-
User avatar
chysn
Vic 20 Scientist
Posts: 1205
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

Re: Speeding up execution time without using speedcode

Post by chysn »

Noizer wrote: Mon Jun 08, 2020 1:09 pm Hi, what you wrote is called "speedcode", I explained some rules to avoid some approachs, as too trivial, right? :D
I had never heard the term "speedcode" and a Google search didn't shed any light on what it meant. It seems like if your primary goal is speed, then why would you prohibit speedcode?

It might be better to measure performance using some ratio of code size and cycle count. My gut feeling is that the speedcode would be the worst approach, and Example 1 would be the best. Every time I do a machine language project for VIC-20, I reach the point where I'd kill my brother-in-law for six bytes.*

* Hyperbole
VIC-20 Projects: wAx Assembler, TRBo: Turtle RescueBot, Helix Colony, Sub Med, Trolley Problem, Dungeon of Dance, ZEPTOPOLIS, MIDI KERNAL, The Archivist, Ed for Prophet-5

WIP: MIDIcast BASIC extension

he/him/his
DarwinNE
Vic 20 Devotee
Posts: 231
Joined: Tue Sep 04, 2018 2:40 am
Website: http://davbucci.chez-alice.fr
Location: Grenoble - France

Re: Speeding up execution time without using speedcode

Post by DarwinNE »

Noizer wrote: Mon Jun 08, 2020 12:29 pm I thought earlier if I add something to the loop, execution time will rise, but in fact it's not so, due not so evident, see yourself.
Isn't that an example of loop unrolling?

https://en.wikipedia.org/wiki/Loop_unrolling
User avatar
Noizer
Vic 20 Devotee
Posts: 297
Joined: Tue May 15, 2018 12:00 pm
Location: Europa

Re: Speeding up execution time without using speedcode

Post by Noizer »

I cannot explain why you couldn't find any example on web. Maybe lockdown inet bandwidth issues?
See at least here:
https://codebase64.pokefinder.org/doku. ... :speedcode
http://www.cactus.jawnet.pl/attitude/?a ... 7&which=21
Both for c64, but you know for sure what it good to vic 20.
But as already underlined, this purpose is not topic affine, the challenge is to not use speedcode.
Unrolling a loop completely - is speedcode. If one reduce the loop counter clever, without pumping away disponible RAM, this is not speedcode, it's code optimization.
BTW, my first example can be expanded to save much more cycles with not that huge effort as spamming the memory with infinites LDA Blabla, STA Blabla.
At the end the performance gain is ~33,5%, with pure speedcode ~43,3%.
I calculated 2385 cy to 2048 cy (from chysn's posting), delta ~11%.
Sry - no time to post the code segment, but everyone can reproduce this easily.
@chysn: Right, the overview gain function when adding pairs of LDA/STA, is an hyperbole. At a specific sequence next steps are no longer worth of expanding the loop.
P.S.: Did you forget to post an hyperbole grafic or someting else in your last posting? :wink:
Valid rule today as earlier: 1 Byte = 8 Bits
-._/classes instead of masses\_.-
User avatar
Mike
Herr VC
Posts: 4841
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: Speeding up execution time without using speedcode

Post by Mike »

My usual approach would be to use another, intrinsically faster algorithm (with whatever amounts of loop unrolling I deem sensible, *if* that technique can even be usefully applied), but the example in the OP doesn't naturally lead to that kind of challenge.
User avatar
chysn
Vic 20 Scientist
Posts: 1205
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

Re: Speeding up execution time without using speedcode

Post by chysn »

Right. I'd be happy to try my hand at this unironically. Optimization for cycles is a bit outside of my comfort zone, so I'd welcome the chance to try it and talk about it. But a small loop like this doesn't offer many opportunities. It's already pretty much optimal.
VIC-20 Projects: wAx Assembler, TRBo: Turtle RescueBot, Helix Colony, Sub Med, Trolley Problem, Dungeon of Dance, ZEPTOPOLIS, MIDI KERNAL, The Archivist, Ed for Prophet-5

WIP: MIDIcast BASIC extension

he/him/his
groepaz
Vic 20 Scientist
Posts: 1187
Joined: Wed Aug 25, 2010 5:30 pm

Re: Speeding up execution time without using speedcode

Post by groepaz »

Unrolling a loop completely - is speedcode. If one reduce the loop counter clever, without pumping away disponible RAM, this is not speedcode, it's code optimization.
ugh well. a lot of so called "speedcode" is still not completely unrolled. it's always a compromise between execution time and available memory.
I'm just a Software Guy who has no Idea how the Hardware works. Don't listen to me.
User avatar
Noizer
Vic 20 Devotee
Posts: 297
Joined: Tue May 15, 2018 12:00 pm
Location: Europa

Re: Speeding up execution time without using speedcode

Post by Noizer »

Thank you guys out there for that feedback.
I will come back to you at right place to right time.
In a meanwhile this topic could enjoy much more replies, I guess
Valid rule today as earlier: 1 Byte = 8 Bits
-._/classes instead of masses\_.-
Post Reply