Multiplication in assembler

Basic and Machine Language

Moderator: Moderators

Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Multiplication in assembler

Post by Kakemoms »

Hi

I am trying to think of the most efficient way to multiply a number in 6502. Its obvious that x*y can be done by summing (128*x7+64*x6+32*x5+16*x4+8*x3+4*x2+2*x1+1*x0)*y in which x7-x0 are the respective bits of number of x (e.g. 1 or 0). Rol and Asl will permutate the y values, e.g. 128*1*y can be computed by 7*rol (or in practice 1 ror and result into high bit+bit 7 low bit). Well, you see were I get to.

This has probably been done a thousand times already so maybe any of you have a good link to some optimized code?

For multiplication of z by 400 I found that the code below was pretty efficient (well, a table is faster but also memory consuming):

Code: Select all

zval=mult400+1
zscr    word 0          ; result here

mult400 lda     #0      ; put z value here before calling
        rol     a
        rol     a
        rol     a
        rol     a       ; swap lower 4->higher 4
        tay
        rol     a       ; get carry
        tax
        tya
        and     #$f0    ; lower 4 bits
        sta     zs1+1   ; store low byte
        lda     zval
        and     #$1     ; load 128*z into x
        rol             ; bit 0 to carry
        rol             ; carry to bit 7 low byte, clear carry
zs1     adc     #0      ; add low byte, overflow to carry

        txa
        and     #$f     ; higher 4 bits->high byte
        adc     #$0     ; add carry
        sta     zs2+1   ; store high byte

        lda     zval
        and     #$fe    ;load 128*z into a (high byte)
        lsr     a       ;divide by 2 & clear carry
        adc     zval    ; add 256*z to high byte
zs2     adc     #0
        sta     zscr+1  ; store high byte
        rts
        
User avatar
srowe
Vic 20 Scientist
Posts: 1325
Joined: Mon Jun 16, 2014 3:19 pm

Re: Multiplication in assembler

Post by srowe »

There are some multiplication and division routines on 6502.org

http://www.6502.org/source/
groepaz
Vic 20 Scientist
Posts: 1180
Joined: Wed Aug 25, 2010 5:30 pm

Re: Multiplication in assembler

Post by groepaz »

there is also some good stuff on codebase: http://codebase64.org/doku.php?id=base:6502_6510_maths
I'm just a Software Guy who has no Idea how the Hardware works. Don't listen to me.
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

groepaz wrote:there is also some good stuff on codebase: http://codebase64.org/doku.php?id=base:6502_6510_maths
I found this one, but all multiplications are essencially loops with adc. Not very fast.
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

groepaz wrote:there is also some good stuff on codebase: http://codebase64.org/doku.php?id=base:6502_6510_maths
That looks more like it. The 32-bit multiplication was interesting.

While plundering around I made a 4-bit*4-bit multiplication without loops. Haven't tested it yet so probably buggy:

Code: Select all

; multiply two 4-bit numbers
; x*y in which x3-x0 indicates respective bits3-0.
; x3*y+x2*y+x1*y+x0*y=a
; efficient as it does not need to concern with overflow during asl operation

	txa
	and	#$8	; bit3 of x
	bne	bit3	; bit3 was not set (and a=0)
	tya		; get a
	asl		; multiply by 8
	asl
	asl
bit3	sta	rs2+1	; store y*8 in rs2
	txa
	and	#$4	; bit2 of x
	bne	bit2	; bit2 was not set (and a=0)
	tya		; get a
	asl		; multiply by 4
	asl
bit2	clc
rs2	adc	#0	; add with y*8 result
	sta	rs1+1	; store y*8+y*4
	txa
	and	#$2	; bit1 of x
	bne	bit1	; bit1 was not set (and a=0)
	tya		; multiply by 2
	asl
bit1	clc
rs1	adc	#0	; add with previous result
	sty	rs0+1	; save y
	tay		; store y*8+y*4+y*2 in y
	txa
	and	#$1	; bit0 of x
	bne	bit0	; bit1 was not set (and a=0)	
	clc
	tya
rs0	adc	#0
	tay
bit0	rts		;result of x*y in y
User avatar
pixel
Vic 20 Scientist
Posts: 1329
Joined: Fri Feb 28, 2014 3:56 am
Website: http://hugbox.org/
Location: Berlin, Germany
Occupation: Pan–galactic shaman

Re: Multiplication in assembler

Post by pixel »

I like doodling around. :mrgreen:

Code: Select all

	txa
	asl
	asl
	asl
	asl
	sty tmp
	ora tmp
	tax
	lda fourbyfourtable_lo,x
	ldy fourbyfourtable_hi,x
A man without talent or ambition is most easily pleased. Others set his path and he is content.
https://github.com/SvenMichaelKlose
User avatar
pixel
Vic 20 Scientist
Posts: 1329
Joined: Fri Feb 28, 2014 3:56 am
Website: http://hugbox.org/
Location: Berlin, Germany
Occupation: Pan–galactic shaman

Re: Multiplication in assembler

Post by pixel »

Oh, no!

Code: Select all

	lda xshiftedbyfour,x
	ora numbersasindex,y
	tax
	lda fourbyfourtable_lo,x
	ldy fourbyfourtable_hi,x
A man without talent or ambition is most easily pleased. Others set his path and he is content.
https://github.com/SvenMichaelKlose
User avatar
Mike
Herr VC
Posts: 4816
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: Multiplication in assembler

Post by Mike »

15 x 15 = 225 < 256

Code: Select all

TYA
ORA x_shifted_left_by_four,X
TAX
LDA four_by_four_table,X
Any questions?
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

Kakemoms wrote:
groepaz wrote:there is also some good stuff on codebase: http://codebase64.org/doku.php?id=base:6502_6510_maths
That looks more like it. The 32-bit multiplication was interesting.

While plundering around I made a 4-bit*4-bit multiplication without loops. Haven't tested it yet so probably buggy:

Code: Select all

; multiply two 4-bit numbers
; x*y in which x3-x0 indicates respective bits3-0.
; x3*y+x2*y+x1*y+x0*y=a
; efficient as it does not need to concern with overflow during asl operation

	txa
	and	#$8	; bit3 of x
	bne	bit3	; bit3 was not set (and a=0)
	tya		; get a
	asl		; multiply by 8
	asl
	asl
bit3	sta	rs2+1	; store y*8 in rs2
	txa
	and	#$4	; bit2 of x
	bne	bit2	; bit2 was not set (and a=0)
	tya		; get a
	asl		; multiply by 4
	asl
bit2	clc
rs2	adc	#0	; add with y*8 result
	sta	rs1+1	; store y*8+y*4
	txa
	and	#$2	; bit1 of x
	bne	bit1	; bit1 was not set (and a=0)
	tya		; multiply by 2
	asl
bit1	clc
rs1	adc	#0	; add with previous result
	sty	rs0+1	; save y
	tay		; store y*8+y*4+y*2 in y
	txa
	and	#$1	; bit0 of x
	bne	bit0	; bit1 was not set (and a=0)	
	clc
	tya
rs0	adc	#0
	tay
bit0	rts		;result of x*y in y
Ok, assuming the first part works (after unceremonously not debugging or testing anything).. I droodled the rest to get a 8bitx8bit multiplication

Code: Select all

; multiply 8bit*8bit=x*y
; lower4bits=l4b
; higher4bits=h4b
; utilize x=l4bx+16*h4bx and y=l4by+16*h4by
; x*y=14bx*14by+l4bx*16*h4by+16*h4bx*l4by+16*16*h4bx*h4by

; uses Vic20 memory $3-$6 for fast zero page storage (unused by kernal/basic)
; and $92-$93 for storing result (only used during tape operation)

	txa
	and	#$f
	sta	$3	;l4bx
	txa
	lsr	a
	lsr	a
	lsr	a
	lsr	a
	sta	$4	;h4bx
	tya
	and	#$f
	sta	$5	;l4by
	tya
	lsr
	lsr
	lsr
	lsr
	sta	$6	;h4by

; all 4bit parts divided into byte 3-6, so start multiplying

	ldx	$3
	ldy	$6
	jsr	mult44	;l4bx*h4by
	sty	$92	;temp store

	ldx	$4
	lda	$5
	jsr	mult44	;h4bx*l4by
	tya
	clc
	adc	$92	;add temp store, carry on overflow
	lda	$92
	and	#$f0	;4 highest bits
	ror		;shift right and put carry in bit7
	lsr
	lsr
	lsr		;all bits in bit4-bit0
	sta	$93	;high byte result (e.g. *16)
	lda	$92
	and	#$f
	asl
	asl
	asl
	asl		;*16
	sta	$92	;low byte result

	ldx	$3
	ldy	$5
	jsr	mult44	;l4bx*l4by
	tya
	adc	$92	;add low byte
	lda	#0
	adc	$93

	lda	$6
	tay
	lda	$4
	tax
	jsr	mult44	;h4bx*h4by*16*16
	tya
	adc	$93	;result high byte
	rts

Only thing I promize is that it probably doesn't work yet. :P
User avatar
pixel
Vic 20 Scientist
Posts: 1329
Joined: Fri Feb 28, 2014 3:56 am
Website: http://hugbox.org/
Location: Berlin, Germany
Occupation: Pan–galactic shaman

Re: Multiplication in assembler

Post by pixel »

Code: Select all

	stx selfmod+2
selfmod:
	lda $ff00,y
Just kidding… :wink:
A man without talent or ambition is most easily pleased. Others set his path and he is content.
https://github.com/SvenMichaelKlose
User avatar
Mike
Herr VC
Posts: 4816
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: Multiplication in assembler

Post by Mike »

If the multiplication forms a time-critical part of a bigger program, and you can afford the memory, you are well-advised to throw tables against the problem. Otherwise any kind of shift-and-add algorithm is probably sufficient.

For my demo vector.prg, I used the formula 4*X*Y = (X+Y)² - (X-Y)² to build a 8x8 -> 16 multiply-add, and synthesized a 16x16 -> 32 multiply from 4 of those. Each of those 8x8 -> 16 mul-adds needs 67..72 cycles, and 'feeds' from four 256 byte tables.

For an implementation of a ellipse draw algorithm (featured in MAXIGRAFIK), I factored out all multiplications out of the inner loop. They all were of the type 8x24 -> 32, and shift-and-add was all that was needed.

For address-calculations with a limited number of indices I just use tables with low- and high-bytes. Easiest to implement and less error-prone. :mrgreen:

But then I suppose you already knew all that.
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

Mike wrote:15 x 15 = 225 < 256

Code: Select all

TYA
ORA x_shifted_left_by_four,X
TAX
LDA four_by_four_table,X
Any questions?
I guess I would have used page 1,2 or 3 for that table and added the 8bitx8bit code to get full byte multiplication?
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

Mike wrote:If the multiplication forms a time-critical part of a bigger program, and you can afford the memory, you are well-advised to throw tables against the problem. Otherwise any kind of shift-and-add algorithm is probably sufficient.

For my demo vector.prg, I used the formula 4*X*Y = (X+Y)² - (X-Y)² to build a 8x8 -> 16 multiply-add, and synthesized a 16x16 -> 32 multiply from 4 of those. Each of those 8x8 -> 16 mul-adds needs 67..72 cycles, and 'feeds' from four 256 byte tables.

For an implementation of a ellipse draw algorithm (featured in MAXIGRAFIK), I factored out all multiplications out of the inner loop. They all were of the type 8x24 -> 32, and shift-and-add was all that was needed.

For address-calculations with a limited number of indices I just use tables with low- and high-bytes. Easiest to implement and less error-prone. :mrgreen:

But then I suppose you already knew all that.
Yep. For an unexpanded Vic20 you probably want to keep it thight. My 4x4bit is bout 52bytes while the 8x8bit adds another 91bytes.

Now, a 16x16 table of multiplied numbers would be symmetrical since x*y=y*x, so using some tricks it could be compressed to about 16x8+8 bytes. E.g. the row of x=1 would not be needed (except for 1x1) as the answer is in the y=1 column if one swap x&y:

Code: Select all


	sty	t1+1
t1	cpx	#00
	bcs	t3	;x>y
	txa		;y=<x
	asl
	asl
	asl
	asl
	sta	t2+1
t2	lda	$0100,y
	rts
t3	tya
	asl
	asl
	asl
	asl
	sta	t4+1
t4	lda	$0100,x
	rts	

The problem is then to compress the data as x=1 with y=2,3,4...16 will not be used, but will need to be there if you use a row/column way of storing the results.
User avatar
Mike
Herr VC
Posts: 4816
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: Multiplication in assembler

Post by Mike »

That extra logic to access a compressed variant of the 4x4 table is probably slower than this table-less implementation:

Code: Select all

 ; A := X*Y, 4x4 -> 8 multiplication (top 4 bits of Y assumed 0)
 STY sum
 TXA
 ASL
 ASL
 ASL
 ASL
 LDX #4
.loop
 ASL 
 BCC skip
 CLC
 ADC sum
.skip
 DEX
 BNE loop
Of course, it's also possible to unroll the loop body ASL/BCC/CLC/ADC four times.

Furthermore, the CLC in the inner loop can be eliminated, if Y is decremented by 1 before it is stored in 'sum'.
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

Mike wrote:That extra logic to access a compressed variant of the 4x4 table is probably slower than this table-less implementation:

Code: Select all

 ; A := X*Y, 4x4 -> 8 multiplication (top 4 bits of Y assumed 0)
 STY sum
 TXA
 ASL
 ASL
 ASL
 ASL
 LDX #4
.loop
 ASL 
 BCC skip
 CLC
 ADC sum
.skip
 DEX
 BNE loop
Of course, it's also possible to unroll the loop body ASL/BCC/CLC/ADC four times.

Furthermore, the CLC in the inner loop can be eliminated, if Y is decremented by 1 before it is stored in 'sum'.
That was neat! Untangling it and removing clc only gives 7 more instructions for a further reduction of 7*4-2=26 cycles. Putting "sum" on zeropage (in $96 or alike) gives a running time of maximum 43 cycles (39 average) while compiling to only 29 bytes.

Of course.. a table would only use 12 cycles but 225 bytes.
Post Reply