Multiplication in assembler

Basic and Machine Language

Moderator: Moderators

Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

I condensed the 8x8bit multiplication a little and included a init routine to make the 4x4bit table using the 4x4 multiplication routine. Total 171 bytes. You may want to tangle the loop there to save some bytes and optimize it a little further (or maybe just include the table). I haven't tested it yet, so probably buggy.

Code: Select all

tab44=$100      ; use a page of mem somewhere (may want to move this)

initmult         ; set up 15x15 multiplication table
        ldx     #$f
im1     ldy     #$f
im2     jsr     mult44
	sta	$96
        iny
        txa
        asl
        asl
        asl
        asl
        sta     im3+1
        lda     $96
im3     sta     tab44,y
        dey
        bne     im2
        dex
        bne     im1
        rts


mult44  ; Mike's 4x4 multiplication code
        DEY            ; 2 To counteract carry add: %1111->%1110 + C
        STY $96        ; 3
        TXA            ; 2
        ASL            ; 2
        ASL            ; 2
        ASL            ; 2
        ASL            ; 2 %XXXX0000
        ASL            ; 2 %XXX00000
        BCC rm0        ; 3
        ADC $96        ; 3 %XXX01111
rm0     ASL            ; 2 %XX011110
        BCC rm1        ; 3
        ADC $96        ; 3 %XX011110+1110+c=%11101101
rm1     ASL            ; 2 %X1011010
        BCC rm2        ; 3
        ADC $96        ; 3 %X1011010+1110+c=%X1101001
rm2     ASL            ; 2 %11010010
        BCC rm3        ; 3
        ADC $96        ; 3 %11010010+1110+c=%11100001=225=15*15
rm3     RTS            ; 2


mult44s                   ; a OR y is already shifted to top 4 bits 
        sta     m4s+1
m4s     lda     tab44,y
        rts

mult44u                 ; have to shift a
        asl
        asl
        asl
        asl
        sta     m4u+1
m4u     lda     tab44,y
        rts


mult88f
        txa
        and     #$f
        sta     $3      ;l4bx
        stx     $4      ;h4bx
        tya
        and     #$f
        sta     $5      ;l4by
        sty     $6      ;h4by

        ldy     $3
        lda     $6
        and     #$f0    ;already shifted
        jsr     mult44s  ;l4bx*h4by
        sty     $92     ;temp store
        lda     $4
        and     #$f0    ;already shifted
        ldy     $5
        jsr     mult44s  ;h4bx*l4by
        tya
        clc
        adc     $92     ;add temp store, carry on overflow
        sta     $92
        and     #$f0    ;4 highest bits
        ror             ;shift right and put carry in bit7
        lsr
        lsr
        lsr             ;all bits in bit4-bit0
        sta     $93     ;high byte result (e.g. *16)
        lda     $92
        and     #$f
        asl
        asl
        asl
        asl             ;*16
        sta     $92     ;low byte result
        lda     $3
        ldy     $5
        jsr     mult44u  ;l4bx*l4by
        tya
        adc     $92     ;add low byte
        sta     $92
        lda     $93
        adc     #0
        sta     $93
        lda     $6
        and     #$f0
        tay
        lda     $4
        lsr
        lsr
        lsr
        lsr
        jsr     mult44s  ;h4bx*h4by*16*16
        tya
        adc     $93     ;result high byte
        sta     $93
        rts

Last edited by Kakemoms on Sun Aug 30, 2015 3:21 am, edited 1 time in total.
User avatar
Mike
Herr VC
Posts: 4816
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: Multiplication in assembler

Post by Mike »

Of course sum should go into zeropage. However, my maths concerning the two table-less and the one table-based implementation of 4x4 are slightly different (average run-time given):

Code: Select all

 ; A := X*Y, 4x4 -> 8 multiplication (top 4 bits of Y assumed 0)
 STY sum       ;    3
 TXA           ;    2
 ASL           ;    2
 ASL           ;    2
 ASL           ;    2
 ASL           ;    2
 LDX #4        ;    2
.loop
 ASL           ;    8 = 4x 2
 BCC skip      ;   10 = 4x(0.5x3 + 0.5x2)
 CLC           ;    4 = 4x(0.5x0 + 0.5x2)
 ADC sum       ;    6 = 4x(0.5x0 + 0.5x3)
.skip
 DEX           ;    8 = 4x 2
 BNE loop      ;   11 = 3x 3 + 1x 2
----------------------------------------
                   62 cycles / 18 bytes


unrolled and with DEY+removed CLC:

 DEY           ;    2
 STY sum       ;    3
 TXA           ;    2
 ASL           ;    2
 ASL           ;    2
 ASL           ;    2
 ASL           ;    2
 ASL           ;    2
 BCC skp1      ;    2.5 = 0.5x3 + 0.5x2
 ADC sum       ;    1.5 = 0.5x0 + 0.5x3
.skp1
 ASL           ;    2
 BCC skp2      ;    2.5 = 0.5x3 + 0.5x2
 ADC sum       ;    1.5 = 0.5x0 + 0.5x3
.skp2
 ASL           ;    2
 BCC skp3      ;    2.5 = 0.5x3 + 0.5x2
 ADC sum       ;    1.5 = 0.5x0 + 0.5x3
.skp3
 ASL           ;    2
 BCC skp4      ;    2.5 = 0.5x3 + 0.5x2
 ADC sum       ;    1.5 = 0.5x0 + 0.5x3
.skp4
---------------------------------------
                   39 cycles / 28 bytes


table based:

 TYA           ;    2
 ORA table1,X  ;    4*
 TAX           ;    2
 LDA table2,X  ;    4*
---------------------------------------
                   12 cycles
                    8 bytes code
             256 + 16 bytes data
My fastest 8 x 8 + 8 -> 16 multiply-add weighs in at 67..72 cycles, 66 bytes code and 1024 bytes table data.

A slightly slower variant I implemented, that actually also builds the 8 x 8 from 4 x 4 multiplies goes with 137..141 cycles, 95 bytes code and 768 bytes table data (256 bytes for 4x4 multiply, 2x 256 bytes for shift tables). The main issue with implementing the 8x8 multiply from four 4x4 multiplies is, that it requires a lot of housekeeping for the extraction of factors and the processing of intermediary results, and this slows down the whole construct even though LUTs are used. :(

And a table-less 8x8 with shift-add?

Code: Select all

hi:lo = xx*yy + 00:lo ; adr(hi)=adr(xx)

 LDA lo     ;   2
 PHA        ;   3
--------------------
 LDA #0     ;   2
 STA lo     ;   3
 LDX #8     ;   2
.loop
 ASL lo     ;  40 = 8x 5
 ROL hi     ;  40 = 8x 5
 BCC skip   ;  20 = 8x(0.5x3 + 0.5x2)
 CLC        ;   8 = 8x(0.5x0 + 0.5x2)
 LDA lo     ;  12 = 8x(0.5x0 + 0.5x3)
 ADC yy     ;  12 = 8x(0.5x0 + 0.5x3)
 STA lo     ;  12 = 8x(0.5x0 + 0.5x3)
 BCC skip   ;  10 = 8x(0.5x0 + 0.5x(0.5x3 + 0.5x2))
 INC hi     ;  10 = 8x(0.5x0 + 0.5x(0.5x0 + 0.5x5))
.skip
 DEX        ;  16 = 8x 2
 BNE loop   ;  23 = 7x 3 + 1x 2
--------------------
 PLA        ;   4
 CLC        ;   2
 ADC lo     ;   3
 STA lo     ;   3
 BCC end    ;   2.5 = (0.5x3 + 0.5x2)
 INC hi     ;   2.5 = (0.5x0 + 0.5x5)
.end

multiply:       210 cycles / 26 bytes
multiply-add:   232 cycles / 39 bytes
Not much slower than a 8x8 multiply built from four 4x4 multiplies, but *much* smaller ... so, take your pick.
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

Mike wrote:

Code: Select all

multiply:       210 cycles / 26 bytes
multiply-add:   232 cycles / 39 bytes
Not much slower than a 8x8 multiply built from four 4x4 multiplies, but *much* smaller ... so, take your pick.
Ok, I condensed it a little using the table functions directly into the 8x8 multiplier. Got rid of 4*jsr+4*rts=12*4=48 cycles.

Code: Select all

mult88f
        txa             ;2
        and     #$f     ;2
        sta     m4s1+1  ;4
        asl             ;2
        asl             ;2
        asl             ;2
        asl             ;2
        sta     m4u1+1  ;4
        txa             ;2 
        and     #$f0    ;2 
        sta     m4s2+1  ;4
        lsr             ;2
        lsr             ;2
        lsr             ;2
        lsr             ;2 (c=0)
        sta     m4s3+1  ;4 
        tya             ;2
        and     #$f0    ;2
        tax             ;2
        tya             ;2
        and     #$f     ;2
        tay
m4s1    lda     tab44,x ;5
m4s2    adc     tab44,y ;5 h4bx*l4by - c=0
        sta     $92     ;3
        ror             ;2 shift right and put carry in bit7
        lsr             ;2
        lsr             ;2
        lsr             ;2 all bits in bit4-bit0
        sta     $93     ;3 high byte result (e.g. *16)
        lda     $92     ;3
        and     #$f     ;2
        asl             ;2
        asl             ;2
        asl             ;2
        asl             ;2 *16
m4u1    adc     tab44,y ;5 l4bx*l4by
        sta     $92     ;3
m4s3    lda     tab44,x ;5 h4bx*h4by*16*16
        adc     $93     ;3 result high byte, add overflow carry
        sta     $93     ;3

Edit:
166 cycles and 106 bytes + 256byte table. Not half as good as yours yet..
Edit2:
154 cycles and 98 bytes +256byte table. I also compiled and tested it and it works..
Edit3&4:
106 cycles and 68 bytes +256byte table. Still working...

I can also add shift tables a 256x2 byte table. I save about 12 cycles on that.
Kakemoms
Vic 20 Nerd
Posts: 740
Joined: Sun Feb 15, 2015 8:45 am

Re: Multiplication in assembler

Post by Kakemoms »

Kakemoms wrote:I condensed the 8x8bit multiplication a little and included a init routine to make the 4x4bit table using the 4x4 multiplication routine. Total 171 bytes. You may want to tangle the loop there to save some bytes and optimize it a little further (or maybe just include the table). I haven't tested it yet, so probably buggy.

(code see above)
I just found an error in the 4x4 table init routine. I am not sure what caused it, but I managed to make a much shorter routine:

Code: Select all

initmult
         ; set up 15x15 multiplication table
        ldx     #$0     ; multiplier
        ldy     #$0     ; index

iml0    tya
        clc
        adc     #16
        sta     imc1+1
        txa             ; start value=multiplier
        sta     $96     ; add multiplier to A each round
iml1    sta     tab44,y
        adc     $96
        iny
imc1    cpy     #00
        bne     iml1
        inx
        cpx     #16
        bne     iml0
        rts
Post Reply