One of the instructions cut from the M0 with regard to the Cortex-M3 core was SMULL. This instruction is extremely helpful when you want to do fixed point arithmetic with more than 16 bits. Compilers typically emulate this instruction, so that you can write:

1 2 3 |
int fixed_point_multiply( int x, int y) { return ((int64_t) x * (int64_t) y) >> 16; } |

To implement its function, `int64_t z = x * y`

, we need to add the individual products. Let `~`

denote a sign extended halfword, and `0`

a zeroed halfword, then the multiplication `x`

× `y`

or `[a,b] * [c,d]`

can be calculated as `00[b*d] + ~[~a*d]0 + ~[~c*b]0 + [~a*~c]00`

.

To do this efficiently in assembly, we need to take into account that we only have 8 registers to work with, and that we have a carry to transport between the lower and upper word. So we will start with calculating the middle terms, and add the remaining ones at the end. The following code takes its parameters in `r0`

and `r1`

, and returns the product in `r1:r0`

.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
PUSH {r4-r7} ASRS r4, r0, #16 ; r4 = [~a] UXTH r5, r0 ; r5 = [0b] ASRS r6, r1, #16 ; r6 = [~c] UXTH r7, r1 ; r7 = [0d] MOV r0, r4 MULS r0, r7, r0 ASRS r1, r0, #16 LSLS r0, r0, #16 ; r1:r0 = ~[~a*0d]0 MOV r2, r6 MULS r2, r5, r2 ASRS r3, r2, #16 LSLS r2, r2, #16 ; r3:r2 = ~[~c*0b]0 ADDS r0, r0, r2 ADCS r1, r1, r3 ; r1:r0 = ~[~a*0d]0 + ~[~c*0b]0 MULS r5, r7, r5 MOVS r7, #0 ADDS r0, r0, r5 ADCS r1, r1, r7 ; r1:r0 += 00[0b*0d] MULS r4, r6, r4 ADDS r1, r1, r4 ; r1:r0 += [~a*~c]00 POP {r4-r7} BX lr |