Division of high numbers in assembly - optimization

I tried to find a location of a crusor with an array of 1896(becomes the whole console in 2D, 79*24). For this I took the location and divided it by 79.
MOV AX, [Y-16H]
DEC AX
MOV BX, 79
DIV BX
MOV Z, DL
ADD Z, DH
MOV DL, Z
MOV Z, AL
ADD Z, AH
MOV DH, Z
I get an overflow error. Can you tell me what am I doing wrong please? maybe suggest a solution?

DIV BX divides the 32-bit number formed by DX (high word) and AX (low word) by BX. You therefore need to clear DX (e.g. XOR DX,DX) prior to the division to avoid an overflow.
By the way, are you sure you don't want to divide by 80? I've never heard of a 79-column console, although I'm no expert on such matters

As Michael mentioned, you need to clear the DX register prior to division.
That said, if you are interested in speed (the usual reason for assembly coding), it is much faster to transform the division by 79 into an equivalent operation using a multiplication and a right shift: (x * 53093) >> 22.
This works because 1/79th is approximately equal to 53093 / (2**22).

Related

STM32 Gyroscope angle tracking

I'm working with a Gyroscope (L3GD20) with a 2000DPS
Correct me if their is a mistake,
I start by reading the values High and Low for the 3 axes and concatenate them. Then I multiply every value by 0.07 to convert them into DPS.
My main goal is to track the angle over time, so I simply implemented a Timer which reads the data every dt = 10 ms
to integrate ValueInDPS * 10ms, here is the code line I'm using :
angleX += (resultGyroX)*dt*0.001; //0.001 to get dt in [seconds]
This should give us the value of the angle in [degree] am I right ?
The problem is that the values I'm getting are a little bit weird, for example when I make a rotation of 90°, I get something like 70°...
Your method is a recipe for imprecision and accumulated error.
You should avoid using floating point (especially if there is no FPU), and especially also if this code is in the timer interrupt handler.
you should avoid unnecessarily converting to degrees/sec on every sample - that conversion is needed only for presentation, so you should perform it only when you need to need the value - internally the integrator should work in gyro sample units.
Additionally, if you are doing floating point in both an ISR and in a normal thread and you have an FPU, you may also encounter unrelated errors, because FPU registers are not preserved and restored in an interrupt handler. All in all floating point should only be used advisedly.
So let us assume you have a function gyroIntegrate() called precisely every 10ms:
static int32_t ax = 0
static int32_t ay = 0
static int32_t az = 0
void gyroIntegrate( int32_t sample_x, int32_t sample_y, int32_t sample_z)
{
ax += samplex ;
ay += sampley ;
az += samplez ;
}
Not ax etc. are the integration of the raw sample values and so proportional to the angle relative to the starting position.
To convert ax to degrees:
degrees = ax × r-1 × s
Where:
r is the gyro resolution in degrees per second (0.07)
s is the sample rate (100).
Now you would do well to avoid floating point and here it is entirely unnecessary; r-1 x s is a constant (1428.571 in this case). So to read the current angle represented by the integrator, you might have a function:
#define GYRO_SIGMA_TO_DEGREESx10 14286
void getAngleXYZ( int32_t* int32_t, int32_t* ydeg, int32_t* zdeg )
{
*xdeg = (ax * 10) / GYRO_SIGMA_TO_DEGREESx10 ;
*ydeg = (ax * 10) / GYRO_SIGMA_TO_DEGREESx10 ;
*zdeg = (ax * 10) / GYRO_SIGMA_TO_DEGREESx10 ;
}
getAngleXYZ() should be called from the application layer when you need a result - not from the integrator - you do the math at the point of need and have CPU cycles left to do more useful stuff.
Note that in the above I have ignored the possibility of arithmetic overflow of the integrator. As it is it is good for approximately +/-1.5 million degrees +/-4175 rotations), so it may not be a problem in some applications. You could use an int64_t or if you are not interested in the number of rotations, just the absolute angle then, in the integrator:
ax += samplex ;
ax %= GYRO_SIGMA_360 ;
Where GYRO_SIGMA_360 equals 514286 (360 x s / r).
Unfortunately, MEMs sensor math is quite complicated.
I would personally use ready libraries provided by the STM https://www.st.com/en/embedded-software/x-cube-mems1.html.
I actually use them, and the results are very good.

Error "Relative branch out of reach" in avr

I'm new on AVR. I have an "Relative branch out of reach" error for the "brne round_loop" line while debugging. Is anyone to help me? Thank you so much for your helps.
; Test if round counter has reached 14
mov t4, rc
subi t4, 14
brne round_loop
round_loop:
round_loop:
; XOR state and key
eor s0, k0
eor s1, k1
eor s2, k2
eor s3, k3
The AVR BRNE instruction is a 16 bit op-code, 7 bits of which are the branch offset. This 7 bit signed operand can have a value k in the range -64 ≤ k ≤ +63. The PC is modified by k +1 (i.e. -63 to +64). If the jump is further then that, a relative branch is unsuitable.
You either need to locate the target closer to the branch, or use an unconditional branch to an unconditional jump (JMP) with a 22bit range, or a relative jump (RJMP) with a 12 bit range.
mov t4, rc
subi t4, 14
brne round_loop_longjmp
rjmp no_round_jmp
round_loop_longjmp:
rjmp round_loop
no_round_jmp:
...
A relative branch means that the jump occurs by changing the position of the program counter(which instruction is being executed right now) by either adding or subtracting a value from it. That means round_loop in brne is not translated into an absolute address, but a distance from the current instruction. The limit for brne is 7bits, so I believe it should be within +-64 words(each instruction is 1 word so 64 instructions). So the round_loop label should be within 64 instructions of the brne instruction, either before or after it.
If you can't move round_loop within that range then you'll have to do a branch to a label that will do a JMP to round_loop.

Julia vs Python compiler/interpreter optimization

I recently asked the following question about Python:
Interpreter optimization in Python
Say I have a string in x, is the Python interpreter smart enough to
know that: string.replace(x, x) should be converted to a NOP?
The answer seems to be No (although the Python interpreter is able to do some optimizations through the peephole optimiser).
I don't know what the equivalent expression in Julia would be, but is Julia capable of optimizing these types of relatively obvious
statements?
Relying on the compiler
The question is can Julia provide enough information to LLVM so that the compiler can optimize things?
From your example yes, and you can verify with code_native. For example the answer is premultiplied and unnecessary assignment to x is optimized away and the function always returns a constant
julia> f()=(x=7*24*60*60)
f (generic function with 1 method)
julia> code_native(f,())
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
push RBP
mov RBP, RSP
mov EAX, 604800
Source line: 1
pop RBP
ret
Optimization from typing
And it can go a bit further at times because more knowledge is available from type information. The converse is that the Any type and globals are to be avoided if possible.
Example
In case I, the comparison needs to be made because y might be greater than 256, but in the 2nd case since it's only 1 byte, it's value can't be greater than 256 and the function will be optimized to always return 1.
Case I
julia> g(y::Int16)=(y<256?1:0)
g (generic function with 1 method)
julia> code_native(g,(Int16,))
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
push RBP
mov RBP, RSP
cmp DI, 256
Source line: 1
setl AL
movzx EAX, AL
pop RBP
ret
Case II
julia> g(y::Int8)=(y<256?1:0)
g (generic function with 2 methods)
julia> code_native(g,(Int8,))
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
push RBP
mov RBP, RSP
mov EAX, 1
Source line: 1
pop RBP
ret

add vs mul (IA32-Assembly)

I know that add is faster as compared to mul function.
I want to know how to go about using add instead of mul in the following code in order to make it more efficient.
Sample code:
mov eax, [ebp + 8] #eax = x1
mov ecx, [ebp + 12] #ecx = x2
mov edx, [ebp + 16] #edx = y1
mov ebx, [ebp + 20] #ebx = y2
sub eax,ecx #eax = x1-x2
sub edx,ebx #edx = y1-y2
mul edx #eax = (x1-x2)*(y1-y2)
add is faster than mul, but if you want to multiply two general values, mul is far faster than any loop iterating add operations.
You can't seriously use add to make that code go faster than it will with mul. If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. But for the general case - no.
If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler.
If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. This works particularly well when the known operand is small, and only has a few bits in its binary representation. To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. This is easily accomplished in assembler by left shifting and adding:
; x in EDX
; product to EAX
xor eax,eax
shl edx,r ; x*2^r
add eax,edx
shl edx,q-r ; x*2^q
add eax,edx
shl edx,p-q ; x*2^p
add eax,edx
The key problem with this is that it takes at least 4 clocks to do this, assuming
a superscalar CPU constrained by register dependencies. Multiply typically takes
10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time
you might as well do a multiply.
To multiply by 9:
mov eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl edx,3 ; x*2^3
add eax,edx
This beats multiply; should only take 2 clocks.
What is less well known is the use of the LEA (load effective address) instruction,
to accomplish fast multiply-by-small-constant.
LEA which takes only a single clock worst case its execution time can often
by overlapped with other instructions by superscalar CPUs.
LEA is essentially "add two values with small constant multipliers".
It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y
being any register. If x==y, you can get 1,2,3,4,5,8,9 times x,
but using x and y as seperate registers allows for intermediate results to be combined
and moved to other registers (e.g., to t), and this turns out to be remarkably handy.
Using it, you can accomplish a multiply by 9 using a single instruction:
lea eax,[edx*8+edx] ; takes 1 clock
Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles:
lea eax,[edx*4+edx] ; 5 * edx
lea eax,[eax*2+edx] ; 11 * edx
lea eax,[eax*4] ; 44 * edx
To do this, you have to decompose your constant multiplier into various factors/sums involving
1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still
only use 3-4 instructions.
If you allow the use other typically single-clock instructions (e.g, SHL/SUB/NEG/MOV)
you can multiply by some constant values that pure LEA can't
do as efficiently by itself. To multiply by 31:
lea eax,[4*edx]
lea eax,[8*eax] ; 32*edx
sub eax,edx; 31*edx ; 3 clocks
The corresponding LEA sequence is longer:
lea eax,[edx*4+edx]
lea eax,[edx*2+eax] ; eax*7
lea eax,[eax*2+edx] ; eax*15
lea eax,[eax*2+edx] ; eax*31 ; 4 clocks
Figuring out these sequences is a bit tricky, but you can set up an organized attack.
Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst
case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions.
This is only useful if the clock count is smaller than the integer multiply for your particular CPU
(I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or
at least it doesn't use up registers that are already busy (avoiding any spills).
I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. A clever person would possibly cache the answer so it doesn't
have to be recomputed each time multiplying the same constant occurs; I didn't actually do that because
the time to generate such sequences is less than you'd expect.
Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants
from 1 to 10000. Most of them can be done in 5-6 instructions worst case.
As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest
arrays of nested structures.
Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul. Having said that, you can use add to do multiplications:
Multiply by 2:
add eax,eax ; x2
Multiply by 4:
add eax,eax ; x2
add eax,eax ; x4
Multiply by 8:
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
They work nicely for powers of two. I'm not saying they're faster. They were certainly necessary in the days before fancy multiplication instructions. That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-)
You can even multiply by non-powers by simply storing interim results:
Multiply by 9:
push ebx ; preserve
push eax ; save for later
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
pop ebx ; get original eax into ebx
add eax,ebx ; x9
pop ebx ; recover original ebx
I generally suggest that you write your code primarily for readability and only worry about performance when you need it. However, if you're working in assembler, you may well already at that point. But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand.
You should, however, always profile your code in the target environment to ensure that what you're doing is actually faster. Assembler doesn't change that aspect of optimisation at all.
If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax. It will not handle overflow elegantly.
START: MOV AX, 0007 ; Load up registers
MOV BX, 0005
CALL MULT ; Call multiply function.
HLT ; Stop.
MULT: PUSH BX ; Preserve BX, CX, DX.
PUSH CX
PUSH DX
XOR CX,CX ; CX is the accumulator.
CMP BX, 0 ; If multiplying by zero, just stop.
JZ FIN
MORE: PUSH BX ; Xfer BX to DX for bit check.
POP DX
AND DX, 0001 ; Is lowest bit 1?
JZ NOADD ; No, do not add.
ADD CX,AX
NOADD: SHL AX,1 ; Shift AX left (double).
SHR BX,1 ; Shift BX right (integer halve, next bit).
JNZ MORE ; Keep going until no more bits in BX.
FIN: PUSH CX ; Xfer product from CX to AX.
POP AX
POP DX ; Restore registers and return.
POP CX
POP BX
RET
It relies on the fact that 123 multiplied by 456 is identical to:
123 x 6
+ 1230 x 5
+ 12300 x 4
which is the same way you were taught multiplication back in grade/primary school. It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding).
It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. There's something to be said for high level languages :-)
When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster.
http://home.comcast.net/~fbui/intel_a.html#add
http://home.comcast.net/~fbui/intel_m.html#mul
My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution.
I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for!
In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - e.g.:
1280 x 1024 display - each line on the
display is 1280 pixels.
1280 = 1024 + 256 = 2^10 + 2^8
y * 1280 = y * (2 ^ 10) + y * (2 ^ 8)
= ADD (SHL y, 10), (SHL y, 8)
...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles.

Generating random numbers with ARM Assembly

I want to generate random number to use it in my iphone project by Inlining in my Objective-C code some assembly, is this possible with arm-assembly?
Look up lfsr on google, linear feedback shift register. Not a true random number generator but you can make pretty good random numbers with maybe three or four lines of assembler.
Go to Wikipedia, find the easiest random number generation algorithm, reimplement in assembly :)
; ========================= RANDOM.INC =========================
; Call with, NOTHING
; Returns, AL = random number between 0-255,
; AX may be a random number too ?
; DW RNDNUM holds AX=random_number_in AL
SEED DW 3749h
RNDNUM DW 0
align 16
RANDOM:
PUSH DX
MOV AX,[SEED] ;; AX = seed
MOV DX,8405h ;; DX = 8405h
MUL DX ;; MUL (8405h * SEED) into dword DX:AX
;
CMP AX,[SEED]
JNZ GOTSEED ;; if new SEED = old SEED, alter SEED
MOV AH,DL
INC AX
GOTSEED:
MOV WORD [SEED],AX ;; We have a new seed, so store it
MOV AX,DX ;; AL = random number
MOV WORD [RNDNUM],AX
POP DX
RET
Jost load a variable from an uninitialized memory address. At every access, increment address to get new random numbers. Voila, guarenteed random, but not well distributed.