MIPS assembly variable registers - optimization

I have a section of code that I'd like to reuse, while changing only one register in one instruction. The initial register is $f18 in coproc1, and each time this code is run I want it to use the next COP1 register (max of 4 times). In this case I am very limited on memory and available GPR registers so I would not like to make a separate subroutine for this.
I know I can use self-modifying code to change the actual instruction, but doing so seems to require me to know the exact address of the line in question. This makes developing my code difficult because the size will constantly fluctuate until I'm finished.
Is it possible to reference a memory address by label+offset?
And is there a better way to do this using very few instructions and additional registers?
calc_and_add_color:
srl $t2, $t2, $t0
andi $s2, $t2, 0x1F
mtc1 $s2, $f22 #f22 is now red/green/blue component
cvt.s.w $f22, $f22
mul.s $f25, $f22, $f18 #<<<F18 HERE IS WHAT I WANT TO CHANGE
round.w.s $f25, $f25
lh $s2, 0x0($s1)
mfc1 $s5, $f25
addu $s5, $s5, $s2 #add new red comp. to existing one
andi $s5, $s5, 0x1F #cap at 31
sh $s2, 0x0($s1) #store it
addiu $s1, $s1, 0x6C0 #next color
addiu $s2, $r0, 0x5 #bit shifter
andi $s5, $fp, 0x0003 #isolate counter
bnel $s5, $zero, calc_and_add_color #when fp reaches zero every color has been done
addiu $fp, $fp, -0x1

Related

MIPS Dynamic Array allocation

I would like to allow the user to tell me how many numbers they would like to enter then allow them to enter them and store them in an array.I would like to call it myArray
I cannot find anything clear anywhere.
print_str ("How many numbers would you like ot enter? ")
li $v0, 5 #taking input in from user
syscall
move $s7, $v0
li $s6, 0 ` # i = 0
inputLoop:
bgt $s6, $s7, exitInputLooop
li $a0, $s7
sll $s7, $s7, 2 #user input x 4 for mememory allocation
li $v0,9 # (1) Allocate a block of memory
li $a0, ($s7) # 4 bytes long
syscall # $v0 <-- address
move myArray, ($s7) # (2) Make a safe copy
addi $s6, $s6, 4 #i++
exitInputLoop

Where to start with Keyboard interrupt Handler

I'm getting started with a program in MARS MIPS that will allow the user to input something in the MMIO input window in the form of "x+y=" and get "x+y=z". However, I just don't really know where to start. I have the basics setup, but I need to write an entire interrupt handler.
I'm using MARS MIPS< and have enabled the interrupt bit, but that's about all I've figured out.
.text
main:
#Turn on the interupt enable bit
lui $t0, Oxffff
lw $t1, 0($t0)
ori $t0, $t1, 0x0002
sw $t1, 0($t0)
.data
expBuffer: .space 60
expBuff: .word 0
.ktext 0x80000180
#Store all used registers
#Recover all used registers
.kdata
#Registers

Too much cycles

I consider an example from Optimizing Assembly from Agner Fog. He tests :
Example 12.6b. DAXPY algorithm, 32-bit mode
L1:
movapd xmm1, [esi+eax] ; X[i], X[i+1]
mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA
movapd xmm0, [edi+eax] ; Y[i], Y[i+1]
subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
movapd [edi+eax], xmm0 ; Store result
add eax, 16 ; Add size of two elements to index
cmp eax, ecx ; Compare with n*8
jl L1 ; Loop back
and he estimates ~6 cycle/iteration on Pentium M. I tried to make the same but on my CPU- Ivy Bridge. And I achieved 3 cycle per iteration but from my computations on paper it is possible to get 2 cycles. And I don't know if I made mistake in theorethical computations or it can be improved.
What I did:
From http://www.realworldtech.com/sandy-bridge/5/ I know that my CPU can retire 4 microops for cycle so it is not a bottleneck.
#uops fused = 8 / 4 = 2
So 2 is a current our bottleneck. Let's see another possibilites:
Microps have pattern: 1-1-1-1-1-1-1-1 and according to Agner Fog my CPU has pattern 1-1-1-1 ( and others). From that we can see that it is possible to decode instructions in 2 cycle. It is no bottleneck. Moreover, SnB cpus have microcache so neither fetching nor decoding should be the bottleneck.
Size of instruction in bytes is 32 so it fit into micro-cache window ( 32 bytes).
From my experiments, when I add nop instruction it increase number of cycle per iteratrion ( approx 0.5 cycle).
So, the question is:
Where is that ONE cycle? :D
Are you sure you're not limited by memory bandwidth? You need to test with arrays that fit in L1 cache, because you're depending on two loads and one store per 2 clocks. (That's more than half of IvB's theoretical max of two 128b memory ops per clock, with at most one of them being a store.)
Decoding can't be relevant because your loop fits in the loop buffer. (It's less than 28 uops). So it just issues in groups of 4 already-decoded uops.
Your fused-domain uop counts are wrong. cmp/jl can macro-fuse into one compare-and-branch uop. However, that mistake is balanced by another mistake due something that's not in Agner Fog's guide (yet).
movapd [edi+eax], xmm0 can't micro-fuse on IvB, so it's 2 fused-domain uops.
SnB-family CPUs can only micro-fuse memory operands that don't use an index register. I recently found official confirmation in Intel's optimization manual that explains the different results from Agner's testing vs. my testing: such addressing modes can micro-fuse in the decoders and uop cache, but not in the OOO core. (See my answer on that question. I should send Agner another message to let him know that Intel's docs sorted out our confusion...)
Try this:
add ecx, edi ; end-of-Y pointer
sub esi, edi ; esi = X-Y
L1:
movapd xmm1, [esi+edi] ; 1 uop
mulpd xmm1, xmm2 ; 1 uop
movapd xmm0, [edi] ; 1 uop
subpd xmm0, xmm1 ; 1 uop
movapd [edi], xmm0 ; 1 micro-fused uop
add edi, 16 ; 1 uop
cmp edi, ecx ; loop while (dst < end_Y)
jb L1 ; cmp+jb = 1 macro-fused uop
Loads don't need to micro-fuse, but stores are 2 fused-domain uops. (store-address and store-data).
IACA would have told you that the store was 2 uops and couldn't micro-fuse. It's worth having a look at. Sometimes its numbers are wrong (e.g. it thinks shrd is still slow on SnB), but often it's useful as long as you realize it's a simplistic approximation to the real hardware behaviour, and not a cycle-accurate simulator.
My version is 7 total fused-domain uops. So it should run at one iteration per 2 clocks on SnB-family CPUs. Your original was 8 uops, so this change shouldn't make any difference. I wrote it before noticing that you didn't account for macro-fusion of the cmp/jcc, so I was thinking your loop was actually 9 uops. Since adding a single nop slows your code down, that's additional evidence that it's 8 fused-domain uops. If cache misses from testing with arrays too large doesn't explain it, then maybe IvB is doing badly at scheduling the load/store uops somehow? Seems unlikely, since they all must use port 2 or 3 for store-address or load uops. (In the unfused domain, store-data uops go to port4).
Are you sure you really got one iteration per 3 cycles with your loop? It doesn't make sense that adding a nop could slow it down beyond that, because a 9 uop loop should issue in 3 cycles.

How could I optimize my code to cut down on the cycles it takes to run it?

I am a newcomer to assembly language in MIPS but I have prior experience in JAVA. I have the following block of code and was wondering how I could make it significantly faster. As you can see, this code takes a total of 45 cycles to run. You will notice that the div instruction is a big portion of the total. Maybe I could add something else in the code in place of div to optimize the code and cut down on cycles?
The code:
li $t0, -32 ----------------------2 cycles
lw $t2, 0($s1)--------------------1 cycle
div $t2, $t2, $t0------------------41 cycles
sw $t2, 0($s1)--------------------1 cycle
total cycles----------------45 cycles
Your help is much appreciated. Thanks.

add vs mul (IA32-Assembly)

I know that add is faster as compared to mul function.
I want to know how to go about using add instead of mul in the following code in order to make it more efficient.
Sample code:
mov eax, [ebp + 8] #eax = x1
mov ecx, [ebp + 12] #ecx = x2
mov edx, [ebp + 16] #edx = y1
mov ebx, [ebp + 20] #ebx = y2
sub eax,ecx #eax = x1-x2
sub edx,ebx #edx = y1-y2
mul edx #eax = (x1-x2)*(y1-y2)
add is faster than mul, but if you want to multiply two general values, mul is far faster than any loop iterating add operations.
You can't seriously use add to make that code go faster than it will with mul. If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. But for the general case - no.
If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler.
If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. This works particularly well when the known operand is small, and only has a few bits in its binary representation. To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. This is easily accomplished in assembler by left shifting and adding:
; x in EDX
; product to EAX
xor eax,eax
shl edx,r ; x*2^r
add eax,edx
shl edx,q-r ; x*2^q
add eax,edx
shl edx,p-q ; x*2^p
add eax,edx
The key problem with this is that it takes at least 4 clocks to do this, assuming
a superscalar CPU constrained by register dependencies. Multiply typically takes
10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time
you might as well do a multiply.
To multiply by 9:
mov eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl edx,3 ; x*2^3
add eax,edx
This beats multiply; should only take 2 clocks.
What is less well known is the use of the LEA (load effective address) instruction,
to accomplish fast multiply-by-small-constant.
LEA which takes only a single clock worst case its execution time can often
by overlapped with other instructions by superscalar CPUs.
LEA is essentially "add two values with small constant multipliers".
It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y
being any register. If x==y, you can get 1,2,3,4,5,8,9 times x,
but using x and y as seperate registers allows for intermediate results to be combined
and moved to other registers (e.g., to t), and this turns out to be remarkably handy.
Using it, you can accomplish a multiply by 9 using a single instruction:
lea eax,[edx*8+edx] ; takes 1 clock
Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles:
lea eax,[edx*4+edx] ; 5 * edx
lea eax,[eax*2+edx] ; 11 * edx
lea eax,[eax*4] ; 44 * edx
To do this, you have to decompose your constant multiplier into various factors/sums involving
1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still
only use 3-4 instructions.
If you allow the use other typically single-clock instructions (e.g, SHL/SUB/NEG/MOV)
you can multiply by some constant values that pure LEA can't
do as efficiently by itself. To multiply by 31:
lea eax,[4*edx]
lea eax,[8*eax] ; 32*edx
sub eax,edx; 31*edx ; 3 clocks
The corresponding LEA sequence is longer:
lea eax,[edx*4+edx]
lea eax,[edx*2+eax] ; eax*7
lea eax,[eax*2+edx] ; eax*15
lea eax,[eax*2+edx] ; eax*31 ; 4 clocks
Figuring out these sequences is a bit tricky, but you can set up an organized attack.
Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst
case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions.
This is only useful if the clock count is smaller than the integer multiply for your particular CPU
(I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or
at least it doesn't use up registers that are already busy (avoiding any spills).
I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. A clever person would possibly cache the answer so it doesn't
have to be recomputed each time multiplying the same constant occurs; I didn't actually do that because
the time to generate such sequences is less than you'd expect.
Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants
from 1 to 10000. Most of them can be done in 5-6 instructions worst case.
As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest
arrays of nested structures.
Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul. Having said that, you can use add to do multiplications:
Multiply by 2:
add eax,eax ; x2
Multiply by 4:
add eax,eax ; x2
add eax,eax ; x4
Multiply by 8:
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
They work nicely for powers of two. I'm not saying they're faster. They were certainly necessary in the days before fancy multiplication instructions. That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-)
You can even multiply by non-powers by simply storing interim results:
Multiply by 9:
push ebx ; preserve
push eax ; save for later
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
pop ebx ; get original eax into ebx
add eax,ebx ; x9
pop ebx ; recover original ebx
I generally suggest that you write your code primarily for readability and only worry about performance when you need it. However, if you're working in assembler, you may well already at that point. But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand.
You should, however, always profile your code in the target environment to ensure that what you're doing is actually faster. Assembler doesn't change that aspect of optimisation at all.
If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax. It will not handle overflow elegantly.
START: MOV AX, 0007 ; Load up registers
MOV BX, 0005
CALL MULT ; Call multiply function.
HLT ; Stop.
MULT: PUSH BX ; Preserve BX, CX, DX.
PUSH CX
PUSH DX
XOR CX,CX ; CX is the accumulator.
CMP BX, 0 ; If multiplying by zero, just stop.
JZ FIN
MORE: PUSH BX ; Xfer BX to DX for bit check.
POP DX
AND DX, 0001 ; Is lowest bit 1?
JZ NOADD ; No, do not add.
ADD CX,AX
NOADD: SHL AX,1 ; Shift AX left (double).
SHR BX,1 ; Shift BX right (integer halve, next bit).
JNZ MORE ; Keep going until no more bits in BX.
FIN: PUSH CX ; Xfer product from CX to AX.
POP AX
POP DX ; Restore registers and return.
POP CX
POP BX
RET
It relies on the fact that 123 multiplied by 456 is identical to:
123 x 6
+ 1230 x 5
+ 12300 x 4
which is the same way you were taught multiplication back in grade/primary school. It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding).
It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. There's something to be said for high level languages :-)
When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster.
http://home.comcast.net/~fbui/intel_a.html#add
http://home.comcast.net/~fbui/intel_m.html#mul
My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution.
I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for!
In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - e.g.:
1280 x 1024 display - each line on the
display is 1280 pixels.
1280 = 1024 + 256 = 2^10 + 2^8
y * 1280 = y * (2 ^ 10) + y * (2 ^ 8)
= ADD (SHL y, 10), (SHL y, 8)
...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles.