I'm studying hardware specification for 8086 and I'm wondering What BHE' signal do? when is activated? deactivated?
The 8086 can address bytes (8 bits) and words (16 bits) in memory.
To access a byte at an even address, the A0 signal will be logically 0 and the BHE signal will be 1.
To access a byte at an odd address, the A0 signal will be logically 1 and the BHE signal will be 0.
To access a word at an even address, the A0 signal will be logically 0 and the BHE signal will also be 0.
instruction
A0
BHE
cycles
mov al, [1234h]
0
1
10
mov al, [1235h]
1
0
10
mov ax, [1234h]
0
0
10
To access a word at an odd address, the processor will need to address the bytes separately. This will incur a penalty of 4 cycles!
The instruction mov ax, [1235h] will take 14 cycles.
I consider an example from Optimizing Assembly from Agner Fog. He tests :
Example 12.6b. DAXPY algorithm, 32-bit mode
L1:
movapd xmm1, [esi+eax] ; X[i], X[i+1]
mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA
movapd xmm0, [edi+eax] ; Y[i], Y[i+1]
subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
movapd [edi+eax], xmm0 ; Store result
add eax, 16 ; Add size of two elements to index
cmp eax, ecx ; Compare with n*8
jl L1 ; Loop back
and he estimates ~6 cycle/iteration on Pentium M. I tried to make the same but on my CPU- Ivy Bridge. And I achieved 3 cycle per iteration but from my computations on paper it is possible to get 2 cycles. And I don't know if I made mistake in theorethical computations or it can be improved.
What I did:
From http://www.realworldtech.com/sandy-bridge/5/ I know that my CPU can retire 4 microops for cycle so it is not a bottleneck.
#uops fused = 8 / 4 = 2
So 2 is a current our bottleneck. Let's see another possibilites:
Microps have pattern: 1-1-1-1-1-1-1-1 and according to Agner Fog my CPU has pattern 1-1-1-1 ( and others). From that we can see that it is possible to decode instructions in 2 cycle. It is no bottleneck. Moreover, SnB cpus have microcache so neither fetching nor decoding should be the bottleneck.
Size of instruction in bytes is 32 so it fit into micro-cache window ( 32 bytes).
From my experiments, when I add nop instruction it increase number of cycle per iteratrion ( approx 0.5 cycle).
So, the question is:
Where is that ONE cycle? :D
Are you sure you're not limited by memory bandwidth? You need to test with arrays that fit in L1 cache, because you're depending on two loads and one store per 2 clocks. (That's more than half of IvB's theoretical max of two 128b memory ops per clock, with at most one of them being a store.)
Decoding can't be relevant because your loop fits in the loop buffer. (It's less than 28 uops). So it just issues in groups of 4 already-decoded uops.
Your fused-domain uop counts are wrong. cmp/jl can macro-fuse into one compare-and-branch uop. However, that mistake is balanced by another mistake due something that's not in Agner Fog's guide (yet).
movapd [edi+eax], xmm0 can't micro-fuse on IvB, so it's 2 fused-domain uops.
SnB-family CPUs can only micro-fuse memory operands that don't use an index register. I recently found official confirmation in Intel's optimization manual that explains the different results from Agner's testing vs. my testing: such addressing modes can micro-fuse in the decoders and uop cache, but not in the OOO core. (See my answer on that question. I should send Agner another message to let him know that Intel's docs sorted out our confusion...)
Try this:
add ecx, edi ; end-of-Y pointer
sub esi, edi ; esi = X-Y
L1:
movapd xmm1, [esi+edi] ; 1 uop
mulpd xmm1, xmm2 ; 1 uop
movapd xmm0, [edi] ; 1 uop
subpd xmm0, xmm1 ; 1 uop
movapd [edi], xmm0 ; 1 micro-fused uop
add edi, 16 ; 1 uop
cmp edi, ecx ; loop while (dst < end_Y)
jb L1 ; cmp+jb = 1 macro-fused uop
Loads don't need to micro-fuse, but stores are 2 fused-domain uops. (store-address and store-data).
IACA would have told you that the store was 2 uops and couldn't micro-fuse. It's worth having a look at. Sometimes its numbers are wrong (e.g. it thinks shrd is still slow on SnB), but often it's useful as long as you realize it's a simplistic approximation to the real hardware behaviour, and not a cycle-accurate simulator.
My version is 7 total fused-domain uops. So it should run at one iteration per 2 clocks on SnB-family CPUs. Your original was 8 uops, so this change shouldn't make any difference. I wrote it before noticing that you didn't account for macro-fusion of the cmp/jcc, so I was thinking your loop was actually 9 uops. Since adding a single nop slows your code down, that's additional evidence that it's 8 fused-domain uops. If cache misses from testing with arrays too large doesn't explain it, then maybe IvB is doing badly at scheduling the load/store uops somehow? Seems unlikely, since they all must use port 2 or 3 for store-address or load uops. (In the unfused domain, store-data uops go to port4).
Are you sure you really got one iteration per 3 cycles with your loop? It doesn't make sense that adding a nop could slow it down beyond that, because a 9 uop loop should issue in 3 cycles.
I tried to find a location of a crusor with an array of 1896(becomes the whole console in 2D, 79*24). For this I took the location and divided it by 79.
MOV AX, [Y-16H]
DEC AX
MOV BX, 79
DIV BX
MOV Z, DL
ADD Z, DH
MOV DL, Z
MOV Z, AL
ADD Z, AH
MOV DH, Z
I get an overflow error. Can you tell me what am I doing wrong please? maybe suggest a solution?
DIV BX divides the 32-bit number formed by DX (high word) and AX (low word) by BX. You therefore need to clear DX (e.g. XOR DX,DX) prior to the division to avoid an overflow.
By the way, are you sure you don't want to divide by 80? I've never heard of a 79-column console, although I'm no expert on such matters
As Michael mentioned, you need to clear the DX register prior to division.
That said, if you are interested in speed (the usual reason for assembly coding), it is much faster to transform the division by 79 into an equivalent operation using a multiplication and a right shift: (x * 53093) >> 22.
This works because 1/79th is approximately equal to 53093 / (2**22).
I'm using clang+LLVM 2.9 to compile various workloads for x86 with the -Os option. Small binary size is important and I must use static linking. All binaries are 32-bit.
I notice that many instructions use addressing modes with 32-bit displacements when only 8 bits are actually used. For example:
89 84 24 d4 00 00 00 mov %eax,0xd4(%esp)
Why didn't the compiler/assembler choose the compact 8-bit displacement?
89 44 24 d4 mov %eax,0xd4(%esp)
In fact, these wasted addressing bytes are over 2% of my entire binary!
I looked at LLVM's link time optimization and tried --emit-llvm, but it didn't mention or help this issue.
Is there some link-time optimization that can use knowledge of the actual displacements to choose the smaller instruction form?
Thanks for any help!
In x86, offsets are signed. This allows you to access data on both sides of the base address. Therefore, the range of an 8 bit offset is -128 to 127. Your instruction is referencing data 212 bytes forward (the value 0xD4 in decimal). If it had been encoded using an 8 bit offset, it would be -44 in decimal, which is not what you wanted.
I know that add is faster as compared to mul function.
I want to know how to go about using add instead of mul in the following code in order to make it more efficient.
Sample code:
mov eax, [ebp + 8] #eax = x1
mov ecx, [ebp + 12] #ecx = x2
mov edx, [ebp + 16] #edx = y1
mov ebx, [ebp + 20] #ebx = y2
sub eax,ecx #eax = x1-x2
sub edx,ebx #edx = y1-y2
mul edx #eax = (x1-x2)*(y1-y2)
add is faster than mul, but if you want to multiply two general values, mul is far faster than any loop iterating add operations.
You can't seriously use add to make that code go faster than it will with mul. If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. But for the general case - no.
If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler.
If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. This works particularly well when the known operand is small, and only has a few bits in its binary representation. To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. This is easily accomplished in assembler by left shifting and adding:
; x in EDX
; product to EAX
xor eax,eax
shl edx,r ; x*2^r
add eax,edx
shl edx,q-r ; x*2^q
add eax,edx
shl edx,p-q ; x*2^p
add eax,edx
The key problem with this is that it takes at least 4 clocks to do this, assuming
a superscalar CPU constrained by register dependencies. Multiply typically takes
10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time
you might as well do a multiply.
To multiply by 9:
mov eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl edx,3 ; x*2^3
add eax,edx
This beats multiply; should only take 2 clocks.
What is less well known is the use of the LEA (load effective address) instruction,
to accomplish fast multiply-by-small-constant.
LEA which takes only a single clock worst case its execution time can often
by overlapped with other instructions by superscalar CPUs.
LEA is essentially "add two values with small constant multipliers".
It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y
being any register. If x==y, you can get 1,2,3,4,5,8,9 times x,
but using x and y as seperate registers allows for intermediate results to be combined
and moved to other registers (e.g., to t), and this turns out to be remarkably handy.
Using it, you can accomplish a multiply by 9 using a single instruction:
lea eax,[edx*8+edx] ; takes 1 clock
Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles:
lea eax,[edx*4+edx] ; 5 * edx
lea eax,[eax*2+edx] ; 11 * edx
lea eax,[eax*4] ; 44 * edx
To do this, you have to decompose your constant multiplier into various factors/sums involving
1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still
only use 3-4 instructions.
If you allow the use other typically single-clock instructions (e.g, SHL/SUB/NEG/MOV)
you can multiply by some constant values that pure LEA can't
do as efficiently by itself. To multiply by 31:
lea eax,[4*edx]
lea eax,[8*eax] ; 32*edx
sub eax,edx; 31*edx ; 3 clocks
The corresponding LEA sequence is longer:
lea eax,[edx*4+edx]
lea eax,[edx*2+eax] ; eax*7
lea eax,[eax*2+edx] ; eax*15
lea eax,[eax*2+edx] ; eax*31 ; 4 clocks
Figuring out these sequences is a bit tricky, but you can set up an organized attack.
Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst
case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions.
This is only useful if the clock count is smaller than the integer multiply for your particular CPU
(I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or
at least it doesn't use up registers that are already busy (avoiding any spills).
I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. A clever person would possibly cache the answer so it doesn't
have to be recomputed each time multiplying the same constant occurs; I didn't actually do that because
the time to generate such sequences is less than you'd expect.
Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants
from 1 to 10000. Most of them can be done in 5-6 instructions worst case.
As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest
arrays of nested structures.
Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul. Having said that, you can use add to do multiplications:
Multiply by 2:
add eax,eax ; x2
Multiply by 4:
add eax,eax ; x2
add eax,eax ; x4
Multiply by 8:
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
They work nicely for powers of two. I'm not saying they're faster. They were certainly necessary in the days before fancy multiplication instructions. That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-)
You can even multiply by non-powers by simply storing interim results:
Multiply by 9:
push ebx ; preserve
push eax ; save for later
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
pop ebx ; get original eax into ebx
add eax,ebx ; x9
pop ebx ; recover original ebx
I generally suggest that you write your code primarily for readability and only worry about performance when you need it. However, if you're working in assembler, you may well already at that point. But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand.
You should, however, always profile your code in the target environment to ensure that what you're doing is actually faster. Assembler doesn't change that aspect of optimisation at all.
If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax. It will not handle overflow elegantly.
START: MOV AX, 0007 ; Load up registers
MOV BX, 0005
CALL MULT ; Call multiply function.
HLT ; Stop.
MULT: PUSH BX ; Preserve BX, CX, DX.
PUSH CX
PUSH DX
XOR CX,CX ; CX is the accumulator.
CMP BX, 0 ; If multiplying by zero, just stop.
JZ FIN
MORE: PUSH BX ; Xfer BX to DX for bit check.
POP DX
AND DX, 0001 ; Is lowest bit 1?
JZ NOADD ; No, do not add.
ADD CX,AX
NOADD: SHL AX,1 ; Shift AX left (double).
SHR BX,1 ; Shift BX right (integer halve, next bit).
JNZ MORE ; Keep going until no more bits in BX.
FIN: PUSH CX ; Xfer product from CX to AX.
POP AX
POP DX ; Restore registers and return.
POP CX
POP BX
RET
It relies on the fact that 123 multiplied by 456 is identical to:
123 x 6
+ 1230 x 5
+ 12300 x 4
which is the same way you were taught multiplication back in grade/primary school. It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding).
It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. There's something to be said for high level languages :-)
When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster.
http://home.comcast.net/~fbui/intel_a.html#add
http://home.comcast.net/~fbui/intel_m.html#mul
My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution.
I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for!
In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - e.g.:
1280 x 1024 display - each line on the
display is 1280 pixels.
1280 = 1024 + 256 = 2^10 + 2^8
y * 1280 = y * (2 ^ 10) + y * (2 ^ 8)
= ADD (SHL y, 10), (SHL y, 8)
...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles.