How to convert int32x4_t to float32x4_t - neon

Can any one tell me how I can convert int32x4_t to float32x4_t by using neon intrinsics .
I am working on imx6 hardware with a9 processor .I am new to the neon .
Any help is appreciated .

Try this:
float32x4_t vcvtq_n_f32_s32(int32x4_t a, __constrange(1,32) int b); // VCVT.F32.S32 q0, q0, #32

Related

Display Potentiometer Value on LCD

I am doing a project using the Mbed OS where I have to use a LCD 1602 to display the value of the potentiometer.
I was able to connect and display "Hello World" on the LCD in my previous project but I don't know how create one where it'll read the potentiometer. I've read through the Arm Mbed web site and still don't know how to create the code.
I'm using a Nucleo board.
The source code for my project is as follows.
#include "mbed.h"
#include "TextLCD.h"
AnalogOut mypot(A0);
TextLCD lcd(D1, D2, D4, D5, D6, D7 ); // rs, rw, e, d4, d5, d6, d7
int main() {
while(1){
for(i=0.0f, i<1.0f , i+=1.0f){
mypot = i
lcd.printf("mypot.read()");
}
}
}
I assume that you want to read the input voltage that is modified by changing the setting of a potentiometer, a kind of variable resistor that has a knob that by twisting divides the incoming voltage into a different output voltage.
It looks like you are using the wrong class, AnalogOut, and should instead be using the class AnalogIn. See https://os.mbed.com/docs/mbed-os/v5.15/apis/analogin.html
Use the AnalogIn API to read an external voltage applied to an analog
input pin. AnalogIn() reads the voltage as a fraction of the system
voltage. The value is a floating point from 0.0(VSS) to 1.0(VCC). For
example, if you have a 3.3V system and the applied voltage is 1.65V,
then AnalogIn() reads 0.5 as the value.
So your program should look something like the following. I don't have your facilities so can't test this but it is in accordance with the Mbed documentation. You will need to make sure you have things wired up to the proper analog pin and that you specify the correct analog pin to be reading from.
#include "mbed.h"
#include "TextLCD.h"
AnalogIn mypot(A0);
TextLCD lcd(D1, D2, D4, D5, D6, D7 ); // rs, rw, e, d4, d5, d6, d7
int main() {
while(1) { // begin infinite loop
// read the voltage level from the analog pin. the value is in
// the range of 0.0 to 1.0 and is interpreted as a percentage from 0% to 100%.
float voltage = mypot.read(); // Read input voltage, a float in the range [0.0, 1.0]
lcd.printf("normalized voltage read is %f\n", voltage);
wait(0.5f); // wait a half second then we poll the voltage again.
}
}
wait() is deprecated but it should work. See https://os.mbed.com/docs/mbed-os/v5.15/apis/wait.html
See as well Working with Text Displays.

Division of high numbers in assembly

I tried to find a location of a crusor with an array of 1896(becomes the whole console in 2D, 79*24). For this I took the location and divided it by 79.
MOV AX, [Y-16H]
DEC AX
MOV BX, 79
DIV BX
MOV Z, DL
ADD Z, DH
MOV DL, Z
MOV Z, AL
ADD Z, AH
MOV DH, Z
I get an overflow error. Can you tell me what am I doing wrong please? maybe suggest a solution?
DIV BX divides the 32-bit number formed by DX (high word) and AX (low word) by BX. You therefore need to clear DX (e.g. XOR DX,DX) prior to the division to avoid an overflow.
By the way, are you sure you don't want to divide by 80? I've never heard of a 79-column console, although I'm no expert on such matters
As Michael mentioned, you need to clear the DX register prior to division.
That said, if you are interested in speed (the usual reason for assembly coding), it is much faster to transform the division by 79 into an equivalent operation using a multiplication and a right shift: (x * 53093) >> 22.
This works because 1/79th is approximately equal to 53093 / (2**22).

Why movlps and movhps SSE instructions are faster than movups for transferring misaligned data?

I found that in some of SSE optimized code for doing math calculation, they use the combination of movlps and movhps instructions instead of a single movups instruction to transfer misaligned data. I don't know why, so I tried it myself, and it's the pseudo code below:
struct Vec4
{
float f[4];
};
const size_t nSize = sizeof(Vec4) * 100;
Vec4* pA = (Vec4*)malloc( nSize );
Vec4* pB = (Vec4*)malloc( nSize );
Vec4* pR = (Vec4*)malloc( nSize );
...Some data initialization code here
...Records current time by QueryPerformanceCounter()
for( int i=0; i<100000, ++i )
{
for( int j=0; j<100; ++j )
{
Vec4* a = &pA[i];
Vec4* b = &pB[i];
Vec4* r = &pR[i];
__asm
{
mov eax, a
mov ecx, b
mov edx, r
...option 1:
movups xmm0, [eax]
movups xmm1, [ecx]
mulps xmm0, xmm1
movups [edx], xmm0
...option 2:
movlps xmm0, [eax]
movhps xmm0, [eax+8]
movlps xmm1, [ecx]
movhps xmm1, [ecx+8]
mulps xmm0, xmm1
movlps [edx], xmm0
movhps [edx+8], xmm0
}
}
}
...Calculates passed time
free( pA );
free( pB );
free( pR );
I ran the code for many times and calculated their average time consuming.
For movups version, the result is about 50ms.
For movlps, movhps version, the result is about 46ms.
And I also tried a data aligned version with __declspec(align(16)) descriptor on structure, and allocated by _aligned_malloc(), the result is about 34ms.
Why the combination of movlps and movhps is faster? Does it mean we'd better to use movlps and movhps instead of movups?
Athlons of this generation (K8) only have 64 bit wide ALU units. So every 128 bit SSE instruction needs to be split into two 64 bit instructions, which incurs a overhead for some instructions.
On this type of processor you will generally find no speedup using SSE compared to equal MMX code.
Quoting Agner Fog in The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers:
12.9 64 bit versus 128 bit instructions
It is a big advantage to use 128-bit instructions on K10, but not on K8 because each 128-bit
instruction is split into two 64-bit macro-operations on the K8.
128 bit memory write instructions are handled as two 64-bit macro-operations on K10, while
128 bit memory read is done with a single macro-operation on K10 (2 on K8).
128 bit memory read instructions use only the FMISC unit on K8, but all three units on K10.
It is therefore not advantageous to use XMM registers just for moving blocks of data from
one memory position to another on the k8, but it is advantageous on K10.
movups works with non-aligned data. movlps, movhps works only with alligned data. Sure movlps, movhps are faster. For time calculation and compare better to use rdtsc, not ms.

Which one is better, gcc or armcc for NEON optimizations?

Refering to #auselen's answer here: Using ARM NEON intrinsics to add alpha and permute, looks like armcc compiler is far more better than the gcc compiler for NEON optimizations. Is this really true? I haven't really tried armcc compiler. But I got pretty optimized code using the gcc compiler with -O3 optimization flag. But now I'm wondering if armcc is really that good? So which of the two compiler is better, considering all the factors?
Compilers are software as well, they tend to improve over time. Any generic claim like armcc is better than GCC on NEON (or better said as vectorization) can't hold true forever since one developer group can close the gap with enough attention. However initially it is logical to expect compilers developed by hardware companies to be superior because they need to demonstrate/market these features.
One recent example I saw was here on Stack Overflow about an answer for branch prediction. Quoting from last line of updated section "This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...".
I am a big fan of GCC, but I wouldn't bet on quality of code produced by it against compilers from Intel or ARM. I expect any mainstream commercial compiler to produce code at least as good as GCC.
One empirical answer to this question could be to use hilbert-space's neon optimization example and see how different compilers optimize it.
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
int i;
uint8x8_t rfac = vdup_n_u8 (77);
uint8x8_t gfac = vdup_n_u8 (151);
uint8x8_t bfac = vdup_n_u8 (28);
n/=8;
for (i=0; i<n; i++)
{
uint16x8_t temp;
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8_t result;
temp = vmull_u8 (rgb.val[0], rfac);
temp = vmlal_u8 (temp,rgb.val[1], gfac);
temp = vmlal_u8 (temp,rgb.val[2], bfac);
result = vshrn_n_u16 (temp, 8);
vst1_u8 (dest, result);
src += 8*3;
dest += 8;
}
}
This is armcc 5.01
20: f421140d vld3.8 {d1-d3}, [r1]!
24: e2822001 add r2, r2, #1
28: f3810c04 vmull.u8 q0, d1, d4
2c: f3820805 vmlal.u8 q0, d2, d5
30: f3830806 vmlal.u8 q0, d3, d6
34: f2880810 vshrn.i16 d0, q0, #8
38: f400070d vst1.8 {d0}, [r0]!
3c: e1520003 cmp r2, r3
40: bafffff6 blt 20 <neon_convert+0x20>
This is GCC 4.4.3-4.7.1
1e: f961 040d vld3.8 {d16-d18}, [r1]!
22: 3301 adds r3, #1
24: 4293 cmp r3, r2
26: ffc0 4ca3 vmull.u8 q10, d16, d19
2a: ffc1 48a6 vmlal.u8 q10, d17, d22
2e: ffc2 48a7 vmlal.u8 q10, d18, d23
32: efc8 4834 vshrn.i16 d20, q10, #8
36: f940 470d vst1.8 {d20}, [r0]!
3a: d1f0 bne.n 1e <neon_convert+0x1e>
Which looks extremely similar, so we have a draw. After seeing this I tried mentioned add alpha and permute again.
void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
numPix /= 8; //process 8 pixels at a time
uint8x8_t alpha = vdup_n_u8 (0xff);
for (int i=0; i<numPix; i++)
{
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8x4_t bgra;
bgra.val[0] = rgb.val[2]; //these lines are slow
bgra.val[1] = rgb.val[1]; //these lines are slow
bgra.val[2] = rgb.val[0]; //these lines are slow
bgra.val[3] = alpha;
vst4_u8(dst, bgra);
src += 8*3;
dst += 8*4;
}
}
Compiling with gcc...
$ arm-linux-gnueabihf-gcc --version
arm-linux-gnueabihf-gcc (crosstool-NG linaro-1.13.1-2012.05-20120523 - Linaro GCC 2012.05) 4.7.1 20120514 (prerelease)
$ arm-linux-gnueabihf-gcc -std=c99 -O3 -c ~/temp/permute.c -marm -mfpu=neon-vfpv4 -mcpu=cortex-a9 -o ~/temp/permute_gcc.o
00000000 <neonPermuteRGBtoBGRA>:
0: e3520000 cmp r2, #0
4: e2823007 add r3, r2, #7
8: b1a02003 movlt r2, r3
c: e92d01f0 push {r4, r5, r6, r7, r8}
10: e1a021c2 asr r2, r2, #3
14: e24dd01c sub sp, sp, #28
18: e3520000 cmp r2, #0
1c: da000019 ble 88 <neonPermuteRGBtoBGRA+0x88>
20: e3a03000 mov r3, #0
24: f460040d vld3.8 {d16-d18}, [r0]!
28: eccd0b06 vstmia sp, {d16-d18}
2c: e59dc014 ldr ip, [sp, #20]
30: e2833001 add r3, r3, #1
34: e59d6010 ldr r6, [sp, #16]
38: e1530002 cmp r3, r2
3c: e59d8008 ldr r8, [sp, #8]
40: e1a0500c mov r5, ip
44: e59dc00c ldr ip, [sp, #12]
48: e1a04006 mov r4, r6
4c: f3c73e1f vmov.i8 d19, #255 ; 0xff
50: e1a06008 mov r6, r8
54: e59d8000 ldr r8, [sp]
58: e1a0700c mov r7, ip
5c: e59dc004 ldr ip, [sp, #4]
60: ec454b34 vmov d20, r4, r5
64: e1a04008 mov r4, r8
68: f26401b4 vorr d16, d20, d20
6c: e1a0500c mov r5, ip
70: ec476b35 vmov d21, r6, r7
74: f26511b5 vorr d17, d21, d21
78: ec454b34 vmov d20, r4, r5
7c: f26421b4 vorr d18, d20, d20
80: f441000d vst4.8 {d16-d19}, [r1]!
84: 1affffe6 bne 24 <neonPermuteRGBtoBGRA+0x24>
88: e28dd01c add sp, sp, #28
8c: e8bd01f0 pop {r4, r5, r6, r7, r8}
90: e12fff1e bx lr
Compiling with armcc...
$ armcc
ARM C/C++ Compiler, 5.01 [Build 113]
$ armcc --C99 --cpu=Cortex-A9 -O3 -c permute.c -o permute_arm.o
00000000 <neonPermuteRGBtoBGRA>:
0: e1a03fc2 asr r3, r2, #31
4: f3870e1f vmov.i8 d0, #255 ; 0xff
8: e0822ea3 add r2, r2, r3, lsr #29
c: e1a031c2 asr r3, r2, #3
10: e3a02000 mov r2, #0
14: ea000006 b 34 <neonPermuteRGBtoBGRA+0x34>
18: f420440d vld3.8 {d4-d6}, [r0]!
1c: e2822001 add r2, r2, #1
20: eeb01b45 vmov.f64 d1, d5
24: eeb02b46 vmov.f64 d2, d6
28: eeb05b40 vmov.f64 d5, d0
2c: eeb03b41 vmov.f64 d3, d1
30: f401200d vst4.8 {d2-d5}, [r1]!
34: e1520003 cmp r2, r3
38: bafffff6 blt 18 <neonPermuteRGBtoBGRA+0x18>
3c: e12fff1e bx lr
In this case armcc produces much better code. I think this justifies fgp's answer above. Most of the time GCC will produce good enough code, but you should keep an eye on critical parts or most importantly first you must measure / profile.
If you use NEON intrinsics, the compiler shouldn't matter that much. Most (if not all) NEON intrinsics translate to a single NEON instruction, so the only thing left to the compiler is register allocation and instruction scheduling. In my experience, both GCC 4.2 and Clang 3.1 do reasonably well at those tasks.
Note, however, that the NEON instructions are bit more expressive than the NEON instrinsics. For example, NEON load/store instructions have pre- and post-increment adressing modes which combine a load or store with an increment of the address register, thus saving you one instruction. The NEON intrinsics don't provide an explicit way to do that, but instead rely on the compiler to combine a reguler NEON load/store intrinsic and an address increment into a load/store instruction with post-increment. Similarly, some load/store instructions allow you to specify the alignment of the memory address, and execute faster if you specify stricter alignment guarantees. The NEON intrinsics, again, don't allow you to specify alignment explicitly, but instead rely on the compiler to deduce the correct alignment specifier. In theory, you use "align" attributes on your pointers to provide suitable hints to the compiler, but at least Clang seems to ignore those...
In my experience, neither Clang nor GCC are very bright when it comes to those kinds of optimizations. Fortunately, the additional performance benefit of these kinds of optimization usually isn't all that high - it's more like 10% than 100%.
Another area where those two compilers aren't particularly smart is avoidance of stack spilling. If you code uses more vector-valued variables than there are NEON registers, I've seem both compilers produce horrible code. Basically, what they seem to do is to schedule instructions based on the assumption that there are enough registers available. Register allocation seems to come afterwards, and seems to simply spill values to the stack once it runs of registers. So make sure you code has a working set of less than 16 128-bit vectors or 32 64-bit vectory at any time!
Overall, I've got pretty good results from both GCC and Clang, but I regularly had to reorganize the code a bit to avoid compiler Idiosyncrasies. My advice would be to stick with GCC or Clang, but check on the regularly with the dissassembler of your choice.
So, overall, I'd say sticking with GCC is fine. You might want to look at the dissassembly of the performance-critical parts, though, and check if it looks reasonable.

add vs mul (IA32-Assembly)

I know that add is faster as compared to mul function.
I want to know how to go about using add instead of mul in the following code in order to make it more efficient.
Sample code:
mov eax, [ebp + 8] #eax = x1
mov ecx, [ebp + 12] #ecx = x2
mov edx, [ebp + 16] #edx = y1
mov ebx, [ebp + 20] #ebx = y2
sub eax,ecx #eax = x1-x2
sub edx,ebx #edx = y1-y2
mul edx #eax = (x1-x2)*(y1-y2)
add is faster than mul, but if you want to multiply two general values, mul is far faster than any loop iterating add operations.
You can't seriously use add to make that code go faster than it will with mul. If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. But for the general case - no.
If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler.
If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. This works particularly well when the known operand is small, and only has a few bits in its binary representation. To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. This is easily accomplished in assembler by left shifting and adding:
; x in EDX
; product to EAX
xor eax,eax
shl edx,r ; x*2^r
add eax,edx
shl edx,q-r ; x*2^q
add eax,edx
shl edx,p-q ; x*2^p
add eax,edx
The key problem with this is that it takes at least 4 clocks to do this, assuming
a superscalar CPU constrained by register dependencies. Multiply typically takes
10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time
you might as well do a multiply.
To multiply by 9:
mov eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl edx,3 ; x*2^3
add eax,edx
This beats multiply; should only take 2 clocks.
What is less well known is the use of the LEA (load effective address) instruction,
to accomplish fast multiply-by-small-constant.
LEA which takes only a single clock worst case its execution time can often
by overlapped with other instructions by superscalar CPUs.
LEA is essentially "add two values with small constant multipliers".
It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y
being any register. If x==y, you can get 1,2,3,4,5,8,9 times x,
but using x and y as seperate registers allows for intermediate results to be combined
and moved to other registers (e.g., to t), and this turns out to be remarkably handy.
Using it, you can accomplish a multiply by 9 using a single instruction:
lea eax,[edx*8+edx] ; takes 1 clock
Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles:
lea eax,[edx*4+edx] ; 5 * edx
lea eax,[eax*2+edx] ; 11 * edx
lea eax,[eax*4] ; 44 * edx
To do this, you have to decompose your constant multiplier into various factors/sums involving
1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still
only use 3-4 instructions.
If you allow the use other typically single-clock instructions (e.g, SHL/SUB/NEG/MOV)
you can multiply by some constant values that pure LEA can't
do as efficiently by itself. To multiply by 31:
lea eax,[4*edx]
lea eax,[8*eax] ; 32*edx
sub eax,edx; 31*edx ; 3 clocks
The corresponding LEA sequence is longer:
lea eax,[edx*4+edx]
lea eax,[edx*2+eax] ; eax*7
lea eax,[eax*2+edx] ; eax*15
lea eax,[eax*2+edx] ; eax*31 ; 4 clocks
Figuring out these sequences is a bit tricky, but you can set up an organized attack.
Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst
case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions.
This is only useful if the clock count is smaller than the integer multiply for your particular CPU
(I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or
at least it doesn't use up registers that are already busy (avoiding any spills).
I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. A clever person would possibly cache the answer so it doesn't
have to be recomputed each time multiplying the same constant occurs; I didn't actually do that because
the time to generate such sequences is less than you'd expect.
Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants
from 1 to 10000. Most of them can be done in 5-6 instructions worst case.
As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest
arrays of nested structures.
Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul. Having said that, you can use add to do multiplications:
Multiply by 2:
add eax,eax ; x2
Multiply by 4:
add eax,eax ; x2
add eax,eax ; x4
Multiply by 8:
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
They work nicely for powers of two. I'm not saying they're faster. They were certainly necessary in the days before fancy multiplication instructions. That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-)
You can even multiply by non-powers by simply storing interim results:
Multiply by 9:
push ebx ; preserve
push eax ; save for later
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
pop ebx ; get original eax into ebx
add eax,ebx ; x9
pop ebx ; recover original ebx
I generally suggest that you write your code primarily for readability and only worry about performance when you need it. However, if you're working in assembler, you may well already at that point. But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand.
You should, however, always profile your code in the target environment to ensure that what you're doing is actually faster. Assembler doesn't change that aspect of optimisation at all.
If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax. It will not handle overflow elegantly.
START: MOV AX, 0007 ; Load up registers
MOV BX, 0005
CALL MULT ; Call multiply function.
HLT ; Stop.
MULT: PUSH BX ; Preserve BX, CX, DX.
PUSH CX
PUSH DX
XOR CX,CX ; CX is the accumulator.
CMP BX, 0 ; If multiplying by zero, just stop.
JZ FIN
MORE: PUSH BX ; Xfer BX to DX for bit check.
POP DX
AND DX, 0001 ; Is lowest bit 1?
JZ NOADD ; No, do not add.
ADD CX,AX
NOADD: SHL AX,1 ; Shift AX left (double).
SHR BX,1 ; Shift BX right (integer halve, next bit).
JNZ MORE ; Keep going until no more bits in BX.
FIN: PUSH CX ; Xfer product from CX to AX.
POP AX
POP DX ; Restore registers and return.
POP CX
POP BX
RET
It relies on the fact that 123 multiplied by 456 is identical to:
123 x 6
+ 1230 x 5
+ 12300 x 4
which is the same way you were taught multiplication back in grade/primary school. It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding).
It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. There's something to be said for high level languages :-)
When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster.
http://home.comcast.net/~fbui/intel_a.html#add
http://home.comcast.net/~fbui/intel_m.html#mul
My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution.
I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for!
In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - e.g.:
1280 x 1024 display - each line on the
display is 1280 pixels.
1280 = 1024 + 256 = 2^10 + 2^8
y * 1280 = y * (2 ^ 10) + y * (2 ^ 8)
= ADD (SHL y, 10), (SHL y, 8)
...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles.