Suggestion in ARM NEON optimization - optimization

For academic purposes I want to try to write an ARM NEON optimization of the following algorithm, even only to test if it is possible to obtain any performance improvement or not. I think this is not a good candidate for SIMD optimization because results are merged togheter losing parallelization gains.
This is the algorithm:
const uchar* center = ...;
int t0, t1, val;
t0 = center[0]; t1 = center[1];
val = t0 < t1;
t0 = center[2]; t1 = center[3];
val |= (t0 < t1) << 1;
t0 = center[4]; t1 = center[5];
val |= (t0 < t1) << 2;
t0 = center[6]; t1 = center[7];
val |= (t0 < t1) << 3;
t0 = center[8]; t1 = center[9];
val |= (t0 < t1) << 4;
t0 = center[10]; t1 = center[11];
val |= (t0 < t1) << 5;
t0 = center[12]; t1 = center[13];
val |= (t0 < t1) << 6;
t0 = center[14]; t1 = center[15];
val |= (t0 < t1) << 7;
d[i] = (uchar)val;
This is what I thought in ARM assembly:
VLD2.8 {d0, d1} ["center" addr]
supposing 8 bit chars, this first operation should load all the t0 and t1 values alternatively in 2 registers.
VCLT.U8 d2, d0, d1
a single operation of "less then" for all the comparisons. NOTES: I've read that VCLT is possible only with a #0 constant as second operand, so this must be inverted in a >=. Reading ARM documentation i think the result of every 8 bit value will be "all 1" for true (11111111) or "all 0" for false (00000000).
VSHR.U8 d4, d2, #7
this right shift will delete 7 out of 8 values in the register 8-bit "cells" (mainly to delete 7 ones). I've used d4 because of the next step will be the first d register mapped in q2.
Now problems begin: shifting and ORs.
VSHLL.U8 q2[1], d4[1], 1
VSHLL.U8 q2[2], d4[2], 2
...
VSHLL.U8 q2[7], d4[7], 7
I can imagine only this way (if it's possible to use [offsets]) for left shifts. Q2 should be specified instead of d4 according to the documentation.
VORR(.U8) d4[0], d4[1], d4[0]
VORR(.U8) d4[0], d4[2], d4[0]
...
VORR(.U8) d4[0], d4[7], d4[0]
Last step should give the result.
VST1.8 d4[0], [d[i] addr]
Simple store of the result.
It is my first approach to ARM NEON, so probably many assumptions may be incorrect. Help me understand possible errors, and suggest a better solution if possible.
EDIT:
This is the final working code after the suggested solutions:
__asm__ __volatile ("VLD2.8 {d0, d1}, [%[ordered_center]] \n\t"
"VCGT.U8 d2, d1, d0 \n\t"
"MOV r1, 0x01 \n\t"
"MOV r2, 0x0200 \n\t"
"ORR r2, r2, r1 \n\t"
"MOV r1, 0x10 \n\t"
"MOV r3, 0x2000 \n\t"
"ORR r3, r3, r1 \n\t"
"MOVT r2, 0x0804 \n\t"
"MOVT r3, 0x8040 \n\t"
"VMOV.32 d3[0], r2 \n\t"
"VMOV.32 d3[1], r3 \n\t"
"VAND d0, d2, d3 \n\t"
"VPADDL.U8 d0, d0 \n\t"
"VPADDL.U16 d0, d0 \n\t"
"VPADDL.U32 d0, d0 \n\t"
"VST1.8 d0[0], [%[desc]] \n\t"
:
: [ordered_center] "r" (ordered_center), [desc] "r" (&desc[i])
: "d0", "d1", "d2", "d3", "r1", "r2", "r3");

After the comparison, you have an array of 8 booleans represented by 0xff or 0x00. The reason SIMD comparisons (on any architecture) produce those values is to make them useful for a bit-mask operation (and/or bit-select in NEON's case) so you can turn the result into an arbitrary value quickly, without a multiply.
So rather than reducing them to 1 or 0 and shifting them about, you'll find it easier to mask them with the constant 0x8040201008040201. Then each lane contains the bit corresponding to its position in the final result. You can pre-load the constant into another register (I'll use d3).
VAND d0, d2, d3
Then, to combine the results, you can use VPADD (instead of OR), which will combine adjacent pairs of lanes, d0[0] = d0[0] + d0[1], d0[1] = d0[2] + d0[3], etc... Since the bit patterns do not overlap there is no carry and add works just as well as or. Also, because the output is half as large as the input we have to fill in the second half with junk. I've used a second copy of d0 for that.
You'll need to do the add three times to get all columns combined.
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
and now the result will now be in d0[0].
As you can see, d0 has room for seven more results; and some lanes of the VPADD operations have been working with junk data. It would be better if you could fetch more data at once, and feed that additional work in as you go so that none of the arithmetic is wasted.
EDIT
Supposing the loop is unrolled four times; with results in d4, d5, d6, and d7; the constant mentioned earlier should be loaded into, say, d30 and d31, and then some q register arithmetic can be used:
VAND q0, q2, q15
VAND q1, q3, q15
VPADD.u8 d0, d0, d1
VPADD.u8 d2, d2, d3
VPADD.u8 d0, d0, d2
VPADD.u8 d0, d0, d0
With the final result in d0[0..3], or simply the 32-bit value in d0[0].
There seem to be lots of registers free to unroll it further, but I don't know how many of those you'll use up on other calculations.

load a d register with the value 0x8040201008040201
vand with the result of vclt
vpaddl.u8 from 2)
vpaddl.u16 from 3)
vpaddl.u32 from 4)
store the lowest single byte from 5)

Start with expressing the parallelism explicitly to begin with:
int /* bool, whatever ... */ val[8] = {
center[0] < center[1],
center[2] < center[3],
center[4] < center[5],
center[6] < center[7],
center[8] < center[9],
center[10] < center[11],
center[12] < center[13],
center[14] < center[15]
};
d[i] = extract_mask(val);
The shifts are equivalent to a "mask move", as you want each comparison to result in a single bit.
The comparison of the above sixteen values can be done by first doing a structure load (vld2.8) to split adjacent bytes into two uint8x8_t, then the parallel compare. The result of that is a uint8x8_t with either 0xff or 0x00 in the bytes. You want one bit of each, in the respective bit position.
That's a "mask extract"; on Intel SSE2, that'd be MASKMOV but on Neon, no direct equiv exists; three vpadd as shown above (or see SSE _mm_movemask_epi8 equivalent method for ARM NEON for more on this) are a suitable substitute.

Related

load 16-bit data from table in 8051 without modifying DPTR

I'm trying to make a simple routine for the 8051 processor that allows me to load any 16-bit number of my choice from a table stored in code memory without modifying any part of DPTR and without requiring stack space. So push and pop cannot be used. Also, I want to use the least amount of processing time possible.
So far I came up with the following code that sort-of allows me to load a value from a table of 4 16-bit values to accumulator and R2 where R2 has the high byte and A has the low byte.
Is this the most efficient way to do this? If so, how do I calculate how much to add to the accumulator before each movc instruction in this example?
mov A,#2h ;want 2nd entry from table
acall getpointer ;run function below
;here R2:A should form correct 16-bit pointer ( = 0456h)
END
getpointer:
rl A ;multiply A value * 2
mov R2,A ;copy to R2
inc R2 ;R2=A+1
;add something to A but what?
movc A,#A+PC ;Load first byte
xch A,R2 ;put result in R2 and let A=original A+1
;add something to A again but what?
movc A,#A+PC ;load second byte
ret ;keep result in A and exit
mytable:
dw 0123h
dw 0456h
dw 0789h
dw 0000h
Try this:
getpointer:
rl a
mov r2, a
add a, #5 ; skip all insts after 1st movc and 1 byte
movc a, #a+pc
xch a, r2 ; 1-byte
inc a ; 1-byte ; skip all instrs after 2nd movc
movc a, #a+pc ; 1-byte
ret ; 1-byte
mytable:
...
I hope I got it right. Note that movc a, #a+pc first increments pc, then adds a to this incremented value. This is why I added instruction lengths in the comments, to show how much code there is.
Note that index of 2 corresponds to 0789h, not 0456h.
Also note that you may need to swap a and r2 and the cheapest may be to swap the data within the table.

If an embedded system coded in C is 8 or 16-bit, how will it manipulate 32-bit data types like int?

I think I'm thinking about this the wrong way, but I'm wondering how an embedded system with less than 32-bits can use 32-bit data values. I'm a beginner programmer so go easy on me :)
base 10
0100 <- carry in/out
5432
+1177
======
6609
never brought up in class but we can now extend that to two operations
100
32
+77
======
09
01
54
+11
======
66
and come up with the 6609 result because we understand that it is column based and each column treated separately.
base 2
1111
+0011
=====
11110
1111
+0011
=====
10010
110
11
+11
=====
10
111
11
+00
=====
100
result 10010
you can break your operations up into however many bits you want 8, 16, 13, 97 whatever. it is column based (for addition) and it just works. division you should be able to figure out, multiplication is just shifting and adding and can turn that into multiple operations as well
n bits * n bits = 2*n bits so if you have an 8 bit * 8 bit = 16 bit multiply you can use that on an 8 bit system otherwise you have to limit to 4 bits * 4 bits = 8 bits and work with that (or if no multiply then just do the shift and add).
base 2
abcd
* 1101
========
abcd
0000
abcd
+abcd
=========
which you can break down into a shifting and adding problem, can do N bits with a 4 or 8 or M bit processor/registers/alu
Or look at it another way, grade school algebra
(a+b)*(c+d) = ac + bc + ad + bd
mnop * tuvw = ((mn*0x100)+(op)) * ((tu*0x100)+(vw)) = (a+b)*(c+d)
and you should find that you can combine the with 0x100 terms and without,
do those separately from the without putting together parts of the answer using an 8 bit alu (or 4 bits of the 8 bit as needed).
shifting should be obvious just move the bits over to the next byte or (half)word or whatever.
and bitwise operations (xor, and, or) are bitwise so dont need anything special just keep the columns lined up.
EDIT
Or you could just try it
unsigned long fun1 ( unsigned long a, unsigned long b )
{
return(a+b);
}
00000000 <_fun1>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0004 mov 4(r5), r0
8: 1d41 0006 mov 6(r5), r1
c: 6d40 0008 add 10(r5), r0
10: 6d41 000a add 12(r5), r1
14: 0b40 adc r0
16: 1585 mov (sp)+, r5
18: 0087 rts pc
00000000 <fun1>:
0: 0e 5c add r12, r14
2: 0f 6d addc r13, r15
4: 30 41 ret
00000000 <fun1>:
0: 62 0f add r22, r18
2: 73 1f adc r23, r19
4: 84 1f adc r24, r20
6: 95 1f adc r25, r21
8: 08 95 ret
bonus points if you can figure out these instruction sets.
unsigned long fun2 ( unsigned long a, unsigned long b )
{
return(a*b);
}
00000000 <_fun2>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 10e6 mov r3, -(sp)
6: 1d41 0006 mov 6(r5), r1
a: 1d40 000a mov 12(r5), r0
e: 1043 mov r1, r3
10: 00a1 clc
12: 0c03 ror r3
14: 74d7 fff2 ash $-16, r3
18: 6d43 0004 add 4(r5), r3
1c: 70c0 mul r0, r3
1e: 00a1 clc
20: 0c00 ror r0
22: 7417 fff2 ash $-16, r0
26: 6d40 0008 add 10(r5), r0
2a: 7040 mul r0, r1
2c: 10c0 mov r3, r0
2e: 6040 add r1, r0
30: 0a01 clr r1
32: 1583 mov (sp)+, r3
34: 1585 mov (sp)+, r5
36: 0087 rts pc
An 8 bit system can perform 8 bit operations in a single instruction and single memory access, on such an 8 bit system, 16 and 32 bit operations require additional data accesses and additional instructions.
For example, typical architectures place arithmetic results in register (often an accumulator but some architectures are more_orthogonal_ and can use any register for results), and arithmetic overflow results in a carry flag being set in a status register. In operations larger that the native architecture, the code can inspect the carry flag in order to take the appropriate action in subsequent instructions.
So say for an 8 bit system you add 1 to 255, the result in the 8 bit accumulator will be zero, with the carry flag set; the next instruction can then add one to the upper byte of a 16 bit value in response to the carry flag. This can be made to ripple through to any number of bytes or words, so that a system can be made to process operations of arbitrary bit length above that of the underlying architecture just not in a single instruction operation.

How to divide a BCD by 2 on an 8085 processor?

On an 8085 processor, an efficient algorithm for dividing a BCD by 2 comes in handy when converting a BCD to binary representation. You might think of recursive subtraction or multiplying by 0.5, however these algorithms require lengthy arithmetics.
Therefore, I would like to share with you the following code (in 8085 assembler) that does it more efficiently. The code has been thoroughly tested on GNUSim8085 and ASM80 emulators. If this code was helpful to you, please share your experience with me.
Before running the code, put the BCD in register A. Set the carry flag if there is a remainder to be received from a more significant byte (worth 50). After execution, register A will contain the result. The carry flag is used to pass the remainder, if any, to the next less significant byte.
The algorithm uses DAA instruction after manipulating C and AC flags in a very special way thus taking into account that any remainder passed down to the next nibble (i.e. half-octet) is worth 5 instead of 8.
;Division of BCD by 2 on an 8085 processor
;Set initial values.
;Register A contains a two-digit BCD. Carry flag contains remainder.
stc
cmc
mvi a, 85H
;Do modified decimal adjust before division.
cmc
cma
rar
adc a
cma
daa
cmc
;Divide by 2.
rar
;Save quotient and remainder to registers B and C.
mov b, a
mvi a, 00H
rar
mov c, a
;Continue working on decimal adjust.
mov a, b
sui 33H
mov b, a
mov a, c
ral
mov a, b
hlt
Suppose a two digit BCD number is represented as:D7D6D5D4 D3D2D1D0
For a division by 2, for binary (or hex), simply right shift the number by one place. If there is an overflow then remainder is 1, and 0 othwerwise. The same things applies to two digit (8-bit) BCD numbers when D4 is 0, i.e. there is no effective bit shift from higher order four bits. Now if D4 is 1 (before the shift), then shifting will introduce a 8 (1000) in the lower order four bits, which apparantly jeopardizes this process. Observe that in BCD the bit shift should introduce 10/2 = 5 not 16/2 = 8. Thus we can simply adjust by subtrating 8-5 = 3 from the lower order four bits, i.e. 03H from the entire number. The following code summarizes this strategy. We assume accumulator holds the data, and after the division the result is kept in the accumulator and remainder is kept in the register B.
MVI B,00H ; remainder = 0
STC
CMC ; clear the carry flag
RAR ; right shift the data
JNC SKIP
INR B ; CY=1 so, remainder = 1
SKIP: MOV D,A ; backup
ANI 08H ; if get D3 after the shift (or D4 before the shift)
MOV A,D ; get the data from backup
JZ FIN ; if D4 before the shift was 0
SUI 03H ; adjustment for the shift
FIN: HLT ; A has the result, B has the remainder

Objective c: How to store indices forming a sequence from middle to middle (indices for array or vector)

I have an array of structs, these structs are basically two points forming an edge. I use structs because of performance.
Now i have object A holding the edges and a method in this object to tell another object B about a list of consecutive of edges in this edges-array.
Edges:
e0 e1 e2 e3 e4 e5 e6 e7
Possible edges to return:
e0 e1 e2
e6 e7 e0 e1 e2
How would you return the info about the list of edges? They have to be ordered as shown. My problem is the case when the range starts near the end. Otherwise i could just use NSIndexSet. Using arrays seems not to be a good idea in this case because of performance. There will be many points and edges and things with points and edges.
You can store a set of consecutive "wrapping-around" indices as three numbers:
total number of indices in the original array,
first index of the index set,
number of indices in the index set.
That would be (8, 0, 3) for your first example e0 e1 e2, and
(8, 6, 5) for your second example e6 e7 e0 e1 e2.
"Possible edges to return: e0 e1 e2 e6 e7 e0 e1 e2 ... They have to be ordered as shown"
If you need to save order and to save some elements twice (e0, e1, e2) you can not use NSIndexSet.
NSIndexSet
You should not use index sets to store an arbitrary collection of
integer values because index sets store indexes as sorted ranges. This
makes them more efficient than storing a collection of individual
integers. It also means that each index value can only appear once in
the index set.
Since Objective C is build over C you can use C-style arrays. It will be very fast.
Small sample:
int* getIndexes(int *number)
{
// here we will generate a fake array;
*number = 10;
int *a = malloc(sizeof(int)*(*number));
for (int i = 0; i < (*number); i++) {
a[i] = i;
}
return a;
}
This function should be used the following way
int N;
int *indexes = getIndexes(&N);
// now in N we have count of indexes
// ... process indexes here
free(indexes)

Small assembly code sequence optimization (intel x86)

I am doing some exercises in assembly language and I found a question about optimization which I can't figure out. Can anyone help me with them
So the question is to optimize the following assembly code:
----------------------------Example1-------------------------
mov dx, 0 ---> this one I know-> xor dx,dx
----------------------------Example2------------------------
cmp ax, 0
je label
----------------------------Example3-------------------------
mov ax, x
cwd
mov si, 16
idiv si
----> Most I can think of in this example is to subs last 2 lines by idiv 16, but I am not sure
----------------------------Example4-------------------------
mov ax, x
mov bx, 7
mul bx
mov t, ax
----------------------------Example5---------------------------
mov si, offset array1
mov di, offset array2
; for i = 0; i < n; ++i
do:
mov bx, [si]
mov [di], bx
add si, 2
add di, 2
loop do
endforloop
For example 2, you should look at the and or test opcodes. Similar to example 1, they allow you to remove the need for a constant.
For example 4, remember that x * 7 is the same as x * (8 - 1) or, expanding that, x * 8 - x. Multiplying by eight can be done with a shift instruction.
For example 5, you'd think Intel would have provided a much simpler way to transfer from SI to DI, since that is the whole reason for their existence. Maybe something like a REPetitive MOVe String Word :-)
For example three, division by a power of two can be implemented as a right shift.
Note that in example 5, the current code fails to initialize CX as needed (and in the optimized version, you'd definitely want to do that too).