assembly asm code, how to load data from different source points? - neon

i tried to improve some code, but it seems so difficult to me.
i develop on Android NDK.
the C++ code i want to improve followed:
unsigned int test_add_C(unsigned int *x, unsigned int *y) {
unsigned int result = 0;
for (int i = 0; i < 8; i++) {
result += x[i] * y[i];
}
return result;
}
and neon code:
unsigned int test_add_neon(unsigned *x, unsigned *y) {
unsigned int result;
__asm__ __volatile__(
"vld1.32 {d2-d5}, [%[x]] \n\t"
"vld1.32 {d6-d9}, [%[y]]! \n\t"
"vmul.s32 d0, d2, d6 \n\t"
"vmla.s32 d0, d3, d7 \n\t"
"vmla.s32 d0, d4, d8 \n\t"
"vmla.s32 d0, d5, d9 \n\t"
"vpadd.s32 d0, d0 \n\t"
"vmov %0, r4, d0 \n\t"
:"=r"(result)
:"r"(x)
:"d0", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "r4"
);
return result;
}
but when i compile the code, it msg that undefined named operand 'x' and 'y'.
i dont know how to load the data from array x and y.
someone can help me?
thanks a lot.

Variable names inside inline assembly can't be "seen" by the compiler, and must be included in the input/output operands list.
Changing the line
:"r"(x)
to
:[x]"r"(x),[y]"r"(y)
will fix your 'undefined named operand' problem. However, I see a few more potential issues right away.
First, the datatype s32 of your multiplication instructions should be u32, since you specify x and y are of unsigned int type.
Second, you post-increment y but not x in the lines
"vld1.32 {d2-d5}, [%[x]] \n\t"
"vld1.32 {d6-d9}, [%[y]]! \n\t"
Unless this is on purpose, it is better to be consistent.

Related

Is performance better to use (multiple) conditional ternary operators than an if statement in GLSL

I remember years ago I was told it was better in a GLSL shader to do
a = condition ? statementX : statementY;
over
if(condition) a = statementX;
else a = statementY;
because in the latter case, for every fragment which didn't satisfy the condition, execution would halt while statementX was executed for fragments which did satisfy the condition; and then execution on those fragments would wait until statementY is executed on the other fragments; while in the former case all statementX and statementY would be executed in parallel for corresponding fragments. (I guess it's a bit more complicated with Workgroups etc but that's the gist of it I think). In fact even for multiple statements I used to see this:
a0 = condition ? statementX0 : statementY0;
a1 = condition ? statementX1 : statementY1;
a2 = condition ? statementX2 : statementY2;
instead of
if(condition) {
a0 = statementX0;
a1 = statementX1;
a2 = statementX1;
} else {
a0 = statementY0;
a1 = statementY1;
a2 = statementY1;
}
Is this still the case? or have architectures or compilers improved? Is this a premature optimization not worth pursuing? Or still very relevant?
(and is it the same for different kinds of shaders? fragment, vertex, compute etc).
In both cases you would normally have a branch and almost surely both will lead to the same assembly.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? __sinf(a) : __cosf(b);
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = __sinf(a);
22 }
23 else
24 {
25 p = __cosf(b);
26 }
27 *out = p;
28 }
Here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140]
MOV R3, c[0x0][0x144]
LD.E R2, [R2]
MOV R5, c[0x0][0x154]
ISETP.EQ.AND P0, PT, R2, RZ, PT
#!P0 I2F.F32.S32 R0, c[0x0] [0x148]
#P0 I2F.F32.S32 R4, c[0x0] [0x14c]
#!P0 RRO.SINCOS R0, R0
#P0 RRO.SINCOS R4, R4
#!P0 MUFU.SIN R0, R0
#P0 MUFU.COS R0, R4
MOV R4, c[0x0][0x150]
F2I.S32.F32.TRUNC R0, R0
ST.E [R4], R0
EXIT
BRA 0x98
The #!P0 and #P0 you see are predicates. Each thread would have its own predicate bit based on the result. Depending on the bit, as the processing unit goes through the code it will be decided whether the instruction is to be executed (could also mean, result being committed?).
Let's look at a case in which you do not have branch regardless of both cases.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? a : b;
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = a;
22 }
23 else
24 {
25 p = b;
26 }
27 *out = p;
28 }
And here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140] ; load in pointer into R2
MOV R3, c[0x0][0x144]
LD.E R2, [R2] ; deref pointer
MOV R6, c[0x0][0x14c] ; load a. b is stored at c[0x0][0x148]
MOV R4, c[0x0][0x150] ; load out pointer into R4
MOV R5, c[0x0][0x154]
ICMP.EQ R0, R6, c[0x0][0x148], R2 ; Check R2 if zero and select source based on result. Result is put into R0.
ST.E [R4], R0
EXIT
BRA 0x60
There's no branch here. You can do can think of the result as a linear interpolation of A and B:
int cond = (*p != 0)
*out = (1-cond) * a + cond * b

Efficiently dividing unsigned value by a power of two, rounding up - in CUDA

I was just reading:
Efficiently dividing unsigned value by a power of two, rounding up
and I was wondering what was the fastest way to do this in CUDA. Of course by "fast" I mean in terms of throughput (that question also addressed the case of subsequent calls depending on each other).
For the lg() function mentioned in that question (base-2 logarithm of divisor), suppose we have:
template <typename T> __device__ int find_first_set(T x);
template <> __device__ int find_first_set<uint32_t>(uint32_t x) { return __ffs(x); }
template <> __device__ int find_first_set<uint64_t>(uint64_t x) { return __ffsll(x); }
template <typename T> __device__ int lg(T x) { return find_first_set(x) - 1; }
Edit: Since I've been made aware that there's no find-first-sert in PTX, nor in the instruction set of all nVIDIA GPUs up to this time, let's replace that lg() with the following:
template <typename T> __df__ int population_count(T x);
template <> int population_count<uint32_t>(uint32_t x) { return __popc(x); }
template <> int population_count<uint64_t>(uint64_t x) { return __popcll(x); }
template <typename T>
__device__ int lg_for_power_of_2(T x) { return population_count(x - 1); }
and we now need to implement
template <typename T> T div_by_power_of_2_rounding_up(T p, T q);
... for T = uint32_t and T = uint64_t. (p is the dividend, q is the divisor).
Notes:
As in the original question, we may not assume that p <= std::numeric_limits<T>::max() - q or that p > 0 - that would collapse the various interesting alternatives :-)
0 is not a power of 2, so we may assume q != 0.
I realize solutions might differ for 32-bit and 64-bit; I'm more interested in the former but also in the latter.
Let's focus on Maxwell and Pascal chips.
With funnel shifting available, a possible 32 bit strategy is doing a 33bit shift (essentially) preserving the carry of the addition so it be done before the shift, such as this: (not tested)
unsigned sum = dividend + mask;
unsigned result = __funnelshift_r(sum, sum < mask, log_2_of_divisor);
Edit by #einpoklum:
Tested using #RobertCrovella's program, seems to work fine. The test kernel PTX for SM_61 is:
.reg .pred %p<2>;
.reg .b32 %r<12>;
ld.param.u32 %r5, [_Z4testjj_param_0];
ld.param.u32 %r6, [_Z4testjj_param_1];
neg.s32 %r7, %r6;
and.b32 %r8, %r6, %r7;
clz.b32 %r9, %r8;
mov.u32 %r10, 31;
sub.s32 %r4, %r10, %r9;
add.s32 %r11, %r6, -1;
add.s32 %r2, %r11, %r5;
setp.lt.u32 %p1, %r2, %r11;
selp.u32 %r3, 1, 0, %p1;
// inline asm
shf.r.wrap.b32 %r1, %r2, %r3, %r4;
// inline asm
st.global.u32 [r], %r1;
ret;
and the SASS is:
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ MOV R0, c[0x0][0x144]; /* 0x4c98078005170000 */
/*0018*/ IADD R2, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff02 */
/* 0x001c4c00fe4007f1 */
/*0028*/ IADD32I R0, R0, -0x1; /* 0x1c0ffffffff70000 */
/*0030*/ LOP.AND R2, R2, c[0x0][0x144]; /* 0x4c47000005170202 */
/*0038*/ FLO.U32 R2, R2; /* 0x5c30000000270002 */
/* 0x003fd800fe2007e6 */
/*0048*/ IADD R5, R0, c[0x0][0x140]; /* 0x4c10000005070005 */
/*0050*/ ISETP.LT.U32.AND P0, PT, R5, R0, PT; /* 0x5b62038000070507 */
/*0058*/ IADD32I R0, -R2, 0x1f; /* 0x1d00000001f70200 */
/* 0x001fc400fe2007f6 */
/*0068*/ IADD32I R0, -R0, 0x1f; /* 0x1d00000001f70000 */
/*0070*/ SEL R6, RZ, 0x1, !P0; /* 0x38a004000017ff06 */
/*0078*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/* 0x0003c400fe4007e4 */
/*0088*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*0090*/ SHF.R.W R0, R5, R0, R6; /* 0x5cfc030000070500 */
/*0098*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*00a8*/ EXIT; /* 0xe30000000007000f */
/*00b0*/ BRA 0xb0; /* 0xe2400fffff87000f */
/*00b8*/ NOP; /* 0x50b0000000070f00 */
Here's an adaptation of a well-performing answer for the CPU:
template <typename T>
__device__ T div_by_power_of_2_rounding_up(T dividend, T divisor)
{
auto log_2_of_divisor = lg(divisor);
auto mask = divisor - 1;
auto correction_for_rounding_up = ((dividend & mask) + mask) >> log_2_of_divisor;
return (dividend >> log_2_of_divisor) + correction_for_rounding_up;
}
I wonder whether one can do much better.
The SASS code (using #RobertCrovella's test kernel) for SM_61 is:
code for sm_61
Function : test(unsigned int, unsigned int)
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fd400fe2007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ MOV R2, c[0x0][0x144]; /* 0x4c98078005170002 */
/* 0x003fc40007a007f2 */
/*0028*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/*0030*/ FLO.U32 R3, R0; /* 0x5c30000000070003 */
/*0038*/ IADD32I R0, R2, -0x1; /* 0x1c0ffffffff70200 */
/* 0x001fc400fcc017f5 */
/*0048*/ IADD32I R3, -R3, 0x1f; /* 0x1d00000001f70303 */
/*0050*/ LOP.AND R2, R0, c[0x0][0x140]; /* 0x4c47000005070002 */
/*0058*/ IADD R2, R0, R2; /* 0x5c10000000270002 */
/* 0x001fd000fe2007f1 */
/*0068*/ IADD32I R0, -R3, 0x1f; /* 0x1d00000001f70300 */
/*0070*/ MOV R3, c[0x0][0x140]; /* 0x4c98078005070003 */
/*0078*/ MOV32I R6, 0x0; /* 0x010000000007f006 */
/* 0x001fc400fc2407f1 */
/*0088*/ SHR.U32 R4, R2, R0.reuse; /* 0x5c28000000070204 */
/*0090*/ SHR.U32 R5, R3, R0; /* 0x5c28000000070305 */
/*0098*/ MOV R2, R6; /* 0x5c98078000670002 */
/* 0x0003c400fe4007f4 */
/*00a8*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*00b0*/ IADD R0, R4, R5; /* 0x5c10000000570400 */
/*00b8*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*00c8*/ EXIT; /* 0xe30000000007000f */
/*00d0*/ BRA 0xd0; /* 0xe2400fffff87000f */
/*00d8*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*00e8*/ NOP; /* 0x50b0000000070f00 */
/*00f0*/ NOP; /* 0x50b0000000070f00 */
/*00f8*/ NOP; /* 0x50b0000000070f00 */
with FLO being the "find leading 1" instruction (thanks #tera). Anyway, those are lots of instructions, even if you ignore the loads from (what looks like) constant memory... the CPU function inspiring this one compiles into just:
tzcnt rax, rsi
lea rcx, [rdi - 1]
shrx rax, rcx, rax
add rax, 1
test rdi, rdi
cmove rax, rdi
(with clang 3.9.0).
riffing off of the kewl answer by #tera:
template <typename T> __device__ T pdqru(T p, T q)
{
return bool(p) * (((p - 1) >> lg(q)) + 1);
}
11 instructions (no branches, no predication) to get the result in R0:
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc800fec007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/* 0x003fc400ffa00711 */
/*0028*/ FLO.U32 R0, R0; /* 0x5c30000000070000 */
/*0030*/ MOV R5, c[0x0][0x140]; /* 0x4c98078005070005 */
/*0038*/ IADD32I R2, -R0, 0x1f; /* 0x1d00000001f70002 */
/* 0x001fd800fcc007f5 */
/*0048*/ IADD32I R0, R5, -0x1; /* 0x1c0ffffffff70500 */
/*0050*/ IADD32I R2, -R2, 0x1f; /* 0x1d00000001f70202 */
/*0058*/ SHR.U32 R0, R0, R2; /* 0x5c28000000270000 */
/* 0x001fd000fe2007f1 */
/*0068*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0070*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0078*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/* 0x001ffc001e2007f2 */
/*0088*/ ICMP.NE R0, R0, RZ, R5; /* 0x5b4b02800ff70000 */
/*0090*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*0098*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*00a8*/ BRA 0xa0; /* 0xe2400fffff07000f */
/*00b0*/ NOP; /* 0x50b0000000070f00 */
/*00b8*/ NOP; /* 0x50b0000000070f00 */
..........................
After studying the above SASS code, it seemed evident that these two instructions:
/*0038*/ IADD32I R2, -R0, 0x1f; /* 0x1d00000001f70002 */
/* 0x001fd800fcc007f5 */
...
/*0050*/ IADD32I R2, -R2, 0x1f; /* 0x1d00000001f70202 */
shouldn't really be necessary. I don't have a precise explanation, but my assumption is that because the FLO.U32 SASS instruction does not have precisely the same semantics as the __ffs() intrinsic, the compiler apparently has an idiom when using that intrinsic, which wraps the basic FLO instruction that is doing the work. It wasn't obvious how to work around this at the C++ source code level, but I was able to use the bfind PTX instruction in a way to reduce the instruction count further, to 7 according to my count (to get the answer into a register):
$ cat t107.cu
#include <cstdio>
#include <cstdint>
__device__ unsigned r = 0;
static __device__ __inline__ uint32_t __my_bfind(uint32_t val){
uint32_t ret;
asm volatile("bfind.u32 %0, %1;" : "=r"(ret): "r"(val));
return ret;}
template <typename T> __device__ T pdqru(T p, T q)
{
return bool(p) * (((p - 1) >> (__my_bfind(q))) + 1);
}
__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
unsigned q2 = 16;
unsigned z = 0;
unsigned l = 1U<<31;
printf("result %u/%u = %u\n", p, q, pdqru(p, q));
printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
printf("result %u/%u = %u\n", p, z, pdqru(p, z));
printf("result %u/%u = %u\n", z, q, pdqru(z, q));
printf("result %u/%u = %u\n", l, q, pdqru(l, q));
printf("result %u/%u = %u\n", q, l, pdqru(q, l));
printf("result %u/%u = %u\n", l, l, pdqru(l, l));
printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
r = pdqru(p, q);
#endif
}
int main(){
unsigned h_r;
test<<<1,1>>>(32767, 32);
cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
printf("result = %u\n", h_r);
}
$ nvcc -arch=sm_61 -o t107 t107.cu -std=c++11
$ cuobjdump -sass t107
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_61
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001c4400fe0007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ { MOV32I R3, 0x0; /* 0x010000000007f003 */
/*0018*/ FLO.U32 R2, c[0x0][0x144]; } /* 0x4c30000005170002 */
/* 0x003fd800fec007f6 */
/*0028*/ MOV R5, c[0x0][0x140]; /* 0x4c98078005070005 */
/*0030*/ IADD32I R0, R5, -0x1; /* 0x1c0ffffffff70500 */
/*0038*/ SHR.U32 R0, R0, R2; /* 0x5c28000000270000 */
/* 0x001fc800fca007f1 */
/*0048*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0050*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0058*/ ICMP.NE R0, R0, RZ, R5; /* 0x5b4b02800ff70000 */
/* 0x001ffc00ffe000f1 */
/*0068*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*0070*/ EXIT; /* 0xe30000000007000f */
/*0078*/ BRA 0x78; /* 0xe2400fffff87000f */
..........................
Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvcc -arch=sm_61 -o t107 t107.cu -std=c++11 -DUSE_DISPLAY
$ cuda-memcheck ./t107
========= CUDA-MEMCHECK
result 32767/32 = 1024
result 32767/16 = 2048
result 32767/0 = 1
result 0/32 = 0
result 2147483648/32 = 67108864
result 32/2147483648 = 1
result 2147483648/2147483648 = 1
result 32/32 = 1
result = 0
========= ERROR SUMMARY: 0 errors
$
I've only demonstrated the 32-bit example, above.
I think I could make the case that there are really only 6 instructions doing the "work" in the above kernel SASS, and that the remainder of the instructions are kernel "overhead" and/or the instructions needed to store the register result into global memory. It seems evident that the compiler is generating just these instructions as a result of the function:
/*0018*/ FLO.U32 R2, c[0x0][0x144]; // find bit set in q
/* */
/*0028*/ MOV R5, c[0x0][0x140]; // load p
/*0030*/ IADD32I R0, R5, -0x1; // subtract 1 from p
/*0038*/ SHR.U32 R0, R0, R2; // shift p right by q bit
/* */
/*0048*/ IADD32I R0, R0, 0x1; // add 1 to result
/*0050*/ ... /* */
/*0058*/ ICMP.NE R0, R0, RZ, R5; // account for p=0 case
However this would be inconsistent with the way I've counted other cases (they should all probably be reduced by 1).
template <typename T> __device__ T div_by_power_of_2_rounding_up(T p, T q)
{
return p==0 ? 0 : ((p - 1) >> lg(q)) + 1;
}
One instruction shorter than Robert's previous answer (but see his comeback) if my count is correct, or the same instruction count as the funnel shift. Has a branch though - not sure if that makes a difference (other than a benefit if the entire warp gets zero p inputs):
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc000fda007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ ISETP.EQ.AND P0, PT, RZ, c[0x0][0x140], PT; /* 0x4b6503800507ff07 */
/*0018*/ { MOV R0, RZ; /* 0x5c9807800ff70000 */
/*0028*/ #P0 BRA 0x90; } /* 0x001fc800fec007fd */
/* 0xe24000000600000f */
/*0030*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0038*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/* 0x003fc400ffa00711 */
/*0048*/ FLO.U32 R0, R0; /* 0x5c30000000070000 */
/*0050*/ MOV R3, c[0x0][0x140]; /* 0x4c98078005070003 */
/*0058*/ IADD32I R2, -R0, 0x1f; /* 0x1d00000001f70002 */
/* 0x001fd800fcc007f5 */
/*0068*/ IADD32I R0, R3, -0x1; /* 0x1c0ffffffff70300 */
/*0070*/ IADD32I R2, -R2, 0x1f; /* 0x1d00000001f70202 */
/*0078*/ SHR.U32 R0, R0, R2; /* 0x5c28000000270000 */
/* 0x001fc800fe2007f6 */
/*0088*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0090*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0098*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/* 0x001ffc00ffe000f1 */
/*00a8*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*00b0*/ EXIT; /* 0xe30000000007000f */
/*00b8*/ BRA 0xb8; /* 0xe2400fffff87000f */
..........................
I believe it should still be possible to shave an instruction or two from the funnel shift by writing it in PTX (Morning update: as Robert has proven in the meantime), but I really need to go to bed.
Update2: Doing that (using Harold's funnel shift and writing the function in PTX)
_device__ uint32_t div_by_power_of_2_rounding_up(uint32_t p, uint32_t q)
{
uint32_t ret;
asm volatile("{\r\t"
".reg.u32 shift, mask, lo, hi;\n\t"
"bfind.u32 shift, %2;\r\t"
"sub.u32 mask, %2, 1;\r\t"
"add.cc.u32 lo, %1, mask;\r\t"
"addc.u32 hi, 0, 0;\r\t"
"shf.r.wrap.b32 %0, lo, hi, shift;\n\t"
"}"
: "=r"(ret) : "r"(p), "r"(q));
return ret;
}
just gets us to the same instruction count as Robert has already achieved with his simpler C code:
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc000fec007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ MOV R0, c[0x0][0x144]; /* 0x4c98078005170000 */
/*0018*/ { IADD32I R2, R0, -0x1; /* 0x1c0ffffffff70002 */
/*0028*/ FLO.U32 R0, c[0x0][0x144]; } /* 0x001fc400fec00716 */
/* 0x4c30000005170000 */
/*0030*/ IADD R5.CC, R2, c[0x0][0x140]; /* 0x4c10800005070205 */
/*0038*/ IADD.X R6, RZ, RZ; /* 0x5c1008000ff7ff06 */
/* 0x003fc800fc8007f1 */
/*0048*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0050*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*0058*/ SHF.R.W R0, R5, R0, R6; /* 0x5cfc030000070500 */
/* 0x001ffc00ffe000f1 */
/*0068*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*0070*/ EXIT; /* 0xe30000000007000f */
/*0078*/ BRA 0x78; /* 0xe2400fffff87000f */
..........................
One possible straightforward approach:
$ cat t105.cu
#include <cstdio>
__device__ unsigned r = 0;
template <typename T>
__device__ T pdqru(T p, T q){
T p1 = p + (q-1);
if (sizeof(T) == 8)
q = __ffsll(q);
else
q = __ffs(q);
return (p1<p)?((p>>(q-1))+1) :(p1 >> (q-1));
}
__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
unsigned q2 = 16;
unsigned z = 0;
unsigned l = 1U<<31;
printf("result %u/%u = %u\n", p, q, pdqru(p, q));
printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
printf("result %u/%u = %u\n", p, z, pdqru(p, z));
printf("result %u/%u = %u\n", z, q, pdqru(z, q));
printf("result %u/%u = %u\n", l, q, pdqru(l, q));
printf("result %u/%u = %u\n", q, l, pdqru(q, l));
printf("result %u/%u = %u\n", l, l, pdqru(l, l));
printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
r = pdqru(p, q);
#endif
}
int main(){
unsigned h_r;
test<<<1,1>>>(32767, 32);
cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
printf("result = %u\n", h_r);
}
$ nvcc -arch=sm_61 -o t105 t105.cu
$ cuobjdump -sass ./t105
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_61
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc800fec007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/* 0x005fd401fe20003d */
/*0028*/ FLO.U32 R2, R0; /* 0x5c30000000070002 */
/*0030*/ MOV R0, c[0x0][0x144]; /* 0x4c98078005170000 */
/*0038*/ IADD32I R3, -R2, 0x1f; /* 0x1d00000001f70203 */
/* 0x001fd000fc2007f1 */
/*0048*/ IADD32I R0, R0, -0x1; /* 0x1c0ffffffff70000 */
/*0050*/ MOV R2, c[0x0][0x140]; /* 0x4c98078005070002 */
/*0058*/ IADD32I R4, -R3, 0x1f; /* 0x1d00000001f70304 */
/* 0x001fd800fe2007f6 */
/*0068*/ IADD R5, R0, c[0x0][0x140]; /* 0x4c10000005070005 */
/*0070*/ ISETP.LT.U32.AND P0, PT, R5, R0, PT; /* 0x5b62038000070507 */
/*0078*/ SHR.U32 R0, R2, R4; /* 0x5c28000000470200 */
/* 0x001fd000fc2007f1 */
/*0088*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0090*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0098*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/* 0x001ffc001e2007f2 */
/*00a8*/ #!P0 SHR.U32 R0, R5, R4; /* 0x5c28000000480500 */
/*00b0*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*00b8*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*00c8*/ BRA 0xc0; /* 0xe2400fffff07000f */
/*00d0*/ NOP; /* 0x50b0000000070f00 */
/*00d8*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*00e8*/ NOP; /* 0x50b0000000070f00 */
/*00f0*/ NOP; /* 0x50b0000000070f00 */
/*00f8*/ NOP; /* 0x50b0000000070f00 */
..........................
Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvcc -arch=sm_61 -o t105 t105.cu -DUSE_DISPLAY
$ cuda-memcheck ./t105
========= CUDA-MEMCHECK
result 32767/32 = 1024
result 32767/16 = 2048
result 32767/0 = 2048
result 0/32 = 0
result 2147483648/32 = 67108864
result 32/2147483648 = 1
result 2147483648/2147483648 = 1
result 32/32 = 1
result = 0
========= ERROR SUMMARY: 0 errors
$
Approximately 14 SASS instructions for the 32-bit case, to get the answer into R0. It produces spurious results for the divide-by-zero case.
The equivalent assembly for this answer case looks like this:
$ cat t106.cu
#include <cstdio>
#include <cstdint>
__device__ unsigned r = 0;
template <typename T> __device__ int find_first_set(T x);
template <> __device__ int find_first_set<uint32_t>(uint32_t x) { return __ffs(x); }
template <> __device__ int find_first_set<uint64_t>(uint64_t x) { return __ffsll(x); }
template <typename T> __device__ T lg(T x) { return find_first_set(x) - 1; }
template <typename T>
__device__ T pdqru(T dividend, T divisor)
{
auto log_2_of_divisor = lg(divisor);
auto mask = divisor - 1;
auto correction_for_rounding_up = ((dividend & mask) + mask) >> log_2_of_divisor;
return (dividend >> log_2_of_divisor) + correction_for_rounding_up;
}
__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
unsigned q2 = 16;
unsigned z = 0;
unsigned l = 1U<<31;
printf("result %u/%u = %u\n", p, q, pdqru(p, q));
printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
printf("result %u/%u = %u\n", p, z, pdqru(p, z));
printf("result %u/%u = %u\n", z, q, pdqru(z, q));
printf("result %u/%u = %u\n", l, q, pdqru(l, q));
printf("result %u/%u = %u\n", q, l, pdqru(q, l));
printf("result %u/%u = %u\n", l, l, pdqru(l, l));
printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
r = pdqru(p, q);
#endif
}
int main(){
unsigned h_r;
test<<<1,1>>>(32767, 32);
cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
printf("result = %u\n", h_r);
}
$ nvcc -std=c++11 -arch=sm_61 -o t106 t106.cu
$ cuobjdump -sass t106
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_61
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fd400fe2007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ MOV R2, c[0x0][0x144]; /* 0x4c98078005170002 */
/* 0x003fc40007a007f2 */
/*0028*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/*0030*/ FLO.U32 R3, R0; /* 0x5c30000000070003 */
/*0038*/ IADD32I R0, R2, -0x1; /* 0x1c0ffffffff70200 */
/* 0x001fc400fcc017f5 */
/*0048*/ IADD32I R3, -R3, 0x1f; /* 0x1d00000001f70303 */
/*0050*/ LOP.AND R2, R0, c[0x0][0x140]; /* 0x4c47000005070002 */
/*0058*/ IADD R2, R0, R2; /* 0x5c10000000270002 */
/* 0x001fd000fe2007f1 */
/*0068*/ IADD32I R0, -R3, 0x1f; /* 0x1d00000001f70300 */
/*0070*/ MOV R3, c[0x0][0x140]; /* 0x4c98078005070003 */
/*0078*/ MOV32I R6, 0x0; /* 0x010000000007f006 */
/* 0x001fc400fc2407f1 */
/*0088*/ SHR.U32 R4, R2, R0.reuse; /* 0x5c28000000070204 */
/*0090*/ SHR.U32 R5, R3, R0; /* 0x5c28000000070305 */
/*0098*/ MOV R2, R6; /* 0x5c98078000670002 */
/* 0x0003c400fe4007f4 */
/*00a8*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*00b0*/ IADD R0, R4, R5; /* 0x5c10000000570400 */
/*00b8*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*00c8*/ EXIT; /* 0xe30000000007000f */
/*00d0*/ BRA 0xd0; /* 0xe2400fffff87000f */
/*00d8*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*00e8*/ NOP; /* 0x50b0000000070f00 */
/*00f0*/ NOP; /* 0x50b0000000070f00 */
/*00f8*/ NOP; /* 0x50b0000000070f00 */
..........................
Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$
which appears to be 1 instruction longer, by my count.
Here is an alternative solution via population count. I tried the 32-bit variant only, testing it exhaustively against the reference implementation. Since the divisor q is a power of 2, we can trivially determine the shift count s with the help of the population count operation. The remainder t of the truncating division can be computed by simple mask m derived directly from the divisor q.
// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t reference (uint32_t p, uint32_t q)
{
uint32_t r = p / q;
if ((q * r) < p) r++;
return r;
}
// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t solution (uint32_t p, uint32_t q)
{
uint32_t r, s, t, m;
m = q - 1;
s = __popc (m);
r = p >> s;
t = p & m;
if (t > 0) r++;
return r;
}
Whether solution() is faster than the previously posted codes will likely depend on the specific GPU architecture. Using CUDA 8.0, it compiles to the following sequence of PTX instructions:
add.s32 %r3, %r2, -1;
popc.b32 %r4, %r3;
shr.u32 %r5, %r1, %r4;
and.b32 %r6, %r3, %r1;
setp.ne.s32 %p1, %r6, 0;
selp.u32 %r7, 1, 0, %p1;
add.s32 %r8, %r5, %r7;
For sm_5x, this translates into machine code pretty much 1:1, except that the two instructions SETP and SELP get contracted into a single ICMP, because the comparison is with 0.

How to optimize load and stores?

I'm trying to have a bunch of operation executed on different targets such as ARM,Bfin... but every time I write a simple code in C and then compile it for each operation it has like 2 loads and one store which is unnecessary for every operation.
ldr r2, [fp, #-24]
ldr r3, [fp, #-28]
add r3, r2, r3
str r3, [fp, #-20]
ldr r2, [fp, #-36]
ldr r3, [fp, #-40]
add r3, r2, r3
str r3, [fp, #-32]
ldr r2, [fp, #-44]
ldr r3, [fp, #-48]
add r3, r2, r3
str r3, [fp, #-20]
ldr r3, [fp, #-16]
add r3, r3, #1
str r3, [fp, #-16]
When I turn on any optimization options, even -O1, it simply calculates the result and stores it in the output:
subl $24, %esp
movl $4, 4(%esp)
movl $.LC0, (%esp)
Is there anyway,I can have operations without fetching the same variable over and over again? I've tried gcc -fgcse-lm and -fgcse-sm but that didn't work.
It depends on the operation. Gcc can't figure out a high level optimizations for
int a(int b, int c)
{
b-=c;
c-=b;
b-=c;
c-=b;
b-=c;
c-=b;
return c;
}
If you want to do benchmarking and avoid constant folding and dead code elimination of the optimizer in gcc, you need to use non-constants as input and make sure the result goes somewhere.
For instance, instead of using
int main(int argc, char** argv) {
int a = 1;
int b = 2;
start_clock();
int c = a + b;
int d = c + a;
int e = d + b;
stop_clock();
output_time_needed();
return 0;
}
You should use something like
int main(int argc, char** argv) {
int a = argc;
int b = argc + 1;
start_clock();
int c = a + b;
int d = c + a;
int e = d + b;
stop_clock();
output_time_needed();
return e;
}

OpenCL Memory Optimization - Nearest Neighbour

I'm writing a program in OpenCL that receives two arrays of points, and calculates the nearest neighbour for each point.
I have two programs for this. One of them will calculate distance for 4 dimensions, and one for 6 dimensions. They are below:
4 dimensions:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float dist = fast_distance(curY, m[x]);
if (dist < minDist)
{
minDist = dist;
minIdx = x;
}
}
i[index] = minIdx;
y[index] = minDist;
}
6 dimensions:
kernel void BruteForce(
global read_only float8* m,
global float8* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float8 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float8 mx = m[x];
float d0 = mx.s0 - curY.s0;
float d1 = mx.s1 - curY.s1;
float d2 = mx.s2 - curY.s2;
float d3 = mx.s3 - curY.s3;
float d4 = mx.s4 - curY.s4;
float d5 = mx.s5 - curY.s5;
float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
if (dist < minDist)
{
minDist = dist;
minIdx = index;
}
}
i[index] = minIdx;
y[index] = minDist;
}
I'm looking for ways to optimize this program for GPGPU. I've read some articles (including http://www.macresearch.org/opencl_episode6, which comes with a source code) about GPGPU optimization by using local memory. I've tried applying it and came up with this code:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
__local float4 * shared)
{
int index = get_global_id(0);
int lsize = get_local_size(0);
int lid = get_local_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = 64000;
int x = 0;
for(x = 0; x < {0}; x += lsize)
{
if((x+lsize) > {0})
lsize = {0} - x;
if ( (x + lid) < {0})
{
shared[lid] = m[x + lid];
}
barrier(CLK_LOCAL_MEM_FENCE);
for (int x1 = 0; x1 < lsize; x1++)
{
float dist = distance(curY, shared[x1]);
if (dist < minDist)
{
minDist = dist;
minIdx = x + x1;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
i[index] = minIdx;
y[index] = minDist;
}
I'm getting garbage results for my 'i' output (e.g. many values that are the same). Can anyone point me to the right direction? I'll appreciate any answer that helps me improve this code, or maybe find the problem with the optimize version above.
Thank you very much
CauĂȘ
One way to get a big speed up here is to use local data structures and compute entire blocks of data at a time. You should also only need a single read/write global vector (float4). The same idea can be applied to the 6d version using smaller blocks. Each work group is able to work freely through the block of data it is crunching. I will leave the exact implementation to you because you will know the specifics of your application.
some pseudo-ish-code (4d):
computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256
__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;
now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global
steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed
need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
if(j==k) continue; //no self-comparison
calculate distance of blockA[j] vs blockA[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
for (i=0;i<get_num_groups(0);i++)
if (i==get_group_id(0)) continue;
load blockB into local memory: async_work_group_copy(...)
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
calculate distance of blockA[j] vs blockB[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
write blockA and blockAnearestIndex to global memory using two async_work_group_copy
There should be no problem in reading a blockB while another work group writes the same block (as its own blockA), because only the W values may have changed. If there happens to be trouble with this -- or if you do require two different vectors of points, you could use two global vectors like you have above, one with the A's (writeable) and the other with the B's (read only).
This algorithm work best when your global data size is a multiple of computeBlockSize. To handle the edges, two solutions come to mind. I recommend writing a second kernel for the non-square edge blocks that would in a similar manner as above. The new kernel can execute after the first, and you could save the second pci-e transfer. Alternately, you can use a distance of -1 to signify a skip in the comparison of two elements (ie if either blockA[j].w == -1 or blockB[k].w == -1, continue). This second solution would result in a lot more branching in your kernel though, which is why I recommend writing a new kernel. A very small percentage of your data points will actually fall in a edge block.

Optimization using NEON assembly

I am trying to optimize some parts of OpenCV code using NEON. Here is the original code block I work on. (Note: If it is of any importance, you can find the full source at "opencvfolder/modules/video/src/lkpyramid.cpp". It is an implementation of an object tracking algorithm.)
for( ; x < colsn; x++ )
{
deriv_type t0 = (deriv_type)(trow0[x+cn] - trow0[x-cn]);
deriv_type t1 = (deriv_type)((trow1[x+cn] + trow1[x-cn])*3 + trow1[x]*10);
drow[x*2] = t0; drow[x*2+1] = t1;
}
In this code, size of deriv_type is a 2 byte.
And here is the NEON assembly I have written. With original code I measure 10-11 fps. With NEON it is worse, I can only get 5-6 fps. I don't really know much about NEON, probably there are lots of mistakes in this code. Where am I doing wrong? Thanks
for( ; x < colsn; x+=4 )
{
__asm__ __volatile__(
"vld1.16 d2, [%2] \n\t" // d2 = trow0[x+cn]
"vld1.16 d3, [%3] \n\t" // d3 = trow0[x-cn]
"vsub.i16 d9, d2, d3 \n\t" // d9 = d2 - d3
"vld1.16 d4, [%4] \n\t" // d4 = trow1[x+cn]
"vld1.16 d5, [%5] \n\t" // d5 = trow1[x-cn]
"vld1.16 d6, [%6] \n\t" // d6 = trow1[x]
"vmov.i16 d7, #3 \n\t" // d7 = 3
"vmov.i16 d8, #10 \n\t" // d8 = 10
"vadd.i16 d4, d4, d5 \n\t" // d4 = d4 + d5
"vmul.i16 d10, d4, d7 \n\t" // d10 = d4 * d7
"vmla.i16 d10, d6, d8 \n\t" // d10 = d10 + d6 * d8
"vst2.16 {d9,d10}, [%0] \n\t" // drow[x*2] = d9; drow[x*2+1] = d10;
//"vst1.16 d4, [%1] \n\t"
: //output
:"r"(drow+x*2), "r"(drow+x*2+1), "r"(trow0+x+cn), "r"(trow0+x-cn), "r"(trow1+x+cn), "r"(trow1+x-cn), "r"(trow1) //input
:"d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10" //registers
);
}
EDIT
This is the verison with intrinsics. It is almost the same with before. It still works slow.
const int16x8_t vk3 = { 3, 3, 3, 3, 3, 3, 3, 3 };
const int16x8_t vk10 = { 10, 10, 10, 10, 10, 10, 10, 10 };
for( ; x < colsn; x+=8 )
{
int16x8x2_t loaded;
int16x8_t t0a = vld1q_s16(&trow0[x + cn]);
int16x8_t t0b = vld1q_s16(&trow0[x - cn]);
loaded.val[0] = vsubq_s16(t0a, t0b); // t0 = (trow0[x + cn] - trow0[x - cn])
loaded.val[1] = vld1q_s16(&trow1[x + cn]);
int16x8_t t1b = vld1q_s16(&trow1[x - cn]);
int16x8_t t1c = vld1q_s16(&trow1[x]);
loaded.val[1] = vaddq_s16(loaded.val[1], t1b);
loaded.val[1] = vmulq_s16(loaded.val[1], vk3);
loaded.val[1] = vmlaq_s16(loaded.val[1], t1c, vk10);
}
You're creating a lot of pipeline stalls due to data hazards. For example these three instructions:
"vadd.i16 d4, d4, d5 \n\t" // d4 = d4 + d5
"vmul.i16 d10, d4, d7 \n\t" // d10 = d4 * d7
"vmla.i16 d10, d6, d8 \n\t" // d10 = d10 + d6 * d8
They each only take 1 instruction to issue, but there are several-cycle stalls between them because the results are not ready (NEON instruction scheduling).
Try unrolling the loop a few times and interleaving their instructions. The compiler might do this for you if you use intrinsics. It's not impossible to beat the compiler at instructions scheduling etc, but it is quite hard and not often worth it (this might fall under not optimizing prematurely).
EDIT
Your intrinsic code is reasonable, I suspect the compiler is just not doing a very good job. Take a look at the assembly code it's producing (objdump -d) and you will probably see that it's also creating a lot of pipeline hazards. A later version of the compiler may help, but if it doesn't you might have to modify the loop yourself to hide the latency of the results (you will need the instruction timings). Keep the current code around, as it is correct and should be optimisable by a clever compiler.
You might end up with something like:
// do step 1 of first iteration
// ...
for (int i = 0; i < n - 1; i++) {
// do step 1 of (i+1)th
// do step 2 of (i)th
// with their instructions interleaved
// ...
}
// do step 2 of (n-1)th
// ...
You can also split the loop into more than 2 steps, or unroll the loop a few times (e.g. change i++ to i+=2, double the body of the loop, changing i to i+1 in the second half). I hope this answer helps, let me know if anything is unclear!
There is some loop invariant stuff there that needs to be moved outside the for loop - this may help a little.
You could also consider using full width SIMD operations, so that you can process 8 ppints per loop iteration rather than 4.
Most importantly though, you should probably be using intrinsics rather than raw asm, so that the compiler can take care of peephole optimisation, register allocation, instruction scheduling, loop unrolling, etc.
E.g.
// constants - init outside loop
const int16x8_t vk3 = { 3, 3, 3, 3, 3, 3, 3, 3 };
const int16x8_t vk10 = { 10, 10, 10, 10, 10, 10, 10, 10 };
for( ; x < colsn; x += 8)
{
int16x8_t t0a = vld1q_s16(&trow0[x + cn]);
int16x8_t t0b = vld1q_s16(&trow0[x - cn]);
int16x8_t t0 = vsubq_s16(t0a, t0b); // t0 = (trow0[x + cn] - trow0[x - cn])
// ...
}