Optimization using NEON assembly - optimization

I am trying to optimize some parts of OpenCV code using NEON. Here is the original code block I work on. (Note: If it is of any importance, you can find the full source at "opencvfolder/modules/video/src/lkpyramid.cpp". It is an implementation of an object tracking algorithm.)
for( ; x < colsn; x++ )
{
deriv_type t0 = (deriv_type)(trow0[x+cn] - trow0[x-cn]);
deriv_type t1 = (deriv_type)((trow1[x+cn] + trow1[x-cn])*3 + trow1[x]*10);
drow[x*2] = t0; drow[x*2+1] = t1;
}
In this code, size of deriv_type is a 2 byte.
And here is the NEON assembly I have written. With original code I measure 10-11 fps. With NEON it is worse, I can only get 5-6 fps. I don't really know much about NEON, probably there are lots of mistakes in this code. Where am I doing wrong? Thanks
for( ; x < colsn; x+=4 )
{
__asm__ __volatile__(
"vld1.16 d2, [%2] \n\t" // d2 = trow0[x+cn]
"vld1.16 d3, [%3] \n\t" // d3 = trow0[x-cn]
"vsub.i16 d9, d2, d3 \n\t" // d9 = d2 - d3
"vld1.16 d4, [%4] \n\t" // d4 = trow1[x+cn]
"vld1.16 d5, [%5] \n\t" // d5 = trow1[x-cn]
"vld1.16 d6, [%6] \n\t" // d6 = trow1[x]
"vmov.i16 d7, #3 \n\t" // d7 = 3
"vmov.i16 d8, #10 \n\t" // d8 = 10
"vadd.i16 d4, d4, d5 \n\t" // d4 = d4 + d5
"vmul.i16 d10, d4, d7 \n\t" // d10 = d4 * d7
"vmla.i16 d10, d6, d8 \n\t" // d10 = d10 + d6 * d8
"vst2.16 {d9,d10}, [%0] \n\t" // drow[x*2] = d9; drow[x*2+1] = d10;
//"vst1.16 d4, [%1] \n\t"
: //output
:"r"(drow+x*2), "r"(drow+x*2+1), "r"(trow0+x+cn), "r"(trow0+x-cn), "r"(trow1+x+cn), "r"(trow1+x-cn), "r"(trow1) //input
:"d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10" //registers
);
}
EDIT
This is the verison with intrinsics. It is almost the same with before. It still works slow.
const int16x8_t vk3 = { 3, 3, 3, 3, 3, 3, 3, 3 };
const int16x8_t vk10 = { 10, 10, 10, 10, 10, 10, 10, 10 };
for( ; x < colsn; x+=8 )
{
int16x8x2_t loaded;
int16x8_t t0a = vld1q_s16(&trow0[x + cn]);
int16x8_t t0b = vld1q_s16(&trow0[x - cn]);
loaded.val[0] = vsubq_s16(t0a, t0b); // t0 = (trow0[x + cn] - trow0[x - cn])
loaded.val[1] = vld1q_s16(&trow1[x + cn]);
int16x8_t t1b = vld1q_s16(&trow1[x - cn]);
int16x8_t t1c = vld1q_s16(&trow1[x]);
loaded.val[1] = vaddq_s16(loaded.val[1], t1b);
loaded.val[1] = vmulq_s16(loaded.val[1], vk3);
loaded.val[1] = vmlaq_s16(loaded.val[1], t1c, vk10);
}

You're creating a lot of pipeline stalls due to data hazards. For example these three instructions:
"vadd.i16 d4, d4, d5 \n\t" // d4 = d4 + d5
"vmul.i16 d10, d4, d7 \n\t" // d10 = d4 * d7
"vmla.i16 d10, d6, d8 \n\t" // d10 = d10 + d6 * d8
They each only take 1 instruction to issue, but there are several-cycle stalls between them because the results are not ready (NEON instruction scheduling).
Try unrolling the loop a few times and interleaving their instructions. The compiler might do this for you if you use intrinsics. It's not impossible to beat the compiler at instructions scheduling etc, but it is quite hard and not often worth it (this might fall under not optimizing prematurely).
EDIT
Your intrinsic code is reasonable, I suspect the compiler is just not doing a very good job. Take a look at the assembly code it's producing (objdump -d) and you will probably see that it's also creating a lot of pipeline hazards. A later version of the compiler may help, but if it doesn't you might have to modify the loop yourself to hide the latency of the results (you will need the instruction timings). Keep the current code around, as it is correct and should be optimisable by a clever compiler.
You might end up with something like:
// do step 1 of first iteration
// ...
for (int i = 0; i < n - 1; i++) {
// do step 1 of (i+1)th
// do step 2 of (i)th
// with their instructions interleaved
// ...
}
// do step 2 of (n-1)th
// ...
You can also split the loop into more than 2 steps, or unroll the loop a few times (e.g. change i++ to i+=2, double the body of the loop, changing i to i+1 in the second half). I hope this answer helps, let me know if anything is unclear!

There is some loop invariant stuff there that needs to be moved outside the for loop - this may help a little.
You could also consider using full width SIMD operations, so that you can process 8 ppints per loop iteration rather than 4.
Most importantly though, you should probably be using intrinsics rather than raw asm, so that the compiler can take care of peephole optimisation, register allocation, instruction scheduling, loop unrolling, etc.
E.g.
// constants - init outside loop
const int16x8_t vk3 = { 3, 3, 3, 3, 3, 3, 3, 3 };
const int16x8_t vk10 = { 10, 10, 10, 10, 10, 10, 10, 10 };
for( ; x < colsn; x += 8)
{
int16x8_t t0a = vld1q_s16(&trow0[x + cn]);
int16x8_t t0b = vld1q_s16(&trow0[x - cn]);
int16x8_t t0 = vsubq_s16(t0a, t0b); // t0 = (trow0[x + cn] - trow0[x - cn])
// ...
}

Related

Transpose 8x8 64-bits matrix

Targeting AVX2, what is a fastest way to transpose a 8x8 matrix containing 64-bits integers (or doubles)?
I searched though this site and I found several ways of doing 8x8 transpose but mostly for 32-bits floats. So I'm mainly asking because I'm not sure whether the principles that made those algorithms fast readily translate to 64-bits and second, apparently AVX2 only has 16 registers so only loading all the values would take up all the registers.
One way of doing it would be to call 2x2 _MM_TRANSPOSE4_PD but I was wondering whether this is optimal:
#define _MM_TRANSPOSE4_PD(row0,row1,row2,row3) \
{ \
__m256d tmp3, tmp2, tmp1, tmp0; \
\
tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0); \
tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF); \
tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0); \
tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF); \
\
(row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20); \
(row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20); \
(row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31); \
(row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31); \
}
Still assuming AVX2, is transposing double[8][8] and int64_t[8][8] largely the same, in principle?
PS: And just being curious, having AVX512 would change the things substantially, correct?
After some thoughts and discussion in the comments, I think this is the most efficient version, at least when source and destination data is in RAM. It does not require AVX2, AVX1 is enough.
The main idea, modern CPUs can do twice as many load micro-ops compared to stores, and on many CPUs loading stuff into higher half of vectors with vinsertf128 has same cost as regular 16-byte load. Compared to your macro, this version no longer needs these relatively expensive (3 cycles of latency on most CPUs) vperm2f128 shuffles.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void loadTransposed( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
// Load top half of the matrix into low half of 4 registers
__m256d t0 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 00, 01
__m256d t1 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 02, 03
rsi += stride;
__m256d t2 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 10, 11
__m256d t3 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 12, 13
rsi += stride;
// Load bottom half of the matrix into high half of these registers
t0 = _mm256_insertf128_pd( t0, _mm_loadu_pd( rsi ), 1 ); // 00, 01, 20, 21
t1 = _mm256_insertf128_pd( t1, _mm_loadu_pd( rsi + 2 ), 1 );// 02, 03, 22, 23
rsi += stride;
t2 = _mm256_insertf128_pd( t2, _mm_loadu_pd( rsi ), 1 ); // 10, 11, 30, 31
t3 = _mm256_insertf128_pd( t3, _mm_loadu_pd( rsi + 2 ), 1 );// 12, 13, 32, 33
// Transpose 2x2 blocks in registers.
// Due to the tricky way we loaded stuff, that's enough to transpose the complete 4x4 matrix.
mat.r0 = _mm256_unpacklo_pd( t0, t2 ); // 00, 10, 20, 30
mat.r1 = _mm256_unpackhi_pd( t0, t2 ); // 01, 11, 21, 31
mat.r2 = _mm256_unpacklo_pd( t1, t3 ); // 02, 12, 22, 32
mat.r3 = _mm256_unpackhi_pd( t1, t3 ); // 03, 13, 23, 33
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
// Transpose 8x8 matrix of double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
loadTransposed( block, rsi );
store( block, rdi );
#if 1
// Using another instance of the block to support in-place transpose
Matrix4x4 block2;
loadTransposed( block, rsi + 4 ); // top right block
loadTransposed( block2, rsi + 8 * 4 ); // bottom left block
store( block2, rdi + 4 );
store( block, rdi + 8 * 4 );
#else
// Flip the #if if you can guarantee ( rsi != rdi )
// Performance is about the same, but this version uses 4 less vector registers,
// slightly more efficient when some registers need to be backed up / restored.
assert( rsi != rdi );
loadTransposed( block, rsi + 4 );
store( block, rdi + 8 * 4 );
loadTransposed( block, rsi + 8 * 4 );
store( block, rdi + 4 );
#endif
// Bottom-right corner
loadTransposed( block, rsi + 8 * 4 + 4 );
store( block, rdi + 8 * 4 + 4 );
}
For completeness, here’s a version which uses the code very similar to your macro, does twice as few loads, same count of stores, and more shuffles. Have not benchmarked but I would expect it to be slightly slower.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void load( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
mat.r0 = _mm256_loadu_pd( rsi );
mat.r1 = _mm256_loadu_pd( rsi + stride );
mat.r2 = _mm256_loadu_pd( rsi + stride * 2 );
mat.r3 = _mm256_loadu_pd( rsi + stride * 3 );
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
inline void transpose( Matrix4x4& m4 )
{
// These unpack instructions transpose lanes within 2x2 blocks of the matrix
const __m256d t0 = _mm256_unpacklo_pd( m4.r0, m4.r1 );
const __m256d t1 = _mm256_unpacklo_pd( m4.r2, m4.r3 );
const __m256d t2 = _mm256_unpackhi_pd( m4.r0, m4.r1 );
const __m256d t3 = _mm256_unpackhi_pd( m4.r2, m4.r3 );
// Produce the transposed matrix by combining these blocks
m4.r0 = _mm256_permute2f128_pd( t0, t1, 0x20 );
m4.r1 = _mm256_permute2f128_pd( t2, t3, 0x20 );
m4.r2 = _mm256_permute2f128_pd( t0, t1, 0x31 );
m4.r3 = _mm256_permute2f128_pd( t2, t3, 0x31 );
}
// Transpose 8x8 matrix with double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
load( block, rsi );
transpose( block );
store( block, rdi );
// Using another instance of the block to support in-place transpose, with very small overhead
Matrix4x4 block2;
load( block, rsi + 4 ); // top right block
load( block2, rsi + 8 * 4 ); // bottom left block
transpose( block2 );
store( block2, rdi + 4 );
transpose( block );
store( block, rdi + 8 * 4 );
// Bottom-right corner
load( block, rsi + 8 * 4 + 4 );
transpose( block );
store( block, rdi + 8 * 4 + 4 );
}
For small matrices where more than 1 row can fit in a single SIMD vector, AVX-512 has very nice 2-input lane-crossing shuffles with 32-bit or 64-bit granularity, with a vector control. (Unlike _mm512_unpacklo_pd which is basically 4 separate 128-bit shuffles.)
A 4x4 double matrix is "only" 128 bytes, two ZMM __m512d vectors, so you only need two vpermt2ps (_mm512_permutex2var_pd) to produce both output vectors: one shuffle per output vector, with both loads and stores being full width. You do need control vector constants, though.
Using 512-bit vector instructions has some downsides (clock speed and execution port throughput), but if your program can spend a lot of time in code that uses 512-bit vectors, there's probably a significant throughput gain from throwing around more data with each instruction, and having more powerful shuffles.
With 256-bit vectors, vpermt2pd ymm would probably not be useful for a 4x4, because for each __m256d output row, each of the 4 elements you want comes from a different input row. So one 2-input shuffle can't produce the output you want.
I think lane-crossing shuffles with less than 128-bit granularity aren't useful unless your matrix is small enough to fit multiple rows in one SIMD vector. See How to transpose a 16x16 matrix using SIMD instructions? for some algorithmic complexity reasoning about 32-bit elements - an 8x8 xpose of 32-bit elements with AVX1 is about the same as an 8x8 of 64-bit elements with AVX-512, where each SIMD vector holds exactly one whole row.
So no need for vector constants, just immediate shuffles of 128-bit chunks, and unpacklo/hi
Transposing an 8x8 with 512-bit vectors (8 doubles) would have the same problem: each output row of 8 doubles needs 1 double from each of 8 input vectors. So ultimately I think you want a similar strategy to Soonts' AVX answer, starting with _mm512_insertf64x4(v, load, 1) as the first step to get the first half of 2 input rows into one vector.
(If you care about KNL / Xeon Phi, #ZBoson's other answer on How to transpose a 16x16 matrix using SIMD instructions? shows some interesting ideas using merge-masking with 1-input shuffles like vpermpd or vpermq, instead of 2-input shuffles like vunpcklpd or vpermt2pd)
Using wider vectors means fewer loads and stores, and maybe even fewer total shuffles because each one combines more data. But you also have more shuffling work to do, to get all 8 elements of a row into one vector, instead of just loading and storing to different places in chunks half the size of a row. It's not obvious is better; I'll update this answer if I get around to actually writing the code.
Note that Ice Lake (first consumer CPU with AVX-512) can do 2 loads and 2 stores per clock. It has better shuffle throughput than Skylake-X for some shuffles, but not for any that are useful for this or Soonts' answer. (All of vperm2f128, vunpcklpd and vpermt2pd only run on port 5, for the ymm and zmm versions. https://uops.info/. vinsertf64x4 zmm, mem, 1 is 2 uops for the front-end, and needs a load port and a uop for p0/p5. (Not p1 because it's a 512-bit uop, and see also SIMD instructions lowering CPU frequency).)

Loading and transposing eight 8-element float vectors

In one of a tight loop running a DSP algorithm I need to load eight 8-element float vectors given a base data pointer and offsets in AVX2 integer register. My current fastest code looks like this:
void LoadTransposed(
const float* data, __m256i offsets,
__m256& v0, __m256& v1, __m256& v2, __m256& v3, __m256& v4, __m256& v5, __m256& v6, __m256& v7)
{
const __m128i offsetsLo = _mm256_castsi256_si128(offsets);
const __m128i offsetsHi = _mm256_extracti128_si256(offsets, 1);
__m256 a0 = _mm256_loadu_ps(data + (uint32)_mm_cvtsi128_si32(offsetsLo ));
__m256 a1 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 1));
__m256 a2 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 2));
__m256 a3 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 3));
__m256 a4 = _mm256_loadu_ps(data + (uint32)_mm_cvtsi128_si32(offsetsHi ));
__m256 a5 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 1));
__m256 a6 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 2));
__m256 a7 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 3));
// transpose
const __m256 t0 = _mm256_unpacklo_ps(a0, a1);
const __m256 t1 = _mm256_unpackhi_ps(a0, a1);
const __m256 t2 = _mm256_unpacklo_ps(a2, a3);
const __m256 t3 = _mm256_unpackhi_ps(a2, a3);
const __m256 t4 = _mm256_unpacklo_ps(a4, a5);
const __m256 t5 = _mm256_unpackhi_ps(a4, a5);
const __m256 t6 = _mm256_unpacklo_ps(a6, a7);
const __m256 t7 = _mm256_unpackhi_ps(a6, a7);
__m256 v = _mm256_shuffle_ps(t0, t2, 0x4E);
const __m256 tt0 = _mm256_blend_ps(t0, v, 0xCC);
const __m256 tt1 = _mm256_blend_ps(t2, v, 0x33);
v = _mm256_shuffle_ps(t1, t3, 0x4E);
const __m256 tt2 = _mm256_blend_ps(t1, v, 0xCC);
const __m256 tt3 = _mm256_blend_ps(t3, v, 0x33);
v = _mm256_shuffle_ps(t4, t6, 0x4E);
const __m256 tt4 = _mm256_blend_ps(t4, v, 0xCC);
const __m256 tt5 = _mm256_blend_ps(t6, v, 0x33);
v = _mm256_shuffle_ps(t5, t7, 0x4E);
const __m256 tt6 = _mm256_blend_ps(t5, v, 0xCC);
const __m256 tt7 = _mm256_blend_ps(t7, v, 0x33);
v0 = _mm256_permute2f128_ps(tt0, tt4, 0x20);
v1 = _mm256_permute2f128_ps(tt1, tt5, 0x20);
v2 = _mm256_permute2f128_ps(tt2, tt6, 0x20);
v3 = _mm256_permute2f128_ps(tt3, tt7, 0x20);
v4 = _mm256_permute2f128_ps(tt0, tt4, 0x31);
v5 = _mm256_permute2f128_ps(tt1, tt5, 0x31);
v6 = _mm256_permute2f128_ps(tt2, tt6, 0x31);
v7 = _mm256_permute2f128_ps(tt3, tt7, 0x31);
}
As you can see, I'm already using blends instead of shuffles to reduce port 5 pressure. I also opted for _mm_cvtsi128_si32 when loading extracting 1st vector element, which is only 1uop, instead of 2uops in case of inconspicuous _mm_extract_epi32. Also, extracting the lower and higher lanes manually seems to help the compiler a bit and removes redundant vextracti128 instructions.
I've tried equivalent code using gather instructions, which as predicted turned out to be 2x slower, as it's doing effectively 64 loads under the hood:
void LoadTransposed_Gather(
const float* data, __m256i offsets,
__m256& v0, __m256& v1, __m256& v2, __m256& v3, __m256& v4, __m256& v5, __m256& v6, __m256& v7)
{
v0 = _mm256_i32gather_ps(data + 0, offsets, 4);
v1 = _mm256_i32gather_ps(data + 1, offsets, 4);
v2 = _mm256_i32gather_ps(data + 2, offsets, 4);
v3 = _mm256_i32gather_ps(data + 3, offsets, 4);
v4 = _mm256_i32gather_ps(data + 4, offsets, 4);
v5 = _mm256_i32gather_ps(data + 5, offsets, 4);
v6 = _mm256_i32gather_ps(data + 6, offsets, 4);
v7 = _mm256_i32gather_ps(data + 7, offsets, 4);
}
Is there any way to speed this (the former snippet) up even further? According to VTune and IACA, the biggest offender is high port 0 and 5 pressure (probably due to vpextrd used during offset extraction from __m128i registers and all the vunpckhps, vunpcklps and vshufps used during transpose).
Do your offsets have a pattern, like a fixed stride that you could just scale?
If not, perhaps pass them around as a struct instead of an __m256i if you're just going to need to extract them anyway?
Or if you're using SIMD to calculate the offsets (so they're naturally in a __m256i in the first place): store/reload to a local array When you need all 8 elements would save shuffle port bandwidth. Maybe _mm_cvtsi128_si32 / _mm_extract_epi32(offsetsLo, 1)) to get the first 1 or 2 offsets via ALU operations, with a couple cycles lower latency than store -> reload store forwarding.
e.g. alignas(32) uint32_t offsets[8]; and _mm256_store_si256 into it. (With some compilers, you may need to stop it from "optimizing" that into ALU extracts. You can use volatile on the array as a nasty hack to work around that. (But be careful not to defeat optimization more than necessary, e.g. load into tmp vars instead of accessing the volatile array multiple times, if you do want each element more than once. This will always defeat constant-propagation, for FP will defeat stuff like using the low element of a vector as a scalar with no shuffle necessary.)
2/clock load throughput, and efficient store forwarding from a vector store to scalar reloads of 32-bit elements makes this good (maybe 7 cycle latency IIRC, for a 256-bit store).
Especially if you're doing this transpose in a loop with other ALU work on the transpose result, so the loop mostly bottlenecks on port 5 in the back-end. The extra load uops shouldn't bottleneck on load ports, especially if there are any L1d cache misses. (In which case replays cost extra cycles on ports for instructions that consume the load results, not of load uops themselves).
Also fewer front-end uops:
1 store (p237+p4 micro-fused) + 1 vmovd (p0) + 7 loads (p23) is only 9 total front-end (fused-domain) uops
vs. vextracti128 + 2x vmovd + 6x vpextrd = 15 ALU uops for port 0 and port 5
Store/reload is fine on Zen/Zen2 as well.
IceLake has more ALU shuffle throughput (some vector shuffles can run on another port as well as p5) but store/reload is still a good strategy when you need all the elements and there are 8 of them. Especially for throughput at a small cost in latency.
#Witek902 reports (in comments) that #chtz's suggestion of building the transpose out of vmovups xmm + vinsertf128 reduces the port 5 shuffle throughput bottleneck on HSW / SKL and gives a speedup in practice. vinsertf128 y,y,mem,i is 2 uops (can't micro-fuse) for p015 + p23 on Intel. So it's more like a blend, not needing the shuffle port. (It's also going to be excellent on Bulldozer-family / Zen1 which handle YMM regs as two 128-bit halves.)
Doing only 128-bit loads is also nice for Sandybridge / IvyBridge, where misaligned 256-bit loads are extra expensive.
And on any CPU; if an offset happens to be an odd multiple of 16-byte alignment, neither 128-bit load will cross a cache-line boundary. So no uop replays of dependent ALU uops creating extra back-end port pressure.

Destructuring records into custom variable names

Consider this function:
mix : Color -> Color -> Color
mix c1 c2 =
let
{ red, green, blue, alpha } = toRgb c1
{ red, green, blue, alpha } = toRgb c2
in
...
The above won't work because it's introducing duplicate variable names. Is it possible to destructure the above values into r1, r2, g1, g2, etc?
For clarification, toRgb has this signature:
toRgb : Color -> { red:Int, green:Int, blue:Int, alpha:Float }
A hypothetical syntax might express better what I'd like to do:
mix : Color -> Color -> Color
mix c1 c2 =
let
{ red as r1, green as g1, blue as b1, alpha as a1 } = toRgb c1
{ red as r2, green as g2, blue as b2, alpha as a2 } = toRgb c2
in
...
I had trouble figuring out if this was possible and realized that field accessors are so powerful that it didn't matter.
Presumably your code might look something like:
mix : Color -> Color -> Color
mix c1 c2 =
{ red = avg c1.red c2.red
, green = avg c1.green c2.green
, blue = avg c1.blue c2.blue
, alpha = avg c1.alpha c2.alpha
}
Not so terrible or unreadable. BUT, you could even do something like:
mix : Color -> Color -> Color
mix c1 c2 =
{ red = avg .red c1 c2
, green = avg .green c1 c2
, blue = avg .blue c1 c2
, alpha = avg .alpha c1 c2
}
Are those worse than:
mix : Color -> Color -> Color
mix c1 c2 =
let
{ red1, green1, blue1, alpha1 } = toRgb c1
{ red2, green2, blue2, alpha2 } = toRgb c2
in
{ red = avg red1 red2
, green = avg green1 green2
, blue = avg blue1 blue2
, alpha = avg alpha1 alpha2
}
EDIT:
I did not realize Color is part of Core, so I edit.
You can destruct Record with property names.
In case of having multiple values, then you have to have some helper.
Following example, I defined toTuple to do that.
import Color exposing (Color)
toTuple {red, green, blue, alpha}
= (red, green, blue, alpha)
mix : Color -> Color -> Color
mix c1 c2 =
let
(r1, g1, b1, a1) = toTuple <| Color.toRgb c1
(r2, g2, b2, a2) = toTuple <| Color.toRgb c2
in
Color.rgba
(avg r1 r2)
(avg g1 g2)
(avg b1 b2)
(avgf a1 a2)
avg i j = (i + j) // 2
avgf p q = 0.5 * (p + q)
Original:
I'm not sure this is what you are looking for but, you do not need to convert it to record.
case of allows you to pattern match via constructor function. e.g.
type Color = RGB Int Int Int
purple = RGB 255 0 255
printRedVal =
case purple of
RGB r g b -> text (toString r)

assembly asm code, how to load data from different source points?

i tried to improve some code, but it seems so difficult to me.
i develop on Android NDK.
the C++ code i want to improve followed:
unsigned int test_add_C(unsigned int *x, unsigned int *y) {
unsigned int result = 0;
for (int i = 0; i < 8; i++) {
result += x[i] * y[i];
}
return result;
}
and neon code:
unsigned int test_add_neon(unsigned *x, unsigned *y) {
unsigned int result;
__asm__ __volatile__(
"vld1.32 {d2-d5}, [%[x]] \n\t"
"vld1.32 {d6-d9}, [%[y]]! \n\t"
"vmul.s32 d0, d2, d6 \n\t"
"vmla.s32 d0, d3, d7 \n\t"
"vmla.s32 d0, d4, d8 \n\t"
"vmla.s32 d0, d5, d9 \n\t"
"vpadd.s32 d0, d0 \n\t"
"vmov %0, r4, d0 \n\t"
:"=r"(result)
:"r"(x)
:"d0", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "r4"
);
return result;
}
but when i compile the code, it msg that undefined named operand 'x' and 'y'.
i dont know how to load the data from array x and y.
someone can help me?
thanks a lot.
Variable names inside inline assembly can't be "seen" by the compiler, and must be included in the input/output operands list.
Changing the line
:"r"(x)
to
:[x]"r"(x),[y]"r"(y)
will fix your 'undefined named operand' problem. However, I see a few more potential issues right away.
First, the datatype s32 of your multiplication instructions should be u32, since you specify x and y are of unsigned int type.
Second, you post-increment y but not x in the lines
"vld1.32 {d2-d5}, [%[x]] \n\t"
"vld1.32 {d6-d9}, [%[y]]! \n\t"
Unless this is on purpose, it is better to be consistent.

2nd order IIR filter, coefficients for a butterworth bandpass (EQ)?

Important update: I already figured out the answers and put them in this simple open-source library: http://bartolsthoorn.github.com/NVDSP/ Check it out, it will probably save you quite some time if you're having trouble with audio filters in IOS!
^
I have created a (realtime) audio buffer (float *data) that holds a few sin(theta) waves with different frequencies.
The code below shows how I created my buffer, and I've tried to do a bandpass filter but it just turns the signals to noise/blips:
// Multiple signal generator
__block float *phases = nil;
[audioManager setOutputBlock:^(float *data, UInt32 numFrames, UInt32 numChannels)
{
float samplingRate = audioManager.samplingRate;
NSUInteger activeSignalCount = [tones count];
// Initialize phases
if (phases == nil) {
phases = new float[10];
for(int z = 0; z <= 10; z++) {
phases[z] = 0.0;
}
}
// Multiple signals
NSEnumerator * enumerator = [tones objectEnumerator];
id frequency;
UInt32 c = 0;
while(frequency = [enumerator nextObject])
{
for (int i=0; i < numFrames; ++i)
{
for (int iChannel = 0; iChannel < numChannels; ++iChannel)
{
float theta = phases[c] * M_PI * 2;
if (c == 0) {
data[i*numChannels + iChannel] = sin(theta);
} else {
data[i*numChannels + iChannel] = data[i*numChannels + iChannel] + sin(theta);
}
}
phases[c] += 1.0 / (samplingRate / [frequency floatValue]);
if (phases[c] > 1.0) phases[c] = -1;
}
c++;
}
// Normalize data with active signal count
float signalMulti = 1.0 / (float(activeSignalCount) * (sqrt(2.0)));
vDSP_vsmul(data, 1, &signalMulti, data, 1, numFrames*numChannels);
// Apply master volume
float volume = masterVolumeSlider.value;
vDSP_vsmul(data, 1, &volume, data, 1, numFrames*numChannels);
if (fxSwitch.isOn) {
// H(s) = (s/Q) / (s^2 + s/Q + 1)
// http://www.musicdsp.org/files/Audio-EQ-Cookbook.txt
// BW 2.0 Q 0.667
// http://www.rane.com/note170.html
//The order of the coefficients are, B1, B2, A1, A2, B0.
float Fs = samplingRate;
float omega = 2*M_PI*Fs; // w0 = 2*pi*f0/Fs
float Q = 0.50f;
float alpha = sin(omega)/(2*Q); // sin(w0)/(2*Q)
// Through H
for (int i=0; i < numFrames; ++i)
{
for (int iChannel = 0; iChannel < numChannels; ++iChannel)
{
data[i*numChannels + iChannel] = (data[i*numChannels + iChannel]/Q) / (pow(data[i*numChannels + iChannel],2) + data[i*numChannels + iChannel]/Q + 1);
}
}
float b0 = alpha;
float b1 = 0;
float b2 = -alpha;
float a0 = 1 + alpha;
float a1 = -2*cos(omega);
float a2 = 1 - alpha;
float *coefficients = (float *) calloc(5, sizeof(float));
coefficients[0] = b1;
coefficients[1] = b2;
coefficients[2] = a1;
coefficients[3] = a2;
coefficients[3] = b0;
vDSP_deq22(data, 2, coefficients, data, 2, numFrames);
free(coefficients);
}
// Measure dB
[self measureDB:data:numFrames:numChannels];
}];
My aim is to make a 10-band EQ for this buffer, using vDSP_deq22, the syntax of the method is:
vDSP_deq22(<float *vDSP_A>, <vDSP_Stride vDSP_I>, <float *vDSP_B>, <float *vDSP_C>, <vDSP_Stride vDSP_K>, <vDSP_Length __vDSP_N>)
See: http://developer.apple.com/library/mac/#documentation/Accelerate/Reference/vDSPRef/Reference/reference.html#//apple_ref/doc/c_ref/vDSP_deq22
Arguments:
float *vDSP_A is the input data
float *vDSP_B are 5 filter coefficients
float *vDSP_C is the output data
I have to make 10 filters (10 times vDSP_deq22). Then I set the gain for every band and combine them back together. But what coefficients do I feed every filter? I know vDSP_deq22 is a 2nd order (butterworth) IIR filter, but how do I turn this into a bandpass?
Now I have three questions:
a) Do I have to de-interleave and interleave the audio buffer? I know setting stride to 2 just filters on channel but how I filter the other, stride 1 will process both channels as one.
b) Do I have to transform/process the buffer before it enters the vDSP_deq22 method? If so, do I also have to transform it back to normal?
c) What values of the coefficients should I set to the 10 vDSP_deq22s?
I've been trying for days now but I haven't been able to figure this on out, please help me out!
Your omega value need to be normalised, i.e. expressed as a fraction of Fs - it looks like you left out the f0 when you calculated omega, which will make alpha wrong too:
float omega = 2*M_PI*Fs; // w0 = 2*pi*f0/Fs
should probably be:
float omega = 2*M_PI*f0/Fs; // w0 = 2*pi*f0/Fs
where f0 is the centre frequency in Hz.
For your 10 band equaliser you'll need to pick 10 values of f0, spaced logarithmically, e.g. 25 Hz, 50 Hz, 100 Hz, 200 Hz, 400 Hz, 800 Hz, 1.6 kHz, 3.2 kHz, 6.4 kHz, 12.8 kHz.