Transpose 8x8 64-bits matrix - optimization

Targeting AVX2, what is a fastest way to transpose a 8x8 matrix containing 64-bits integers (or doubles)?
I searched though this site and I found several ways of doing 8x8 transpose but mostly for 32-bits floats. So I'm mainly asking because I'm not sure whether the principles that made those algorithms fast readily translate to 64-bits and second, apparently AVX2 only has 16 registers so only loading all the values would take up all the registers.
One way of doing it would be to call 2x2 _MM_TRANSPOSE4_PD but I was wondering whether this is optimal:
#define _MM_TRANSPOSE4_PD(row0,row1,row2,row3) \
{ \
__m256d tmp3, tmp2, tmp1, tmp0; \
\
tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0); \
tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF); \
tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0); \
tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF); \
\
(row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20); \
(row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20); \
(row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31); \
(row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31); \
}
Still assuming AVX2, is transposing double[8][8] and int64_t[8][8] largely the same, in principle?
PS: And just being curious, having AVX512 would change the things substantially, correct?

After some thoughts and discussion in the comments, I think this is the most efficient version, at least when source and destination data is in RAM. It does not require AVX2, AVX1 is enough.
The main idea, modern CPUs can do twice as many load micro-ops compared to stores, and on many CPUs loading stuff into higher half of vectors with vinsertf128 has same cost as regular 16-byte load. Compared to your macro, this version no longer needs these relatively expensive (3 cycles of latency on most CPUs) vperm2f128 shuffles.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void loadTransposed( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
// Load top half of the matrix into low half of 4 registers
__m256d t0 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 00, 01
__m256d t1 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 02, 03
rsi += stride;
__m256d t2 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 10, 11
__m256d t3 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 12, 13
rsi += stride;
// Load bottom half of the matrix into high half of these registers
t0 = _mm256_insertf128_pd( t0, _mm_loadu_pd( rsi ), 1 ); // 00, 01, 20, 21
t1 = _mm256_insertf128_pd( t1, _mm_loadu_pd( rsi + 2 ), 1 );// 02, 03, 22, 23
rsi += stride;
t2 = _mm256_insertf128_pd( t2, _mm_loadu_pd( rsi ), 1 ); // 10, 11, 30, 31
t3 = _mm256_insertf128_pd( t3, _mm_loadu_pd( rsi + 2 ), 1 );// 12, 13, 32, 33
// Transpose 2x2 blocks in registers.
// Due to the tricky way we loaded stuff, that's enough to transpose the complete 4x4 matrix.
mat.r0 = _mm256_unpacklo_pd( t0, t2 ); // 00, 10, 20, 30
mat.r1 = _mm256_unpackhi_pd( t0, t2 ); // 01, 11, 21, 31
mat.r2 = _mm256_unpacklo_pd( t1, t3 ); // 02, 12, 22, 32
mat.r3 = _mm256_unpackhi_pd( t1, t3 ); // 03, 13, 23, 33
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
// Transpose 8x8 matrix of double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
loadTransposed( block, rsi );
store( block, rdi );
#if 1
// Using another instance of the block to support in-place transpose
Matrix4x4 block2;
loadTransposed( block, rsi + 4 ); // top right block
loadTransposed( block2, rsi + 8 * 4 ); // bottom left block
store( block2, rdi + 4 );
store( block, rdi + 8 * 4 );
#else
// Flip the #if if you can guarantee ( rsi != rdi )
// Performance is about the same, but this version uses 4 less vector registers,
// slightly more efficient when some registers need to be backed up / restored.
assert( rsi != rdi );
loadTransposed( block, rsi + 4 );
store( block, rdi + 8 * 4 );
loadTransposed( block, rsi + 8 * 4 );
store( block, rdi + 4 );
#endif
// Bottom-right corner
loadTransposed( block, rsi + 8 * 4 + 4 );
store( block, rdi + 8 * 4 + 4 );
}
For completeness, here’s a version which uses the code very similar to your macro, does twice as few loads, same count of stores, and more shuffles. Have not benchmarked but I would expect it to be slightly slower.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void load( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
mat.r0 = _mm256_loadu_pd( rsi );
mat.r1 = _mm256_loadu_pd( rsi + stride );
mat.r2 = _mm256_loadu_pd( rsi + stride * 2 );
mat.r3 = _mm256_loadu_pd( rsi + stride * 3 );
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
inline void transpose( Matrix4x4& m4 )
{
// These unpack instructions transpose lanes within 2x2 blocks of the matrix
const __m256d t0 = _mm256_unpacklo_pd( m4.r0, m4.r1 );
const __m256d t1 = _mm256_unpacklo_pd( m4.r2, m4.r3 );
const __m256d t2 = _mm256_unpackhi_pd( m4.r0, m4.r1 );
const __m256d t3 = _mm256_unpackhi_pd( m4.r2, m4.r3 );
// Produce the transposed matrix by combining these blocks
m4.r0 = _mm256_permute2f128_pd( t0, t1, 0x20 );
m4.r1 = _mm256_permute2f128_pd( t2, t3, 0x20 );
m4.r2 = _mm256_permute2f128_pd( t0, t1, 0x31 );
m4.r3 = _mm256_permute2f128_pd( t2, t3, 0x31 );
}
// Transpose 8x8 matrix with double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
load( block, rsi );
transpose( block );
store( block, rdi );
// Using another instance of the block to support in-place transpose, with very small overhead
Matrix4x4 block2;
load( block, rsi + 4 ); // top right block
load( block2, rsi + 8 * 4 ); // bottom left block
transpose( block2 );
store( block2, rdi + 4 );
transpose( block );
store( block, rdi + 8 * 4 );
// Bottom-right corner
load( block, rsi + 8 * 4 + 4 );
transpose( block );
store( block, rdi + 8 * 4 + 4 );
}

For small matrices where more than 1 row can fit in a single SIMD vector, AVX-512 has very nice 2-input lane-crossing shuffles with 32-bit or 64-bit granularity, with a vector control. (Unlike _mm512_unpacklo_pd which is basically 4 separate 128-bit shuffles.)
A 4x4 double matrix is "only" 128 bytes, two ZMM __m512d vectors, so you only need two vpermt2ps (_mm512_permutex2var_pd) to produce both output vectors: one shuffle per output vector, with both loads and stores being full width. You do need control vector constants, though.
Using 512-bit vector instructions has some downsides (clock speed and execution port throughput), but if your program can spend a lot of time in code that uses 512-bit vectors, there's probably a significant throughput gain from throwing around more data with each instruction, and having more powerful shuffles.
With 256-bit vectors, vpermt2pd ymm would probably not be useful for a 4x4, because for each __m256d output row, each of the 4 elements you want comes from a different input row. So one 2-input shuffle can't produce the output you want.
I think lane-crossing shuffles with less than 128-bit granularity aren't useful unless your matrix is small enough to fit multiple rows in one SIMD vector. See How to transpose a 16x16 matrix using SIMD instructions? for some algorithmic complexity reasoning about 32-bit elements - an 8x8 xpose of 32-bit elements with AVX1 is about the same as an 8x8 of 64-bit elements with AVX-512, where each SIMD vector holds exactly one whole row.
So no need for vector constants, just immediate shuffles of 128-bit chunks, and unpacklo/hi
Transposing an 8x8 with 512-bit vectors (8 doubles) would have the same problem: each output row of 8 doubles needs 1 double from each of 8 input vectors. So ultimately I think you want a similar strategy to Soonts' AVX answer, starting with _mm512_insertf64x4(v, load, 1) as the first step to get the first half of 2 input rows into one vector.
(If you care about KNL / Xeon Phi, #ZBoson's other answer on How to transpose a 16x16 matrix using SIMD instructions? shows some interesting ideas using merge-masking with 1-input shuffles like vpermpd or vpermq, instead of 2-input shuffles like vunpcklpd or vpermt2pd)
Using wider vectors means fewer loads and stores, and maybe even fewer total shuffles because each one combines more data. But you also have more shuffling work to do, to get all 8 elements of a row into one vector, instead of just loading and storing to different places in chunks half the size of a row. It's not obvious is better; I'll update this answer if I get around to actually writing the code.
Note that Ice Lake (first consumer CPU with AVX-512) can do 2 loads and 2 stores per clock. It has better shuffle throughput than Skylake-X for some shuffles, but not for any that are useful for this or Soonts' answer. (All of vperm2f128, vunpcklpd and vpermt2pd only run on port 5, for the ymm and zmm versions. https://uops.info/. vinsertf64x4 zmm, mem, 1 is 2 uops for the front-end, and needs a load port and a uop for p0/p5. (Not p1 because it's a 512-bit uop, and see also SIMD instructions lowering CPU frequency).)

Related

Loading and transposing eight 8-element float vectors

In one of a tight loop running a DSP algorithm I need to load eight 8-element float vectors given a base data pointer and offsets in AVX2 integer register. My current fastest code looks like this:
void LoadTransposed(
const float* data, __m256i offsets,
__m256& v0, __m256& v1, __m256& v2, __m256& v3, __m256& v4, __m256& v5, __m256& v6, __m256& v7)
{
const __m128i offsetsLo = _mm256_castsi256_si128(offsets);
const __m128i offsetsHi = _mm256_extracti128_si256(offsets, 1);
__m256 a0 = _mm256_loadu_ps(data + (uint32)_mm_cvtsi128_si32(offsetsLo ));
__m256 a1 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 1));
__m256 a2 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 2));
__m256 a3 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 3));
__m256 a4 = _mm256_loadu_ps(data + (uint32)_mm_cvtsi128_si32(offsetsHi ));
__m256 a5 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 1));
__m256 a6 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 2));
__m256 a7 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 3));
// transpose
const __m256 t0 = _mm256_unpacklo_ps(a0, a1);
const __m256 t1 = _mm256_unpackhi_ps(a0, a1);
const __m256 t2 = _mm256_unpacklo_ps(a2, a3);
const __m256 t3 = _mm256_unpackhi_ps(a2, a3);
const __m256 t4 = _mm256_unpacklo_ps(a4, a5);
const __m256 t5 = _mm256_unpackhi_ps(a4, a5);
const __m256 t6 = _mm256_unpacklo_ps(a6, a7);
const __m256 t7 = _mm256_unpackhi_ps(a6, a7);
__m256 v = _mm256_shuffle_ps(t0, t2, 0x4E);
const __m256 tt0 = _mm256_blend_ps(t0, v, 0xCC);
const __m256 tt1 = _mm256_blend_ps(t2, v, 0x33);
v = _mm256_shuffle_ps(t1, t3, 0x4E);
const __m256 tt2 = _mm256_blend_ps(t1, v, 0xCC);
const __m256 tt3 = _mm256_blend_ps(t3, v, 0x33);
v = _mm256_shuffle_ps(t4, t6, 0x4E);
const __m256 tt4 = _mm256_blend_ps(t4, v, 0xCC);
const __m256 tt5 = _mm256_blend_ps(t6, v, 0x33);
v = _mm256_shuffle_ps(t5, t7, 0x4E);
const __m256 tt6 = _mm256_blend_ps(t5, v, 0xCC);
const __m256 tt7 = _mm256_blend_ps(t7, v, 0x33);
v0 = _mm256_permute2f128_ps(tt0, tt4, 0x20);
v1 = _mm256_permute2f128_ps(tt1, tt5, 0x20);
v2 = _mm256_permute2f128_ps(tt2, tt6, 0x20);
v3 = _mm256_permute2f128_ps(tt3, tt7, 0x20);
v4 = _mm256_permute2f128_ps(tt0, tt4, 0x31);
v5 = _mm256_permute2f128_ps(tt1, tt5, 0x31);
v6 = _mm256_permute2f128_ps(tt2, tt6, 0x31);
v7 = _mm256_permute2f128_ps(tt3, tt7, 0x31);
}
As you can see, I'm already using blends instead of shuffles to reduce port 5 pressure. I also opted for _mm_cvtsi128_si32 when loading extracting 1st vector element, which is only 1uop, instead of 2uops in case of inconspicuous _mm_extract_epi32. Also, extracting the lower and higher lanes manually seems to help the compiler a bit and removes redundant vextracti128 instructions.
I've tried equivalent code using gather instructions, which as predicted turned out to be 2x slower, as it's doing effectively 64 loads under the hood:
void LoadTransposed_Gather(
const float* data, __m256i offsets,
__m256& v0, __m256& v1, __m256& v2, __m256& v3, __m256& v4, __m256& v5, __m256& v6, __m256& v7)
{
v0 = _mm256_i32gather_ps(data + 0, offsets, 4);
v1 = _mm256_i32gather_ps(data + 1, offsets, 4);
v2 = _mm256_i32gather_ps(data + 2, offsets, 4);
v3 = _mm256_i32gather_ps(data + 3, offsets, 4);
v4 = _mm256_i32gather_ps(data + 4, offsets, 4);
v5 = _mm256_i32gather_ps(data + 5, offsets, 4);
v6 = _mm256_i32gather_ps(data + 6, offsets, 4);
v7 = _mm256_i32gather_ps(data + 7, offsets, 4);
}
Is there any way to speed this (the former snippet) up even further? According to VTune and IACA, the biggest offender is high port 0 and 5 pressure (probably due to vpextrd used during offset extraction from __m128i registers and all the vunpckhps, vunpcklps and vshufps used during transpose).
Do your offsets have a pattern, like a fixed stride that you could just scale?
If not, perhaps pass them around as a struct instead of an __m256i if you're just going to need to extract them anyway?
Or if you're using SIMD to calculate the offsets (so they're naturally in a __m256i in the first place): store/reload to a local array When you need all 8 elements would save shuffle port bandwidth. Maybe _mm_cvtsi128_si32 / _mm_extract_epi32(offsetsLo, 1)) to get the first 1 or 2 offsets via ALU operations, with a couple cycles lower latency than store -> reload store forwarding.
e.g. alignas(32) uint32_t offsets[8]; and _mm256_store_si256 into it. (With some compilers, you may need to stop it from "optimizing" that into ALU extracts. You can use volatile on the array as a nasty hack to work around that. (But be careful not to defeat optimization more than necessary, e.g. load into tmp vars instead of accessing the volatile array multiple times, if you do want each element more than once. This will always defeat constant-propagation, for FP will defeat stuff like using the low element of a vector as a scalar with no shuffle necessary.)
2/clock load throughput, and efficient store forwarding from a vector store to scalar reloads of 32-bit elements makes this good (maybe 7 cycle latency IIRC, for a 256-bit store).
Especially if you're doing this transpose in a loop with other ALU work on the transpose result, so the loop mostly bottlenecks on port 5 in the back-end. The extra load uops shouldn't bottleneck on load ports, especially if there are any L1d cache misses. (In which case replays cost extra cycles on ports for instructions that consume the load results, not of load uops themselves).
Also fewer front-end uops:
1 store (p237+p4 micro-fused) + 1 vmovd (p0) + 7 loads (p23) is only 9 total front-end (fused-domain) uops
vs. vextracti128 + 2x vmovd + 6x vpextrd = 15 ALU uops for port 0 and port 5
Store/reload is fine on Zen/Zen2 as well.
IceLake has more ALU shuffle throughput (some vector shuffles can run on another port as well as p5) but store/reload is still a good strategy when you need all the elements and there are 8 of them. Especially for throughput at a small cost in latency.
#Witek902 reports (in comments) that #chtz's suggestion of building the transpose out of vmovups xmm + vinsertf128 reduces the port 5 shuffle throughput bottleneck on HSW / SKL and gives a speedup in practice. vinsertf128 y,y,mem,i is 2 uops (can't micro-fuse) for p015 + p23 on Intel. So it's more like a blend, not needing the shuffle port. (It's also going to be excellent on Bulldozer-family / Zen1 which handle YMM regs as two 128-bit halves.)
Doing only 128-bit loads is also nice for Sandybridge / IvyBridge, where misaligned 256-bit loads are extra expensive.
And on any CPU; if an offset happens to be an odd multiple of 16-byte alignment, neither 128-bit load will cross a cache-line boundary. So no uop replays of dependent ALU uops creating extra back-end port pressure.

Unknown SSE bottleneck

I have a generic code that I am trying to move to SSE to speed it up since it's getting called a lot. The code in question is basically something like this:
for (int i = 1; i < mysize; ++i)
{
buf[i] = myMin(buf[i], buf[i - 1] + offset);
}
where myMin is your simple min function (a < b) ? a : b (I've looked at the disassembly and there are jumps in here)
My SSE code (which I've gone through several iterations to speed up) is at this form now:
float tmpf = *(tmp - 1);
__m128 off = _mm_set_ss(offset);
for (int l = 0; l < mysize; l += 4)
{
__m128 post = _mm_load_ps(tmp);
__m128 pre = _mm_move_ss(post, _mm_set_ss(tmpf));
pre = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(0, 3, 2, 1));
pre = _mm_add_ss(pre, off);
post = _mm_min_ss(post, pre);
// reversed
pre = _mm_shuffle_ps(post, post, _MM_SHUFFLE(2, 1, 0, 3));
post = _mm_add_ss(post, off );
pre = _mm_min_ss(pre, post);
post = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(2, 1, 0, 3));
pre = _mm_add_ss(pre, off);
post = _mm_min_ss(post, pre);
// reversed
pre = _mm_shuffle_ps(post, post, _MM_SHUFFLE(2, 1, 0, 3));
post = _mm_add_ss(post, off);
pre = _mm_min_ss(pre, post);
post = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(2, 1, 0, 3));
_mm_store_ps(tmp, post);
tmpf = tmp[3];
tmp += 4;
}
Ignoring any edge case scenarios, which I've handled fine, and overhead for those are negligible due to size of buf/tmp, can anyone explain why the SSE version is slower by 2x? VTune keeps attributing it to L1 misses, but as I can see, it should be make 4x less trips to L1 and no branches/jumps, so it should be faster, but it's not. What am I mistaking here?
Thanks
EDIT:
So I did find something else in a separate test case. I didn't think this would matter but alas it did. So mysize above is actually not that big (about 30-50), but there are a LOT of these and they are all being done serially. In that case, the ternary expression is faster than SSE. However, if it's reversed with mysize being in millions and there are only 30-50 iterations of them, the SSE version is faster. Any idea why? I would think memory interactions would be the same for both, including pre-emptive prefetching etc...
If this code is performance critical, you'll have to look at the data that you get. It's the serial dependency that's killing you, and you need to get rid of it.
One very small value an buf [i] will influence a lot of the following values. For example, if offset = 1, buf [0] = 0, and all other values are > 1 million, that one value will influence the next one million. On the other hand, that kind of thing might happen very rarely.
If it is rare, they you check fully vectorised whether buf [i] > buf [i] + offset, replace it if it is, and keep track where changes were made, without considering that buf [i] values could trickle upwards. Then you check where changes were made, and re-check them.
In extreme cases, say buf [i] is always between 0 and 1, and offset > 0.5, you know that buf [i] cannot influence buf [i + 2] at all, so you just ignore the serial dependency and do everything in parallel, fully vectorised.
On the other hand, if you have some tiny values in your buffer that influence large numbers of consecutive values, then you start with the first value buf [0] and fully vectorised check whether buf [i] < buf [0] + i * offset, replacing values, until the check fails.
You say "the values can be anything". If that is the case, for example if buf [i] is randomly chosen anywhere between 0 and 1,000,000, and offset is not very large, then you will have elements buf [i] which force lots of following elements to be buf [i] + (k - i) * offset. For example if offset = 1, and you find buf [i] is about 10,000, then it will force on average about 100 values to be equal to buf [i] + (k - i) * offset.
Here's a branchless solution you could try
for (int i = 1; i < mysize; i++) {
float a = buf[i];
float b = buf[i-1] + offset;
buf[i] = b + (a<b)*(a-b);
}
Here is the assembly:
.L6:
addss xmm0, xmm4
movss xmm1, DWORD PTR [rax]
movaps xmm2, xmm1
add rax, 4
movaps xmm3, xmm6
cmpltss xmm2, xmm0
subss xmm1, xmm0
andps xmm3, xmm2
andnps xmm2, xmm5
orps xmm2, xmm3
mulss xmm1, xmm2
addss xmm0, xmm1
movss DWORD PTR [rax-4], xmm0
cmp rax, rdx
jne .L6
But the version with a branch is probably already better
for (int i = 1; i < mysize; i++) {
float a = buf[i];
float b = buf[i-1] + offset;
buf[i] = a<b ? a : b;
}
Here is the assembly
.L15:
addss xmm0, xmm2
movss xmm1, DWORD PTR [rax]
add rax, 4
minss xmm1, xmm0
movss DWORD PTR [rax-4], xmm1
cmp rax, rdx
movaps xmm0, xmm1
jne .L15
This produces code which is branchless anyway using minss (cmp rax, rdx applies to the loop iterator).
Finally, here is code you can be used with MSVC which produces the same assembly as GCC which is branchless
__m128 offset4 = _mm_set1_ps(offset);
for (int i = 1; i < mysize; i++) {
__m128 a = _mm_load_ss(&buf[i]);
__m128 b = _mm_load_ss(&buf[i-1]);
b = _mm_add_ss(b, offset4);
a = _mm_min_ss(a,b);
_mm_store_ss(&buf[i], a);
}
Here is another form you can try which uses a branch
__m128 offset4 = _mm_set1_ps(offset);
for (int i = 1; i < mysize; i++) {
__m128 a = _mm_load_ss(&buf[i]);
__m128 b = _mm_load_ss(&buf[i-1]);
b = _mm_add_ss(b, offset4);
if(_mm_comige_ss(b,a))
_mm_store_ss(&buf[i], b);
}

accelerate framework cepstrum peak find

I'm trying to find peak values of cepstrum analysis with accelerate framework. I get peak values always at the end of or at the beginning of frames. I'm analysing it real-time getting audio from microphone. What is wrong with this my code? My code is below :
OSStatus microphoneInputCallback (void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData){
// get reference of test app we need for test app attributes
TestApp *this = (TestApp *)inRefCon;
COMPLEX_SPLIT complexArray = this->fftA;
void *dataBuffer = this->dataBuffer;
float *outputBuffer = this->outputBuffer;
FFTSetup fftSetup = this->fftSetup;
uint32_t log2n = this->fftLog2n;
uint32_t n = this->fftN; // 4096
uint32_t nOver2 = this->fftNOver2;
uint32_t stride = 1;
int bufferCapacity = this->fftBufferCapacity; // 4096
SInt16 index = this->fftIndex;
OSStatus renderErr;
// observation objects
float *observerBufferRef = this->observerBuffer;
int observationCountRef = this->observationCount;
renderErr = AudioUnitRender(rioUnit, ioActionFlags,
inTimeStamp, bus1, inNumberFrames, this->bufferList);
if (renderErr < 0) {
return renderErr;
}
// Fill the buffer with our sampled data. If we fill our buffer, run the
// fft.
int read = bufferCapacity - index;
if (read > inNumberFrames) {
memcpy((SInt16 *)dataBuffer + index, this->bufferList->mBuffers[0].mData, inNumberFrames*sizeof(SInt16));
this->fftIndex += inNumberFrames;
} else {
// If we enter this conditional, our buffer will be filled and we should PERFORM FFT.
memcpy((SInt16 *)dataBuffer + index, this->bufferList->mBuffers[0].mData, read*sizeof(SInt16));
// Reset the index.
this->fftIndex = 0;
/*************** FFT ***************/
//multiply by window
vDSP_vmul((SInt16 *)dataBuffer, 1, this->window, 1, this->outputBuffer, 1, n);
// We want to deal with only floating point values here.
vDSP_vflt16((SInt16 *) dataBuffer, stride, (float *) outputBuffer, stride, bufferCapacity );
/**
Look at the real signal as an interleaved complex vector by casting it.
Then call the transformation function vDSP_ctoz to get a split complex
vector, which for a real signal, divides into an even-odd configuration.
*/
vDSP_ctoz((COMPLEX*)outputBuffer, 2, &complexArray, 1, nOver2);
// Carry out a Forward FFT transform.
vDSP_fft_zrip(fftSetup, &complexArray, stride, log2n, FFT_FORWARD);
vDSP_ztoc(&complexArray, 1, (COMPLEX *)outputBuffer, 2, nOver2);
complexArray.imagp[0] = 0.0f;
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, nOver2);
bzero(complexArray.imagp, (nOver2) * sizeof(float));
// scale
float scale = 1.0f / (2.0f*(float)n);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, nOver2);
// step 2 get log for cepstrum
float *logmag = malloc(sizeof(float)*nOver2);
for (int i=0; i < nOver2; i++)
logmag[i] = logf(sqrtf(complexArray.realp[i]));
// configure float array into acceptable input array format (interleaved)
vDSP_ctoz((COMPLEX*)logmag, 2, &complexArray, 1, nOver2);
// create cepstrum
vDSP_fft_zrip(fftSetup, &complexArray, stride, log2n-1, FFT_INVERSE);
//convert interleaved to real
float *displayData = malloc(sizeof(float)*n);
vDSP_ztoc(&complexArray, 1, (COMPLEX*)displayData, 2, nOver2);
float dominantFrequency = 0;
int currentBin = 0;
float dominantFrequencyAmp = 0;
// find peak of cepstrum
for (int i=0; i < nOver2; i++){
//get current frequency magnitude
if (displayData[i] > dominantFrequencyAmp) {
// DLog("Bufferer filled %f", displayData[i]);
dominantFrequencyAmp = displayData[i];
currentBin = i;
}
}
DLog("currentBin : %i amplitude: %f", currentBin, dominantFrequencyAmp);
}
return noErr;
}
I haven't worked with the Accelerate Framework, but your code appears to be taking the proper steps to calculate the Cepstrum.
The Cepstrum of real acoustic signals tends to have a very large DC component, a large peak at and near zero quefrency [sic]. Just ignore the near-DC portion of the Cepstrum and look for peaks above 20 Hz frequency (above quefrency of Cepstrum_Width/20Hz).
If the input signal contains a series of very closely spaced overtones, the Cepstrum will also have a large peak at the high quefrency end.
For example, the plot below shows the Cepstrum of a Dirichlet Kernel of N=128 and Width=4096, the spectrum of which is a series of very closely spaced overtones.
You may want to use a static synthetic signal to test and debug your code. A good choice for a test signal is any sinusoid with a fundamental F and several overtones at exact integer multiples of F.
Your Cepstra should look something like the following examples.
First a synthetic signal.
The plot below shows the Cepstrum of a synthetic steady-state E2 note, synthesized using a typical near-DC component, a fundamental at 82.4 Hz, and 8 harmonics at integer multiples of 82.4 Hz. The synthetic sinusoid was programmed to generate 4096 samples.
Observe the prominent non-DC peak at 12.36. The Cepstrum width is 1024 (the output of the second FFT), therefore the peak corresponds to 1024/12.36 = 82.8 Hz which is very close to 82.4 Hz the true fundamental frequency.
Now a real acoustical signal.
The plot below shows the Cepstrum of a real acoustic guitar's E2 note. The signal was not windowed prior to the first FFT. Observe the prominent non-DC peak at 542.9. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/542.9 = 60.4 Hz which is fairly far from 82.4 Hz the true fundamental frequency.
The plot below shows the Cepstrum of the same real acoustic guitar's E2 note, but this time the signal was Hann windowed prior to the first FFT. Observe the prominent non-DC peak at 268.46. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/268.46 = 122.1 Hz which is even farther from 82.4 Hz the true fundamental frequency.
The acoustic guitar's E2 note used for this analysis was sampled at 44.1 KHz with a high quality microphone under studio conditions, it contains essentially zero background noise, no other instruments or voices, and no post processing.
References:
Real audio signal data, synthetic signal generation, plots, FFT, and Cepstral analysis were done here: Musical instrument cepstrum

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

Optimizing Vector elements swaps using CUDA

Since I am new to cuda .. I need your kind help
I have this long vector, for each group of 24 elements, I need to do the following:
for the first 12 elements, the even numbered elements are multiplied by -1,
for the second 12 elements, the odd numbered elements are multiplied by -1 then the following swap takes place:
Graph: because I don't yet have enough points, I couldn't post the image so here it is:
http://www.freeimagehosting.net/image.php?e4b88fb666.png
I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts ..
//subvector is a multiple of 24, Mds and Nds are shared memory
____shared____ double Mds[subVector];
____shared____ double Nds[subVector];
int tx = threadIdx.x;
int tx_mod = tx ^ 0x0001;
int basex = __umul24(blockDim.x, blockIdx.x);
Mds[tx] = M.elements[basex + tx];
__syncthreads();
// flip the signs
if (tx < (tx/24)*24 + 12)
{
//if < 12 and even
if ((tx & 0x0001)==0)
Mds[tx] = -Mds[tx];
}
else
if (tx < (tx/24)*24 + 24)
{
//if >12 and < 24 and odd
if ((tx & 0x0001)==1)
Mds[tx] = -Mds[tx];
}
__syncthreads();
if (tx < (tx/24)*24 + 6)
{
//for the first 6 elements .. swap with last six in the 24elements group (see graph)
Nds[tx] = Mds[tx_mod + 18];
Mds [tx_mod + 18] = Mds [tx];
Mds[tx] = Nds[tx];
}
else
if (tx < (tx/24)*24 + 12)
{
// for the second 6 elements .. swp with next adjacent group (see graph)
Nds[tx] = Mds[tx_mod + 6];
Mds [tx_mod + 6] = Mds [tx];
Mds[tx] = Nds[tx];
}
__syncthreads();
Thanks in advance ..
paul gave you pretty good starting points you previous questions.
couple things to watch out for: you are doing non-base 2 division which is expensive.
Instead try to utilize multidimensional nature of the thread block. For example, make the x-dimension of size 24, which will eliminate need for division.
in general, try to fit thread block dimensions to reflect your data dimensions.
simplify sign flipping: for example, if you do not want to flip sign, you can still multiplied by identity 1. Figure out how to map even/odd numbers to 1 and -1 using just arithmetic: for example sign = (even*2+1) - 2 where even is either 1 or 0.