Loading and transposing eight 8-element float vectors - optimization

In one of a tight loop running a DSP algorithm I need to load eight 8-element float vectors given a base data pointer and offsets in AVX2 integer register. My current fastest code looks like this:
void LoadTransposed(
const float* data, __m256i offsets,
__m256& v0, __m256& v1, __m256& v2, __m256& v3, __m256& v4, __m256& v5, __m256& v6, __m256& v7)
{
const __m128i offsetsLo = _mm256_castsi256_si128(offsets);
const __m128i offsetsHi = _mm256_extracti128_si256(offsets, 1);
__m256 a0 = _mm256_loadu_ps(data + (uint32)_mm_cvtsi128_si32(offsetsLo ));
__m256 a1 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 1));
__m256 a2 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 2));
__m256 a3 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsLo, 3));
__m256 a4 = _mm256_loadu_ps(data + (uint32)_mm_cvtsi128_si32(offsetsHi ));
__m256 a5 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 1));
__m256 a6 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 2));
__m256 a7 = _mm256_loadu_ps(data + (uint32)_mm_extract_epi32(offsetsHi, 3));
// transpose
const __m256 t0 = _mm256_unpacklo_ps(a0, a1);
const __m256 t1 = _mm256_unpackhi_ps(a0, a1);
const __m256 t2 = _mm256_unpacklo_ps(a2, a3);
const __m256 t3 = _mm256_unpackhi_ps(a2, a3);
const __m256 t4 = _mm256_unpacklo_ps(a4, a5);
const __m256 t5 = _mm256_unpackhi_ps(a4, a5);
const __m256 t6 = _mm256_unpacklo_ps(a6, a7);
const __m256 t7 = _mm256_unpackhi_ps(a6, a7);
__m256 v = _mm256_shuffle_ps(t0, t2, 0x4E);
const __m256 tt0 = _mm256_blend_ps(t0, v, 0xCC);
const __m256 tt1 = _mm256_blend_ps(t2, v, 0x33);
v = _mm256_shuffle_ps(t1, t3, 0x4E);
const __m256 tt2 = _mm256_blend_ps(t1, v, 0xCC);
const __m256 tt3 = _mm256_blend_ps(t3, v, 0x33);
v = _mm256_shuffle_ps(t4, t6, 0x4E);
const __m256 tt4 = _mm256_blend_ps(t4, v, 0xCC);
const __m256 tt5 = _mm256_blend_ps(t6, v, 0x33);
v = _mm256_shuffle_ps(t5, t7, 0x4E);
const __m256 tt6 = _mm256_blend_ps(t5, v, 0xCC);
const __m256 tt7 = _mm256_blend_ps(t7, v, 0x33);
v0 = _mm256_permute2f128_ps(tt0, tt4, 0x20);
v1 = _mm256_permute2f128_ps(tt1, tt5, 0x20);
v2 = _mm256_permute2f128_ps(tt2, tt6, 0x20);
v3 = _mm256_permute2f128_ps(tt3, tt7, 0x20);
v4 = _mm256_permute2f128_ps(tt0, tt4, 0x31);
v5 = _mm256_permute2f128_ps(tt1, tt5, 0x31);
v6 = _mm256_permute2f128_ps(tt2, tt6, 0x31);
v7 = _mm256_permute2f128_ps(tt3, tt7, 0x31);
}
As you can see, I'm already using blends instead of shuffles to reduce port 5 pressure. I also opted for _mm_cvtsi128_si32 when loading extracting 1st vector element, which is only 1uop, instead of 2uops in case of inconspicuous _mm_extract_epi32. Also, extracting the lower and higher lanes manually seems to help the compiler a bit and removes redundant vextracti128 instructions.
I've tried equivalent code using gather instructions, which as predicted turned out to be 2x slower, as it's doing effectively 64 loads under the hood:
void LoadTransposed_Gather(
const float* data, __m256i offsets,
__m256& v0, __m256& v1, __m256& v2, __m256& v3, __m256& v4, __m256& v5, __m256& v6, __m256& v7)
{
v0 = _mm256_i32gather_ps(data + 0, offsets, 4);
v1 = _mm256_i32gather_ps(data + 1, offsets, 4);
v2 = _mm256_i32gather_ps(data + 2, offsets, 4);
v3 = _mm256_i32gather_ps(data + 3, offsets, 4);
v4 = _mm256_i32gather_ps(data + 4, offsets, 4);
v5 = _mm256_i32gather_ps(data + 5, offsets, 4);
v6 = _mm256_i32gather_ps(data + 6, offsets, 4);
v7 = _mm256_i32gather_ps(data + 7, offsets, 4);
}
Is there any way to speed this (the former snippet) up even further? According to VTune and IACA, the biggest offender is high port 0 and 5 pressure (probably due to vpextrd used during offset extraction from __m128i registers and all the vunpckhps, vunpcklps and vshufps used during transpose).

Do your offsets have a pattern, like a fixed stride that you could just scale?
If not, perhaps pass them around as a struct instead of an __m256i if you're just going to need to extract them anyway?
Or if you're using SIMD to calculate the offsets (so they're naturally in a __m256i in the first place): store/reload to a local array When you need all 8 elements would save shuffle port bandwidth. Maybe _mm_cvtsi128_si32 / _mm_extract_epi32(offsetsLo, 1)) to get the first 1 or 2 offsets via ALU operations, with a couple cycles lower latency than store -> reload store forwarding.
e.g. alignas(32) uint32_t offsets[8]; and _mm256_store_si256 into it. (With some compilers, you may need to stop it from "optimizing" that into ALU extracts. You can use volatile on the array as a nasty hack to work around that. (But be careful not to defeat optimization more than necessary, e.g. load into tmp vars instead of accessing the volatile array multiple times, if you do want each element more than once. This will always defeat constant-propagation, for FP will defeat stuff like using the low element of a vector as a scalar with no shuffle necessary.)
2/clock load throughput, and efficient store forwarding from a vector store to scalar reloads of 32-bit elements makes this good (maybe 7 cycle latency IIRC, for a 256-bit store).
Especially if you're doing this transpose in a loop with other ALU work on the transpose result, so the loop mostly bottlenecks on port 5 in the back-end. The extra load uops shouldn't bottleneck on load ports, especially if there are any L1d cache misses. (In which case replays cost extra cycles on ports for instructions that consume the load results, not of load uops themselves).
Also fewer front-end uops:
1 store (p237+p4 micro-fused) + 1 vmovd (p0) + 7 loads (p23) is only 9 total front-end (fused-domain) uops
vs. vextracti128 + 2x vmovd + 6x vpextrd = 15 ALU uops for port 0 and port 5
Store/reload is fine on Zen/Zen2 as well.
IceLake has more ALU shuffle throughput (some vector shuffles can run on another port as well as p5) but store/reload is still a good strategy when you need all the elements and there are 8 of them. Especially for throughput at a small cost in latency.
#Witek902 reports (in comments) that #chtz's suggestion of building the transpose out of vmovups xmm + vinsertf128 reduces the port 5 shuffle throughput bottleneck on HSW / SKL and gives a speedup in practice. vinsertf128 y,y,mem,i is 2 uops (can't micro-fuse) for p015 + p23 on Intel. So it's more like a blend, not needing the shuffle port. (It's also going to be excellent on Bulldozer-family / Zen1 which handle YMM regs as two 128-bit halves.)
Doing only 128-bit loads is also nice for Sandybridge / IvyBridge, where misaligned 256-bit loads are extra expensive.
And on any CPU; if an offset happens to be an odd multiple of 16-byte alignment, neither 128-bit load will cross a cache-line boundary. So no uop replays of dependent ALU uops creating extra back-end port pressure.

Related

Transpose 8x8 64-bits matrix

Targeting AVX2, what is a fastest way to transpose a 8x8 matrix containing 64-bits integers (or doubles)?
I searched though this site and I found several ways of doing 8x8 transpose but mostly for 32-bits floats. So I'm mainly asking because I'm not sure whether the principles that made those algorithms fast readily translate to 64-bits and second, apparently AVX2 only has 16 registers so only loading all the values would take up all the registers.
One way of doing it would be to call 2x2 _MM_TRANSPOSE4_PD but I was wondering whether this is optimal:
#define _MM_TRANSPOSE4_PD(row0,row1,row2,row3) \
{ \
__m256d tmp3, tmp2, tmp1, tmp0; \
\
tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0); \
tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF); \
tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0); \
tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF); \
\
(row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20); \
(row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20); \
(row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31); \
(row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31); \
}
Still assuming AVX2, is transposing double[8][8] and int64_t[8][8] largely the same, in principle?
PS: And just being curious, having AVX512 would change the things substantially, correct?
After some thoughts and discussion in the comments, I think this is the most efficient version, at least when source and destination data is in RAM. It does not require AVX2, AVX1 is enough.
The main idea, modern CPUs can do twice as many load micro-ops compared to stores, and on many CPUs loading stuff into higher half of vectors with vinsertf128 has same cost as regular 16-byte load. Compared to your macro, this version no longer needs these relatively expensive (3 cycles of latency on most CPUs) vperm2f128 shuffles.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void loadTransposed( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
// Load top half of the matrix into low half of 4 registers
__m256d t0 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 00, 01
__m256d t1 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 02, 03
rsi += stride;
__m256d t2 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 10, 11
__m256d t3 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 12, 13
rsi += stride;
// Load bottom half of the matrix into high half of these registers
t0 = _mm256_insertf128_pd( t0, _mm_loadu_pd( rsi ), 1 ); // 00, 01, 20, 21
t1 = _mm256_insertf128_pd( t1, _mm_loadu_pd( rsi + 2 ), 1 );// 02, 03, 22, 23
rsi += stride;
t2 = _mm256_insertf128_pd( t2, _mm_loadu_pd( rsi ), 1 ); // 10, 11, 30, 31
t3 = _mm256_insertf128_pd( t3, _mm_loadu_pd( rsi + 2 ), 1 );// 12, 13, 32, 33
// Transpose 2x2 blocks in registers.
// Due to the tricky way we loaded stuff, that's enough to transpose the complete 4x4 matrix.
mat.r0 = _mm256_unpacklo_pd( t0, t2 ); // 00, 10, 20, 30
mat.r1 = _mm256_unpackhi_pd( t0, t2 ); // 01, 11, 21, 31
mat.r2 = _mm256_unpacklo_pd( t1, t3 ); // 02, 12, 22, 32
mat.r3 = _mm256_unpackhi_pd( t1, t3 ); // 03, 13, 23, 33
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
// Transpose 8x8 matrix of double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
loadTransposed( block, rsi );
store( block, rdi );
#if 1
// Using another instance of the block to support in-place transpose
Matrix4x4 block2;
loadTransposed( block, rsi + 4 ); // top right block
loadTransposed( block2, rsi + 8 * 4 ); // bottom left block
store( block2, rdi + 4 );
store( block, rdi + 8 * 4 );
#else
// Flip the #if if you can guarantee ( rsi != rdi )
// Performance is about the same, but this version uses 4 less vector registers,
// slightly more efficient when some registers need to be backed up / restored.
assert( rsi != rdi );
loadTransposed( block, rsi + 4 );
store( block, rdi + 8 * 4 );
loadTransposed( block, rsi + 8 * 4 );
store( block, rdi + 4 );
#endif
// Bottom-right corner
loadTransposed( block, rsi + 8 * 4 + 4 );
store( block, rdi + 8 * 4 + 4 );
}
For completeness, here’s a version which uses the code very similar to your macro, does twice as few loads, same count of stores, and more shuffles. Have not benchmarked but I would expect it to be slightly slower.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void load( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
mat.r0 = _mm256_loadu_pd( rsi );
mat.r1 = _mm256_loadu_pd( rsi + stride );
mat.r2 = _mm256_loadu_pd( rsi + stride * 2 );
mat.r3 = _mm256_loadu_pd( rsi + stride * 3 );
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
inline void transpose( Matrix4x4& m4 )
{
// These unpack instructions transpose lanes within 2x2 blocks of the matrix
const __m256d t0 = _mm256_unpacklo_pd( m4.r0, m4.r1 );
const __m256d t1 = _mm256_unpacklo_pd( m4.r2, m4.r3 );
const __m256d t2 = _mm256_unpackhi_pd( m4.r0, m4.r1 );
const __m256d t3 = _mm256_unpackhi_pd( m4.r2, m4.r3 );
// Produce the transposed matrix by combining these blocks
m4.r0 = _mm256_permute2f128_pd( t0, t1, 0x20 );
m4.r1 = _mm256_permute2f128_pd( t2, t3, 0x20 );
m4.r2 = _mm256_permute2f128_pd( t0, t1, 0x31 );
m4.r3 = _mm256_permute2f128_pd( t2, t3, 0x31 );
}
// Transpose 8x8 matrix with double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
load( block, rsi );
transpose( block );
store( block, rdi );
// Using another instance of the block to support in-place transpose, with very small overhead
Matrix4x4 block2;
load( block, rsi + 4 ); // top right block
load( block2, rsi + 8 * 4 ); // bottom left block
transpose( block2 );
store( block2, rdi + 4 );
transpose( block );
store( block, rdi + 8 * 4 );
// Bottom-right corner
load( block, rsi + 8 * 4 + 4 );
transpose( block );
store( block, rdi + 8 * 4 + 4 );
}
For small matrices where more than 1 row can fit in a single SIMD vector, AVX-512 has very nice 2-input lane-crossing shuffles with 32-bit or 64-bit granularity, with a vector control. (Unlike _mm512_unpacklo_pd which is basically 4 separate 128-bit shuffles.)
A 4x4 double matrix is "only" 128 bytes, two ZMM __m512d vectors, so you only need two vpermt2ps (_mm512_permutex2var_pd) to produce both output vectors: one shuffle per output vector, with both loads and stores being full width. You do need control vector constants, though.
Using 512-bit vector instructions has some downsides (clock speed and execution port throughput), but if your program can spend a lot of time in code that uses 512-bit vectors, there's probably a significant throughput gain from throwing around more data with each instruction, and having more powerful shuffles.
With 256-bit vectors, vpermt2pd ymm would probably not be useful for a 4x4, because for each __m256d output row, each of the 4 elements you want comes from a different input row. So one 2-input shuffle can't produce the output you want.
I think lane-crossing shuffles with less than 128-bit granularity aren't useful unless your matrix is small enough to fit multiple rows in one SIMD vector. See How to transpose a 16x16 matrix using SIMD instructions? for some algorithmic complexity reasoning about 32-bit elements - an 8x8 xpose of 32-bit elements with AVX1 is about the same as an 8x8 of 64-bit elements with AVX-512, where each SIMD vector holds exactly one whole row.
So no need for vector constants, just immediate shuffles of 128-bit chunks, and unpacklo/hi
Transposing an 8x8 with 512-bit vectors (8 doubles) would have the same problem: each output row of 8 doubles needs 1 double from each of 8 input vectors. So ultimately I think you want a similar strategy to Soonts' AVX answer, starting with _mm512_insertf64x4(v, load, 1) as the first step to get the first half of 2 input rows into one vector.
(If you care about KNL / Xeon Phi, #ZBoson's other answer on How to transpose a 16x16 matrix using SIMD instructions? shows some interesting ideas using merge-masking with 1-input shuffles like vpermpd or vpermq, instead of 2-input shuffles like vunpcklpd or vpermt2pd)
Using wider vectors means fewer loads and stores, and maybe even fewer total shuffles because each one combines more data. But you also have more shuffling work to do, to get all 8 elements of a row into one vector, instead of just loading and storing to different places in chunks half the size of a row. It's not obvious is better; I'll update this answer if I get around to actually writing the code.
Note that Ice Lake (first consumer CPU with AVX-512) can do 2 loads and 2 stores per clock. It has better shuffle throughput than Skylake-X for some shuffles, but not for any that are useful for this or Soonts' answer. (All of vperm2f128, vunpcklpd and vpermt2pd only run on port 5, for the ymm and zmm versions. https://uops.info/. vinsertf64x4 zmm, mem, 1 is 2 uops for the front-end, and needs a load port and a uop for p0/p5. (Not p1 because it's a 512-bit uop, and see also SIMD instructions lowering CPU frequency).)

CUDA 5.0 Replay Overhead

I am a novice CUDA programmer. I recently learned more about achieving better performance at lower occupancy. Here is a code snippet, I need help for understanding a few thing about replay overhead and Instruction Level Parallellism
__global__ void myKernel(double *d_dst, double *d_a1, double *d_a2, size_t SIZE)
{
int tId = threadIdx.x + blockDim.x * blockIdx.x;
d_dst[tId] = d_a1[tId] * d_a2[tId];
d_dst[tId + SIZE] = d_a1[tId + SIZE] * d_a2[tId + SIZE];
d_dst[tId + SIZE * 2] = d_a1[tId + SIZE * 2] * d_a2[tId + SIZE * 2];
d_dst[tId + SIZE * 3] = d_a1[tId + SIZE * 3] * d_a2[tId + SIZE * 3];
}
This is my simple kernel, which simply multiplies two 2D array to form a third 2D array (from logical perspective) where these array are all placed as flat 1D arrays in device memory.
Below I present another piece of code snippet:
void doCompute() {
double *h_a1;
double *h_a2;
size_t SIZE = pow(31, 3) + 1;
// Imagine h_a1, h_a2 as 2D arrays
// with 4 rows and SIZE Columns
// For convenience created as 1D arrays
h_a1 = (double *) malloc(SIZE * 4 * sizeof(double));
h_a2 = (double *) malloc(SIZE * 4 * sizeof(double));
memset(h_a1, 5.0, SIZE * 4 * sizeof(double));
memset(h_a2, 5.0, SIZE * 4 * sizeof(double));
double *d_dst;
double *d_a1;
double *d_a2;
cudaMalloc(&d_dst, SIZE * 4 * sizeof(double));
cudaMalloc(&d_a1, SIZE * 4 * sizeof(double));
cudaMalloc(&d_a2, SIZE * 4 * sizeof(double));
cudaMemcpy(d_a1, h_a1, SIZE * 4 * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_a2, h_a2, SIZE * 4 * sizeof(double), cudaMemcpyHostToDevice);
int BLOC_SIZE = 32;
int GRID_SIZE = (SIZE + BLOC_SIZE - 1) / BLOC_SIZE;
myKernel <<< GRID_SIZE, BLOC_SIZE >>> (d_dst, d_a1, d_a2, SIZE);
}
Q1) Am I here breaking any coalesced memory access pattern?
Q2) Can I say that the accesses to the memory, the way they are coded in the kernel
are also example of Instruction Level parallelism? If yes, am I using ILP2 or ILP4? And
Why?
Q3) If all I am doing is right then why does the nvvp profiler gives me following message?
Total Replay Overhead: 4.6%
Global Cache Replay Overhead: 30.3%
How can I reduce them or fix them?
Cheers,
The compiler has a limited ability to schedule instructions for possible ILP exploitation. The GPU itself must also have ILP capability, and the extent of this varies by GPU generation. Yes, any resource that is not available can cause a warp to stall, the typical one being data required from memory. The definitions of the replay quantities you're asking about are given here.
So, for example, the global cache replay overhead will be triggered by a cache miss, and your code is going to have some cache misses. Cache misses are possible even though you have 100% coalesced access and (nearly) 100% bandwidth utilization efficiency.

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

Goerzel algorithm- amplitude goes down,and other issues

I am using Goerzel to id a certain frequency .
What i see is that it works great-but in a strange way- when i input to it samples(±500/1024) i get the right values-but they becomes lower and lower till zero -while the frequency is STILL there . so i get for ex: 700, than it goes slowly down ..
Also, i would like to make it more exponential -so differences between noise and frequency will be higher .
What can cause this problem ,and how can i improve my code ?
thanks.
float goertzel_mag(int16_t* data ,int SAMPLING_RATE ,double TARGET_FREQUENCY,int numSamples )
{
int k,i;
float floatnumSamples;
float omega,sine,cosine,coeff,q0,q1,q2,magnitude,real,imag;
float scalingFactor = numSamples / 2.0; // -2
floatnumSamples = (float) numSamples;
k = (int) (0.5 + ((floatnumSamples * TARGET_FREQUENCY) / SAMPLING_RATE));
omega = (2.0 * M_PI * k) / floatnumSamples;
sine = sin(omega);
cosine = cos(omega);
coeff = 2.0 * cosine;
q0=0;
q1=0;
q2=0;
for(i=0; i<numSamples; i++)
{
q0 = coeff * q1 - q2 + data[i];
q2 = q1;
q1 = q0;
}
real = (q1 - q2 * cosine) / scalingFactor;
imag = (q2 * sine) / scalingFactor;
//double theta = atan2 ( imag, real); //PHASE
magnitude = sqrtf(real*real + imag*imag);
return magnitude;
}
After SO much researches about Goerzel , i found out that the problem is not him .
When i input a pure sin wave to the mac , and print out the buffer :
int16_t *q = (int16_t *)(&bufferList)->mBuffers[0].mData;
Its values are becomes high, but after 5 seconds- the signal is going lower and lower to zero!
Moving the signal source, will make it again becomes higher, and goes down again.
For what i have read , the chanel can go into saturation , and maybe this can cause the problem.
This Goerzel algorithm is very good .

OpenCL Memory Optimization - Nearest Neighbour

I'm writing a program in OpenCL that receives two arrays of points, and calculates the nearest neighbour for each point.
I have two programs for this. One of them will calculate distance for 4 dimensions, and one for 6 dimensions. They are below:
4 dimensions:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float dist = fast_distance(curY, m[x]);
if (dist < minDist)
{
minDist = dist;
minIdx = x;
}
}
i[index] = minIdx;
y[index] = minDist;
}
6 dimensions:
kernel void BruteForce(
global read_only float8* m,
global float8* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float8 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float8 mx = m[x];
float d0 = mx.s0 - curY.s0;
float d1 = mx.s1 - curY.s1;
float d2 = mx.s2 - curY.s2;
float d3 = mx.s3 - curY.s3;
float d4 = mx.s4 - curY.s4;
float d5 = mx.s5 - curY.s5;
float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
if (dist < minDist)
{
minDist = dist;
minIdx = index;
}
}
i[index] = minIdx;
y[index] = minDist;
}
I'm looking for ways to optimize this program for GPGPU. I've read some articles (including http://www.macresearch.org/opencl_episode6, which comes with a source code) about GPGPU optimization by using local memory. I've tried applying it and came up with this code:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
__local float4 * shared)
{
int index = get_global_id(0);
int lsize = get_local_size(0);
int lid = get_local_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = 64000;
int x = 0;
for(x = 0; x < {0}; x += lsize)
{
if((x+lsize) > {0})
lsize = {0} - x;
if ( (x + lid) < {0})
{
shared[lid] = m[x + lid];
}
barrier(CLK_LOCAL_MEM_FENCE);
for (int x1 = 0; x1 < lsize; x1++)
{
float dist = distance(curY, shared[x1]);
if (dist < minDist)
{
minDist = dist;
minIdx = x + x1;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
i[index] = minIdx;
y[index] = minDist;
}
I'm getting garbage results for my 'i' output (e.g. many values that are the same). Can anyone point me to the right direction? I'll appreciate any answer that helps me improve this code, or maybe find the problem with the optimize version above.
Thank you very much
Cauê
One way to get a big speed up here is to use local data structures and compute entire blocks of data at a time. You should also only need a single read/write global vector (float4). The same idea can be applied to the 6d version using smaller blocks. Each work group is able to work freely through the block of data it is crunching. I will leave the exact implementation to you because you will know the specifics of your application.
some pseudo-ish-code (4d):
computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256
__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;
now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global
steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed
need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
if(j==k) continue; //no self-comparison
calculate distance of blockA[j] vs blockA[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
for (i=0;i<get_num_groups(0);i++)
if (i==get_group_id(0)) continue;
load blockB into local memory: async_work_group_copy(...)
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
calculate distance of blockA[j] vs blockB[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
write blockA and blockAnearestIndex to global memory using two async_work_group_copy
There should be no problem in reading a blockB while another work group writes the same block (as its own blockA), because only the W values may have changed. If there happens to be trouble with this -- or if you do require two different vectors of points, you could use two global vectors like you have above, one with the A's (writeable) and the other with the B's (read only).
This algorithm work best when your global data size is a multiple of computeBlockSize. To handle the edges, two solutions come to mind. I recommend writing a second kernel for the non-square edge blocks that would in a similar manner as above. The new kernel can execute after the first, and you could save the second pci-e transfer. Alternately, you can use a distance of -1 to signify a skip in the comparison of two elements (ie if either blockA[j].w == -1 or blockB[k].w == -1, continue). This second solution would result in a lot more branching in your kernel though, which is why I recommend writing a new kernel. A very small percentage of your data points will actually fall in a edge block.