GNURadio PSK bit recovery

GNURadio PSK bit recovery - gnuradio

I have followed the wonderful GNURadio Guided Tutorial PSK Demodulation:
https://wiki.gnuradio.org/index.php/Guided_Tutorial_PSK_Demodulation
I've created a very simple DBPSK modulator
I feed in a series of bits that are sliding. So the first byte I feed in is 0x01, the next byte is 0x02, 0x04, 0x08 and so on. This is the output of hd:
00000000 00 00 ac 0e d0 f0 20 40 81 02 04 08 10 00 20 40 |...... #...... #|
00000010 81 02 04 08 10 00 20 40 81 02 04 08 10 00 20 40 |...... #...... #|
*
00015000
The first few bytes are garbage, but then you can see the pattern. Looking at the second line you see:
0x81, 0x02, 0x04, 0x08, 0x10, 0x00, 0x20, 0x40, 0x81
The walking ones is there, but after 0x10, the PSK demodulator receives a 0x00, then a few bytes later is receives a 0x81. It almost seems like the timing recovery is off.
Has anyone else seen something like this?

OK, I figured it out. Below is my DBPSK modulation.
If you let this run, the BER will continue to drop. Some things to keep in mind. The PSK Mod takes an 8-bit value (or perhaps an short or int as well). It grabs the bits and modulates them. Then the PSK Demod does the same. If you save this to a file, you will not get the exact bits out. You will need to shift the bits to align them. I added the Vector Insert block to generate a preamble of sorts.
Then I wrote some Python to find my preamble:
import numpy as np
import matplotlib.pyplot as plt
def findPreamble(preamble, x):
for i in range(0, len(x) - len(preamble)):
check = 0
for j in range(0, len(preamble)):
check += x[i + j] - preamble[j]
if (check == 0):
print("Found a preamble at {0}".format(i))
x = x[i + len(preamble)::]
break
return check == 0, x
def shiftBits(x):
for i in range (0, len(x) - 1):
a = x[i]
a = a << 1
if x[i + 1] & 0x80:
a = a | 1
x[i] = (a & 0xFF)
return x
f = open('test.bits', 'rb')
x = f.read();
f.close()
preamble = [0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08]
searchForBit = True
x = np.frombuffer(x, dtype='uint8')
x = x.astype('int')
print(x)
while searchForBit:
x = shiftBits(x)
print(x)
found, y = findPreamble(preamble, x)
if found:
searchForBit = False
y = y.astype('uint8')
f = open('test.bits', 'wb')
f.write(y)
f.close()

Related

Transpose 8x8 64-bits matrix

Targeting AVX2, what is a fastest way to transpose a 8x8 matrix containing 64-bits integers (or doubles)?
I searched though this site and I found several ways of doing 8x8 transpose but mostly for 32-bits floats. So I'm mainly asking because I'm not sure whether the principles that made those algorithms fast readily translate to 64-bits and second, apparently AVX2 only has 16 registers so only loading all the values would take up all the registers.
One way of doing it would be to call 2x2 _MM_TRANSPOSE4_PD but I was wondering whether this is optimal:
#define _MM_TRANSPOSE4_PD(row0,row1,row2,row3) \
{ \
__m256d tmp3, tmp2, tmp1, tmp0; \
\
tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0); \
tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF); \
tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0); \
tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF); \
\
(row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20); \
(row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20); \
(row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31); \
(row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31); \
}
Still assuming AVX2, is transposing double[8][8] and int64_t[8][8] largely the same, in principle?
PS: And just being curious, having AVX512 would change the things substantially, correct?

After some thoughts and discussion in the comments, I think this is the most efficient version, at least when source and destination data is in RAM. It does not require AVX2, AVX1 is enough.
The main idea, modern CPUs can do twice as many load micro-ops compared to stores, and on many CPUs loading stuff into higher half of vectors with vinsertf128 has same cost as regular 16-byte load. Compared to your macro, this version no longer needs these relatively expensive (3 cycles of latency on most CPUs) vperm2f128 shuffles.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void loadTransposed( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
// Load top half of the matrix into low half of 4 registers
__m256d t0 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 00, 01
__m256d t1 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 02, 03
rsi += stride;
__m256d t2 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 10, 11
__m256d t3 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 12, 13
rsi += stride;
// Load bottom half of the matrix into high half of these registers
t0 = _mm256_insertf128_pd( t0, _mm_loadu_pd( rsi ), 1 ); // 00, 01, 20, 21
t1 = _mm256_insertf128_pd( t1, _mm_loadu_pd( rsi + 2 ), 1 );// 02, 03, 22, 23
rsi += stride;
t2 = _mm256_insertf128_pd( t2, _mm_loadu_pd( rsi ), 1 ); // 10, 11, 30, 31
t3 = _mm256_insertf128_pd( t3, _mm_loadu_pd( rsi + 2 ), 1 );// 12, 13, 32, 33
// Transpose 2x2 blocks in registers.
// Due to the tricky way we loaded stuff, that's enough to transpose the complete 4x4 matrix.
mat.r0 = _mm256_unpacklo_pd( t0, t2 ); // 00, 10, 20, 30
mat.r1 = _mm256_unpackhi_pd( t0, t2 ); // 01, 11, 21, 31
mat.r2 = _mm256_unpacklo_pd( t1, t3 ); // 02, 12, 22, 32
mat.r3 = _mm256_unpackhi_pd( t1, t3 ); // 03, 13, 23, 33
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
// Transpose 8x8 matrix of double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
loadTransposed( block, rsi );
store( block, rdi );
#if 1
// Using another instance of the block to support in-place transpose
Matrix4x4 block2;
loadTransposed( block, rsi + 4 ); // top right block
loadTransposed( block2, rsi + 8 * 4 ); // bottom left block
store( block2, rdi + 4 );
store( block, rdi + 8 * 4 );
#else
// Flip the #if if you can guarantee ( rsi != rdi )
// Performance is about the same, but this version uses 4 less vector registers,
// slightly more efficient when some registers need to be backed up / restored.
assert( rsi != rdi );
loadTransposed( block, rsi + 4 );
store( block, rdi + 8 * 4 );
loadTransposed( block, rsi + 8 * 4 );
store( block, rdi + 4 );
#endif
// Bottom-right corner
loadTransposed( block, rsi + 8 * 4 + 4 );
store( block, rdi + 8 * 4 + 4 );
}
For completeness, here’s a version which uses the code very similar to your macro, does twice as few loads, same count of stores, and more shuffles. Have not benchmarked but I would expect it to be slightly slower.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void load( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
mat.r0 = _mm256_loadu_pd( rsi );
mat.r1 = _mm256_loadu_pd( rsi + stride );
mat.r2 = _mm256_loadu_pd( rsi + stride * 2 );
mat.r3 = _mm256_loadu_pd( rsi + stride * 3 );
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
inline void transpose( Matrix4x4& m4 )
{
// These unpack instructions transpose lanes within 2x2 blocks of the matrix
const __m256d t0 = _mm256_unpacklo_pd( m4.r0, m4.r1 );
const __m256d t1 = _mm256_unpacklo_pd( m4.r2, m4.r3 );
const __m256d t2 = _mm256_unpackhi_pd( m4.r0, m4.r1 );
const __m256d t3 = _mm256_unpackhi_pd( m4.r2, m4.r3 );
// Produce the transposed matrix by combining these blocks
m4.r0 = _mm256_permute2f128_pd( t0, t1, 0x20 );
m4.r1 = _mm256_permute2f128_pd( t2, t3, 0x20 );
m4.r2 = _mm256_permute2f128_pd( t0, t1, 0x31 );
m4.r3 = _mm256_permute2f128_pd( t2, t3, 0x31 );
}
// Transpose 8x8 matrix with double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
load( block, rsi );
transpose( block );
store( block, rdi );
// Using another instance of the block to support in-place transpose, with very small overhead
Matrix4x4 block2;
load( block, rsi + 4 ); // top right block
load( block2, rsi + 8 * 4 ); // bottom left block
transpose( block2 );
store( block2, rdi + 4 );
transpose( block );
store( block, rdi + 8 * 4 );
// Bottom-right corner
load( block, rsi + 8 * 4 + 4 );
transpose( block );
store( block, rdi + 8 * 4 + 4 );
}

For small matrices where more than 1 row can fit in a single SIMD vector, AVX-512 has very nice 2-input lane-crossing shuffles with 32-bit or 64-bit granularity, with a vector control. (Unlike _mm512_unpacklo_pd which is basically 4 separate 128-bit shuffles.)
A 4x4 double matrix is "only" 128 bytes, two ZMM __m512d vectors, so you only need two vpermt2ps (_mm512_permutex2var_pd) to produce both output vectors: one shuffle per output vector, with both loads and stores being full width. You do need control vector constants, though.
Using 512-bit vector instructions has some downsides (clock speed and execution port throughput), but if your program can spend a lot of time in code that uses 512-bit vectors, there's probably a significant throughput gain from throwing around more data with each instruction, and having more powerful shuffles.
With 256-bit vectors, vpermt2pd ymm would probably not be useful for a 4x4, because for each __m256d output row, each of the 4 elements you want comes from a different input row. So one 2-input shuffle can't produce the output you want.
I think lane-crossing shuffles with less than 128-bit granularity aren't useful unless your matrix is small enough to fit multiple rows in one SIMD vector. See How to transpose a 16x16 matrix using SIMD instructions? for some algorithmic complexity reasoning about 32-bit elements - an 8x8 xpose of 32-bit elements with AVX1 is about the same as an 8x8 of 64-bit elements with AVX-512, where each SIMD vector holds exactly one whole row.
So no need for vector constants, just immediate shuffles of 128-bit chunks, and unpacklo/hi
Transposing an 8x8 with 512-bit vectors (8 doubles) would have the same problem: each output row of 8 doubles needs 1 double from each of 8 input vectors. So ultimately I think you want a similar strategy to Soonts' AVX answer, starting with _mm512_insertf64x4(v, load, 1) as the first step to get the first half of 2 input rows into one vector.
(If you care about KNL / Xeon Phi, #ZBoson's other answer on How to transpose a 16x16 matrix using SIMD instructions? shows some interesting ideas using merge-masking with 1-input shuffles like vpermpd or vpermq, instead of 2-input shuffles like vunpcklpd or vpermt2pd)
Using wider vectors means fewer loads and stores, and maybe even fewer total shuffles because each one combines more data. But you also have more shuffling work to do, to get all 8 elements of a row into one vector, instead of just loading and storing to different places in chunks half the size of a row. It's not obvious is better; I'll update this answer if I get around to actually writing the code.
Note that Ice Lake (first consumer CPU with AVX-512) can do 2 loads and 2 stores per clock. It has better shuffle throughput than Skylake-X for some shuffles, but not for any that are useful for this or Soonts' answer. (All of vperm2f128, vunpcklpd and vpermt2pd only run on port 5, for the ymm and zmm versions. https://uops.info/. vinsertf64x4 zmm, mem, 1 is 2 uops for the front-end, and needs a load port and a uop for p0/p5. (Not p1 because it's a 512-bit uop, and see also SIMD instructions lowering CPU frequency).)

WEP shared-key authentication response generation

While I was capturing packets with Wireshark using my phone I tried connect to my access point which has WEP shared-key authentication (only for testing purposes) and I got the authentication packets which contained the IV, challenge text, etc. Then I tired to represent the ciphertext what my phone sent. So I already know the password and I took the IV, after that concatenated these two and put in the RC4 algorithm what gave me a keystream. I xored the keystream and the challenge text but this always gives me different chipertext than my phone sent.
Maybe I concatenate the IV and password in the wrong way or I'm using wrong algorithm and why is the response in the provided image is 147 bytes long?
Image of wireshark captured packets
Code what I'm using
def KSA(key):
keylength = len(key)
S = range(256)
j = 0
for i in range(256):
j = (j + S[i] + key[i % keylength]) % 256
S[i], S[j] = S[j], S[i] # swap
return S
def PRGA(S):
i = 0
j = 0
while True:
i = (i + 1) % 256
j = (j + S[i]) % 256
S[i], S[j] = S[j], S[i] # swap
K = S[(S[i] + S[j]) % 256]
yield K
def RC4(key):
S = KSA(key)
return PRGA(S)

Ti C6x DSP intrinsics for optimising C code

I want to use C66x intrinsics to optimise my code .
Below is some C code what I want to optimise by using DSP intrinsics .
I am new to DSP intrinsic ,so not having full knowledge of which intrinsic use for below logic .
uint8 const src[40] = = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40};
uint32_t width = 8;
uint32_t axay1_6 = 112345;
uint32_t axay2_6 = 123456;
uint32_t axay3_6 = 134567;
uint32_t axay4_6 = 145678;
C code:
uint8_t const *cLine = src;
uint8_t const *nLine = cLine + width;
uint32_t res = 0;
const uint32_t a1 = (*cLine++) * axay1_6;
const uint32_t a3 = (*nLine++) * axay3_6;
res = a1 + a3;
const uint32_t a2 = (*cLine) * axay2_6;
const uint32_t a4 = (*nLine) * axay4_6;
res += a2 + a4;
C66x Intrinscics :
const uint8_t *Ix00, *Ix01, *Iy00,*Iy01;
uint32_t in1,in2;
uint64_t l1, l2;
__x128_t axay1_6 = _dup32_128(axay1_6); //112345 112345 112345 112345
__x128_t axay2_6 = _dup32_128(axay2_6); //123456 123456 123456 123456
__x128_t axay3_6 = _dup32_128(axay3_6); //134567 134567 134567 134567
__x128_t axay4_6 = _dup32_128(axay4_6); //145678 145678 145678 145678
Ix00 = src ;
Ix01 = Ix00 + 1 ;
Iy00 = src + width;
Iy01 = Iy00 + 1;
int64_t I_00 = _mem8_const(Ix00); //00 01 02 03 04 05 06 07
int64_t I_01 = _mem8_const(Ix01); //01 02 03 04 05 06 07 08
int64_t I_10 = _mem8_const(Iy00); //10 11 12 13 14 15 16 17
int64_t I_11 = _mem8_const(Iy01); //11 12 13 14 15 16 17 18
in1 = _loll(I_00); //00 01 02 03
l1 = _unpkbu4(in1); //00 01 02 03 (16x4)
in2 = _hill(I_00); //04 05 06 07
l2 = _unpkbu4(in2); //04 05 06 07 (16x4)
Here I want one something __x128 register with 32*4 value containg " 00 01 02 03 " data .
So I can multiply __x128 into __x128 bit register and get __x128 bit value .Presently i am planning to use _qmpy32
I am new to this C66x DSP intrinscic .
Can you tell me which intrinsic is suitable to get __x128 type of register with 32x4 values with 00 01 02 03 values.
(means how to convert 16 bit to 32 bit by using dsp intrinsic)

Use the _unpkhu2 instruction to expand the 16x4 to 32x4.
__x128_t src1_128, src2_128;
src1_128 = _llto128(_unpkhu2(_hill(l1)), _unpkhu2(_loll(l1)));
src2_128 = _llto128(_unpkhu2(_hill(l2)), _unpkhu2(_loll(l2)));
Be careful: Little-endian/Big-endian settings can make these sorts of things come out in a way you didn't expect.
Also, I wouldn't recommend naming a variable l1. In some fonts, lower-case L and the number 1 are indistinguishable.

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).

This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.

Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

How can I convert hex string to binary?

My problem is getting a 64-bit key from user. For this I need to get 16 characters as string which contains hexadecimal characters (123456789ABCDEF). I got the string from user and I reached characters with the code below. But I don't know how to convert character to 4-bit binary
.data
insert_into:
.word 8
Ask_Input:
.asciiz "Please Enter a Key which size is 16, and use hex characters : "
key_array:
.space 64
.text
.globl main
main:
la $a0, Ask_Input
li $v0, 4
syscall
la $a0, insert_into
la $a1, 64
li $v0, 8
syscall
la $t0, insert_into
li $t2, 0
li $t3, 0
loop_convert:
lb $t1, ($t0)
addi $t0, $t0, 1
beq $t1, 10, end_convert
# Now charcter is in $t1 but
#I dont know how to convert it to 4 bit binary and storing it
b loop_convert
end_convert:
li $v0, 10 # exit
syscall

I don't think masking with 0x15 as #Joelmob's opinion is the right solution, because
'A' = 0x41 → 0x41 & 0x15 = 0
'B' = 0x42 → 0x42 & 0x15 = 0
'C' = 0x43 → 0x43 & 0x15 = 1
'D' = 0x44 → 0x44 & 0x15 = 4
'E' = 0x45 → 0x45 & 0x15 = 5
'F' = 0x46 → 0x46 & 0x15 = 4
which doesn't produce any relevant binary values
The easiest way is subtracting the range's lower limit from the character value. I'll give the idea in C, you can easily convert it to MIPS asm
if ('0' <= ch && ch <= '9')
{
return ch - '0';
}
else if ('A' <= ch && ch <= 'F')
{
return ch - 'A' + 10;
}
else if ('a' <= ch && ch <= 'f')
{
return ch - 'a' + 10;
}
Another way to implement:
if ('0' <= ch && ch <= '9')
{
return ch & 0x0f;
}
else if (('A' <= ch && ch <= 'F') || ('a' <= ch && ch <= 'f'))
{
return (ch & 0x0f) + 9;
}
However this can be further optimized to a single comparison using the technique describes in the following questions
Fastest way to determine if an integer is between two integers (inclusive) with known sets of values
Check if number is in range on 8051
Now the checks can be rewritten as below
if ((unsigned char)(ch - '0') <= ('9'-'0'))
if ((unsigned char)(ch - 'A') <= ('F'-'A'))
if ((unsigned char)(ch - 'a') <= ('f'-'a'))
Any modern compilers can do this kind of optimization, here is an example output
hex2(unsigned char):
andi $4,$4,0x00ff # ch, ch
addiu $2,$4,-48 # tmp203, ch,
sltu $2,$2,10 # tmp204, tmp203,
bne $2,$0,$L13
nop
andi $2,$4,0xdf # tmp206, ch,
addiu $2,$2,-65 # tmp210, tmp206,
sltu $2,$2,6 # tmp211, tmp210,
beq $2,$0,$L12 #, tmp211,,
andi $4,$4,0xf # tmp212, ch,
j $31
addiu $2,$4,9 # D.2099, tmp212,
$L12:
j $31
li $2,255 # 0xff # D.2099,
$L13:
j $31
andi $2,$4,0xf # D.2099, ch,

Have a look at this ASCII table you will see that hex-code for numbers from 9 and below are 0x9 and for capital letters this is between 0x41 and 0x5A A-Z determine if its a number or character, you see if theres a number its quite done, if it were a character mask with 0x15 to get the four bits.
If you want to include lowercase letters do same procedure with masking and determine if its a char between 0x61 and 0x7A

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas