WEP shared-key authentication response generation - authentication

While I was capturing packets with Wireshark using my phone I tried connect to my access point which has WEP shared-key authentication (only for testing purposes) and I got the authentication packets which contained the IV, challenge text, etc. Then I tired to represent the ciphertext what my phone sent. So I already know the password and I took the IV, after that concatenated these two and put in the RC4 algorithm what gave me a keystream. I xored the keystream and the challenge text but this always gives me different chipertext than my phone sent.
Maybe I concatenate the IV and password in the wrong way or I'm using wrong algorithm and why is the response in the provided image is 147 bytes long?
Image of wireshark captured packets
Code what I'm using
def KSA(key):
keylength = len(key)
S = range(256)
j = 0
for i in range(256):
j = (j + S[i] + key[i % keylength]) % 256
S[i], S[j] = S[j], S[i] # swap
return S
def PRGA(S):
i = 0
j = 0
while True:
i = (i + 1) % 256
j = (j + S[i]) % 256
S[i], S[j] = S[j], S[i] # swap
K = S[(S[i] + S[j]) % 256]
yield K
def RC4(key):
S = KSA(key)
return PRGA(S)

Related

decoding base64 encoded text with POSIX awk

In a bash script that I'm writing for Linux/Solaris I need to decode more than a hundred thousand base64-encoded text strings, and, because I don't wanna massively fork a non-portable base64 binary from awk, I wrote a function that does the decoding.
Here's the code of my base64_decode function:
function base64_decode(str, out,i,n,v) {
out = ""
if ( ! ("A" in _BASE64_DECODE_c2i) )
for (i = 1; i <= 64; i++)
_BASE64_DECODE_c2i[substr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",i,1)] = i-1
i = 0
n = length(str)
while (i <= n) {
v = _BASE64_DECODE_c2i[substr(str,++i,1)] * 262144 + \
_BASE64_DECODE_c2i[substr(str,++i,1)] * 4096 + \
_BASE64_DECODE_c2i[substr(str,++i,1)] * 64 + \
_BASE64_DECODE_c2i[substr(str,++i,1)]
out = out sprintf("%c%c%c", int(v/65536), int(v/256), v)
}
return out
}
Which works fine:
printf '%s\n' SmFuZQ== amRvZQ== |
LANG=C command -p awk '
{ print base64_decode($0) }
function base64_decode(...) {...}
'
Jane
jdoe
SIMPLIFIED REAL-LIFE EXAMPLE THAT DOESN'T WORK AS EXPECTED
I want to get the givenName of the users that are members of GroupCode = 025496 from the output of ldapsearch -LLL -o ldif-wrap=no ... '(|(uid=*)(GroupCode=*))' uid givenName sn GroupCode memberUid:
dn: uid=jsmith,ou=users,dc=example,dc=com
givenName: John
sn: SMITH
uid: jsmith
dn: uid=jdoe,ou=users,dc=example,dc=com
uid: jdoe
givenName:: SmFuZQ==
sn:: RE9F
dn: cn=group1,ou=groups,dc=example,dc=com
GroupCode: 025496
memberUid:: amRvZQ==
memberUid: jsmith
Here would be an awk for doing so:
LANG=C command -p awk -F '\n' -v RS='' -v GroupCode=025496 '
{
delete attrs
for (i = 2; i <= NF; i++) {
match($i,/::? /)
key = substr($i,1,RSTART-1)
val = substr($i,RSTART+RLENGTH)
if (RLENGTH == 3)
val = base64_decode(val)
attrs[key] = ((key in attrs) ? attrs[key] SUBSEP val : val)
}
if ( /\nuid:/ )
givenName[ attrs["uid"] ] = attrs["givenName"]
else
memberUid[ attrs["GroupCode"] ] = attrs["memberUid"]
}
END {
n = split(memberUid[GroupCode],uid,SUBSEP)
for ( i = 1; i <= n; i++ )
print givenName[ uid[i] ]
}
function base64_decode(...) { ... }
'
On BSD and Solaris the result is:
Jane
John
While on Linux it is:
John
I don't know where the issue might be; is there something wrong with the base64_decode function and/or the code that uses it?
Your function generates NUL bytes when its argument (encoded string) ends with padding characters (=s). Below is a corrected version of your while loop:
while (i < n) {
v = _BASE64_DECODE_c2i[substr(str,1+i,1)] * 262144 + \
_BASE64_DECODE_c2i[substr(str,2+i,1)] * 4096 + \
_BASE64_DECODE_c2i[substr(str,3+i,1)] * 64 + \
_BASE64_DECODE_c2i[substr(str,4+i,1)]
i += 4
if (v%256 != 0)
out = out sprintf("%c%c%c", int(v/65536), int(v/256), v)
else if (int(v/256)%256 != 0)
out = out sprintf("%c%c", int(v/65536), int(v/256))
else
out = out sprintf("%c", int(v/65536))
}
Note that if the decoded bytes contains an embedded NUL then this approach may not work properly.
Problem is within base64_decode function that outputs some junk characters on gnu-awk.
You can use this awk code that uses system provided base64 utility as an alternative:
{
delete attrs
for (i = 2; i <= NF; i++) {
match($i,/::? /)
key = substr($i,1,RSTART-1)
val = substr($i,RSTART+RLENGTH)
if (RLENGTH == 3) {
cmd = "echo " val " | base64 -di"
cmd | getline val # should also check exit code here
}
attrs[key] = ((key in attrs) ? attrs[key] SUBSEP val : val)
}
if ( /\nuid:/ )
givenName[ attrs["uid"] ] = attrs["givenName"]
else
memberUid[ attrs["GroupCode"] ] = attrs["memberUid"]
}
END {
n = split(memberUid[GroupCode],uid,SUBSEP)
for ( i = 1; i <= n; i++ )
print givenName[ uid[i] ]
}
I have tested this on gnu and BSD awk versions and I am getting expected output in all the cases.
If you cannot use external base64 utility then I suggest you take a look here for awk version of base64 decode.
This answer is for reference
Here's a working base64_decode function (thanks #MNejatAydin for pointing out the issue(s) in the original one):
function base64_decode(str, out,bits,n,i,c1,c2,c3,c4) {
out = ""
# One-time initialization during the first execution
if ( ! ("A" in _BASE64) )
for (i = 1; i <= 64; i++)
# The "_BASE64" array associates a character to its base64 index
_BASE64[substr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",i,1)] = i-1
# Decoding the input string
n = length(str)
i = 0
while ( i < n ) {
c1 = substr(str, ++i, 1)
c2 = substr(str, ++i, 1)
c3 = substr(str, ++i, 1)
c4 = substr(str, ++i, 1)
bits = _BASE64[c1] * 262144 + _BASE64[c2] * 4096 + _BASE64[c3] * 64 + _BASE64[c4]
if ( c4 != "=" )
out = out sprintf("%c%c%c", bits/65536, bits/256, bits)
else if ( c3 != "=" )
out = out sprintf("%c%c", bits/65536, bits/256)
else
out = out sprintf("%c", bits/65536)
}
return out
}
WARNING: the function requires LANG=C
It also doesn't check that the input is a valid base64 string; for that you can add a simple condition like:
match( str, "^([a-zA-Z/-9+]{4})*([a-zA-Z/-9+]{2}[a-zA-Z/-9+=]{2})?$" )
Interestingly, the code is 2x faster than base64decode.awk, but it's only 3x faster than forking the base64 binary from inside awk.
notes:
In a base64 encoded string, 4 bytes represent 3 bytes of data; the input have to be processed by groups of 4 characters.
Multiplying and dividing an integer by a power of two is equivalent to do bitwise left and right shifts operations.
262144 is 2^18, so N * 262144 is equivalent to N << 18
4096 is 2^12, so N * 4096 is equivalent to N << 12
64 id 2^6, so N * 4096 is equivalent to N << 6
65536 is 2^16, so N / 65536 (integer division) is equivalent to N >> 16
256 is 2^8, so N / 256 (integer division) is equivalent to N >> 8
What happens in printf "%c", N:
N is first converted to an integer (if need be) and then, WITH LANG=C, the 8 least significant bits are taken in for the %c formatting.
How the possible padding of one or two trailing = characters at the end of the encoded string is handled:
If the 4th char isn't = (i.e. there's no padding) then the result should be 3 bytes of data.
If the 4th char is = and the 3rd char isn't = then there's 2 bytes of of data to decode.
If the fourth char is = and the third char is = then there's only one byte of data.

How does the RarePackFour smartcontract generate a new "unique" number from a given random number?

I'm trying to understand the RarePackFour smart contract from the game gods unchained. I noticed that they use a random number to generate other "random" (in parenthesis because i dont think the newly generated numbers are random).
This the code im trying to understand. Could you help me understand what is happening here ?
function extract(uint random, uint length, uint start) internal pure returns (uint) {
return (((1 << (length * 8)) - 1) & (random >> ((start * 8) - 1)));
}
Bitwise operators are not realy a strong point for me so it would really help if you can help understand what is happening in the code.
Let's go with example:
length = 1
start = 1
random = 3250 # Binary: 0b110010110010
1. ((1 << (length * 8)) - 1) = 2^8 - 1 = 255 = 0B11111111 # Binary
2. (random >> ((start * 8) - 1))) = 0b110010110010 >> 7 = 0B11001 # Decimal 25
0B11111111 & 0B11001 = 0B11001 = 25
Usually if length * 8 > ((start * 8) - 1), the function returns random / (start * 8 - 1). Notice that it's only integer calculation in Solidity.

thermal printer stalls when printing image

I have two Bluetooth thermal printers as well as an integrated device.
One of the printers doesn't support QR codes via GS ( k .. 49, so I'm printing by loading a file.bmp into a Bitmap kotlin class and then sending as image via GS v 0.
The problem I'm facing is that when I print the QR image the other printer stalls mid-image.
I must restart the printer for it to work properly, otherwise it'll print garbage.
The source file has the following characteristics:
82x82 pixels
2.9x2.9 cm print size (needs to be 3 cm)
24 bits per pixel
2 colors
It's loaded into a kotlin Bitmap as such:
var bfo = BitmapFactory.Options()
bfo.outHeight = 20
bfo.outWidth = 20
bfo.inJustDecodeBounds = false
val fRawBmp = File(qrCodeRawFilePath)
val rawBmp = BitmapFactory.decodeFile(fRawBmp.absolutePath, bfo)
.outHeight and .outWidth don't seem to have any effect on dimensions (probably used for screen rendering?). The rawBmp object has the following characteristics:
82x82 px
total Bytes: 26896
bytes per row: 328
bytes per px: 4
Since the width is too small it must be scaled with:
if(inBmp.width < 264) {
val startTime = System.nanoTime()
qrBmp = Bitmap.createScaledBitmap(inBmp, 264, 264, true)
val endTime = System.nanoTime()
val duration = endTime - startTime
wasScaled = true
}
This changes the characteristics to
264x264px
total bytes 278784
bytes per row 1056
bytes per px 4
Since the width is a multiple of 8 it doesn't need to be padded.
I then setup the GS v 0 header:
val bytesPerLine = ceil((widthInPx.toFloat() / 8f).toDouble()).toInt()
val m = 0 // 0-3
val xH = bytesPerLine / 256
val xL = bytesPerLine - xH * 256
val yH = heightInPx / 256
val yL = heightInPx - yH * 256
val imageBytes = ByteArray(8 + bytesPerLine * heightInPx)
System.arraycopy(byteArrayOf(0x1D, 0x76, 0x30, m.toByte(), xL.toByte(), xH.toByte(), yL.toByte(), yH.toByte()), 0, imageBytes, 0, 8)
I must have 1 bit per pixel or the image will be distorted. I achieve it with this (adapted from ESCPOS-ThermalPrinter):
var i = 8
for (posY in 0 until heightInPx) {
var jj = 0
while (jj < widthInPx) {
val stringBinary = StringBuilder()
for (k in 0..7) {
val posX = jj + k
if (posX < widthInPx) {
val color: Int = qrBmp.getPixel(posX, posY)
val r = color shr 16 and 0xff
val g = color shr 8 and 0xff
val b = color and 0xff
if (r > 160 && g > 160 && b > 160) {
stringBinary.append("0")
} else {
stringBinary.append("1")
}
} else {
stringBinary.append("0")
}
}
imageBytes[i++] = stringBinary.toString().toInt(2).toByte()
jj += 8
}
}
The final parameters are:
m: 0
xL: 33 bytes
xH: 0 bytes
yL: 8 dots
yH: 1 dots
k: 8712
data: 8720 bytes (8+k)
I then send it fo the OutputStream of the Bluetooth socket and the printer chokes on the image.
I'm testing with multiple devices with different Android versions, ABIs, Bluetooth versions and architectures - occasionally it'll print on one device or another, must it mostly fails.
If using some demo apps from the net, the printer does print images, so I assume I'm doing something wrong.
Perhaps the image is too big for the buffer?
Edit 1
On a simple test using text1 + image + text2, it'll print text1 and image if i flush the stream; but won't print text2, i.e.:
bt.outStream!!.write(byteArrayOf(0x1B, 0x74, 0x02)) // ESC t codepage PC437 USA Standard Europe
bt.outStream?.write("text1\n".toByteArray(Charsets.ISO_8859_1))
br.outStream?.flush()
var bfo = BitmapFactory.Options()
bfo.inJustDecodeBounds = false
val fRawBmp = File(path2file)
val rawBmp = BitmapFactory.decodeFile(fRawBmp.absolutePath, bfo)
bt.outStream?.write(bmp2Bytes(rawBmp))
bt.outStream?.flush()
bt.outStream?.write("text2\n\n\n".toByteArray(Charsets.ISO_8859_1))
bt.outStream?.flush()
bt.outStream?.close()
bt.inStream?.close()
bt.socket?.close()
The QR code is readable but i must still restart the printer. So I must be overflowing something...
Turns out the problem wasn't in the printer buffer, missing ESC/POS command or data size.
I must wait before closing the Bluetooth socket otherwise there may be unsent data.
So,
Thread.sleep(400) // 200ms is enough for _most_ devices I tested
bt.outStream?.write("text2\n\n\n".toByteArray(Charsets.ISO_8859_1))
bt.outStream?.flush()
bt.outStream?.close()
bt.inStream?.close()
bt.socket?.close()

How to propagate `\n` to sympy.latex()

The Goal is to format a polynomial with more than 6 parameters into a plot title.
Here is my polynomial parameter to string expression function, inspired by this answer, followed by sym.latex():
def func(p_list):
str_expr = ""
for i in range(len(p_list)-1,-1,-1):
if (i%2 == 0 and i !=len(p_list)):
str_expr =str_expr + " \n "
if p_list[i]>0:
sign = " +"
else:
sign = ""
if i > 1:
str_expr = str_expr+" + %s*x**%s"%(p_list[i],i)
if i == 1:
str_expr = str_expr+" + %s*x"%(p_list[i])
if i == 0:
str_expr = str_expr+sign+" %s"%(p_list[i])
print("str_expr",str_expr)
return sym.sympify(str_expr)
popt = [-2,1,1] # some toy data
tex = sym.latex(func(popt))
print("tex",tex)
Outputs:
str_expr
+ -1*x**2 + 1*x
-2
tex - x^{2} + x - 2
in str_expr the line breaks from \n are visible, yet in the sympy.latex output the are gone.
How to propagate this linebreak?
Edit: I took # wsdookadr answer and modified it, so that plt.title takes the result of the function as text the argument
def tex_multiline_poly(e, chunk_size=2, separator="\n"):
tex = ""
# split into monomials
print("reversed(e.args)",reversed(e.args))
mono = list(e.args)
print("mono",mono)
mono.reverse()
print("mono",mono)
# we're going split the list of monomials into chunks of chunk_size
# serialize each chunk, and insert separators between the chunks
for i in range(0,len(mono),chunk_size):
chunk = mono[i:i + chunk_size]
print("sum(chunk)",sum(chunk))
print("sym.latex(sum(chunk))",sym.latex(sum(chunk)))
if i == 0:
tex += r'$f(x)= %s$'%(sym.latex(sum(chunk)))+separator
else:
tex += '$%s$'%(sym.latex(sum(chunk))) + separator
return tex
popt = est.params
x = sym.symbols('x')
p = sym.Poly.from_list(reversed(popt),gens=x)
tex = tex_multiline_poly(p.as_expr(),chunk_size=2)
plt.title(text=tex)
In your code, you're inserting a linebreak for every even-power monomial, except for the last one.
if (i%2 == 0 and i !=len(p_list)):
str_expr =str_expr + " \n "
Since you are just building a polynomial from a list of coefficients, your code can be simplified.
Generally what we want is to build/transform/handle things symbolically, and only at the end serialize them and print the result in some specific format
import sympy as sym
x = sym.symbols('x')
def func(p_list):
expr = 0
for i in range(len(p_list)-1,-1,-1):
expr += p_list[i] * (x ** i)
return sym.sympify(expr)
popt = [-2,1,1]
p = func(popt)
p_tex = sym.latex(p)
p_str = str(p)
print("str:", p_str)
print("tex:", p_tex)
Output:
str: x**2 + x - 2
tex: x^{2} + x - 2
We could simplify this even further by using SymPy's built-in functions to build the poly from a list of coefficients:
import sympy as sym
from sympy import symbols
popt = [-2,1,1]
x = symbols('x')
p = sym.Poly.from_list(reversed(popt),gens=x)
p_tex = sym.latex(p.as_expr())
p_str = str(p.as_expr())
print("str:", p_str)
print("tex:", p_tex)
Output:
str: x**2 + x - 2
tex: x^{2} + x - 2
Does the output look like what you would expect?
UPDATE:
After learning more about the use-case, here's a version that inserts separators every N=2 monomials in the latex form of your expression.
import sympy as sym
from sympy import symbols
popt = [-2,1,1]
x = symbols('x')
p = sym.Poly.from_list(reversed(popt),gens=x)
def tex_multiline_poly(e, chunk_size=2, separator="\n"):
tex = ""
# split into monomials
mono = list(reversed(e.args))
# we're going split the list of monomials into chunks of chunk_size
# serialize each chunk, and insert separators between the chunks
for i in range(0,len(mono),chunk_size):
chunk = mono[i:i + chunk_size]
tex += sym.latex(sum(chunk)) + separator
return tex
p_tex = tex_multiline_poly(p.as_expr(),chunk_size=2)
p_str = str(p.as_expr())
print("str:",p_str)
print("tex:",p_tex)
Output:
str: x**2 + x - 2
tex: x^{2} + x
-2
Edit: wrong edit

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.