thermal printer stalls when printing image - kotlin

I have two Bluetooth thermal printers as well as an integrated device.
One of the printers doesn't support QR codes via GS ( k .. 49, so I'm printing by loading a file.bmp into a Bitmap kotlin class and then sending as image via GS v 0.
The problem I'm facing is that when I print the QR image the other printer stalls mid-image.
I must restart the printer for it to work properly, otherwise it'll print garbage.
The source file has the following characteristics:
82x82 pixels
2.9x2.9 cm print size (needs to be 3 cm)
24 bits per pixel
2 colors
It's loaded into a kotlin Bitmap as such:
var bfo = BitmapFactory.Options()
bfo.outHeight = 20
bfo.outWidth = 20
bfo.inJustDecodeBounds = false
val fRawBmp = File(qrCodeRawFilePath)
val rawBmp = BitmapFactory.decodeFile(fRawBmp.absolutePath, bfo)
.outHeight and .outWidth don't seem to have any effect on dimensions (probably used for screen rendering?). The rawBmp object has the following characteristics:
82x82 px
total Bytes: 26896
bytes per row: 328
bytes per px: 4
Since the width is too small it must be scaled with:
if(inBmp.width < 264) {
val startTime = System.nanoTime()
qrBmp = Bitmap.createScaledBitmap(inBmp, 264, 264, true)
val endTime = System.nanoTime()
val duration = endTime - startTime
wasScaled = true
}
This changes the characteristics to
264x264px
total bytes 278784
bytes per row 1056
bytes per px 4
Since the width is a multiple of 8 it doesn't need to be padded.
I then setup the GS v 0 header:
val bytesPerLine = ceil((widthInPx.toFloat() / 8f).toDouble()).toInt()
val m = 0 // 0-3
val xH = bytesPerLine / 256
val xL = bytesPerLine - xH * 256
val yH = heightInPx / 256
val yL = heightInPx - yH * 256
val imageBytes = ByteArray(8 + bytesPerLine * heightInPx)
System.arraycopy(byteArrayOf(0x1D, 0x76, 0x30, m.toByte(), xL.toByte(), xH.toByte(), yL.toByte(), yH.toByte()), 0, imageBytes, 0, 8)
I must have 1 bit per pixel or the image will be distorted. I achieve it with this (adapted from ESCPOS-ThermalPrinter):
var i = 8
for (posY in 0 until heightInPx) {
var jj = 0
while (jj < widthInPx) {
val stringBinary = StringBuilder()
for (k in 0..7) {
val posX = jj + k
if (posX < widthInPx) {
val color: Int = qrBmp.getPixel(posX, posY)
val r = color shr 16 and 0xff
val g = color shr 8 and 0xff
val b = color and 0xff
if (r > 160 && g > 160 && b > 160) {
stringBinary.append("0")
} else {
stringBinary.append("1")
}
} else {
stringBinary.append("0")
}
}
imageBytes[i++] = stringBinary.toString().toInt(2).toByte()
jj += 8
}
}
The final parameters are:
m: 0
xL: 33 bytes
xH: 0 bytes
yL: 8 dots
yH: 1 dots
k: 8712
data: 8720 bytes (8+k)
I then send it fo the OutputStream of the Bluetooth socket and the printer chokes on the image.
I'm testing with multiple devices with different Android versions, ABIs, Bluetooth versions and architectures - occasionally it'll print on one device or another, must it mostly fails.
If using some demo apps from the net, the printer does print images, so I assume I'm doing something wrong.
Perhaps the image is too big for the buffer?
Edit 1
On a simple test using text1 + image + text2, it'll print text1 and image if i flush the stream; but won't print text2, i.e.:
bt.outStream!!.write(byteArrayOf(0x1B, 0x74, 0x02)) // ESC t codepage PC437 USA Standard Europe
bt.outStream?.write("text1\n".toByteArray(Charsets.ISO_8859_1))
br.outStream?.flush()
var bfo = BitmapFactory.Options()
bfo.inJustDecodeBounds = false
val fRawBmp = File(path2file)
val rawBmp = BitmapFactory.decodeFile(fRawBmp.absolutePath, bfo)
bt.outStream?.write(bmp2Bytes(rawBmp))
bt.outStream?.flush()
bt.outStream?.write("text2\n\n\n".toByteArray(Charsets.ISO_8859_1))
bt.outStream?.flush()
bt.outStream?.close()
bt.inStream?.close()
bt.socket?.close()
The QR code is readable but i must still restart the printer. So I must be overflowing something...

Turns out the problem wasn't in the printer buffer, missing ESC/POS command or data size.
I must wait before closing the Bluetooth socket otherwise there may be unsent data.
So,
Thread.sleep(400) // 200ms is enough for _most_ devices I tested
bt.outStream?.write("text2\n\n\n".toByteArray(Charsets.ISO_8859_1))
bt.outStream?.flush()
bt.outStream?.close()
bt.inStream?.close()
bt.socket?.close()

Related

Using groupBy/groupingBy/aggregate to sum into smaller buckets in parallel?

I've got a collection of "stuff", and I'd like to sum it into smaller buckets. (In my particular case, I'm downsampling a luma channel of an image by 8x.)
I'd like it to be as fast as possible on your average multi-core android device, which I think means coroutine-per-bucket. (because there isn't any reason to play with IntAdders if I don't have to)
The naive linear solution:
val SCALE = 8
image.planes[0].buffer.toByteArray().forEachIndexed { index, byte ->
val x1 = index % image.width
val y1 = index / image.width
val x2 = x1 / SCALE
val y2 = y1 / SCALE
val quadIdx = y2 * (image.width / SCALE) + x2
summedQuadLum[quadIdx] += (byte.toInt() and 0xFF)
}
That isn't great - needs to pre-declare the summedQuadLum collection, and doesn't have any chance of parallel work.
I'd love to use groupBy, or groupingBy? or aggregate?) but those all seem to use the values to determine the new keys, and I need to use the key to determine the new keys. I think the least overhead is withIndex which could be done as
val thumbSums = bufferArray.withIndex().groupingBy { (idx, _) ->
val x1 = idx % previewImageDimension.width
val y1 = idx / previewImageDimension.width
val x2 = x1 / SCALE
val y2 = y1 / SCALE
y2 * (previewImageDimension.width / SCALE) + x2
}.aggregate { _, acc: Int?, (_, lum), _ ->
(acc ?: 0) + (lum.toInt() and 0xFF)
}.values.toIntArray()
Much better, it is really close - if I could figure out how to sum each bucket in a coroutine, I think it would be as good as can be expected.
So after groupingBy we have a Grouping object, which we can use to aggregate values. It's important to notice the grouping itself has not been done yet, we basically have a description how to group the values and an iterator of the original array. From here we have a few options:
Create a Channel from the iterator and launch a few worker coroutines to consume it in parallel. Channels support fan-out, so every item in the source is processed by one worker only. The problem here is all the workers need to update different items in the resulting array, so synchronization is required and that's where it gets tricky and likely inefficient.
To avoid multiple workers to write to the same item, we need to tell each of them what items to process. That mean either each of the worker should process all the items, picking only suitable or we should precalculate the groups in advance and feed the workers with the groups. Both approaches have pretty much the same performance as the serial algorithm, so do not make any sense.
So to parallelize it efficiently we want to avoid having a shared mutable state, because it requires synchronization. Obviously we don't want to precalculate the groups also.
My suggestion here is to come from another side - instead of mapping original array to sampled one, let's map sampled array to the original. So we say
This approaches makes each value to be calculated independently by one worker, so no synchronization needed. Now we can implement it like this:
suspend fun sample() {
val asyncFactor = 8
val src = Image(bufferArray, width)
val dst = Image(src.width / SCALE, src.height / SCALE)
val chunkSize = dst.sizeBytes / asyncFactor
val jobs = Array(asyncFactor) { idx ->
async(Dispatchers.Default) {
val chunkStartIdx = chunkSize * idx
val chunkEndIdxExclusive = min(chunkStartIdx + chunkSize, dst.sizeBytes)
calculateSampledImageForIndexes(src, dst, chunkStartIdx, chunkEndIdxExclusive, SCALE)
}
}
awaitAll(*jobs)
}
private fun calculateSampledImageForIndexes(src: Image, dst: Image, startIdx: Int, exclusiveEndIdx: Int, scaleFactor: Int) {
for (i in startIdx until exclusiveEndIdx) {
val destX = i % dst.width
val destY = i / dst.width
val srcX = destX * scaleFactor
val srcY = destY * scaleFactor
var sum = 0
for (xi in 0 until scaleFactor) {
for (yi in 0 until scaleFactor) {
sum += src[srcX + xi, srcY + yi]
}
}
dst[destX, destY] = sum / (scaleFactor * scaleFactor)
}
}
Where Image is a convenient wrapper around the image data buffer:
class Image(val buffer: ByteArray, val width: Int) {
val height = buffer.size / width
val sizeBytes get() = buffer.size
constructor(w: Int, h: Int) : this(ByteArray(w * h), w)
operator fun get(x: Int, y: Int): Byte = buffer[clampX(x) * width + clampY(y)]
operator fun set(x: Int, y: Int, value: Int) {
buffer[x * width + y] = (value and 0xFF).toByte()
}
private fun clampX(x: Int) = max(min(x, width), 0)
private fun clampY(y: Int) = max(min(y, height), 0)
}
Also, with this approach you can easily implement many image processing functions, which based on convolution operation, like blur and edge detection.

Different FFT results from Matlab fft and Objective-c fft

Here is my code in matlab:
x = [1 2 3 4];
result = fft(x);
a = real(result);
b = imag(result);
Result from matlab:
a = [10,-2,-2,-2]
b = [ 0, 2, 0,-2]
And my runnable code in objective-c:
int length = 4;
float* x = (float *)malloc(sizeof(float) * length);
x[0] = 1;
x[1] = 2;
x[2] = 3;
x[3] = 4;
// Setup the length
vDSP_Length log2n = log2f(length);
// Calculate the weights array. This is a one-off operation.
FFTSetup fftSetup = vDSP_create_fftsetup(log2n, FFT_RADIX2);
// For an FFT, numSamples must be a power of 2, i.e. is always even
int nOver2 = length/2;
// Define complex buffer
COMPLEX_SPLIT A;
A.realp = (float *) malloc(nOver2*sizeof(float));
A.imagp = (float *) malloc(nOver2*sizeof(float));
// Generate a split complex vector from the sample data
vDSP_ctoz((COMPLEX*)x, 2, &A, 1, nOver2);
// Perform a forward FFT using fftSetup and A
vDSP_fft_zrip(fftSetup, &A, 1, log2n, FFT_FORWARD);
//Take the fft and scale appropriately
Float32 mFFTNormFactor = 0.5;
vDSP_vsmul(A.realp, 1, &mFFTNormFactor, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &mFFTNormFactor, A.imagp, 1, nOver2);
printf("After FFT: \n");
printf("%.2f | %.2f \n",A.realp[0], 0.0);
for (int i = 1; i< nOver2; i++) {
printf("%.2f | %.2f \n",A.realp[i], A.imagp[i]);
}
printf("%.2f | %.2f \n",A.imagp[0], 0.0);
The output from objective c:
After FFT:
10.0 | 0.0
-2.0 | 2.0
The results are so close. I wonder where is the rest ? I know missed something but don't know what is it.
Updated: I found another answer here . I updated the output
After FFT:
10.0 | 0.0
-2.0 | 2.0
-2.0 | 0.0
but even that there's still 1 element missing -2.0 | -2.0
Performing a FFT delivers a right hand spectrum and a left hand spectrum.
If you have N samples the frequencies you will return are:
( -f(N/2), -f(N/2-1), ... -f(1), f(0), f(1), f(2), ..., f(N/2-1) )
If A(f(i)) is the complex amplitude A of the frequency component f(i) the following relation is true:
Real{A(f(i)} = Real{A(-f(i))} and Imag{A(f(i)} = -Imag{A(-f(i))}
This means, the information of the right hand spectrum and the left hand spectrum is the same. However, the sign of the imaginary part is different.
Matlab returns the frequency in a different order.
Matlab order is:
( f(0), f(1), f(2), ..., f(N/2-1) -f(N/2), -f(N/2-1), ... -f(1), )
To get the upper order use the Matlab function fftshift().
In the case of 4 Samples you have got in Matlab:
a = [10,-2,-2,-2]
b = [ 0, 2, 0,-2]
This means:
A(f(0)) = 10 (DC value)
A(f(1)) = -2 + 2i (first frequency component of the right hand spectrum)
A(-f(2) = -2 ( second frequency component of the left hand spectrum)
A(-f(1) = -2 - 2i ( first frequency component of the left hand spectrum)
I do not understand your objective-C code.
However, it seems to me that the program returns the right hand spectrum only.
So anything is perfect.

Script interface for the Fit image Palette introduced in GMS 2.3?

The Fit Image Palette is quite nice and powerful. Is there a script interface that we can access it directly?
There is a script interface, and the example script below will get you started. However, the script interface is not officially supported. It might therefore be buggy or likely to change in future GMS versions.
For GMS 2.3 the following script works:
// create the input image:
Image input := NewImage("formula test", 2, 100)
input = 500.5 - icol*11.1 + icol*icol*0.11
// add some random noise:
input += (random()-0.5)*sqrt(abs(input))
// create image with error data (not required)
Image errors := input.ImageClone()
errors = tert(input > 1, sqrt(input), 1)
// setup fit:
Image pars := NewImage("pars", 2, 3)
Image parsToFit := NewImage("pars to fit", 2, 3)
pars = 10; // starting values
parsToFit = 1;
Number chiSqr = 1e6
Number conv_cond = 0.00001
Result("\n starting pars = {")
Number xSize = pars.ImageGetDimensionSize(0)
Number i = 0
for (i = 0; i < xSize; i++)
{
Result(GetPixel(pars, i, 0))
if (i < (xSize-1)) Result(", ")
}
Result("}")
// fit:
String formulaStr = "p0 + p1*x + p2*x**2"
Number ok = FitFormula(formulaStr, input, errors, pars, parsToFit, chiSqr, conv_cond)
Result("\n results pars = {")
for (i = 0; i < xSize; i++)
{
Result(GetPixel(pars, i, 0))
if (i < (xSize-1)) Result(", ")
}
Result("}")
Result(", chiSqr ="+ chiSqr)
// plot results of fit:
Image plot := PlotFormula(formulaStr, input, pars)
// compare the plot and original data:
Image compare := NewImage("Compare Fit", 2, 100, 3)
compare[icol, 0] = input // original data
compare[icol, 1] = plot // fit function
compare[icol, 2] = input - plot // residuals
ImageDocument linePlotDoc = CreateImageDocument("Test Fitting")
ImageDisplay linePlotDsp = linePlotDoc.ImageDocumentAddImageDisplay(compare, 3)
linePlotDoc.ImageDocumentShow()

Removing the spacing between tiles in tilesheet

So I have an image which contains a tile-sheet, where each tile is approx 16 pixels wide, and high. But there spaced out with a transparent spacer between each tile.
Like so:
But this is ugly, and makes displaying the sprites in the program annoying, not to mention it wastes valuable image space. Is there any easy (Besides me manually using Photoshop to move each individual tile) way to make it look like this?
I looked through Photoshop macros, as-well as other programs and I diden't seem to find anything that would directly do this.
Google also suggests I go to home-depo and get tile caulk remover.
Try this snippet. As you said, it assumes tiles are always going to be 16 pixels. Top left one is in the correct position and a single pixel gap. The script assumes the document will opened with the layer containing your tiles set as the active layer.
#target photoshop
app.preferences.rulerUnits = Units.PIXELS;
app.preferences.typeUnits = TypeUnits.PIXELS;
var gap = 1;
var tileSize = 16;
var doc = app.activeDocument.duplicate();
var sourceLyr = doc.activeLayer;
var xTilePosition = 0;
var yTilePosition = 0;
for (var x = 0; x < sourceLyr.bounds[2]; x = x+ tileSize + 1 ) {
for (var y = 0; y < sourceLyr.bounds[3]; y = y + tileSize + 1) {
if (x > 0 || y > 0) {
app.activeDocument = doc;
doc.activeLayer = sourceLyr;
selRegion = Array(Array(x, y),
Array(x + tileSize, y),
Array(x + tileSize, y + tileSize),
Array(x, y + tileSize),
Array(x, y))
doc.selection.select(selRegion);
var dx = x - (xTilePosition * tileSize);
var dy = y - (yTilePosition * tileSize);
doc.selection.translate(0 - dx, 0 - dy);
}
yTilePosition ++;
}
xTilePosition++;
yTilePosition = 0;
}

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.