Using CUDA to find the pixel-wise average value of a bunch of images

Using CUDA to find the pixel-wise average value of a bunch of images - indexing

So I have a cube of images. 512X512X512, I want to sum up the images pixel-wise and save it to a final resulting image. So if all the pixels were value 1...the final image would all be 512. I am having trouble understanding the indexing to do this in CUDA. I figure one thread's job will be to sum up all 512 at it's pixel...so the total thread number will be 512X512. So I plan to do it with 512 blocks, with 512 threads each. From here, I am having trouble coming up with the indexing of how to sum the depth. Any help will be greatly appreciated.

One way to solve this problem is imaging the cube as a set of Z slides. The coordinates X, Y refers to the width and height of the image, and the Z coordinate to each slide in the Z dimension. Each thread will iterate in the Z coordinate to accumulate the values.
With this in mind, configure a kernel to launch a block of 16x16 threads and a grid of enough blocks to process the width and height of the image (I'm assuming a gray scale image with 1 byte per pixel):
#define THREADS 16
// kernel configuration
dim3 dimBlock = dim3 ( THREADS, THREADS, 1 );
dim3 dimGrid = dim3 ( WIDTH / THREADS, HEIGHT / THREADS );
// call the kernel
kernel<<<dimGrid, dimBlock>>>(i_data, o_Data, WIDTH, HEIGHT, DEPTH);
If you are clear how to index a 2D array, loop through the Z dimension would be also clear
__global__ void kernel(unsigned char* i_data, unsigned char* o_data, int WIDTH, int HEIGHT, int DEPTH)
{
// in your kernel map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global index of a pixel into the image array
// this global index is to the first slide of the cube
int idx = x + y * WIDTH;
// partial results
int r = 0;
// iterate in the Z dimension
for (int z = 0; z < DEPTH; ++z)
{
// WIDTH * HEIGHT is the offset of one slide
int idx_z = z * WIDTH*HEIGHT + idx;
r += i_data[ idx_z ];
}
// o_data is a 2D array, so you can use the global index idx
o_data[ idx ] = r;
}
This is a naive implementation. In order to maximize memory throughput, the data should be properly aligned.

This can be done easily using ArrayFire GPU library ( free). In ArrayFire, you can construct 3D arrays like the following :
Two approaches:
// Method 1:
array data = rand(x,y,z);
// Just reshaping the array, this is a noop
data = newdims(data,x*y, z, 1);
// Sum of pixels
res = sum(data);
// Method 2:
// Use ArrayFire "GFOR"
array data = rand(x,y,z);res = zeros(z,1);
gfor(array i, z) {
res(ii) = sum(data(:,:,i);
}

Related

Is there anyway I can get every color(rgb) image's pixel matching which depth(ir) image's pixel?

I use Kinect2.0. I already got the intrinsic parameters of the depth camera and color camera, and extrinsic parameters between them.
Now I already know every depth(ir) image's pixel match which color(rgb) image's pixel.
for (int i = 0; i < 424; i++)
{
for (int j = 0; j < 512; j++)
{
fscanf(fp_dp, "%lf", &depthValue);
if (depthValue == 0) continue;
double Pir[3][1] = { j*depthValue, i*depthValue, depthValue };
P_ir = Mat(3, 1, CV_64F, Pir);
P_rgb = Mat(3, 1, CV_64F);
P_rgb = Intrinsic_rgb*(R_ir2rgb*(Intrinsic_ir_inv*P_ir) + T_ir2rgb);
int x = P_rgb.at<double>(0, 0) / depthValue;
int y = P_rgb.at<double>(1, 0) / depthValue;
//printf("(%d,%d)\n", x, y);
if (x < 0 || y < 0 || x >= 1920 || y >= 1080)
{
continue;
}
img_mmap.at<Vec3b>(i, j)[0] = img_rgb.at<Vec3b>(y, x)[0];
img_mmap.at<Vec3b>(i, j)[1] = img_rgb.at<Vec3b>(y, x)[1];
img_mmap.at<Vec3b>(i, j)[2] = img_rgb.at<Vec3b>(y, x)[2];
Color_depth[y][x] = depthValue;
}
fscanf(fp_dp, "\n");
}
fclose(fp_dp);
imwrite(ir_name, img_mmap);
As you can see I want get the color image's depth data. But when I use this method. I just got 512x424 units data. It's not 1920x1080.
So Is there anyway I can know every color(rgb) image's pixel match which depth(ir) image's pixel when I already get the intrinsic parameters of the two cameras and the extrinsic parameters between them?

Use MapColorFrameToDepthSpace.
Remark:
Allocate the depthSpacePoints array before calling this method. It
should have the same number of elements as the color frame has pixels
(1920px by 1080px). Each entry in the filled depthSpacePoints array
contains the depth point to which the corresponding pixel belongs.

PyCUDA large nonuniform matrix operations

I am working with large, nonuniform matrices and am having problems with what I believe to be mismatching on the elements.
In example.py, get_simulated_ipp() builds echo and tx, two linear arrays of size 250000 and 25000 respectively. The code also hardcoded sr=25.
My code is attempting to complex multiply tx into echo along different stretches, depending on specified ranges and value of sr. This will then be stored in an array S.
After searching through some other people's examples, I found a way of building blocks and grids here that I thought would work well. I'm unfamiliar with C code, but have been trying to learn over the past week. Here is my code:
#!/usr/bin/python
#This iteration only works on the first and last elements, mismatching after that.
# However, this doesn't result in any empty elements in S
import numpy as np
import example as ex
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
#pull simulated data and get info about it
((echo,tx)) = ex.get_simulated_ipp()
ranges = np.arange(4000,6000).astype(np.int32)
S = np.zeros([len(ranges),len(tx)],dtype=np.complex64)
sr = ex.sr
#copying input to gpu
# will try this explicitly if in/out (in the function call) don't work
block_dim_x = 8 #thread number is product of block dims,
block_dim_y = 8 # want a multiple of 32 (warp multiple)
blocks_x = np.ceil(len(ranges)/block_dim_x).astype(np.int32).item()
blocks_y = np.ceil(len(tx)/block_dim_y).astype(np.int32).item()
kernel_code="""
#include <cuComplex.h>
__global__ void complex_mult(cuFloatComplex *tx, cuFloatComplex *echo, cuFloatComplex *result,
int *ranges, int sr)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned long int idx = threads_in_block * block_num + thread_num;
//aligning the i,j to idx, something is mismatched?
int i = ((idx % (threads_in_block * gridDim.x)) % blockDim.x) +
((block_num % gridDim.x) * blockDim.x);
int j = ((idx - (threads_in_block * block_num)) / blockDim.x) +
((block_num / gridDim.x) * blockDim.y);
result[idx] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
}
"""
## want something to work like this:
## result[i][j] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
#includes directory of where cuComplex.h is located
mod = SourceModule(kernel_code, include_dirs=['/usr/local/cuda-7.0/include/'])
complex_mult = mod.get_function("complex_mult")
complex_mult(cuda.In(tx), cuda.In(echo), cuda.Out(S), cuda.In(ranges), np.int32(sr),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
compare = np.zeros_like(S) #built to compare CPU vs GPU calcs
txidx = np.arange(len(tx))
for ri,r in enumerate(ranges):
compare[ri,:] = echo[txidx+r*sr]*tx
print np.subtract(S, compare)
At the bottom here, I've put in a CPU implementation of what I'm attempting to accomplish and put in a subtraction. The result is that the very first and very last elements come out as 0+0j, but the rest do not. The kernel is attempting to align an i and j to the idx so that I can traverse echo, ranges, and tx more easily.
Is there a better way to implement something like this? Also, why might the result not come out as all 0+0j as I intend?
Edit:
Trying a little example to get a better grasp of how the arrays are being indexed with this block/grid configuration, I stumbled upon something very strange. Before, I tried to index the elements, I just wanted to run a little test multiplication. It seems like my block/grid covers all of the ary_in matrix, but the result ends up only doubling the top half of ary_in and the bottom half is returning whatever was left over from the bottom half calculation previously.
If I change blocks_x to 4 so that I cover more space than needed, however, the doubling works fine. If I then run it with a 4x4 grid, with * 3 instead, it'll work out fine with ary_out as ary_in tripled. When I run it again with a 2x4 grid and only doubling, the top half of ary_out returns the doubled ary_in, but the bottom half returns the previous result in memory, a tripled value instead. I would understand this to be something in my index/block/grid mapping wrongly to the values, but I can't figure out what.
ary_in = np.arange(128).reshape((8,16))
print ary_in
ary_out = np.zeros_like(ary_in)
block_dim_x = 4
block_dim_y = 4
blocks_x = 2
blocks_y = 4
limit = block_dim_x * block_dim_y * blocks_x * blocks_y
mod = SourceModule("""
__global__ void indexing_order(int *ary_in, int *ary_out, int n)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned int idx = threads_in_block * block_num + thread_num;
if (idx < n) {
// ary_out[idx] = thread_num;
ary_out[idx] = ary_in[idx] * 2;
}
}
""")
indexing_order = mod.get_function("indexing_order")
indexing_order(drv.In(ary_in), drv.Out(ary_out), np.int32(limit),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
print ary_out
FINAL EDIT:
I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;
I still don't understand how to get this working with non even division of block dimensions into the output array (e.g. 16x16), even with an if (idx

I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).

This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.

Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

OpenCL Memory Optimization - Nearest Neighbour

I'm writing a program in OpenCL that receives two arrays of points, and calculates the nearest neighbour for each point.
I have two programs for this. One of them will calculate distance for 4 dimensions, and one for 6 dimensions. They are below:
4 dimensions:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float dist = fast_distance(curY, m[x]);
if (dist < minDist)
{
minDist = dist;
minIdx = x;
}
}
i[index] = minIdx;
y[index] = minDist;
}
6 dimensions:
kernel void BruteForce(
global read_only float8* m,
global float8* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float8 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float8 mx = m[x];
float d0 = mx.s0 - curY.s0;
float d1 = mx.s1 - curY.s1;
float d2 = mx.s2 - curY.s2;
float d3 = mx.s3 - curY.s3;
float d4 = mx.s4 - curY.s4;
float d5 = mx.s5 - curY.s5;
float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
if (dist < minDist)
{
minDist = dist;
minIdx = index;
}
}
i[index] = minIdx;
y[index] = minDist;
}
I'm looking for ways to optimize this program for GPGPU. I've read some articles (including http://www.macresearch.org/opencl_episode6, which comes with a source code) about GPGPU optimization by using local memory. I've tried applying it and came up with this code:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
__local float4 * shared)
{
int index = get_global_id(0);
int lsize = get_local_size(0);
int lid = get_local_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = 64000;
int x = 0;
for(x = 0; x < {0}; x += lsize)
{
if((x+lsize) > {0})
lsize = {0} - x;
if ( (x + lid) < {0})
{
shared[lid] = m[x + lid];
}
barrier(CLK_LOCAL_MEM_FENCE);
for (int x1 = 0; x1 < lsize; x1++)
{
float dist = distance(curY, shared[x1]);
if (dist < minDist)
{
minDist = dist;
minIdx = x + x1;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
i[index] = minIdx;
y[index] = minDist;
}
I'm getting garbage results for my 'i' output (e.g. many values that are the same). Can anyone point me to the right direction? I'll appreciate any answer that helps me improve this code, or maybe find the problem with the optimize version above.
Thank you very much
Cauê

One way to get a big speed up here is to use local data structures and compute entire blocks of data at a time. You should also only need a single read/write global vector (float4). The same idea can be applied to the 6d version using smaller blocks. Each work group is able to work freely through the block of data it is crunching. I will leave the exact implementation to you because you will know the specifics of your application.
some pseudo-ish-code (4d):
computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256
__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;
now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global
steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed
need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
if(j==k) continue; //no self-comparison
calculate distance of blockA[j] vs blockA[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
for (i=0;i<get_num_groups(0);i++)
if (i==get_group_id(0)) continue;
load blockB into local memory: async_work_group_copy(...)
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
calculate distance of blockA[j] vs blockB[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
write blockA and blockAnearestIndex to global memory using two async_work_group_copy
There should be no problem in reading a blockB while another work group writes the same block (as its own blockA), because only the W values may have changed. If there happens to be trouble with this -- or if you do require two different vectors of points, you could use two global vectors like you have above, one with the A's (writeable) and the other with the B's (read only).
This algorithm work best when your global data size is a multiple of computeBlockSize. To handle the edges, two solutions come to mind. I recommend writing a second kernel for the non-square edge blocks that would in a similar manner as above. The new kernel can execute after the first, and you could save the second pci-e transfer. Alternately, you can use a distance of -1 to signify a skip in the comparison of two elements (ie if either blockA[j].w == -1 or blockB[k].w == -1, continue). This second solution would result in a lot more branching in your kernel though, which is why I recommend writing a new kernel. A very small percentage of your data points will actually fall in a edge block.

Resizing N # of squares to be as big as possible while still fitting into box of X by Y dimensions. (Thumbnails!)

I have N squares.
I have a Rectangular box.
I want all the squares to fit in the box.
I want the squares to be as large as possible.
How do I calculate the largest size for the squares such that they all fit in the box?
This is for thumbnails in a thumbnail gallery.
int function thumbnailSize(
iItems, // The number of items to fit.
iWidth, // The width of the container.
iHeight, // The height of the container.
iMin // The smallest an item can be.
)
{
// if there are no items we don't care how big they are!
if (iItems = 0) return 0;
// Max size is whichever dimension is smaller, height or width.
iDimension = (iWidth min iHeight);
// Add .49 so that we always round up, even if the square root
// is something like 1.2. If the square root is whole (1, 4, etc..)
// then it won't round up.
iSquare = (round(sqrt(iItems) + 0.49));
// If we arrange our items in a square pattern we have the same
// number of rows and columns, so we can just divide by the number
// iSquare, because iSquare = iRows = iColumns.
iSize = (iDimension / iSquare);
// Don't use a size smaller than the minimum.
iSize = (iSize max iMin);
return iSize;
}
This code currently works OK. The idea behind it is to take the smallest dimension of the rectangular container, pretend the container is a square of that dimension, and then assume we have an equal number of rows and columns, just enough to fit iItems squares inside.
This function works great if the container is mostly squarish. If you have a long rectangle, though, the thumbnails come out smaller than they could be. For instance, if my rectangle is 100 x 300, and I have three thumbnails, it should return 100, but instead returns 33.

Probably not optimal (if it works which I haven't tried), but I think better than you current approach :
w: width of rectangle
h: height of rectangle
n: number of images
a = w*h : area of the rectangle.
ia = a/n max area of an image in the ideal case.
il = sqrt(ia) max length of an image in the ideal case.
nw = round_up(w/il): number of images you need to stack on top of each other.
nh = round_up(h/il): number of images you need to stack next to each other.
l = min(w/nw, w/nh) : length of the images to use.

The solution on https://math.stackexchange.com/a/466248 works perfectly.
An unoptimized javascript implementation:
var getMaxSizeOfSquaresInRect = function(n,w,h)
{
var sw, sh;
var pw = Math.ceil(Math.sqrt(n*w/h));
if (Math.floor(pw*h/w)*pw < n) sw = h/Math.ceil(pw*h/w);
else sw = w/pw;
var ph = Math.ceil(Math.sqrt(n*h/w));
if (Math.floor(ph*w/h)*ph < n) sh = w/Math.ceil(w*ph/h);
else sh = h/ph;
return Math.max(sw,sh);
}

I was looking for a similar solution, but instead of squares I had to fit rectangles in the container. Since a square is also a rectangle, my solution also answers this question.
I combined the answers from Neptilo and mckeed into the fitToContainer() function. Give it the number of rectangles to fit n, the containerWidth and containerHeight and the original itemWidth and itemHeight. In case items have no original width and height, use itemWidth and itemHeight to specify the desired ratio of the items.
For example fitToContainer(10, 1920, 1080, 16, 9) results in {nrows: 4, ncols: 3, itemWidth: 480, itemHeight: 270}, so four columns and 3 rows of 480 x 270 (pixels, or whatever the unit is).
And to fit 10 squares in the same example area of 1920x1080 you could call fitToContainer(10, 1920, 1080, 1, 1) resulting in {nrows: 2, ncols: 5, itemWidth: 384, itemHeight: 384}.
function fitToContainer(n, containerWidth, containerHeight, itemWidth, itemHeight) {
// We're not necessarily dealing with squares but rectangles (itemWidth x itemHeight),
// temporarily compensate the containerWidth to handle as rectangles
containerWidth = containerWidth * itemHeight / itemWidth;
// Compute number of rows and columns, and cell size
var ratio = containerWidth / containerHeight;
var ncols_float = Math.sqrt(n * ratio);
var nrows_float = n / ncols_float;
// Find best option filling the whole height
var nrows1 = Math.ceil(nrows_float);
var ncols1 = Math.ceil(n / nrows1);
while (nrows1 * ratio < ncols1) {
nrows1++;
ncols1 = Math.ceil(n / nrows1);
}
var cell_size1 = containerHeight / nrows1;
// Find best option filling the whole width
var ncols2 = Math.ceil(ncols_float);
var nrows2 = Math.ceil(n / ncols2);
while (ncols2 < nrows2 * ratio) {
ncols2++;
nrows2 = Math.ceil(n / ncols2);
}
var cell_size2 = containerWidth / ncols2;
// Find the best values
var nrows, ncols, cell_size;
if (cell_size1 < cell_size2) {
nrows = nrows2;
ncols = ncols2;
cell_size = cell_size2;
} else {
nrows = nrows1;
ncols = ncols1;
cell_size = cell_size1;
}
// Undo compensation on width, to make squares into desired ratio
itemWidth = cell_size * itemWidth / itemHeight;
itemHeight = cell_size;
return { nrows: nrows, ncols: ncols, itemWidth: itemWidth, itemHeight: itemHeight }
}
The JavaScript implementation form mckeed gave me better results then the other answers I found. The idea to first stretch the rectangle to a square came from Neptilo.

you want something more like
n = number of thumbnails
x = one side of a rect
y = the other side
l = length of a side of a thumbnail
l = sqrt( (x * y) / n )

Here is my final code based off of unknown (google)'s reply:
For the guy who wanted to know what language my first post is in, this is VisualDataflex:
Function ResizeThumbnails Integer iItems Integer iWidth Integer iHeight Returns Integer
Integer iArea iIdealArea iIdealSize iRows iCols iSize
// If there are no items we don't care how big the thumbnails are!
If (iItems = 0) Procedure_Return
// Area of the container.
Move (iWidth * iHeight) to iArea
// Max area of an image in the ideal case (1 image).
Move (iArea / iItems) to iIdealArea
// Max size of an image in the ideal case.
Move (sqrt(iIdealArea)) to iIdealSize
// Number of rows.
Move (round((iHeight / iIdealSize) + 0.50)) to iRows
// Number of cols.
Move (round((iWidth / iIdealSize) + 0.50)) to iCols
// Optimal size of an image.
Move ((iWidth / iCols) min (iHeight / iRows)) to iSize
// Check to make sure it is at least the minimum.
Move (iSize max iMinSize) to iSize
// Return the size
Function_Return iSize
End_Function

This should work. It is solved with an algorithm rather than an equation. The algorithm is as follows:
Span the entire short side of the rectangles with all of the squares
Decrease the number of squares in this span (as a result, increasing the size) until the depth of the squares exceeds the long side of the rectangle
Stop when the span reaches 1, because this is as good as we can get.
Here is the code, written in JavaScript:
function thumbnailSize(items, width, height, min) {
var minSide = Math.min(width, height),
maxSide = Math.max(width, height);
// lets start by spanning the short side of the rectange
// size: the size of the squares
// span: the number of squares spanning the short side of the rectangle
// stack: the number of rows of squares filling the rectangle
// depth: the total depth of stack of squares
var size = 0;
for (var span = items, span > 0, span--) {
var newSize = minSide / span;
var stack = Math.ceil(items / span);
var depth = stack * newSize;
if (depth < maxSide)
size = newSize;
else
break;
}
return Math.max(size, min);
}

In Objective C ... the length of a square side for the given count of items in a containing rectangle.
int count = 8; // number of items in containing rectangle
int width = 90; // width of containing rectangle
int height = 50; // width of container
float sideLength = 0; //side length to use.
float containerArea = width * height;
float maxArea = containerArea/count;
float maxSideLength = sqrtf(maxArea);
float rows = ceilf(height/maxSideLength); //round up
float columns = ceilf(width/maxSideLength); //round up
float minSideLength = MIN((width/columns), (height/rows));
float maxSideLength = MAX((width/columns), (height/rows));
// Use max side length unless this causes overlap
if (((rows * maxSideLength) > height) && (((rows-1) * columns) < count) ||
(((columns * maxSideLength) > width) && (((columns-1) * rows) < count))) {
sideLength = minSideLength;
}
else {
sideLength = maxSideLength;
}

My JavaScript implementation:
var a = Math.floor(Math.sqrt(w * h / n));
return Math.floor(Math.min(w / Math.ceil(w / a), h / Math.ceil(h / a)));
Where w is width of the rectangle, h is height, and n is number of squares you want to squeeze in.

double _getMaxSizeOfSquaresInRect(double numberOfItems, double parentWidth, double parentHeight) {
double sw, sh;
var pw = (numberOfItems * parentWidth / parentHeight).ceil();
if ((pw * parentHeight / parentWidth).floor() * pw < numberOfItems) {
sw = parentHeight / (pw * parentHeight / parentWidth).ceil();
} else {
sw = parentWidth / pw;
}
var ph = (sqrt(numberOfItems * parentHeight / parentWidth)).ceil();
if ((ph * parentWidth / parentHeight).floor() * ph < numberOfItems) {
sh = parentWidth / (parentWidth * ph / parentHeight).ceil();
} else {
sh = parentHeight / ph;
}
return max(sw, sh);
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas