How to speed up Kinect Face Tracking at full depth resolution - kinect

I am working on application with Kinect and I need full depth resolution. Face tracking works fast (30 fps) with 320x240 depth resolution. But, when I switch to 640x480, I get one third of speed (~10fps). I have tried, to extract 320x240 depth image for face tracking (every second column and row), but I did not detect any face.
depthImageSmall = new short[this.depthImage.Length/4];
for (int y = 0; y < 240; y++)
for (int x = 0; x < 320; x++)
depthImageSmall[y * 320 + x] = this.depthImage[y * 640 + x * 2];
How can I speed up face tracking?
Thank you

Related

Get centroid of point cloud data based on color using kinect v2 in ROS

I want to get the centroid of point cloud data based on color using kinect v2. Even after searching for a long time I was not able to find a package which can do this task. But since this is a general problem, I think there should be a existing package.
Please help. Thanks in advance!
If you are using PCL you can do
pcl::PointXYZRGB centroid;
pcl::computeCentroid(*cloud, centroid);
Otherwise it is just the average of the points. For example:
pcl::PointXYZI centroid;
float x = 0, y = 0, z = 0;
for (int k = 0; k < cloud->size(); k++)
{
x += cloud->at(k).x;
y += cloud->at(k).y;
z += cloud->at(k).z;
}
centroid.x = x / (cloud->size() + 0.0);
centroid.y = y / (cloud->size() + 0.0);
centroid.z = z / (cloud->size() + 0.0);

Is there anyway I can get every color(rgb) image's pixel matching which depth(ir) image's pixel?

I use Kinect2.0. I already got the intrinsic parameters of the depth camera and color camera, and extrinsic parameters between them.
Now I already know every depth(ir) image's pixel match which color(rgb) image's pixel.
for (int i = 0; i < 424; i++)
{
for (int j = 0; j < 512; j++)
{
fscanf(fp_dp, "%lf", &depthValue);
if (depthValue == 0) continue;
double Pir[3][1] = { j*depthValue, i*depthValue, depthValue };
P_ir = Mat(3, 1, CV_64F, Pir);
P_rgb = Mat(3, 1, CV_64F);
P_rgb = Intrinsic_rgb*(R_ir2rgb*(Intrinsic_ir_inv*P_ir) + T_ir2rgb);
int x = P_rgb.at<double>(0, 0) / depthValue;
int y = P_rgb.at<double>(1, 0) / depthValue;
//printf("(%d,%d)\n", x, y);
if (x < 0 || y < 0 || x >= 1920 || y >= 1080)
{
continue;
}
img_mmap.at<Vec3b>(i, j)[0] = img_rgb.at<Vec3b>(y, x)[0];
img_mmap.at<Vec3b>(i, j)[1] = img_rgb.at<Vec3b>(y, x)[1];
img_mmap.at<Vec3b>(i, j)[2] = img_rgb.at<Vec3b>(y, x)[2];
Color_depth[y][x] = depthValue;
}
fscanf(fp_dp, "\n");
}
fclose(fp_dp);
imwrite(ir_name, img_mmap);
As you can see I want get the color image's depth data. But when I use this method. I just got 512x424 units data. It's not 1920x1080.
So Is there anyway I can know every color(rgb) image's pixel match which depth(ir) image's pixel when I already get the intrinsic parameters of the two cameras and the extrinsic parameters between them?
Use MapColorFrameToDepthSpace.
Remark:
Allocate the depthSpacePoints array before calling this method. It
should have the same number of elements as the color frame has pixels
(1920px by 1080px). Each entry in the filled depthSpacePoints array
contains the depth point to which the corresponding pixel belongs.

C/Obj-C noise generators always return 0 after the first run?

Unfortunately the simplex/perlin noise generator I've always used is very bloated and java-based, and would be a pain to transfer to c/obj-c. I'm looking for better classes to use in an iOS version of a game, but i have an odd problem.
I have code that loops through each "tile" of a 2d background - it should calculate a noise value for each tile. In my java implementations it works fine.
However, each time I run the code, it appears to print a proper value the first time the breakpoint is hit, but from then on only ever returns zero:
for (double x = 0; x < 2; x++){
for (double y = 0; y < 2; y++){
double tileNoise = PerlinNoise2D(x,y,2,2,1);
}
}
I've tried two different implementations, the current being this c perlin library.
The breakpoint shows a value like 1.88858049852505e-308 the first time, but when I continue execution all subsequent breaks show "0".
What am I missing?
Perlin noise is defined to be zero for integer locations. Try rotating, scaling or translating your space and see what happens example:
double u = 0.1;
double v = 0.1;
for (double x = 0; x < 2; x++){
for (double y = 0; y < 2; y++){
double tileNoise = PerlinNoise2D(x+u,y+v,2,2,1);
}
}

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

Directly draw a rectangle on IMFMediaBuffer using C++ code

Currently, I try to modify sample from Media Foundation Transform. I tend to achieve the following
Perform face detection in C++ code.
Pass back the face detected coordinate to C# code.
Let C# draw detected face rectangle on the screen.
I completed step 1. However, I am being sucked on step 2. I am facing similar problem as this : How to get feedback from MediaCapture API in Windows 8. I cannot find a way, to make my C++ MFT code, talk with C# code.
I was thinking another workaround. Directly draw a rectangle on IMFMediaBuffer using C++ code.
However, I do not see Microsoft provides such APIs to do so. If not, what is the correct way I can use?
If you can set pixel colors it should be fairly simple to draw a rectangle with a loop.
for (int y = top; y <= bottom; y++)
for (int x = left; x <= right; x++)
pixels[y * width + x] = color; // pseudocode
Drawing just borders of a rectangle is just 4 separate loops.
Simplest way to draw a circle:
for (int y = -r; y <= r; y++)
for (int x = -r; x <= r; x++)
if (x * x + y * y < r * r)
pixels[(center.y + y) * width + center.x + x] = color; // pseudocode