Is the raw depth data from the Kinect 2 completely unfiltered? - kinect

As the title says, is the raw data really raw or does the Kinect apply some sort of filtering (median, bilateral etc.) to reduce the noise? I am comparing the data with other non consumer ToF cameras and it seems that the raw values from the Kinect 2 are pretty smooth.

No, some filters are applied.
But Microsoft doesn't publish any information about what's going on inside their Kinect SDK/hardware, so we can only guess.
The best information about this comes from libfreenect2, the open source driver for Kinect v2. One of the developers said:
[libfreenect's] current depth processing code [...] is doing the same things as the shader shipped with the K4W2 Preview SDK (might have changed in the meantime). The bilateral filter is applied to the complex valued images before computing the amplitude/phase(depth). Its only aware of intensity edges in these images. The "edge-aware" filter basically tries to filter the flying pixels at the object boundaries by calculating some statistics in a local neighborhood. Both filters can be disabled in libfreenect2.
(emphasis mine, Source)
Of course we don't know if there's anything else going on or if something changed in the release version of the Microsoft SDK.
Btw. here's a recent paper comparing some current ToF sensors:
A Comparative Error Analysis of Current Time-of-Flight Sensors - Peter Fürsattel et al.

Related

Custom rendering with GPU, Direct3D or OpenGL

I have a Windows application that currently renders graphics largely using MFC that I'd like to change to get better use out of the GPU. Most of the graphics are straightforward and could easily be built up into a scene graph, but some of the graphics could prove very difficult. Specifically, in addition to the normal mesh type objects, I'm also dealing with point clouds which are liable to contain billions of Cartesian stored in a very compact manner that use quite a lot of custom culling techniques to be displayed in real time (Example). What I'm looking for is a mechanism that does the bulk of the scene rendering to a buffer and then gives me access to that buffer, a z buffer, and camera parameters such that I can modify them before putting them out to the display. I'm wondering whether this is possible with Direct3D, OpenGL or possibly use a higher level framework like OpenSceneGraph, and what would be the best starting point? Given the software is Windows based, I'd probably prefer to use Direct3D as this is likely to lead to fewest driver issues which I'm eager to avoid. OpenSceneGraph seems to provide custom culling via octrees, which are close but not identical to what I'm using.
Edit: To clarify a bit more, currently I have the following;
A display list / scene in memory which will typically contain up to a few million triangles, lines, and pieces of text, which I cull in software and output to a bitmap using low performing drawing primitives
A point cloud in memory which may contain billions of points in a highly compressed format (~4.5 bytes per 3d point) which I cull and output to the same bitmap
Cursor information that gets added to the bitmap prior to output
A camera, z-buffer and attribute buffers for navigation and picking purposes
The slow bit is the highlighted part of section 1 which I'd like to replace with GPU rendering of some kind. The solution I envisage is to build a scene for the GPU, render it to a bitmap (with matching z-buffer) based on my current camera parameters and then add my point cloud prior to output.
Alternatively, I could move to a scene based framework that managed the cameras and navigation for me and provide points in view as spheres or splats based on volume and level of detail during the rendering loop. In this scenario I'd also need to be able add cursor information to the view.
In either scenario, the hosting application will be MFC C++ based on VS2017 which would require too much work to change for the purposes of this exercise.
It's hard to say exactly based on your description of a complex problem.
OSG can probably do what you're looking for.
Depending on your timeframe, I'd consider eschewing both OpenGL (OSG) and DirectX in favor of the newer Vulkan 3D API. It's a successor to both D3D and OGL, and is designed by the GPU manufacturers themselves to provide optimal performance exceeding both of its predecessors.
The OSG project is currently developing a Vulkan scenegraph known as VSG, which already demonstrates superior performance to OSG and will have more generalized culling ability.
I've worked a bunch with point clouds and am pretty experienced with them, but I'm not exactly clear on what you're proposing to do.
If you want to actually have a verbal discussion about the matter, I'm pretty easy to find (my company is AlphaPixel -- AlphaPixel.com) and you could call us. I'm in the European time zone right now, it's not clear from your question where you are but you sound US-based.

It is possible to recognize all objects from a room with Microsoft Kinect?

I have a project where I have to recognize an entire room so I can calculate the distances between objects (like big ones eg. bed, table, etc.) and a person in that room. It is possible something like that using Microsoft Kinect?
Thank you!
Kinect provides you following
Depth Stream
Color Stream
Skeleton information
Its up to you how you use this data.
To answer your question - Official Micorosft Kinect SDK doesnt provides shape detection out of the box. But it does provide you skeleton data/face tracking with which you can detect distance of user from kinect.
Also with mapping color stream to depth stream you can detect how far a particular pixel is from kinect. In your implementation if you have unique characteristics of different objects like color,shape and size you can probably detect them and also detect the distance.
OpenCV is one of the library that i use for computer vision etc.
Again its up to you how you use this data.
Kinect camera provides depth and consequently 3D information (point cloud) about matte objects in the range 0.5-10 meters. With this information it is possible to segment out the floor (by fitting a plane) of the room and possibly walls and the ceiling. This step is important since these surfaces often connect separate objects making them a one big object.
The remaining parts of point cloud can be segmented by depth if they don't touch each other physically. Using color one can separate the objects even further. Note that we implicitly define an object as 3D dense and color consistent entity while other definitions are also possible.
As soon as you have your objects segmented you can measure the distances between your segments, analyse their shape, recognize artifacts or humans, etc. To the best of my knowledge however a Skeleton library can recognize humans after they moved for a few seconds. Below is a simple depth map that was broken on a few segments using depth but not color information.

Kinect skeleton tracking

I am working with Kinect and I implemented a skeleton tracking mechanism to keep joint trajectories of a sequence of human gait with single kinect. The problem is with the accuracy of the mesurments. I want to overcome occlusion and jitter problems. I found some filter implementations to deal with jittering (more advanced than average filters). But occlusion is more difficult. Is there an good open source project for skeleton tracking with good results (more accurate). Microsoft sdk or openni it doesn't matter.
Thanks in advance
Actually I did a gait recognition using Kinect and I used microsoft official sdk. The accuracy of measurements is based on your calculation. Check this as example.

Transformation between point clouds

I hope to find some hints where to start with a problem I am dealing with.
I am using a Kinect sensor to capture 3d point clouds. I created a 3d object detector which is already working.
Here my task:
Lets say I have a point cloud 1. I detected a object in cloud A and I know the centroid position of my object (x1,y1,z1). Now I move my sensor around a path and create new clouds (e.g. cloud 2). In that cloud 2 I see the same object but e.g. from the side, where the object detection is not working fine.
I would like to transform the detected object form cloud 1 to cloud 2, to get the centroid also in cloud 2. For me it sound like I need a matrix (Translation, Rotation) to transform point from 1 to 2.
And ideas how I could solve my problem?
Maybe ICP? Are there better solutions?
THX!
In general, this task is called registration. It relies on having a good estimation of which points in cloud 1 correspond to which clouds in point 2 (more specifically, which given a point in cloud 1, which point in cloud 2 represents the same location on the detected object). There's a good overview in the PCL library documentation
If you have such a correspondence, you're in luck and you can directly compute a rotation and translation as demonstrated here.
If not, you'll need to estimate that correspondence. ICP does that for approximately aligned point clouds, but if your point clouds are not already fairly well aligned, you may want to start by estimating "key points" (such as book corners, distinct colors, etc) in your point clouds, computing a rotation and translation as above, and then performing ICP. As D.J.Duff mentioned, ICP works better in practice on point clouds that are already approximately aligned because it estimates correspondences using one of two metrics, minimal point-to-point distance or minimal point to plane distance, according to wikipedia, the latter works better in practice, but it does involve estimating normals, which can be tricky. If the correspondences are far off, the transforms likely will be as well.
I think what you were asking about was in particular to the Kinect Sensor and the API Microsoft released for it.
If you are not planning to do reconstruction, you can look into the AlignPointClouds function in Sensor Fusion namespace. This should take care of it automatically, in methods similar to the answer given by #pnhgiol.
On the other hand, if you are looking at doing reconstruction as well as point cloud transforms, the Reconstruction class is what you are looking for. All of which can be found out about, here.

What frameworks for depth cameras are out there?

I want to evaluate the performance of several SDKs / frameworks for depth cameras. These cameras can either be using Time-of-Flight or structured light.
The framework should be capable (at least) of person tracking / blob detection and gesture recognition.
So far I found the following frameworks:
OpenNI (structured light only)
Microsoft Kinect SDK (Kinect only)
Beckon SDK by Omek Interactive (ToF and structured light)
iisu by SoftKinetic (ToF and structured light)
Are there any other frameworks I should be aware of?
EDIT: I found this article by Techradar that seems to indicate that these are indeed the only options currently available.
Any feedback would be very much appreciated!
I have found some interesting links on this. You can take MIT's approach using CodAC . They list lots of facts on this post, the most important ones I will post here.
9. What are limitations of this technique?
The main limitation of our framework is inapplicability to scenes with curvilinear
objects, which would require extensions of the current mathematical model.
Another limitation is that a periodic light source creates a wrap-around error
as it does in other TOF devices. For scenes in which surfaces have high reflectance
or texture variations, availability of a traditional 2D image prior to our data
acquisition allows for improved depth map reconstruction as discussed in our paper.
10. What are advantages of this technique/device and how does it
compare with existing TOF-based range sensing techniques?
In laser scanning, spatial resolution is limited by the scanning time.
TOF cameras do not provide high spatial resolution because they rely on a
low-resolution 2D pixel array of range-sensing pixels. CoDAC is a single-sensor,
high spatial resolution depth camera which works by exploiting the sparsity of natural
scene structure.
11. What is the range resolution and spatial resolution of the CoDAC system?
We have demonstrated sub-centimeter range resolution in our experiments.
This is significantly better than fundamental limit of about 10 cm that would
arise from using a detector with 0.7 nanosecond rise time if we were not using
parametric signal modeling. The improvement in range resolution comes from the
parametric modeling and deconvolution in our framework. We refer the reader to
our publications for complete details and analysis.
We have demonstrated 64-by-64 pixel spatial resolution,
as this is the spatial resolution of our spatial light modulator.
Spatially patterning with a digital micromirror device (DMD) will enable
much higher spatial resolution. Our experiments use only 205 projection patterns,
which correspond to just 5% of number of pixels in the reconstructed depth map.
This is a significant improvement over raster scanning in LIDAR, and it is
obtained without the 2D sensor array used in TOF cameras.
Also another interesting project I found on Youtube uses libfreenect and libusb
There is also dSensingNI which is described as
This work presents an approach to overcome the disadvantages of existing interaction
frameworks and technologies for touch detection and object interaction. The robust and
easy to use framework dSensingNI (Depth Sensing Natural Interaction) is described,
which supports multitouch and tangible interaction with arbitrary objects. It uses
images from a depth-sensing camera and provides tracking of users fingers of palm of
hands and combines this with object interaction, such as grasping, grouping and
stacking, which can be used for advanced interaction techniques.
So you have hit most of them out there, especially that use Kinect, but there are a few other options out there! Hope this Helps!