Pointcloud and RGB Image alignment on RealSense ROS - tensorflow

I am working on a dog detection system using deep learning (Tensorflow object detection) and Real Sense D425 camera. I am using the Intel(R) RealSense(TM) ROS Wrapper in order to get images from the camera.
I am executing "roslaunch rs_rgbd.launch" and my Python code is subscribed to "/camera/color/image_raw" topic in order to get the RGB image. Using this image and object detection library, I am able to infer (20 fps) the location of a dog in a image as a box level (xmin,xmax,ymin,ymax)
I will like to crop the PointCloud information with the object detection information (xmin,xmax,ymin,ymax)
and determine if the dog is far away or near the camera. I will like to use the aligned information pixel by pixel between the RGB image and the pointcloud.
How can I do it? Is there any topic for that?
Thanks in advance

Intel publishes their python notebook about the same problem at: https://github.com/IntelRealSense/librealsense/blob/jupyter/notebooks/distance_to_object.ipynb
What they do is as follow :
get color frame and depth frame (point cloud in your case)
align the depth to color
use ssd to detect the dog inside color frame
Get the average depth for detected dog and convert to meter
depth = np.asanyarray(aligned_depth_frame.get_data())
# Crop depth data:
depth = depth[xmin_depth:xmax_depth,ymin_depth:ymax_depth].astype(float)
# Get data scale from the device and convert to meters
depth_scale = profile.get_device().first_depth_sensor().get_depth_scale()
depth = depth * depth_scale
dist,_,_,_ = cv2.mean(depth)
print("Detected a {0} {1:.3} meters away.".format(className, dist))
Hope this help

Related

Camera calibration using Direct Linear Transformation in python

I'm using a Numpy implementation of camera calibration by direct linear transformation (DLT) in python.
I'm trying to use it for 3 dimensional camera calibration.
My problem is, the mean error of the DLT (mean residual of the DLT transformation in units of camera coordinates) is very high in the example, in the thousands of pixels especially compared to the examples provided by the original author (see here).
These are the 3D points I use:
objpoints = [[86.438, -174.922,51.316],[-27.519,-215.460,39.154],
[73.601, 107.800,120.455],[87.602,133.413,34.023],
[101.276,-55.204,108.884],[88.509,-68.038,116.634],
[27.518,-215.460,39.154],[-31.355,-207.334,85.184],
[87.601,-131.059,33.881],[-60.234,-23.833,148.269],[62.162,-23.042,148.715]]
These are the pixels I use:
imgpoints = [[576.0,861.0],[660.0,996.0],[253.0,1383.0],[575.0,1481.0],
[276.0,1217.0],[241.0,1139.0],[665.0,461.0],[231.0, 411.0],
[660.0,226.0],[141.0,684.0],[111.0,1123.0]]
I extracted these points manually, for 3D from a point cloud model (.ply format) and for matching 2D image by pixels.
Something must be wrong with my coordinates at a very basic level, but I'm not sure what it is and how to find it.

How to calculate the Horizontal and Vertical FOV for the KITTI cameras from the camera intrinsic matrix?

I would like to calculate the Horizontal and Vertical field of view from the camera intrinsic matrix for the cameras used in the KITTI dataset. The reason I need the Field of view is to convert a depth map into 3D point clouds.
Though this question has been asked quite a long time ago, I felt it needed an answer as I ran into the same issue and was unable to find any info on it.
I have however solved it using the information available in this document and some more general camera calibration documents
Firstly, we need to convert the supplied disparity into distance. This can be done through fist converting the disp map into floats through the method in the dev_kit where they state:
disp(u,v) = ((float)I(u,v))/256.0;
This disparity can then be converted into a distance through the default stereo vision equation:
Depth = Baseline * focal length/ Disparity
Now come some tricky parts. I searched high and low for the focal length and was unable to find it in documentation.
I realised just now when writing that the baseline is documented in the aforementioned source however from section IV.B we can see that it can be found in P(i)rect indirectly.
The P_rects can be found in the calibration files and will be used for both calculating the baseline and the translation from uv in the image to xyz in the real world.
The steps are as follows:
For pixel in depthmap:
xyz_normalised = P_rect \ [u,v,1]
where u and v are the x and y coordinates of the pixel respectively
which will give you a xyz_normalised of shape [x,y,z,0] with z = 1
You can then multiply it with the depth that is given at that pixel to result in a xyz coordinate.
For completeness, as P_rect is the depth map here, you need to use P_3 from the cam_cam calibration txt files to get the baseline (as it contains the baseline between the colour cameras) and the P_2 belongs to the left camera which is used as a reference for occ_0 files.

What is the unit for raw data in Kinect V2?

I am trying to figure out what is the raw data in Kinect V2? ... I know we can convert these raw data to meters and to gray color to represent the depth .... but what is the unit of these raw data ?
and why all the images that captured by Kinect are mirrored?
The raw values stored in the depth image are in millimeters. You can get the X and Y values using the pixel position along with the depth camera intrinsic parameters. If you want I could share a Matlab code that converts the depth image into X,Y,Z values.
Yes, the images are mirrored in Windows-SDK and in the "libfreenect2" which is a open source version of SDK. I couldn't get a solid answer why it is so, but you could look at the discussion available in the link.
There are different kinds of frame which can be captured through Kinect V2. Every raw data captured has a different unit. Like, for depth frame it is millimeters, for color it is RGB (0-255, 0-255, 0-255), for bodyFrames it is 0 or 1 ( having same resolution as depth frame, but can identify up-to a maximum number of human bodies at a time ) and etc.
Ref: https://developer.microsoft.com/en-us/windows/kinect

Get face coordinates using affdex sdk

It's possible to get face coordinates in image source file or frame? Some thing like:
face.Height = Affdex.Face[0].PositionHeight;
face.Left = Affdex.Face[0].PositionLeft;
face.Top = Affdex.Face[0].PositionTop;
face.Width = Affdex.Face[0].PositionWidth;
http://developer.affectiva.com/fpi/
The bounding box for each face is not provided directly via the Affdex SDKs, but they do provide coordinates for all the face points, so to determine a face's bounding box, all you need to do is iterate through its face points and track the max/min values in each dimension.
As an example, see the drawFacePoints method in the DrawingView class of the AffdexMe sample app: https://github.com/Affectiva/affdexme-android/blob/master/app/src/main/java/com/affectiva/affdexme/DrawingView.java

face alignment algorithm on images

How can I do a basic face alignment on a 2-dimensional image with the assumption that I have the position/coordinates of the mouth and eyes.
Is there any algorithm that I could implement to correct the face alignment on images?
Face (or image) alignment refers to aligning one image (or face in your case) with respect to another (or a reference image/face). It is also referred to as image registration. You can do that using either appearance (intensity-based registration) or key-point locations (feature-based registration). The second category stems from image motion models where one image is considered a displaced version of the other.
In your case the landmark locations (3 points for eyes and nose?) provide a good reference set for straightforward feature-based registration. Assuming you have the location of a set of points in both of the 2D images, x_1 and x_2 you can estimate a similarity transform (rotation, translation, scaling), i.e. a planar 2D transform S that maps x_1 to x_2. You can additionally add reflection to that, though for faces this will most-likely be unnecessary.
Estimation can be done by forming the normal equations and solving a linear least-squares (LS) problem for the x_1 = Sx_2 system using linear regression. For the 5 unknown parameters (2 rotation, 2 translation, 1 scaling) you will need 3 points (2.5 to be precise) for solving 5 equations. Solution to the above LS can be obtained through Direct Linear Transform (e.g. by applying SVD or a matrix pseudo-inverse). For cases of a sufficiently large number of reference points (i.e. automatically detected) a RANSAC-type method for point filtering and uncertainty removal (though this is not your case here).
After estimating S, apply image warping on the second image to get the transformed grid (pixel) coordinates of the entire image 2. The transform will change pixel locations but not their appearance. Unavoidably some of the transformed regions of image 2 will lie outside the grid of image 1, and you can decide on the values for those null locations (e.g. 0, NaN etc.).
For more details: R. Szeliski, "Image Alignment and Stitching: A Tutorial" (Section 4.3 "Geometric Registration")
In OpenCV see: Geometric Image Transformations, e.g. cv::getRotationMatrix2D cv::getAffineTransform and cv::warpAffine. Note though that you should estimate and apply a similarity transform (special case of an affine) in order to preserve angles and shapes.
For the face there is lot of variability in feature points. So it won't be possible to do a perfect fit of all feature points by just affine transforms. The only way to align all the points perfectly is to warp the image given the points. Basically you can do a triangulation of image given the points and do a affine warp of each triangle to get the warped image where all the points are aligned.
Face detection could be handled based on the just eye positions.
Herein, OpenCV, Dlib and MTCNN offers to detect faces and eyes. Besides, it is a python based framework but deepface wraps those methods and offers an out-of-the box detection and alignment function.
detectFace function applies detection and alignment in the background respectively.
#!pip install deepface
from deepface import DeepFace
backends = ['opencv', 'ssd', 'dlib', 'mtcnn']
DeepFace.detectFace("img.jpg", detector_backend = backends[0])
Besides, you can apply detection and alignment manually.
from deepface.commons import functions
img = functions.load_image("img.jpg")
backends = ['opencv', 'ssd', 'dlib', 'mtcnn']
detected_face = functions.detect_face(img = img, detector_backend = backends[3])
plt.imshow(detected_face)
aligned_face = functions.align_face(img = img, detector_backend = backends[3])
plt.imshow(aligned_face)
processed_img = functions.detect_face(img = aligned_face, detector_backend = backends[3])
plt.imshow(processed_img)
There's a section Aligning Face Images in OpenCV's Face Recognition guide:
http://docs.opencv.org/trunk/modules/contrib/doc/facerec/facerec_tutorial.html#aligning-face-images
The script aligns given images at the eyes. It's written in Python, but should be easy to translate to other languages. I know of a C# implementation by Sorin Miron:
http://code.google.com/p/stereo-face-recognition/