How can the auto-focus of camera be explained using pinhole camera model? - camera

Shifting the auto-focus in real-world camera doesn't change the focal length, rotation, or any other camera parameter in pinhole camera model. However, it does shift the image plane and affect the depth of field. How is this possible?
I understand that complex mechanism of real-world camera cannot be easily explained with pinhole camera model. However, I believe that there should be some link between them as we use this simplified model in various real-world computer vision applications.

Short answer: it cannot. The pinhole camera model has no notion of 'focus'.
A more interesting question is, I think, the effect of changing the focusing distance on a pinhole approximation of the lens+camera combination, the approximation itself being estimated, for example, through a camera calibration procedure.
With "ordinary" consumer-type lenses having moderate non-linear distortion, usually one observes significant changes in:
The location of the principal point (which is anyway hard to estimate precisely, and confused with the center of the distortion)
The amount of nonlinear distortion (especially with cheaper lenses and wide FOV).
The "effective" field of view - due to the fact that a change in nonlinear distortion will "pull-in" a wider or thinner view at the edges.
The last item implies a change of the calibrated focal length, and this is sometimes "surprising" for novices, who are taught that a lens's focus and focal length do not mix. To convince yourself that the FOV change is in fact happening, visualize the bounding box of the undistorted image, which is "butterfly"-shaped in the common case of barrel distortion. The pinhole model FOV angle is twice the arctangent of the ratio between the image half-width and the calibrated approximation to the physical focal length (which is the distance between the sensor and the lens's last optical surface). Changing the distortion stretches or squeezes that half-width value.

Related

Why does SSD resize random crops during data augmentation?

The SSD paper details its random-crop data augmentation scheme as:
Data augmentation To make the model more robust to various input object sizes and
shapes, each training image is randomly sampled by one of the following options:
– Use the entire original input image.
– Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3,
0.5, 0.7, or 0.9.
– Randomly sample a patch.
The size of each sampled patch is [0.1, 1] of the original image size, and the aspect ratio
is between 1 and 2. We keep the overlapped part of the ground truth box if the center of
it is in the sampled patch. After the aforementioned sampling step, each sampled patch
is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to
applying some photo-metric distortions similar to those described in [14].
https://arxiv.org/pdf/1512.02325.pdf
My question is: what is the reasoning for resizing crops that range in aspect ratios between 0.5 and 2.0?
For instance if your input image is 300x300, reshaping a crop with AR=2.0 back to square resolution will severely stretch objects (square features become rectangular, circles become ellipses, etc.) I understand small distortions may be good to improve generalization, but training the network on objects distorted up to 2x in either dimension seems counter-productive. Am I misunderstanding how random-crop works?
[Edit] I completely understand that augmented images need to be the same size as the original -- I'm more wondering why the authors don't fix the Aspect Ratio to 1.0 to preserve object proportions.
GPU architecture enforces us to use batches to speedup training, and these batches should be of the same size. Using not-so-distorted image crops could make training more efficient, but much slower.
Personally I consider that any transformation makes sense as long as you as a human can still identify the object/subject, and as long as they make sense in the receptive field of the network. Also I guess somehow that the aspect ratio might help to learn some kind of perspective distortion (look at the cow in fig 5, it's kind of "compressed"). Objects like a cup, a tree, a chair, even stretched are still identifiable. Otherwise you could also consider that some point-controlled or skew transforms just don't make sense as well.
Then, if you are working with different images than natural images, without perspective, it is probably not a good idea to do so. If your image shows objects of a fixed known size like in a microscope or other medical imaging device, and if your object has more or less a fixed size (let's say a cell), then it's probably not a good idea to perform strong distortion on the scale (like a cell twice as large), maybe then a cell twice as an ellipse actually makes more sense.
With this library, you can perform strong augmentations, but not all of them make sense if you look at the image here:

Compute road plane normal with an embedded camera

I am developing some computer vision algorithms for vehicle applications.
I am in front of a problem and some help would be appreciated.
Let say we have a calibrated camera attached to a vehicle which captures a frame of the road forward the vehicle:
Initial frame
We apply a first filter to keep only the road markers and return a binary image:
Filtered image
Once the road lane are separated, we can approximate the lanes with linear expressions and detect the vanishing point:
Objective
But what I am looking for to recover is the equation of the normal n into the image without any prior knowledge of the rotation matrix and the translation vector. Nevertheless, I assume L1, L2 and L3 lie on the same plane.
In the 3D space the problem is quite simple. In the 2D image plane, since the camera projective transformation does not keep the angle properties more complex. I am not able to find a way to figure out the equation of the normal.
Do you have any idea about how I could compute the normal?
Thanks,
Pm
No can do, you need a minimum of two independent vanishing points (i.e. vanishing points representing the images of the points at infinity of two different pencils of parallel lines).
If you have them, the answer is trivial: express the image positions of said vanishing points in homogeneous coordinates. Then their cross product is equal (up to scale) to the normal vector of the 3D plane said pencils define, decomposed in camera coordinates.
Your information is insufficient as the others have stated. If your data is coming from a video a common way to get a road ground plane is to take two or more images, compute the associated homography then decompose the homography matrix into the surface normal and relative camera motion. You can do the decomposition with OpenCV's decomposeHomographyMatmethod. You can compute the homography by associating four or more point correspondences using OpenCV's findHomography method. If it is hard to determine these correspondences it is also possible to do it with a combination of point and line correspondences paper, however this is not implemented in OpenCV.
You do not have sufficient information in the example you provide.
If you are wondering "which way is up", one thing you might be able to do is to detect the line on the horizon. If K is the calibration matrix then KTl will give you the plane normal in 3D relative to your camera. (The general equation for backprojection of a line l in the image to a plane E through the center of projection is E=PTl with a 3x4 projection matrix P)
A better alternative might be to establish a homography to rectify the ground-plane. To do so, however, you need at least four non-collinear points with known coordinates - or four lines, no three of which may be parallel.

setup requirement of stereo camera

In the stereo camera system, two cameras are needed and should be mounted side by side. I see someone just glues two cameras to a wooden board. However One mobile phone manufacture claimed that the two lens of dual camera modules on their phone are parallel within 0.3 degree. Why do two lens on mobile phones need such high precise assembly? Does this will bring any benefit?
I have not worked on stereo setup, but would like to answer from what I have studied during my course. Camera setup which are not parallel are usually called converged/toe-in setup.
Stereo camera setup which are parallel does provide some advantages over toe-in setup. But the advantage is not absolute and dependent on what is required.
In toe-in setup there is a problem of keystoning. Keystoning is when two images(Left and Right) are kept side by side, the images are aligned at the meeting point but they tilt as you go further towards the edge. This leads to depth plane curvature and it shows as though farther objects are curved. This can be corrected in post processing and its called keystone correction. There is no keystone problem in parallel setup. Below image shows image distortion towards edges. If your requirement is not to have keystone effect, then it is an advantage ;)
In parallel setup you can decide the image convergence in post processing by slightly shifting the images horizontally(Horizontal image translation HIT). In toe-in you need to decide the convergence area during the shoot. Convergence is the region of the image which is same in both Left and Right. As you can imagine, in parallel setup, there is no convergence and you get stereo effect for the whole image. This is good right ? Depends. Because, in stereo, we have zero place, near plane and far place. Zero plane is when the image is perceived as to be formed on the screen(screen on which image is projected in the theatre).Near field is near the viewer(imagine popping out of the screen towards the viewer). Far field is farther from the viewer. Therefore, since there is no convergence in parallel setup the whole screen has stereo effect(that is near or far field, see figure below) and convergence is at infinity. Now imagine sky which is very deep in real, i.e the sky which is at infinity. But since in parallel setup sky converges as it is at infinity and appears to be formed on the screen. But a person who is near to the viewer seems to be floating in stereo space, which messes up the brain. Therefore usually people prefer slight convergence angle to avoid this or use HIT such that the convergence point appears on the zero field. Hope this helps :) I will try to rephrase this tomorrow as I wrote this in one go.

How to segment depth image faster?

I need to segment depth image that captured from
a kinect device in realtime(30fps).
Currently I am using EuclideanClusterExtraction from PCL, it works but very slow(1fps).
Here is a paragraph in the PCL tutorial:
“Unorganized” point clouds are characterized by non-existing point references between points from different point clouds due to varying size, resolution, density and/or point ordering. In case of “organized” point clouds often based on a single 2D depth/disparity images with fixed width and height, a differential analysis of the corresponding 2D depth data might be faster.
So I think there are faster method to segment depth image.
The project doesn't use the RGB Camera, so I need a segmentation method that use only the depth image.
PCL provides segmentation algorithms optimised for organised point clouds.
For details see:
The tutorial here describing them and showing how to use them:
http://www.pointclouds.org/assets/icra2012/segmentation.pdf
The example code in thePCL distribution (relatively late versions): organized_segmentation_demo and openni_organized_multi_plane_segmentation
In the API, OrganizedConnectedComponentSegmentation and OrganizedMultiPlaneSegmentation. The latter builds on the former.

Fundamental matrix to be computed or known apriori, for real world applications

If you are to design a real world application of a stereo vision algorithm, lets say for a UAV or a spacecraft which is computing elevation maps from the two images, is the fundamental matrix known a priori or will I have to calculate it alongside with the disparity map?
If the fundamental matrix can be obtained apriori, is it correct that knowledge of the calibration matrix and the projective matrices is sufficient to compute the matrix?
Regarding your first question:
In my experience, this depends on the mechanical design of your camera system, and on the use of a fixed focal length. If you are able to mount your cameras rigidly, and if your focal focal length does not change, then you can pre-calibrate the whole thing.
If the relative position of your cameras is likely to change (as they are mounted, for example, on a not perfectly rigid structure), or if you are zooming or using autofocus (!), then you must think about dynamic calibration (or about better fixing your cameras). The depth error induced by calibration error depends on the baseline of your stereo setup and the distance to your scene, so you can compute your tolerances.
Regarding your second question:
Yes, it is sufficient.
Your should be aware that there are many ways of computing an F-matrix. I highly recommend to look into Hearley & Zisserman, which is the de-facto reference for these topics.