Object detection with R-CNN? - tensorflow

What does R-CNN actually do? Is it like using features extracted by CNN to detect classes in a specified window area?
Is there any tensorflow implementation for this?

R-CNN is using the following algorithm:
Get region proposals for object detection (using selective search).
For each region crop the area from the image and run it thorough a CNN which classify the object.
There are more advanced algorithms that are built upon this like fast-R-CNN and faster R-CNN.
fast-R-CNN:
Run the entire image through the CNN
For each region from the region proposals extract the area using "roi polling" layer and than classify the object.
faster R-CNN:
Run the entire image through the CNN
Using the features detected using the CNN find region proposals using a object proposals network.
For each object proposal extract the area using "roi polling" layer and than classify the object.
There are a lot of implantation in tensorflow specifically for faster R-CNN which is the most recent variant just google faster R-CNN tensorflow.
Good luck

R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it. I am trying to explain R-CNN and the other variants of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model.
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d) For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
An “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc.
For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
Linking some Tensorflow implementation:
https://github.com/smallcorgi/Faster-RCNN_TF
https://github.com/CharlesShang/FastMaskRCNN
You can find a lot of implementation of Github.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.

Related

Using Tensorflow Object Detection API: RPN losses keep increasing. Are there ways to make RPN losses decrease?

I am using Tensorflow Object Detection API for fine-tuning, using my own data. The goal is to detect 2 classes of objects. I am using the pre-trained faster_rcnn_resnet101_coco model.
The various detection box precision and recall measures are generally increasing (see screenshots below) and are fairly high:
The box classifier losses are decreasing. HOWEVER, the RPN losses are increasing (see screenshots below) -- It looks that the model is having a hard time distinguishing foregrounds from backgrounds (hence, the increasing RPN losses), but once the model is able to identify and locate the right foreground, it classifies well (hence, the decreasing box classifier losses)? I think this can be observed in the model's performance on test images: the false positive rate (on images that do not contain any of the two classes of target objects) is rather high. On the other hand, on images that do contain those target objects, the model does a fantastic job in accurately identifying and locating those objects.
So my question is essentially: what are some of the things I could try to help make sure RPN losses are also decreasing.

Where are the filter image data in this TensorFlow example?

I'm trying to consume this tutorial by Google to use TensorFlow Estimator to train and recognise images: https://www.tensorflow.org/tutorials/estimators/cnn
The data I can see in the tutorial are: train_data, train_labels, eval_data, eval_labels:
((train_data,train_labels),(eval_data,eval_labels)) =
tf.keras.datasets.mnist.load_data();
In the convolutional layers, there should be feature filter image data to multiply with the input image data? But I don't see them in the code.
As from this guide, the input image data matmul with filter image data to check for low-level features (curves, edges, etc.), so there should be filter image data too (the right matrix in the image below)?: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks
The filters are the weight matrices of the Conv2d layers used in the model, and are not pre-loaded images like the "butt curve" you gave in the example. If this were the case, we would need to provide the CNN with all possible types of shapes, curves, colours, and hope that any unseen data we feed the model contains this finite sets of images somewhere in them which the model can recognise.
Instead, we allow the CNN to learn the filters it requires to sucessfully classify from the data itself, and hope it can generalise to new data. Through multitudes of iterations and data( which they require a lot of), the model iteratively crafts the best set of filters for it to succesfully classify the images. The random initialisation at the start of training ensures that all filters per layer learn to identify a different feature in the input image.
The fact that earlier layers usually corresponds to colour and edges (like above) is not predefined, but the network has realised that looking for edges in the input is the only way to create context in the rest of the image, and thereby classify (humans do the same initially).
The network uses these primitive filters in earlier layers to generate more complex interpretations in deeper layers. This is the power of distributed learning: representing complex functions through multiple applications of much simpler functions.

How to sweep a neural-network through an image with tensorflow?

My question is about finding an efficient (mostly in term of parameters count) way to implement a sliding window in tensorflow (1.4) in order to apply a neural network through the image and produce a 2-d map with each pixel (or region) representing the network output for the corresponding receptive field (which in this case is the sliding window itself).
In practice, I'm trying to implement either a MTANN or a PatchGAN using tensorflow, but I cannot understand the implementation I found.
The two architectures can be briefly described as:
MTANN: A linear neural network with input size of [1,N,N,1] and output size [ ] is applied to an image of size [1,M,M,1] to produce a map of size [1,G,G,1], in which every pixel of the generated map corresponds to a likelihood of the corresponding NxN patch to belong to a certain class.
PatchGAN Discriminator: More general architecture, as I can understand the network that is strided through the image outputs a map itself instead of a single value, which then is combined with adjacent maps to produce the final map.
While I cannot find any tensorflow implementation of MTANN, I found the PatchGAN implementation, which is considered as a convolutional network, but I couldn't figure out how to implement this in practice.
Let's say I got a pre-trained network of which I got the output tensor. I understand that convolution is the way to go, since a convolutional layer operates over a local region of the input and what is I'm trying to do can be clearly represented as a convolutional network. However, what if I already have the network that generates the sub-maps from a given window of fixed-size?
E.g. I got a tensor
sub_map = network(input_patch)
which returns a [1,2,2,1] maps from a [1,8,8,3] image (corresponding to a 3-layer FCN with input size 8, filter size 3x3).
How can I sweep this network on [1,64,64,3] images, in order to produce a [1,64,64,1] map composed of each spatial contribution, like it happens in a convolution?
I've considered these solutions:
Using tf.image.extract_image_patches which explicitly extract all the image patches and channels in the depth dimension, but I think it would consume too many resources, as I'm switching to PatchGAN Discriminator from a full convolutional network due to memory constraints - also the composition of the final map is not so straight-forward.
Adding a convolutional layer before the network I got, but I cannot figure out what the filter (and its size) should be in this case in order to keep the pretrained model work on 8x8 images while integrating it in a model which works on bigger images.
For what I can get it should be something like whole_map = tf.nn.convolution(input=x64_images, filter=sub_map, ...) but I don't think this would work as the filter is an operator which depends on the receptive field itself.
The ultimate goal is to apply this small network to big images (eg. 1024x1024) in an efficient way, since my current model downscales progressively the images and wouldn't fit in memory due to the huge number of parameters.
Can anyone help me to get a better understanding of what I am missing?
Thank you
I found an interesting video by Andrew Ng exactly on how to implement a sliding window using a convolutional layer.
The problem here was that I was thinking at the number of layers as a variable that is dependent on a fixed input/output shape, while it should be the opposite.
In principle, a saved model should only contain the learned filters for each level and as long as the filter shapes are compatible with the layers' input/output depth. Thus, applying a different (ie. bigger) spatial resolution to the network input produces a different output shape, which can be seen as an application of the neural network to a sliding windows sweeping across the input image.

Tensorflow object detection: why is the location in image affecting detection accuracy when using ssd mobilnet v1?

I'm training a model to detect meteors within a picture of the night sky and I have a fairly small dataset with about 85 images and each image is annotated with a bounding box. I'm using the transfer learning technique starting with the ssd_mobilenet_v1_coco_11_06_2017 checkpoint and Tensorflow 1.4. I'm resizing images to 600x600pixels during training. I'm using data augmentation in the pipeline configuration to randomly flip the images horizontally, vertically and rotate 90 deg. After 5000 steps, the model converges to a loss of about 0.3 and will detect meteors but it seems to matter where in the image the meteor is located. Do I have to train the model by giving examples of every possible location? I've attached a sample of a detection run where I tiled a meteor over the entire image and received various levels of detection (filtered to 50%). How can I improve this?detected meteors in image example
It could very well be your data and I think you are making a prudent move by improving the heterogeneity of your dataset, BUT it could also be your choice of model.
It is worth noting that ssd_mobilenet_v1_coco has the lowest COCO mAP relative to the other models in the TensorFlow Object Detection API model zoo. You aren't trying to detect a COCO object, but the mAP numbers are a reasonable aproximation for generic model accuracy.
At the highest possible level, the choice of model is largely a tradeoff between speed/accuracy. The model you chose, ssd_mobilenet_v1_coco, favors speed over accuracy. Consequently, I would reccomend you try one of the Faster RCNN models (e.g., faster_rcnn_inception_v2_coco) before you spend a signifigant amount of time preprocessing images.

deep learning for shape localization and recognition

There is a set of images, each of which contains different shape entities, such as shown in the following figure. I am trying to localize and recognize these different shapes. For instance, adding a bounding box for each different shape and maybe even label it. What are the major research papers/deep learning models that have been able to solve this kind of problem?
Object detection papers such as rcnn, faster rcnn, yolo and ssd would help you solve this if you were bent on using a deep learning approach.
It’s easy to say this is a trivial problem that can be solved with tools in OpenCV and deep learning is overkill, but I can see many reasons to use deep learning tools and that does not answer your question.
We assume that your shapes has different scales and rotations. Actually your main image shown above is very large for training process and it needs a lot of training samples to generate a good accuracy at the end on test samples. In this case it is better to train a Convolutional Neural Network on a short images (like 128x128) with only one shape per each image and then use slide trick!
This project will have three main steps:
Generate test and train samples, each image should have only one shape
Train a classifier to recognize a single shape within each input image
Use slide trick! Break your original image containing many shapes to overlapping blocks of size 128x128. Pass each block to your model trained in the second step.
In this way at the end you will have label for each shape from your trained model, and also you will have location of each shape using slide trick.
For the classifier you can use exactly CNN structure of Tensorflow's MNIST tutorial.
Here is a paper with exactly same method applied to finger print images to extract local features.
A direct fingerprint minutiae extraction approach based on convolutional neural networks