Using Tensorflow Object Detection API: RPN losses keep increasing. Are there ways to make RPN losses decrease? - tensorflow

I am using Tensorflow Object Detection API for fine-tuning, using my own data. The goal is to detect 2 classes of objects. I am using the pre-trained faster_rcnn_resnet101_coco model.
The various detection box precision and recall measures are generally increasing (see screenshots below) and are fairly high:
The box classifier losses are decreasing. HOWEVER, the RPN losses are increasing (see screenshots below) -- It looks that the model is having a hard time distinguishing foregrounds from backgrounds (hence, the increasing RPN losses), but once the model is able to identify and locate the right foreground, it classifies well (hence, the decreasing box classifier losses)? I think this can be observed in the model's performance on test images: the false positive rate (on images that do not contain any of the two classes of target objects) is rather high. On the other hand, on images that do contain those target objects, the model does a fantastic job in accurately identifying and locating those objects.
So my question is essentially: what are some of the things I could try to help make sure RPN losses are also decreasing.

Related

Is there any way to resize input shapes in SNPE (dlc)?

I have trained a model based on Tensorflow. This model is supposed to work on the mobile phone but I have got a problem when converting froze graph (pb) to deep learning container(dlc). I have to set the input size to be constant. This cause that model can't work with any input size.
I am trying to find a way that resizes input shapes of a DLC model without initializing model with "snpe-tensorflow-to-dlc --input_dims 1,512,512,3" because this way is consuming.
Actually, I want to resize input shapes in dlc model. can anybody help me?
Usually deployment solutions work with fixed input shapes because they assume some widely acknowledged usage model - resize all picture of the same certain size and do inference. And due to this usage model, developers of deployment solutions do not prioritize model loading time while they usually prioritize the inference time. The same happens in SNPE, in OpenVINO, in TFLite, etc.
To illustrate the times, here is some results from Snapdragon 820. To load Inception v3 to CPU takes 715ms, to load model to DSP takes 3 seconds. Inference on CPU takes 1 sec, inference on DSP takes 100ms. You see that loading time on DSP is bigger than on CPU, but inference time is much much better.
At the same time, usually it is allowed to change a shape before loading of the model assuming that all input pictures will have different size (but again, same for all pictures) than shapes for which model was trained. For SNPE it is SNPEBuilder::setInputDimensions
If model allow to do reshape and if no bugs in SNPE implementation, the model can be reshaped and loaded.
Not sure if your usage model fits to the vision described in the first paragraph. At the same time, to have a benefit from different input size you need to develop special topology that unlikely be supported by SNPE. If you take just regular SSD and reshape it to different size and measure accuracy on validation set, the most likely you get the best result on shpaes where model was trained.

Training SSD-MOBILENET V1 and the loss does not deacrease

I'm new in everithing about CNN and tensorflow. Im training a pretrained ssd-mobilenev1-pets.config to detect columns of buildings, about one day but the loss is between 2-1 and doesnt decrease since 10 hours ago.
I realized that my input images are 128x128 and SSD resize de image to 300*300.
Does the size of the input images affect the training?
If that is the case, should I retrain the network with larger input images? or what would be another option to decrease the loss? my train dataset has 660 images and test 166 I dont Know if there are enough images
I really aprecciate your help ....
Loss values of ssd_mobilenet can be different from faster_rcnn. From EdjeElectronics' TensorFlow Object Detection Tutorial:
For my training on the Faster-RCNN-Inception-V2 model, it started at
about 3.0 and quickly dropped below 0.8. I recommend allowing your
model to train until the loss consistently drops below 0.05, which
will take about 40,000 steps, or about 2 hours (depending on how
powerful your CPU and GPU are). Note: The loss numbers will be
different if a different model is used. MobileNet-SSD starts with a
loss of about 20, and should be trained until the loss is consistently
under 2.
For more information: https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10#6-run-the-training
The SSD Mobilnet architecture demands additional training to suffice
the loss accuracy values of the R-CNN model, however, offers
practicality, scalability, and easy accessibility on smaller devices
which reveals the SSD model as a promising candidate for further
assessment (Fleury and Fleury, 2018).
For more information: Fleury, D. & Fleury, A. (2018). Implementation of Regional-CNN and SSD machine learning object detection architectures for the real time analysis of blood borne pathogens in dark field microscopy. MDPI AG.
I would recommend you to take 15%-20% images for testing which cover all the variety present in training data. As you said you have 650+ images for training and 150+ for testing. That is roughly 25% of testing images. It looks like you have enough images to start with. I know the more, the merrier but make sure your model also has sufficient data to learn from!
Resizing the images does not contribute to the loss. It makes sure there is consistency across all images for the model to recognize them without bias. The loss has nothing to do with image resizing as long as every image is resized identically.
You have to make stops and recover checkpoints again and again if you want your model to be perfectly fit. Usually, you can get away with good accuracy by re-training the ssd mobilenet until the loss consistently becomes under 1.Ideally we want the loss to be as lower as possible but we want to make sure the model is not over-fitting. It is all about trial and error. (Loss between 0.5 and 1 seems to be doing the job well but again it all depends on you.)
The reason I think your model is underperforming is due to the fact that you have variety of testing data and not enough training data to suffice.
The model has not been given enough knowledge in training data to make the model learn for new variety of testing data. (For example : Your test data has some images of new angles of buildings which are not sufficiently present in training data). In that case, I recommend you to put variety of all images in training data and then picking images to test making sure you still have sufficient training data of new postures. That's why I recommend you to take 15%-20% test data.

Tensorflow object detection: why is the location in image affecting detection accuracy when using ssd mobilnet v1?

I'm training a model to detect meteors within a picture of the night sky and I have a fairly small dataset with about 85 images and each image is annotated with a bounding box. I'm using the transfer learning technique starting with the ssd_mobilenet_v1_coco_11_06_2017 checkpoint and Tensorflow 1.4. I'm resizing images to 600x600pixels during training. I'm using data augmentation in the pipeline configuration to randomly flip the images horizontally, vertically and rotate 90 deg. After 5000 steps, the model converges to a loss of about 0.3 and will detect meteors but it seems to matter where in the image the meteor is located. Do I have to train the model by giving examples of every possible location? I've attached a sample of a detection run where I tiled a meteor over the entire image and received various levels of detection (filtered to 50%). How can I improve this?detected meteors in image example
It could very well be your data and I think you are making a prudent move by improving the heterogeneity of your dataset, BUT it could also be your choice of model.
It is worth noting that ssd_mobilenet_v1_coco has the lowest COCO mAP relative to the other models in the TensorFlow Object Detection API model zoo. You aren't trying to detect a COCO object, but the mAP numbers are a reasonable aproximation for generic model accuracy.
At the highest possible level, the choice of model is largely a tradeoff between speed/accuracy. The model you chose, ssd_mobilenet_v1_coco, favors speed over accuracy. Consequently, I would reccomend you try one of the Faster RCNN models (e.g., faster_rcnn_inception_v2_coco) before you spend a signifigant amount of time preprocessing images.

Training keras with tensorflow: Redundancy in labelling the object or multiple labels on same object

I was training keras with tensorflow for person detection. After the training, when the testing was done so many images contains redundant labeling of person. ie; for a single person in an image, multiple labeling as a person was shown. What is the actual reason behind this?
My training set contains nearly 2000 images, a single class person, batch=32, epoch=100, threshold=0.55 and testing images=250.
Overtraining of samples may lead to redundancy and if you are using different angles of an image, for example if you train for detecting people and you are providing samples of human from different angles, then it may show errors on detection in real cases. If this is not the issue, then non- maximal suppression will be the better option.

Object detection with R-CNN?

What does R-CNN actually do? Is it like using features extracted by CNN to detect classes in a specified window area?
Is there any tensorflow implementation for this?
R-CNN is using the following algorithm:
Get region proposals for object detection (using selective search).
For each region crop the area from the image and run it thorough a CNN which classify the object.
There are more advanced algorithms that are built upon this like fast-R-CNN and faster R-CNN.
fast-R-CNN:
Run the entire image through the CNN
For each region from the region proposals extract the area using "roi polling" layer and than classify the object.
faster R-CNN:
Run the entire image through the CNN
Using the features detected using the CNN find region proposals using a object proposals network.
For each object proposal extract the area using "roi polling" layer and than classify the object.
There are a lot of implantation in tensorflow specifically for faster R-CNN which is the most recent variant just google faster R-CNN tensorflow.
Good luck
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it. I am trying to explain R-CNN and the other variants of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model.
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d) For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
An “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc.
For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
Linking some Tensorflow implementation:
https://github.com/smallcorgi/Faster-RCNN_TF
https://github.com/CharlesShang/FastMaskRCNN
You can find a lot of implementation of Github.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.