Tensorflow high false-positive rate and non-max-suppression issue - tensorflow

I am training Tensorflow Object detection on Windows 10using faster_rcnn_inception_v2_coco as pretrained model. I'm on Windows 10, with tensorflow-gpu 1.6 on NVIDIA GeForce GTX 1080, CUDA 9.0 and CUDNN 7.0.
My dataset contain only one object, "Pistol", and 3000 images (2700 train set, 300 test set). The size of the images are from ~100x200 to ~800x600.
I trained this model for 55k iterations, where the mAP was ~0.8 and the TotalLoss seems converged to 0.001. But however, seeing the evaluation, that there are a lot of multiple bounding boxes on the same detected object (e.g. this and this), and lot of false positives (house detected as a pistol). For example, in this photo taked by me (blur filter was applied later), the model detect a person and a car as pistols, as well as the correct detection.
The dataset is uploaded here, together with the tfrecords and the label map.
I used this config file, where the only things that I changed are: num_classes to 1, the fine_tune_checkpoint, input_path and label_map_path for train and eval, and num_examples.
Since I thought that the multiple boxes are a non-max-suppression problem, I changed the score_threshold (line 73) from 0 to 0.01 and the iou_threshold (line 74) from 1 to 0.6. With the standard values the outcome was much worse than this.
What can I do to have a good detection? What should I change? Maybe I miss something about parameters tuning...
Thanks

I think that before diving into paramter tuning (i.e. the mentioned score_threshold) you will have to review your dataset.
I didn't check the entire dataset you shared but from a high level view the main problem I found is that most of the images are really small and with a highly variable aspect ratio.
In my opinion this enters in conflict with this part of your configuration file:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
If take one of the images of your dataset and you manually apply that transformation you will see that the result is very noisy for small images and very deformed for many images that have a different aspect ratio.
I would highly recommend you to re-build your dataset with images with more definition and maybe try to preprocess the images with unusual aspect ration with padding, cropping or other strategies.
If you want to stick with the small images you'd have to at least change the min and max dimensions of the image_resizer but, from my experience, the biggest problem here is the dataset and I would invest the time in trying to fix that.
Pd.
I don't see the house false positive as a big problem if we consider that it's from a totally different domain of your dataset.
You could probably adjust the minium confidence to consider a detections as true positive and remove it.
If you take the current winner of COCO and feed it with strange images like from a cartoon you will see that it generates a lot of false positives.
So it's more like a problem with the current object detection approaches wich are not robust to domain changes.

A lot of people I see online have been running into the same issue using Tensorflow API. I think there are some inherent problems with the idea/process of using the pretrained models with custom classifier(s) at home. For example people want to use SSD Mobile or Faster RCNN Inception to detect objects like "Person w/ helmet," "pistol," or "tool box," etc. The general process is to feed in images of that object, but most of the time, no matter how many images...200 to 2000, you still end up with false positives when you go actually run it at your desk.
The object classifier works great when you show it the object in its own context, but you end up getting 99% match on every day items like your bedroom window, your desk, your computer monitor, keyboard, etc. People have mentioned the strategy of introducing negative images or soft images. I think the problem has to do with limited context in the images that most people use. The pretrained models were trained with over a dozen classifiers in many variety of environments like in one example could be a Car on the street. The CNN sees the car and then everything in that image that is not a car is a negative image which includes the street, buildings, sky, etc.. In another image, it can see a Bottle and everything in that image which includes desks, tables, windows, etc. I think the problem with training custom classifiers is that it is a negative image problem. Even if you have enough images of the object itself, there isn't enough data of that that same object in different contexts and backgrounds. So in a sense, there is not enough negative images even if conceptually you shouldn't need negative images. When you run the algorithm at home you get false positives all over the place identifying objects around your own room. I think the idea of transfer learning in this way is flawed. We just end up seeing a lot of great tutorials online of people identifying playing cards, Millenium Falcons, etc., but none of those models are deployable in the real world as they all would generate a bunch of false positives when it sees anything outside of its image pool. The best strategy would be to retrain the CNN from scratch with a multiple classifiers and add the desired ones in there as well. I suggest re-introducing a previous dataset from ImageNet or Pascal with 10-20 pre-existing classifiers and add your own ones and retrain it.

Related

Object detection project (root architecture) using Tensorflow + Keras. Image sample size for accurate training of model?

Im currenty working on a project at University, where we are using python + tensorflow and keras to train an image object detector, to detect different parts of the root system of Arabidopsis.
Our current ressults are pretty bad, as we do only have about 100 images to train the model with at this moment, but we are currently working on cultuvating more plants in order to get more images(more data) to train the tensorflow model.
We have implemented the following Mask_RCNN model:Github- Mask_RCNN tensorflow
We are looking to detect three object clases: stem, main root and secondary root.
But the model detects main roots incorrectly where the secondary roots are located.
It should be able to detect something like this:Root detection example
Training root data set that we are using right now:training images
What is the usual sample size that is used to train a neural network accurate results?
First off: I think there is no simple rule to estimate the sample size but at least it depends on:
1. Quality of your images
I downloaded the images and I think you need to preprocess them before you can use it to reduce the "problem complexity". In some projects, in which I worked with biological data, a background removal (image - low pass filter) was the key to get better results. But you should definitely remove/crop the area outside the region of your interest (like the tape and the ruler). I would try to get the cleanest data set as possible (including manually adjustments cv2/ gimp/ etc.) to focus the network to solve "the right problem".. After that you could apply some random distortion to make it also work on fuzzy/bad/realistic images as well.
2. The way you work with your data
There are a few tricks that enables you to "expand" your dataset.
Sometimes it's very helpful to let a generator method crop random small patches from your input data. This allows you to work with more batches (on small gpus) and gives your network more "variety", (just think about the conv2d task: if you don't use random cropping your filters will slide over the same areas over and over again (at the same image)). Because of the same reason: apply random distortion, flip and rotate your images.
3. Network architecture
In your case I would prefer a U-Net architecture with a last conv2d output of 3 (your classes) feature maps, a final softmax activation and an categorical_crossentropy, this enables you to play with the depth, because sometimes you need sophisticated architectures to solve a problem (close to 100%) but in your case you just want to see a first working result. So fewer layers and a simple architecture could also help you to get things work. Maybe there are some trained network weights for a U-Net which meets your requirements (search on kaggle for example). Because it is also helpful (to reduce the data you need) to use "transfer learning" -> use the first layers of an network (weights) which is already trained. Using a semantic segmentation the first filters will become something like an edge detection for the most given problems/images.
4. Your mental model of "accurate results"
This is the hardest part.. because it evolves during your project. Eg. in the same moment your networks starts to perform well on preprocessed input images you will start to think about architecture/data changes to make it work on fuzzy images as well. This is why you should start with a feasible problem but always improve your dataset (including rare kinds of roots) and tune your network architecture step by step.

Tensorflow faster rcnn giving good detection but still detecting false positives with coco objects

I have used the tensorflow API to detect the Guinness harp using the process described here - https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/.
I have mostly good results, whenever the logo is clear in the image it finds it nicely -
However, after retraining from a coco checkpoint, it still detects what I think are coco objects with a very high confidence rating i.e people, magazines. I cannot work out why this is is.
(see below)
I am using the faster_rcnn_inception_v2_coco.config found here - https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/faster_rcnn_inception_v2_coco.config
Training for more steps does not seem to help as the total loss averages out. The above screenshots were from 10,000 training steps. I am training on a cpu.
I am augmenting my training images using imgaug, and an example training image can be seen below ( i have included the debug bounding box around the target) -
However, if the training images were the problem, wouldn't the graph have trouble detecting the target altogether?
I had a similar issue recently, from what it somewhat looks like a case of underfitting, I tried multiple things to improve on the results.
The thing that worked for me was actually augmenting data using the library imgaug. You can augment the images as well as the bounding boxes using a simple script, try and increase the dataset by say 10/12 fold.
I would also suggest adding some background images, ie. images with no object, it was recommended by a few people in the tensorflow discussion in the issues.
Try and train the dataset again and monitor it using tensorboard. I think you will be able to reduce the number of false positives significantly.

convolutional neural network image recognition

Currently I am working on a project with convolutional network using tensorflow and I have set up the network and now i need to train it. I don't have a clue of how could the image should be for training. Like how much of % of the image the object is training on.
It's a cigarette that I have to detect and I have tried around 280 individual pictures where the cigarette is about2-5% of the image. I'm thinking of scrapping those pictures and take new one where the cigarette is about 30-50% of the image.
All the pictures are taking outside on the street environment.
So my question is: are there are any kind of rule regarding good pictures in a training set?
I will report back when I have tried my own solution
The object you are trying to recognise is too small. In the Sample, I think first one will be the best bet for you. Convolution neural network works by doing convolution operations on image pixels. In the second picture, background is too large compared to the object you are trying to recongise. Training on such data will not help you.
Just trying to answer your rule question:
Make sure that cigarette occupies maximum region of the Image. It can be 50% to 90% (with experience). You can still identify cigarettes with 2 to 3 % area, but you need millions of images with varying backgrounds.
CNN learns from the input image. Looking at the sample images you shared (I guess all the images are taken from road side platforms and grass areas). CNN may not learn to find the cigarette, instead it will learn to detect the common background, if your background occupies maximum area of the image. Please make sure to keep different background patterns.

Counting Pedestrians Using TensorFlow's Object Detection

I am new to machine learning field and based on what I have seen on youtube and read on internet I conjectured that it might be possible to count pedestrians in a video using tensorflow's object detection API.
Consequently, I did some research on tensorflow and read documentation about how to install tensorflow and then finally downloaded tensorflow and installed it. Using the sample files provided on github I adapted the code related to object_detection notebook provided here ->https://github.com/tensorflow/models/tree/master/research/object_detection.
I executed the adapted code on the videos that I collected while making changes to visualization_utils.py script so as to report number of objects that cross a defined region of interest on the screen. That is I collected bounding boxes dimensions (left,right,top, bottom) of person class and counted all the detection's that crossed the defined region of interest (imagine a set of two virtual vertical lines on video frame with left and right pixel value and then comparing detected bounding box's left & right values with predefined values). However, when I use this procedure I am missing on lot of pedestrians even though they are detected by the program. That is the program correctly classifies them as persons but sometimes they don't meet the criteria that I defined for counting and as such they are not counted. I want to know if there is a better way of counting unique pedestrians using the code rather than using the simplistic method that I am trying to develop. Is the approach that I am using the right one ? Could there be other better approaches ? Would appreciate any kind of help.
Please go easy on me as I am not a machine learning expert and just a novice.
You are using a pretrained model which is trained to identify people in general. I think you're saying that some people are pedestrians whereas some other people are not pedestrians, for example, someone standing waiting at the light is a pedestrian, but someone standing in their garden behind the street is not a pedestrian.
If I'm right, then you've reached the limitations of what you'll get with this model and you will probably have to train a model yourself to do what you want.
Since you're new to ML building your own dataset and training your own model probably sounds like a tall order, there's a learning curve to be sure. So I'll suggest the easiest way forward. That is, use the object detection model to identify people, then train a new binary classification model (about the easiest model to train) to identify if a particular person is a pedestrian or not (you will create a dataset of images and 1/0 values to identify them as pedestrian or not). I suggest this because a boolean classification model is about as easy a model as you can get and there are dozens of tutorials you can follow. Here's a good one:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/neural_network.ipynb
A few things to note when doing this:
When you build your dataset you will want a set of images, at least a few thousand along with the 1/0 classification for each (pedestrian or not pedestrian).
You will get much better results if you start with a model that is pretrained on imagenet than if you train it from scratch (though this might be a reasonable step-2 as it's an extra task). Especially if you only have a few thousand images to train it on.
Since your images will have multiple people in it you have a problem of identifying which person you want the model to classify as a pedestrian or not. There's no one right way to do this necessarily. If you have a yellow box surrounding the person the network may be successful in learning this notation. Another valid approach might be to remove the other people that were detected in the image by deleting them and leaving that area black. Centering on the target person may also be a reasonable approach.
My last bullet-point illustrates a problem with the idea as it's I've proposed it. The best solution would be to alter the object detection network to ouput both a bounding box per-person, and also a pedestrian/non pedestrian classification with it; or to only train the model to identify pedestrians, specifically, in the first place. I mention this as more optimal, but I consider it a more advanced task than my first suggestion, and a more complex dataset to manage. It's probably not the first thing you want to tackle as you learn your way around ML.

small object detection with faster-RCNN in tensorflow-models

I'm attempting to train a faster-rccn model for small digit detection. I'm using the newly released tensorflow object detection API and so far have been fine tuning a pre-trained faster_rcnn_resnet101_coco from the zoo. All my training attempts have resulted in models with high precision but low recall. Out of the ~120 objects (digits) on each image only ~20 objects are ever detected, but when detected the classification is accurate. (Also, I am able to train a simple convnet from scratch on my cropped images with high accuracy so the problem is in the detection aspect of the model.) Each digit is on average 60x30 in the original images (and probably about half that size after the image is resized before being fed into the model.) Here is an example image with detected boxes of what I'm seeing:
What is odd to me is how it is able to correctly detect neighboring digits but completely miss the rest that are very similar in terms of pixel dimensions.
I have tried adjusting the hyperparameters around anchor box generation and first_stage_max_proposals but nothing has improved the results so far. Here is an example config file I have used. What other hyperparameters should I try adjusting? Any other suggestions on how to diagnose the problem? Should I be looking into other architectures or does my task look doable with faster-rccn and/or SSD?
In the end the immediate problem was that I was not using the visualizer correctly. By updating the parameters for visualize_boxes_and_labels_on_image_array as described by Johnathan in the comments I was able to see that that I am at least detecting more boxes than I had thought.
I check your config gile, you are decreasing the resolution of your image to 1024. The region of your digit will not contain a lot of pixel and you are loosing some information. What I suggest is to train the model with an another dataset (smaller images). You can for example crop the images in 4 four area.
If you have a good GPU increase the max dimension in the image_resizer, but I guess you will run out of memory