convolutional neural network image recognition - tensorflow

Currently I am working on a project with convolutional network using tensorflow and I have set up the network and now i need to train it. I don't have a clue of how could the image should be for training. Like how much of % of the image the object is training on.
It's a cigarette that I have to detect and I have tried around 280 individual pictures where the cigarette is about2-5% of the image. I'm thinking of scrapping those pictures and take new one where the cigarette is about 30-50% of the image.
All the pictures are taking outside on the street environment.
So my question is: are there are any kind of rule regarding good pictures in a training set?
I will report back when I have tried my own solution

The object you are trying to recognise is too small. In the Sample, I think first one will be the best bet for you. Convolution neural network works by doing convolution operations on image pixels. In the second picture, background is too large compared to the object you are trying to recongise. Training on such data will not help you.

Just trying to answer your rule question:
Make sure that cigarette occupies maximum region of the Image. It can be 50% to 90% (with experience). You can still identify cigarettes with 2 to 3 % area, but you need millions of images with varying backgrounds.
CNN learns from the input image. Looking at the sample images you shared (I guess all the images are taken from road side platforms and grass areas). CNN may not learn to find the cigarette, instead it will learn to detect the common background, if your background occupies maximum area of the image. Please make sure to keep different background patterns.

Related

Deep Learning Model for Complicated Pattern REcognition

I am using transfer learning using ResNet50 for snack packets recognition.
They are one and another similar in dominant color and shape. Those like in images below.
I have about 33 items to recognize.
I used FasterRCNN and SSD for ResNet50.
Not doing well and a lot of items are confused each other.
Which Deep Learning Architecture is suitable to recognize such objects?
Or are there any special tricks to have better recognition for such objects?
I think we need to have architecture to recognize detail pattern.
Make sure you are linking the original pre-trained network in caffe, or you're starting from the beginning with network training!
If you're looking to increase your dataset size, ill frequently take the same image set and rotate each image a few times.
Definitely decrease your image size, and consider giving your images less background noise to work with (people, variable backgrounds etc.)
In the past I have used Alexnet for similar issues with small feature differences.
best of luck!

Tensorflow high false-positive rate and non-max-suppression issue

I am training Tensorflow Object detection on Windows 10using faster_rcnn_inception_v2_coco as pretrained model. I'm on Windows 10, with tensorflow-gpu 1.6 on NVIDIA GeForce GTX 1080, CUDA 9.0 and CUDNN 7.0.
My dataset contain only one object, "Pistol", and 3000 images (2700 train set, 300 test set). The size of the images are from ~100x200 to ~800x600.
I trained this model for 55k iterations, where the mAP was ~0.8 and the TotalLoss seems converged to 0.001. But however, seeing the evaluation, that there are a lot of multiple bounding boxes on the same detected object (e.g. this and this), and lot of false positives (house detected as a pistol). For example, in this photo taked by me (blur filter was applied later), the model detect a person and a car as pistols, as well as the correct detection.
The dataset is uploaded here, together with the tfrecords and the label map.
I used this config file, where the only things that I changed are: num_classes to 1, the fine_tune_checkpoint, input_path and label_map_path for train and eval, and num_examples.
Since I thought that the multiple boxes are a non-max-suppression problem, I changed the score_threshold (line 73) from 0 to 0.01 and the iou_threshold (line 74) from 1 to 0.6. With the standard values the outcome was much worse than this.
What can I do to have a good detection? What should I change? Maybe I miss something about parameters tuning...
Thanks
I think that before diving into paramter tuning (i.e. the mentioned score_threshold) you will have to review your dataset.
I didn't check the entire dataset you shared but from a high level view the main problem I found is that most of the images are really small and with a highly variable aspect ratio.
In my opinion this enters in conflict with this part of your configuration file:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
If take one of the images of your dataset and you manually apply that transformation you will see that the result is very noisy for small images and very deformed for many images that have a different aspect ratio.
I would highly recommend you to re-build your dataset with images with more definition and maybe try to preprocess the images with unusual aspect ration with padding, cropping or other strategies.
If you want to stick with the small images you'd have to at least change the min and max dimensions of the image_resizer but, from my experience, the biggest problem here is the dataset and I would invest the time in trying to fix that.
Pd.
I don't see the house false positive as a big problem if we consider that it's from a totally different domain of your dataset.
You could probably adjust the minium confidence to consider a detections as true positive and remove it.
If you take the current winner of COCO and feed it with strange images like from a cartoon you will see that it generates a lot of false positives.
So it's more like a problem with the current object detection approaches wich are not robust to domain changes.
A lot of people I see online have been running into the same issue using Tensorflow API. I think there are some inherent problems with the idea/process of using the pretrained models with custom classifier(s) at home. For example people want to use SSD Mobile or Faster RCNN Inception to detect objects like "Person w/ helmet," "pistol," or "tool box," etc. The general process is to feed in images of that object, but most of the time, no matter how many images...200 to 2000, you still end up with false positives when you go actually run it at your desk.
The object classifier works great when you show it the object in its own context, but you end up getting 99% match on every day items like your bedroom window, your desk, your computer monitor, keyboard, etc. People have mentioned the strategy of introducing negative images or soft images. I think the problem has to do with limited context in the images that most people use. The pretrained models were trained with over a dozen classifiers in many variety of environments like in one example could be a Car on the street. The CNN sees the car and then everything in that image that is not a car is a negative image which includes the street, buildings, sky, etc.. In another image, it can see a Bottle and everything in that image which includes desks, tables, windows, etc. I think the problem with training custom classifiers is that it is a negative image problem. Even if you have enough images of the object itself, there isn't enough data of that that same object in different contexts and backgrounds. So in a sense, there is not enough negative images even if conceptually you shouldn't need negative images. When you run the algorithm at home you get false positives all over the place identifying objects around your own room. I think the idea of transfer learning in this way is flawed. We just end up seeing a lot of great tutorials online of people identifying playing cards, Millenium Falcons, etc., but none of those models are deployable in the real world as they all would generate a bunch of false positives when it sees anything outside of its image pool. The best strategy would be to retrain the CNN from scratch with a multiple classifiers and add the desired ones in there as well. I suggest re-introducing a previous dataset from ImageNet or Pascal with 10-20 pre-existing classifiers and add your own ones and retrain it.

How do different input image sizes/resolutions affect the output quality of semantic image segmentation networks?

While trying to perform image segmentation on images from one dataset (KITTI) with a deep learning network trained on another dataset (Cityscapes) I realized that there is a big difference in subjectively perceived quality of the output (and probably also when benchmarking the (m)IoU).
This raised my question, if and how size/resolution of an input image affects the output from a network for semantic image segmentation which has been trained on images with different size or resolution than the input image.
I attached two images and their corresponding output images from this network: https://github.com/hellochick/PSPNet-tensorflow (using provided weights).
The first image is from the CityScapes dataset (test set) with a width and height of (2048,1024). The network has been trained with training and validation images from this dataset.
CityScapes original image
CityScapes output image
The second image is from the KITTI dataset with a width and height of (1242,375):
KITTI original image
KITTI output image
As one can see, the shapes in the first segmented image are clearly defined while in the second one a detailed separation of objects is not possible.
Neural networks in general are fairly robust to variations in scale, but they certainly aren't perfect. Although I don't have references available off the top of my head there have been a number of papers that show that scale does indeed affect accuracy.
In fact training your network with a dataset with images at varying scales is almost certainly going to improve it.
Also, many of the image segmentation networks used today explicitly build constructs into the network to improve this at the level of the network architecture.
Since you probably don't know exactly how these networks were trained I would suggest that you resize your images to match the approximate shape that the network you are using was trained on. Resizing an image using normal image resize functions is quite a normal preprocessing step.
Since the images you are referencing there are large I'm also going to say that whatever data input pipeline you're feeding them through is already resizing the images on your behalf. Most neural networks of this type are trained on images of around 256x256. The input image is cropped and centered as necessary before training or prediction. Processing very large images like that is extremely compute-intensive and hasn't been found to improve the accuracy much.

TensorFlow Custom Object Detection Disappointing Result - Why?

I have just started TF Object Detection API two weeks ago, and manage to train a model to recognize a custom object, in my case, a Mecanum wheel.
Here's the details:
No. of training images = 125
All training images are around 500 x 500 (plus minus)
Transfer Learning
Model used = ssd_mobilenet_v1_coco
batch size = 2
total steps ran = 12715
loss is around 0.5000 - 2.5000, some time it fluctuate to more than 10, I am not sure why
Here's the result:
The first image is encouraging.
The second image starts to disappoint me a little. I expect the model to detect FOUR (four boxes) Mecanum wheel. Why?
Then, I suspect that's there's something wrong with my trained model. I tried with the sample test images, the third image and fourth image, then I am sure that this is totally not the model I first aim for.
I have been reading this post which I think our problems are quite similar (and he manage to solve it). He mentioned that the input image needs to be less than 600 x 1024, so I tried with fifth image and unsurprisingly, the result is again disappointing.
I went through the tutorial series by sentdex and in the comment sections, I notice that there are many people face this problem too. So, what to do now?
Can someone please help me to edit the list? Why can't I make it to one paragraph one list?
125 images? You will not be able to get very good results with that many images. If you want to validate that this is indeed the problem, try training with just subsets of your original 125 images.
For example, how bad is the output when you train on 10 images?
Does it get better when you use 50 images?
Does it get better yet when you use 125 images?
If the accuracy improves with increasing dataset size, you can extrapolate and guess that with 1000 images, you will be able to do even better. I would guess that that is your problem.

small object detection with faster-RCNN in tensorflow-models

I'm attempting to train a faster-rccn model for small digit detection. I'm using the newly released tensorflow object detection API and so far have been fine tuning a pre-trained faster_rcnn_resnet101_coco from the zoo. All my training attempts have resulted in models with high precision but low recall. Out of the ~120 objects (digits) on each image only ~20 objects are ever detected, but when detected the classification is accurate. (Also, I am able to train a simple convnet from scratch on my cropped images with high accuracy so the problem is in the detection aspect of the model.) Each digit is on average 60x30 in the original images (and probably about half that size after the image is resized before being fed into the model.) Here is an example image with detected boxes of what I'm seeing:
What is odd to me is how it is able to correctly detect neighboring digits but completely miss the rest that are very similar in terms of pixel dimensions.
I have tried adjusting the hyperparameters around anchor box generation and first_stage_max_proposals but nothing has improved the results so far. Here is an example config file I have used. What other hyperparameters should I try adjusting? Any other suggestions on how to diagnose the problem? Should I be looking into other architectures or does my task look doable with faster-rccn and/or SSD?
In the end the immediate problem was that I was not using the visualizer correctly. By updating the parameters for visualize_boxes_and_labels_on_image_array as described by Johnathan in the comments I was able to see that that I am at least detecting more boxes than I had thought.
I check your config gile, you are decreasing the resolution of your image to 1024. The region of your digit will not contain a lot of pixel and you are loosing some information. What I suggest is to train the model with an another dataset (smaller images). You can for example crop the images in 4 four area.
If you have a good GPU increase the max dimension in the image_resizer, but I guess you will run out of memory