Which model should I use for object recognition on mobile devices? - tensorflow

I am working on a project where I need the app to recognize a special character, just one small graphic, from a photographed document. Something similar to the example from the picture. More specifically, the app would use this character to determine the corners of the document.
Something like this
Which model would be suitable for that, Mobile SSD, Yolo or something completely different? Approximately how many photos and how much time would it take me to successfully train the model to 90%+ detection? And is TensorFlow Model Maker a good option?
I already tried to train it with Model Maker but the results were really disappointing. I have used
efficientdet_lite0
model. The photos were taken with a phone in high resolution, tagged with labelImg. About 40 for training, five each for validation and test.
It would mean a lot to me if someone would tell me if I am at least on the right track. Thank you very much in advance.

efficientdet or yolo should be good enough for you use case. Yolov5 recommends 10000 annotated instances of each class for good results, this is given that the variability of the class representation is large. I would not start with anything less than a few hundred.

Related

How to train your own(w/o YOLO etc.) object detector in tf/keras

I successfully trained multi-classificator model, that was really easy with simple class related folder structure and keras.preprocessing.image.ImageDataGenerator with flow_from_directory (no one-hot encoding by hand btw!) after i just compile fit and evaluate - extremely well done pipeline by Keras!
BUT! when i decided to make my own (not cats, not dogs, not you_named) object detector - this is became a nightmare...
TFRecord and tf.Example are just madness! but ok, i almost get it (my dataset is small, i have plenty of ram, but who cares, write f. boilerplate, so much meh...)
The main thing - i just can't find any docs/tutorial how to make it with plain simple tf/keras, everyone just want to build up it on top of someone model, YOLO SSD FRCNN, even if they trying to detect completely new objects!!!
There two links about OD in official docs, and they both using some models underneath.
So my main question WHY ??? or i just blind..? -__-
It becomes a nightmare because Object Detection is way way harder than classification. The most simple object detector is this: first train a classifier on all your objects. Then when you want to detect objects in your image, slide a window over your image, and classify each window. Then, if your classifier is certain that a certain window is one of the objects, mark it as a successful detection.
But this approach has a lot of problems, mainly it's way (like waaaay) too slow. So, researcher improved it and invented RCNNs. That had it problems, so they invented Faster-RCNN, YOLO and SSD, all to make it faster and more accurate.
You won't find any tutorials online on how to implement the sliding window technique because it's not useful anyway, and you won't find any tutorials on how to implement the more advanced stuff because, well, the networks get complicated pretty quick.
Also note that using YOLO doesn't mean you should use the same weights as in YOLO. You can always train YOLO from scratch on your own data if you want by randomly initiliazing all the weights in the network layers. So the even if they trying to detect completely new objects!!! you mentioned isn't really valid. Also also note that I still would advise you to do use the weights they used in Yolo network. Transfer Learning is generally looked at as being a good idea, especially when starting out and especially in the image processing world, as many images share common features (like edges, for example).
I am having pretty much the same problem as my images are B/W diagrams, quite different from regular pictures, I want to train a custom model on just only diagrams.
I have found this documentation section in Tensorflow models repo:
https://github.com/tensorflow/models/blob/master/research/object_detection/README.md
It has a couple of sections explaining how to bring your own model and dataset in "extras" that could be a starting point.

Deep Learning Model for Complicated Pattern REcognition

I am using transfer learning using ResNet50 for snack packets recognition.
They are one and another similar in dominant color and shape. Those like in images below.
I have about 33 items to recognize.
I used FasterRCNN and SSD for ResNet50.
Not doing well and a lot of items are confused each other.
Which Deep Learning Architecture is suitable to recognize such objects?
Or are there any special tricks to have better recognition for such objects?
I think we need to have architecture to recognize detail pattern.
Make sure you are linking the original pre-trained network in caffe, or you're starting from the beginning with network training!
If you're looking to increase your dataset size, ill frequently take the same image set and rotate each image a few times.
Definitely decrease your image size, and consider giving your images less background noise to work with (people, variable backgrounds etc.)
In the past I have used Alexnet for similar issues with small feature differences.
best of luck!

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

YOLO vs Inception on unique images

I have images of unique products that are used at my workplace. I can't imagine that the inception database already has similar items that it has been trained on.
I tried to train a model using YOLO. It was taking a very very long time. Maybe 7minutes between epochs; and I wanted to do 1000 epochs due to small data size.
I used tiny-yolov2-voc cfg/weight on 1.0 GPU. I had a video of the item but i broke it up into frames so i could annotate. I then attempted to train on the images (not video). The products are healthcare related. Basically anything that a hospital would use.
Ive also used the inception method on images I got from Google. I noticed that inception method was very fast and resulted in accurate predictions. However, i'm worried that my images are too unique for inception to work.
Which method is best to use?
If you recommend YOLO, can you please provide suggestions on how to speed up the training phase?
If you recommend inception, can you please provide an explanation why it would work on unique images? I guess i'm having trouble understanding how inception knows which item i'm trying to train on without me providing annotations.
Thanks in advance
Just my impression (no recommendation or even related experience)
Having a look at the Hardware recommendations related to darknet a assumption is that you might stock up your own hardware to get faster results.
I read about the currently three different versions of YOLO and expect there are lot's of GFLOPS training included if you download the recommended files, but if the models never fit to your products then for you they never might be very helpful.
I must admit I've neither been active with YOLO nor with Tensorflow, so my impression might not be helpful at all.
If you see some videos of YOLO you can remark that sometimes a camel is labeled with horse and the accuracy seems being bad but it depends on the threshold that is applied to the images, so the videos look amazing as it seems the recognition is done so fast but with higher accuracy the process would slow down - also depending on the trained motives.
They never hide it though, they explain on an image where a dog is labeled as cow and a horse as sheep (Version 2) that in combination with darknet it's getting much faster but less accurate too, so usage of darknet is an important aspect too.
The information about details seems being quite bad on the websites of YOLO, they present it more like you'd do with a popstar, in comparison the website of Tensorflow looks more academic and is informing about the mathematics behind the framework.
Concerning Tensorflow I don't know about the hardware-recommendations, but as you wrote your results are useful, probably they are a bit or even much less.
My impression is that YOLO is primary intended for real-time detection in (live-)videos and needs much training for high accuracy. So depending on your use-case it might be right but you'd to invest in hardware probably for professional usage.
This is not an opinion against Tensorflow but that I had to verify more and it seems taking more time to get an impression. Concerning Tensorflow in the moment I even can't say if it can be used for real-time-detection, how accurate it is then and if the results are then still better then those of YOLO.
My assumption is that concerning both solutions it's a matter of involved elements (like the decision if to include darknet for speed), configuration, training and adjustments. Probably there is always something to increase in speed and accuracy, so investing in a system for recognition won't be static process with fixed end in timeline, but a steady process.
This is just a short overview of my impressions, I've never any experience with any recognition-software and hardly recommend that you make any decision based on my words.
Just if you want to do use any recognition software professional, especially for real-time-recognition, then you've to invest in hardware probably.
To my understanding of your problem you need you need inception with the capability of identifying your unique images. In this circumstance you can use transfer-learning on the inception model. With transfer-learning you can still train inception your own pictures while retaining the previous knowledge of inception.
More on transfer-learning

Counting Pedestrians Using TensorFlow's Object Detection

I am new to machine learning field and based on what I have seen on youtube and read on internet I conjectured that it might be possible to count pedestrians in a video using tensorflow's object detection API.
Consequently, I did some research on tensorflow and read documentation about how to install tensorflow and then finally downloaded tensorflow and installed it. Using the sample files provided on github I adapted the code related to object_detection notebook provided here ->https://github.com/tensorflow/models/tree/master/research/object_detection.
I executed the adapted code on the videos that I collected while making changes to visualization_utils.py script so as to report number of objects that cross a defined region of interest on the screen. That is I collected bounding boxes dimensions (left,right,top, bottom) of person class and counted all the detection's that crossed the defined region of interest (imagine a set of two virtual vertical lines on video frame with left and right pixel value and then comparing detected bounding box's left & right values with predefined values). However, when I use this procedure I am missing on lot of pedestrians even though they are detected by the program. That is the program correctly classifies them as persons but sometimes they don't meet the criteria that I defined for counting and as such they are not counted. I want to know if there is a better way of counting unique pedestrians using the code rather than using the simplistic method that I am trying to develop. Is the approach that I am using the right one ? Could there be other better approaches ? Would appreciate any kind of help.
Please go easy on me as I am not a machine learning expert and just a novice.
You are using a pretrained model which is trained to identify people in general. I think you're saying that some people are pedestrians whereas some other people are not pedestrians, for example, someone standing waiting at the light is a pedestrian, but someone standing in their garden behind the street is not a pedestrian.
If I'm right, then you've reached the limitations of what you'll get with this model and you will probably have to train a model yourself to do what you want.
Since you're new to ML building your own dataset and training your own model probably sounds like a tall order, there's a learning curve to be sure. So I'll suggest the easiest way forward. That is, use the object detection model to identify people, then train a new binary classification model (about the easiest model to train) to identify if a particular person is a pedestrian or not (you will create a dataset of images and 1/0 values to identify them as pedestrian or not). I suggest this because a boolean classification model is about as easy a model as you can get and there are dozens of tutorials you can follow. Here's a good one:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/neural_network.ipynb
A few things to note when doing this:
When you build your dataset you will want a set of images, at least a few thousand along with the 1/0 classification for each (pedestrian or not pedestrian).
You will get much better results if you start with a model that is pretrained on imagenet than if you train it from scratch (though this might be a reasonable step-2 as it's an extra task). Especially if you only have a few thousand images to train it on.
Since your images will have multiple people in it you have a problem of identifying which person you want the model to classify as a pedestrian or not. There's no one right way to do this necessarily. If you have a yellow box surrounding the person the network may be successful in learning this notation. Another valid approach might be to remove the other people that were detected in the image by deleting them and leaving that area black. Centering on the target person may also be a reasonable approach.
My last bullet-point illustrates a problem with the idea as it's I've proposed it. The best solution would be to alter the object detection network to ouput both a bounding box per-person, and also a pedestrian/non pedestrian classification with it; or to only train the model to identify pedestrians, specifically, in the first place. I mention this as more optimal, but I consider it a more advanced task than my first suggestion, and a more complex dataset to manage. It's probably not the first thing you want to tackle as you learn your way around ML.