How to train your own(w/o YOLO etc.) object detector in tf/keras - tensorflow

I successfully trained multi-classificator model, that was really easy with simple class related folder structure and keras.preprocessing.image.ImageDataGenerator with flow_from_directory (no one-hot encoding by hand btw!) after i just compile fit and evaluate - extremely well done pipeline by Keras!
BUT! when i decided to make my own (not cats, not dogs, not you_named) object detector - this is became a nightmare...
TFRecord and tf.Example are just madness! but ok, i almost get it (my dataset is small, i have plenty of ram, but who cares, write f. boilerplate, so much meh...)
The main thing - i just can't find any docs/tutorial how to make it with plain simple tf/keras, everyone just want to build up it on top of someone model, YOLO SSD FRCNN, even if they trying to detect completely new objects!!!
There two links about OD in official docs, and they both using some models underneath.
So my main question WHY ??? or i just blind..? -__-

It becomes a nightmare because Object Detection is way way harder than classification. The most simple object detector is this: first train a classifier on all your objects. Then when you want to detect objects in your image, slide a window over your image, and classify each window. Then, if your classifier is certain that a certain window is one of the objects, mark it as a successful detection.
But this approach has a lot of problems, mainly it's way (like waaaay) too slow. So, researcher improved it and invented RCNNs. That had it problems, so they invented Faster-RCNN, YOLO and SSD, all to make it faster and more accurate.
You won't find any tutorials online on how to implement the sliding window technique because it's not useful anyway, and you won't find any tutorials on how to implement the more advanced stuff because, well, the networks get complicated pretty quick.
Also note that using YOLO doesn't mean you should use the same weights as in YOLO. You can always train YOLO from scratch on your own data if you want by randomly initiliazing all the weights in the network layers. So the even if they trying to detect completely new objects!!! you mentioned isn't really valid. Also also note that I still would advise you to do use the weights they used in Yolo network. Transfer Learning is generally looked at as being a good idea, especially when starting out and especially in the image processing world, as many images share common features (like edges, for example).

I am having pretty much the same problem as my images are B/W diagrams, quite different from regular pictures, I want to train a custom model on just only diagrams.
I have found this documentation section in Tensorflow models repo:
https://github.com/tensorflow/models/blob/master/research/object_detection/README.md
It has a couple of sections explaining how to bring your own model and dataset in "extras" that could be a starting point.

Related

How does custom object detection actually work?

I am currently testing out custom object detection using the Tensorflow API. But I don't quite seem to understand the theory behind it.
So if I for example download a version of MobileNet and use it to train on, lets say, red and green apples. Does it forget all the things that is has already been trained on? And if so, why does it then benefit to use MobileNet over building a CNN from scratch.
Thanks for any answers!
Does it forget all the things that is has already been trained on?
Yes, if you re-train a CNN previously trained on a large database with a new database containing fewer classes it will "forget" the old classes. However, the old pre-training can help learning the new classes, this is a training strategy called "transfert learning" of "fine tuning" depending on the exact approach.
As a rule of thumb it is generally not a good idea to create a new network architecture from scratch as better networks probably already exist. You may want to implement your custom architecture if:
You are learning CNN's and deep learning
You have a specific need and you proved that other architectures won't fit or will perform poorly
Usually, one take an existing pre-trained network and specialize it for their specific task using transfert learning.
A lot of scientific literature is available for free online if you want to learn. you can start with the Yolo series and R-CNN, Fast-RCNN and Faster-RCNN for detection networks.
The main concept behind object detection is that it divides the input image in a grid of N patches, and then for each patch, it generates a set of sub-patches with different aspect ratios, let's say it generates M rectangular sub-patches. In total you need to classify MxN images.
In general the idea is then analyze each sub-patch within each patch . You pass the sub-patch to the classifier in your model and depending on the model training, it will classify it as containing a green apple/red apple/nothing. If it is classified as a red apple, then this sub-patch is the bounding box of the object detected.
So actually, there are two parts you are interested in:
Generating as many sub-patches as possible to cover as many portions of the image as possible (Of course, the more sub-patches, the slower your model will be) and,
The classifier. The classifier is normally an already exisiting network (MobileNeet, VGG, ResNet...). This part is commonly used as the "backbone" and it will extract the features of the input image. With the classifier you can either choose to training it "from zero", therefore your weights will be adjusted to your specific problem, OR, you can load the weigths from other known problem and use them in your problem so you won't need to spend time training them. In this case, they will also classify the objects for which the classifier was training for.
Take a look at the Mask-RCNN implementation. I find very interesting how they explain the process. In this architecture, you will not only generate a bounding box but also segment the object of interest.

Multi-label image classification vs. object detection

For my next TF2-based computer vision project I need to classify images to a pre-defined set of classes. However, multiple objects of different classes can occur on one such image. That sounds like an object detection task, so I guess I could go for that.
But: I don't need to know where on an image each of these objects are, I just need to know which classes of objects are visible on an image.
Now I am thinking which route I should take. I am in particular interested in a high accuracy/quality of the solution. So I would prefer the approach that leads to better results. Thus from your experience, should I still go for an object detector, even though I don't need to know the location of the detected objects on the image, or should I rather build an image classifier, which could output all the classes that are located on an image? Is this even an option, can a "normal" classifier output multiple classes?
Since you don't need the object localization, stick to classification only.
Although you will be tempted to use the standard off-the-shelf network of multi-class multi-label object detection because of its re-usability, but realize that you are asking the model to do more things. If you have tons of data - not a problem. Or if your objects are similar to the ones used in ImageNet/COCO etc, you can simply use standard off-the-shelf object detection architecture and fine-tune on your dataset.
However, if you have less data and you need to train from scratch (e.g. medical images, weird objects), then object detection will be an overkill and will give you inferior results.
Remember, most of the object detection networks re-cycle the classification architectures with modifications added to last layers to incorporate additional outputs for object detection coordinates. There is a loss function associated with those additional outputs. During training in order to get best loss value, some of the classification accuracy is compromised for the sake of getting better object localization coordinates. You don't need that compromise. So, you can modify the last layer of object detection network and remove the outputs for coordinates.
Again, all this hassle is worth only if you have less data and you really need to train from scratch.

Recognize scene with deep learning

What is the approach to recognize a scene with deep learning (preferably Keras).
There are many examples showing how to classify images of limited size e.g. dogs/cats hand-written letters etc. There are also some examples for the detection of a searched object within a big image.
But, what is the best approach to recognize e.g. is it a class-room, bed-room or a dinning room? Create a data-set with that images? I think no. I think one should train a model with many things, which may appear in the scene, create a vector of the found things in the analysed image and using the second classifier (SVM or simple NN) classify the scene. Is it a right approach?
P.S.: Actually, I'm facing another problem, which IHMO the same. My "scene" is a microscope image. The images contain different sets of cells and artifacts. Depending on a set, a doctor makes a diagnosis. So I aim to train a CNN with the artifacts, which I extract with a simple morphologicyl methods. These artifacts (e.g. biological cells) will be my features. So the first level of the recognition - feature extraction is done by CNN, the later classification by SVM. Just wanted be sure, that I'm not reinventing a wheel.
In my opinion the comparison between your room-scenes and the biological scenes differ. Especially since your scene is a microscope image (probably of a limited predefined domain).
In this case, pure classification should work (without seeing the data). In other words the neural network should be able to figure out what it is seeing, without having you to hand-craft features (in case you need interpretability that's a whole new discussion).
Also there are lots approaches for scene understanding in this paper.

Deep Learning Model for Complicated Pattern REcognition

I am using transfer learning using ResNet50 for snack packets recognition.
They are one and another similar in dominant color and shape. Those like in images below.
I have about 33 items to recognize.
I used FasterRCNN and SSD for ResNet50.
Not doing well and a lot of items are confused each other.
Which Deep Learning Architecture is suitable to recognize such objects?
Or are there any special tricks to have better recognition for such objects?
I think we need to have architecture to recognize detail pattern.
Make sure you are linking the original pre-trained network in caffe, or you're starting from the beginning with network training!
If you're looking to increase your dataset size, ill frequently take the same image set and rotate each image a few times.
Definitely decrease your image size, and consider giving your images less background noise to work with (people, variable backgrounds etc.)
In the past I have used Alexnet for similar issues with small feature differences.
best of luck!

How to locate multiple objects in the same image?

I am a newbie in TensorFlow.
Currently, I am testing some classification's examples "Convolutional Neural Network" in the TensorFlow website, and it explains how to classify input images into pre-defined classes, but the problem is: I can't figure out how to locate multiple objects in the same image. For example, I had an input image with a cat and dog and I want my graph to display in the output that there are both of them "a cat and a dog" in the image.
Great question. Detecting multiple objects in the same image boils is essentially a "segmentation problem". Two nice and popular algorithms are
YOLO (You Only Look Once), and SSD(Single Shot Multibox Detector). I included links to them at the bottom.
I would watch a few videos on how YOLO works, and see if you grasp the idea. Then read the paper on SSD, and see if you get why this algorithm is even faster and more precise.
Both algorithms are single-pass: they only look at the image "once" and predict bounding boxes for the categories they spot. There are more precise algorithms, but they are slower (they first pick many spots they want to look, and then run a classifier on only that spot. The result is that they run this classifier many times per image, which is slow).
As you stated you are a newbie to Tensorflow, you can try this code other people made: https://github.com/thtrieu/darkflow . The very extensive readme shows you how to get started on your own dataset.
Good luck, and let us know if you have other questions, or if these algorithms do not fit your use-case.
YOLO 9000 (https://pjreddie.com/darknet/yolo/)
SSD (Single shot multibox detector) (https://arxiv.org/abs/1512.02325)
A naive approach for what you are trying to do would be to classify parts of the image independently.
But there are some better techniques for object detection. Actually, there is the TensorFlow Object Detection API, which gives you access to the most common object detection methods like Faster R-CNN or SSD.