how to use tensorflow object detection API for face detection - tensorflow

Open CV provides a simple API to detect and extract faces from given images. ( I do not think it works perfectly fine though because I experienced that it cuts frames from the input pictures that have nothing to do with face images. )
I wonder if tensorflow API can be used for face detection. I failed finding relevant information but hoping that maybe an experienced person in the field can guide me on this subject. Can tensorflow's object detection API be used for face detection as well in the same way as Open CV does? (I mean, you just call the API function and it gives you the face image from the given input image.)

You can, but some work is needed.
First, take a look at the object detection README. There are some useful articles you should follow. Specifically: (1) Configuring an object detection pipeline, (3) Preparing inputs and (3) Running locally. You should start with an existing architecture with a pre-trained model. Pretrained models can be found in Model Zoo, and their corresponding configuration files can be found here.
The most common pre-trained models in Model Zoo are on COCO dataset. Unfortunately this dataset doesn't contain face as a class (but does contain person).
Instead, you can start with a pre-trained model on Open Images, such as faster_rcnn_inception_resnet_v2_atrous_oid, which does contain face as a class.
Note that this model is larger and slower than common architectures used on COCO dataset, such as SSDLite over MobileNetV1/V2. This is because Open Images has a lot more classes than COCO, and therefore a well working model need to be much more expressive in order to be able to distinguish between the large amount of classes and localizing them correctly.
Since you only want face detection, you can try the following two options:
If you're okay with a slower model which will probably result in better performance, start with faster_rcnn_inception_resnet_v2_atrous_oid, and you can only slightly fine-tune the model on the single class of face.
If you want a faster model, you should probably start with something like SSDLite-MobileNetV2 pre-trained on COCO, but then fine-tune it on the class of face from a different dataset, such as your own or the face subset of Open Images.
Note that the fact that the pre-trained model isn't trained on faces doesn't mean you can't fine-tune it to be, but rather that it might take more fine-tuning than a pre-trained model which was pre-trained on faces as well.

just increase the shape of the input, I tried and it's work much better

Related

How does custom object detection actually work?

I am currently testing out custom object detection using the Tensorflow API. But I don't quite seem to understand the theory behind it.
So if I for example download a version of MobileNet and use it to train on, lets say, red and green apples. Does it forget all the things that is has already been trained on? And if so, why does it then benefit to use MobileNet over building a CNN from scratch.
Thanks for any answers!
Does it forget all the things that is has already been trained on?
Yes, if you re-train a CNN previously trained on a large database with a new database containing fewer classes it will "forget" the old classes. However, the old pre-training can help learning the new classes, this is a training strategy called "transfert learning" of "fine tuning" depending on the exact approach.
As a rule of thumb it is generally not a good idea to create a new network architecture from scratch as better networks probably already exist. You may want to implement your custom architecture if:
You are learning CNN's and deep learning
You have a specific need and you proved that other architectures won't fit or will perform poorly
Usually, one take an existing pre-trained network and specialize it for their specific task using transfert learning.
A lot of scientific literature is available for free online if you want to learn. you can start with the Yolo series and R-CNN, Fast-RCNN and Faster-RCNN for detection networks.
The main concept behind object detection is that it divides the input image in a grid of N patches, and then for each patch, it generates a set of sub-patches with different aspect ratios, let's say it generates M rectangular sub-patches. In total you need to classify MxN images.
In general the idea is then analyze each sub-patch within each patch . You pass the sub-patch to the classifier in your model and depending on the model training, it will classify it as containing a green apple/red apple/nothing. If it is classified as a red apple, then this sub-patch is the bounding box of the object detected.
So actually, there are two parts you are interested in:
Generating as many sub-patches as possible to cover as many portions of the image as possible (Of course, the more sub-patches, the slower your model will be) and,
The classifier. The classifier is normally an already exisiting network (MobileNeet, VGG, ResNet...). This part is commonly used as the "backbone" and it will extract the features of the input image. With the classifier you can either choose to training it "from zero", therefore your weights will be adjusted to your specific problem, OR, you can load the weigths from other known problem and use them in your problem so you won't need to spend time training them. In this case, they will also classify the objects for which the classifier was training for.
Take a look at the Mask-RCNN implementation. I find very interesting how they explain the process. In this architecture, you will not only generate a bounding box but also segment the object of interest.

retrain model when I am only interested of a subcategory of the existing classes

I want to use a trained model form the tensorflow object detection API, specifically I want to use faster_rcnn_inception_resnet_v2_atrous_oid_v4 trained on google open imaged. I am not interested in detecting all the 601 classes, but rather would like to detect 10 subclasses. Will I gain improvement in accuracy if I retain the last layer or is it better to filter the layers I am not interested after the model is done with prediction. If I went with retaining, is it ok to retain the model with images form google open images again or it is better to use different data.
This official example seems to help you.

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

Fixing error output from seq2seq model

I want to ask you how we can effectively re-train a trained seq2seq model to remove/mitigate a specific observed error output. I'm going to give an example about Speech Synthesis, but any idea from different domains, such as Machine Translation and Speech Recognition, using seq2seq model will be appreciated.
I learned the basics of seq2seq with attention model, especially for Speech Synthesis such as Tacotron-2.
Using a distributed well-trained model showed me how naturally our computer could speak with the seq2seq (end-to-end) model (you can listen to some audio samples here). But still, the model fails to read some words properly, e.g., it fails to read "obey [əˈbā]" in multiple ways like [əˈbī] and [əˈbē].
The reason is obvious because the word "obey" appears too little, only three times out of 225,715 words, in our dataset (LJ Speech), and the model had no luck.
So, how can we re-train the model to overcome the error? Adding extra audio clips containing the "obey" pronunciation sounds impractical, but reusing the three audio clips has the danger of overfitting. And also, I suppose we use a well-trained model and "simply training more" is not an effective solution.
Now, this is one of the drawbacks of seq2seq model, which is not talked much. The model successfully simplified the pipelines of the traditional models, e.g., for Speech Synthesis, it replaced an acoustic model and a text analysis frontend etc by a single neural network. But we lost the controllability of our model at all. It's impossible to make the system read in a specific way.
Again, if you use a seq2seq model in any field and get an undesirable output, how do you fix that? Is there a data-scientific workaround to this problem, or maybe a cutting-edge Neural Network mechanism to gain more controllability in seq2seq model?
Thanks.
I found an answer to my own question in Section 3.2 of the paper (Deep Voice 3).
So, they trained both of phoneme-based model and character-based model, using phoneme inputs mainly except that character-based model is used if words cannot be converted to their phoneme representations.

Is there a way to recognise an object in an image?

I am looking for some pre-trained deep learning model which can recognise an object in an image. Usually the images are of type used in shopping websites for products. I want to recognise what is the product in the image. I have come across some pre-trained models like VGG, Inception but they seems to be trained on some few general objects like 1000 objects. I am looking for something which is trained on more like 10000 or more.
I think the best way to do this is to build your own training set with the labels that you need to predict, then take an existing pre-trained model like VGG, remove the last fully connected layers and train the mode with your data, the process called transfer learning. Some more info here.