Process to build our own model for image detection - tensorflow

Currently, I am working on deep neural network for image detection and I founded a model called YOLO Network, and it's very powerful to make objects detections, but I have a question:
How can we design and concept our own model? Do we use a brut force for that, for example "I use 2 convolutional and 1 pooling layer and 1 fully connected layer" after that if the result is'nt good I change the number of layers and change the parameter until I find the best model, Please if there is anyone who knows some informations about that, show me how ?
I use Tensorflow.
Thanks,

There are a couple of papers addressing this issue. For example in http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf some general principles are mentioned, like preserving information by not having too rapid changes in any cut of the graph seperating the output from the input.
Another paper is https://arxiv.org/pdf/1606.02228.pdf where specific hyperparameter combinations are tried.
The remainder are just what you observe in practice and depends on your dataset and on your requirement. Maybe you have performance requirements because you want to deploy to mobile or you need more than 90 % accuracy. Then you will have to choose your model accordingly.

Related

How does custom object detection actually work?

I am currently testing out custom object detection using the Tensorflow API. But I don't quite seem to understand the theory behind it.
So if I for example download a version of MobileNet and use it to train on, lets say, red and green apples. Does it forget all the things that is has already been trained on? And if so, why does it then benefit to use MobileNet over building a CNN from scratch.
Thanks for any answers!
Does it forget all the things that is has already been trained on?
Yes, if you re-train a CNN previously trained on a large database with a new database containing fewer classes it will "forget" the old classes. However, the old pre-training can help learning the new classes, this is a training strategy called "transfert learning" of "fine tuning" depending on the exact approach.
As a rule of thumb it is generally not a good idea to create a new network architecture from scratch as better networks probably already exist. You may want to implement your custom architecture if:
You are learning CNN's and deep learning
You have a specific need and you proved that other architectures won't fit or will perform poorly
Usually, one take an existing pre-trained network and specialize it for their specific task using transfert learning.
A lot of scientific literature is available for free online if you want to learn. you can start with the Yolo series and R-CNN, Fast-RCNN and Faster-RCNN for detection networks.
The main concept behind object detection is that it divides the input image in a grid of N patches, and then for each patch, it generates a set of sub-patches with different aspect ratios, let's say it generates M rectangular sub-patches. In total you need to classify MxN images.
In general the idea is then analyze each sub-patch within each patch . You pass the sub-patch to the classifier in your model and depending on the model training, it will classify it as containing a green apple/red apple/nothing. If it is classified as a red apple, then this sub-patch is the bounding box of the object detected.
So actually, there are two parts you are interested in:
Generating as many sub-patches as possible to cover as many portions of the image as possible (Of course, the more sub-patches, the slower your model will be) and,
The classifier. The classifier is normally an already exisiting network (MobileNeet, VGG, ResNet...). This part is commonly used as the "backbone" and it will extract the features of the input image. With the classifier you can either choose to training it "from zero", therefore your weights will be adjusted to your specific problem, OR, you can load the weigths from other known problem and use them in your problem so you won't need to spend time training them. In this case, they will also classify the objects for which the classifier was training for.
Take a look at the Mask-RCNN implementation. I find very interesting how they explain the process. In this architecture, you will not only generate a bounding box but also segment the object of interest.

Is this the correct way of using YOLO for image classification in a custom project?

I'm a beginner in computer vision. Could anyone tell me whether what I'm considering to do is correct or not? I wanted to detect a certain cyst in teeth. So my dataset consists of a part of the dental x-ray that contains that cyst. I train my model with these pictures. The one with the colored area contains cyst (infected teeth), and the one below it is the uninfected teet.
Image with cyst
Uninfected teeth
After training my model, I want to use it on a full dental x-ray, and determine if this picture has the cyst or not. A full dental x-ray is shown below.
Full dental X-Ray
Does this work? Or I'm completely wrong?
Instead of treating this as an object detection problem, you would get far better results if you were to treat this as a classification problem.
There are already various architectures for such classification tasks.
There are various architectures in TensorFlow to get you started.
Take a look at this. If you have enough data you can train them from scratch instead of using pre-trained weights
Note - The architecture provided in TensorFlow will almost always give you better results than the architectures that you create.
Object detection is suitable for cases where you have well-defined objects. If you take a look at recently published research papers you can see that these types of problems are considered as classification problems instead of object detection problems.

Which model (GPT2, BERT, XLNet and etc) would you use for a text classification task? Why?

I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks. However, I'm not sure which one should I pick to start with. Which of these recent models in NLP such as original Transformer model, Bert, GPT2, XLNet would you use to start with? And why? I'd rather to implement in Tensorflow, but I'm flexible to go for PyTorch too.
Thanks!
It highly depends on your dataset and is part of the data scientist's job to find which model is more suitable for a particular task in terms of selected performance metric, training cost, model complexity etc.
When you work on the problem you will probably test all of the above models and compare them. Which one of them to choose first? Andrew Ng in "Machine Learning Yearning" suggest starting with simple model so you can quickly iterate and test your idea, data preprocessing pipeline etc.
Don’t start off trying to design and build the perfect system.
Instead, build and train a basic system quickly—perhaps in just a few
days
According to this suggestion, you can start with a simpler model such as ULMFiT as a baseline, verify your ideas and then move on to more complex models and see how they can improve your results.
Note that modern NLP models contain a large number of parameters and it is difficult to train them from scratch without a large dataset. That's why you may want to use transfer learning: you can download pre-trained model and use it as a basis and fine-tune it to your task-specific dataset to achieve better performance and reduce training time.
I agree with Max's answer, but if the constraint is to use a state of the art large pretrained model, there is a really easy way to do this. The library by HuggingFace called pytorch-transformers. Whether you chose BERT, XLNet, or whatever, they're easy to swap out. Here is a detailed tutorial on using that library for text classification.
EDIT: I just came across this repo, pytorch-transformers-classification (Apache 2.0 license), which is a tool for doing exactly what you want.
Well like others mentioned, it depends on the dataset and multiple models should be tried and best one must be chosen.
However, sharing my experience, XLNet beats all other models so far by a good margin. Hence if learning is not the objective, i would simple start with XLNET and then try a few more down the line and conclude. It just saves time in exploring.
Below repo is excellent to do all this quickly. Kudos to them.
https://github.com/microsoft/nlp-recipes
It uses hugging face transformers and makes them dead simple. 😃
I have used XLNet, BERT, and GPT2 for summarization tasks (English only). Based on my experience, GPT2 works the best among all 3 on short paragraph-size notes, while BERT performs better for longer texts (up to 2-3 pages). You can use XLNet as a benchmark.

Misclassification using VGG16 Net

I use Faster RCNN to classify 33 items. But most of them are misclassified among each other. All items are snack packets and sweet packets like in the link below.
https://redmart.com/product/lays-salt-and-vinegar-potato-chips
https://www.google.com/search?q=ice+breakers&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj5qqXMofHfAhUQY48KHbIgCO8Q_AUIDigB&biw=1855&bih=953#imgrc=TVDtryRBYCPlnM:
https://www.google.com/search?biw=1855&bih=953&tbm=isch&sa=1&ei=S5g-XPatEMTVvATZgLiwDw&q=disney+frozen+egg&oq=disney+frozen+egg&gs_l=img.3..0.6353.6886..7047...0.0..0.43.116.3......1....1..gws-wiz-img.OSreIYZziXU#imgrc=TQVYPtSi--E7eM:
So color and shape are similar.
What could be the best way to solve this misclassification problem?
Fine tuning is a way to use features, learned on some big dataset, in our problem, which means instead of training the complete network again, we freeze out weights of the lower layer of the network and add few layers at the end of network, as per requirement. Now we train it on our data-set again. So the advantage here is that, we don't need to train all-millions of parameters, but few only. Another is that we don't need large-dataset to fine-tune.
More you can find here. This is another-useful resource, where author has explained this in more detail(with code).
Note: This is also known as transfer-learning.

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.