What is the purpose of a pre-trained network in Faster R-CNN? - faster-rcnn

I am not able to understand the purpose of a pre-trained network. From what I read, it is used for the RPN and the Classification Network. But I dont't understand how.

CNNs take a notoriously long time to train, especially for more complex models with higher resolutions. In order to avoid the days of training on a high-end GPU, pre-trained models have been made available. You then just have to train on your specific data (assuming your data is similar to the pre-trained data). For instance, if you want to train a CNN to recognize cats in high resolution images, you might want to start with a pre-trained model that recognizes dogs. The training should take a lot, lot less time due to the fact that a lot of the same underlying patterns have already been learned and all your training needs to do is differentiate cats from dogs.

Related

Training SSD-MOBILENET V1 and the loss does not deacrease

I'm new in everithing about CNN and tensorflow. Im training a pretrained ssd-mobilenev1-pets.config to detect columns of buildings, about one day but the loss is between 2-1 and doesnt decrease since 10 hours ago.
I realized that my input images are 128x128 and SSD resize de image to 300*300.
Does the size of the input images affect the training?
If that is the case, should I retrain the network with larger input images? or what would be another option to decrease the loss? my train dataset has 660 images and test 166 I dont Know if there are enough images
I really aprecciate your help ....
Loss values of ssd_mobilenet can be different from faster_rcnn. From EdjeElectronics' TensorFlow Object Detection Tutorial:
For my training on the Faster-RCNN-Inception-V2 model, it started at
about 3.0 and quickly dropped below 0.8. I recommend allowing your
model to train until the loss consistently drops below 0.05, which
will take about 40,000 steps, or about 2 hours (depending on how
powerful your CPU and GPU are). Note: The loss numbers will be
different if a different model is used. MobileNet-SSD starts with a
loss of about 20, and should be trained until the loss is consistently
under 2.
For more information: https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10#6-run-the-training
The SSD Mobilnet architecture demands additional training to suffice
the loss accuracy values of the R-CNN model, however, offers
practicality, scalability, and easy accessibility on smaller devices
which reveals the SSD model as a promising candidate for further
assessment (Fleury and Fleury, 2018).
For more information: Fleury, D. & Fleury, A. (2018). Implementation of Regional-CNN and SSD machine learning object detection architectures for the real time analysis of blood borne pathogens in dark field microscopy. MDPI AG.
I would recommend you to take 15%-20% images for testing which cover all the variety present in training data. As you said you have 650+ images for training and 150+ for testing. That is roughly 25% of testing images. It looks like you have enough images to start with. I know the more, the merrier but make sure your model also has sufficient data to learn from!
Resizing the images does not contribute to the loss. It makes sure there is consistency across all images for the model to recognize them without bias. The loss has nothing to do with image resizing as long as every image is resized identically.
You have to make stops and recover checkpoints again and again if you want your model to be perfectly fit. Usually, you can get away with good accuracy by re-training the ssd mobilenet until the loss consistently becomes under 1.Ideally we want the loss to be as lower as possible but we want to make sure the model is not over-fitting. It is all about trial and error. (Loss between 0.5 and 1 seems to be doing the job well but again it all depends on you.)
The reason I think your model is underperforming is due to the fact that you have variety of testing data and not enough training data to suffice.
The model has not been given enough knowledge in training data to make the model learn for new variety of testing data. (For example : Your test data has some images of new angles of buildings which are not sufficiently present in training data). In that case, I recommend you to put variety of all images in training data and then picking images to test making sure you still have sufficient training data of new postures. That's why I recommend you to take 15%-20% test data.

Is that a good idea to use transfer learning in real world projects?

SCENARIO
What if my intention is to train for a dataset of medical images and I have chosen a coco pre-trained model.
My Doubts
1 Since I have chosen medical images there is no point of train it on COCO dataset, right? if so what is a possible solution to do the same?
2 Adding more layers to a pre-trained model will screw the entire model? with classes of around 10 plus and 10000's of training datasets?
3 Without train from scratch what are the possible solutions , like fine-tuning the model?
PS - let's assume this scenario is based on deploying the model for business purposes.
Thanks-
Yes, it is a good idea to reuse the Pre-Trained Models or Transfer Learning in Real World Projects, as it saves Computation Time and as the Architectures are proven.
If your use case is to classify the Medical Images, that is, Image Classification, then
Since I have chosen medical images there is no point of train it on
COCO dataset, right? if so what is a possible solution to do the same?
Yes, COCO Dataset is not a good idea for Image Classification as it is efficient for Object Detection. You can reuse VGGNet or ResNet or Inception Net or EfficientNet. For more information, refer TF HUB Modules.
Adding more layers to a pre-trained model will screw the entire model?
with classes of around 10 plus and 10000's of training datasets?
No. We can remove the Top Layer of the Pre-Trained Model and can add our Custom Layers, without affecting the performance of the Pre-Trained Model.
Without train from scratch what are the possible solutions , like
fine-tuning the model?
In addition to using the Pre-Trained Models, you can Tune the Hyper-Parameters of the Model (Custom Layers added by you) using HParams of Tensorboard.

When should I stop the object detection model training while mAP are not stable?

I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.

CNN : Fine tuning small network vs feature extracting from a big network

To elaborate : Under what circumstances would fine tuning all layers of a small network (say SqueezeNet) perform better than feature extracting or fine tuning only last 1 or 2 Convolution layer of a big network (e.g inceptionV4)?
My understanding is computing resource required for both is somewhat comparable. And I remember reading in a paper that extreme options i.e fine tuning 90% or 10% of network is far better compared to more moderate like 50%. So, what should be the default choice when experimenting extensively is not an option?
Any past experiments and intuitive description of their result, research paper or blog would be specially helpful. Thanks.
I don't have much experience in training models like SqueezeNet, but I think it is much easier to finetune only the last 1 or 2 layers of a big network: you don't have to extensively search for many optimal hyperparameters. Transfer learning works amazingly well out of the box with the LR finder and the cyclical learning rate from fast.ai.
If you want fast inference after the training, then it is preferable to train SqueezeNet. It might also be the case if the new task is very different from ImageNet.
Some intuition from http://cs231n.github.io/transfer-learning/
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

Training Resnet deep neural network from scratch

I need to gain some knowledge about deep neural networks.
For a 'ResNet' very deep neural network, we can use transfer learning to train a model.
But Resnet has been trained over the ImageNet dataset. So their pre-trained weights can be used to train the model with another dataset. (for an example training a model for lung cancer detection with CT lung images)
I feels that this approach will be not accurate as the pre-trained weights has been completely trained over other objects but not with medical data.
Instead of transfer learning, is it possible to train the resnet from scratch? (but the available number of images to train the resnet is around 1500) . Is it something possible to do with a normal computer.
Can someone please share your valuable ideas with me
is it possible to train the resnet from scratch?
Yes, it is possible, but the amount of time one needs to get to good accuracy greatly depends on the data. For instance, training original ResNet-50 on a NVIDIA M40 GPU took 14 days (10^18 single precision ops). The most expensive operation in CNN is the convolution in the early layers.
ImageNet contains 14m 226x226x3 images. Since your dataset is ~10000x smaller, each epoch will take ~10000x less ops. On top of that, if you pass gray-scale instead of RGB images, the first convolution will take 3x less ops. Likewise spatial image size affects the training time as well. Training on smaller images can also increase the batch size, which usually speeds things up due to vectorization.
All in all, I estimate that a machine with a single consumer GPU, such as 1080 or 1080ti, can train ~100 epochs of ResNet-50 model in a day. Obviously, training on a 2-GPU machine would be even faster. If that is what you mean by a normal computer, the answer is yes.
But since your dataset is very small, there's a big chance of overfitting. This looks like the biggest issue that your approach faces.