Why does the loss explode during training from scratch? - Tensorflow Object Detection Models - tensorflow

First of all I want to state out that I am familiar with the benefits of transfer learning. Moreover I am able to train a pretrained model from 'modelzoo' on my dataset. But for research purposes I want to train my model from scratch without transferlearning.
I want to adopt the Faster-RCNN Resnet 101 implementation from tensorsflow's Object Detection API to my dataset. If I use one of the pretrained models the training goes as expected and the loss is always in 'normal' ranges (never above about 6). But if I do not use transferlearning the loss jumps very frequently to extrem high values (about 80,000,000), but between those values the loss is in normal ranges. In addition to this I do not see any predictions of the network on images in TensorBoard. It seems like the network does not make any predictions at all. The only thing which I change is to comment out those two lines in the model.config file:
# fine_tune_checkpoint: 'path'
# from_detection_checkpoint: true
I tried a lot of things to find the reason: Changed optimizer, changed the learning rate, used gradient clipping, changed the initializer used different machines to train on but nothing helps. Moreover I inspected my label_map as well as my record file. To ensure that those files are correct I redid the steps mentioned above by using the pascal voc dataset, the script to create records and the label map from the api, but even with this code from the Object Detection API without any code changes, the loss explodes (Tensorflow Object Detection API own inputs).


Tensorflow Object Detection API - "fine_tune" vs "detection" vs "classification"

I am following this tutorial: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
In it, it has the following snippet in the file pipeline.config:
fine_tune_checkpoint_type: "detection" # Set this to "detection" since we want to be training the full detection model
Further investigation leads to the following discoveries:
There are at least 3 options for the field fine_tune_checkpoint_type - fine_tune,detection and classification
Not all models from the model zoo allow all options.
My questions are:
What do each of fine_tune,detection and classification mean in this context, and more importantly when is it appropriate to use each one.
How do I tell which options are compatible with models in the model zoo?
Ultimately I wish to do transfer learning - e.g. take an existing trained model and train it to draw boxes for one or more novel classes.
Those options indicates how to restore checkpoints and comes from here
I copy here the interesting part:
This option controls how variables are restored from the (pre-trained) fine_tune_checkpoint. For TF2 models, 3 different types are supported:
"classification": Restores only the
classification backbone part of the feature extractor. This option is typically used when you want to train a detection model starting from a pre-trained image classification model, e.g. a ResNet model pre-trained on ImageNet.
"detection": Restores the entire feature extractor. The only parts of the full detection model that are not restored are the box and class prediction heads. This option is typically used when you want to use a pre-trained detection model and train on a new dataset or task which requires different box and class prediction heads.
"full":Restores the entire detection model, including the feature extractor, its classification backbone, and the prediction heads. This option should only be used when the pre-training and fine-tuning tasks are the same. Otherwise, the model's parameters may have incompatible shapes, which will cause errors when attempting to restore the checkpoint. For more details about this parameter, see the restore_map (TF1) or restore_from_object (TF2) function documentation in the /meta_architectures/*meta_arch.py files.
I guess fine_tune is currently replaced by "full". Based on your needs the right choice appear to be "detection". To know which models supports wich options, as indicated above you have to look at the restore_from_object function definition in the proper /meta_architectures/*meta_arch.py files

How to best transfer learning using Dopamine for Reinforcement Learning?

I am using Google's Dopamine framework to train a specific reinforcement learning use-case. I am using an auto encoder to pre-train the convolutional layers of the Deep Q Network and then transfer those pre-trained weights in the final network.
To that end, I have created a separate model (in this case an auto-encoder) which I train and save the resulting model and weights.
The DQN model is created using Keras's model sub-classing method and the model used to save the trained convolutional layers weights was build using the Sequential API. My issue is with when trying to load the pre-trained weights to my final DQN model. Based on whether I use the load_model() or load_weights() functionality from Tensorflow's API I get two different overall behaviors of my network and I would like to understand why. Specifically I have the two following scenarios:
Loading the weights with theload_weights() method to the final model. The weights are the weights of the encoder plus one additional layer(added just before saving the weights) to fit the architecture of the final network implemented in dopamine where they are loaded.
First load the saved model with load_model() and then when defining the new model in the __init__() method, extract the relevant layers from the loaded model and then use them for the final model.
Overall, I would expect the two approaches to yield similar results with regards to the average reward achieved per episode , when I use the same pre-trained weights. However the two approaches differ ( 1. yield higher average reward than 2. although using the same pre-trained weights) and I don't understand why.
Furthermore, in order to validate this behavior I have tried loading random weights with the two aforementioned approaches in order to see a change in behavior. In both cases, based on which of the two aforementioned loading methods I am using, I end up with very similar resulting behavior with the respected case when loading the trained weights. It's seems like the pre-trained weights in each respected case have no effect on the overall resulting training behavior. Although, this might be irrelevant to the issue I am trying to investigate here as it might be the case that the pre-trained weights don't offer any benefit overall which is also possible.
Any thoughts and ideas on this would be much appreciated.

How to continue training an object detection model using Tensorflow Object Detection API?

I'm using Tensorflow Object Detection API to train an object detection model using transfer learning. Specifically, I'm using ssd_mobilenet_v1_fpn_coco from the model zoo, and using the sample pipeline provided, having of course replaced the placeholders with actual links to my training and eval tfrecords and labels.
I was able able to successfully train a model on my ~5000 images (and corresponding bounding boxes) using the above pipeline (I'm mainly using Google's ML Engine on TPU, if revelant).
Now, I prepared an additional ~2000 images, and would like continue training my model with those new images, without restarting from scratch (training the initial model took ~6h of TPU time). How can I do that?
You have two options, in both you need to change the input_path of the train_input_reader of your new dataset:
When specifying a checkpoint to fine-tune in the training configuration, specify the checkpoint of your trained model
fine_tune_checkpoint: <path_to_your_checkpoint>
fine_tune_checkpoint_type: "detection"
load_all_detection_checkpoint_vars: true
Simply keep using the same configuration (except the train_input_reader) with the same model_dir of your previous model. That way, the API will create a graph and will check whether a checkpoint already exists in model_dir and fits the graph. If so - it will restore it and continue training it.
Edit: fine_tune_checkpoint_type was previously set as true by mistake, while it should be "detection" or "classification" in general, and "detection" in this specific case. Thanks Krish for noticing.
I haven't retrained the object detection model on a new dataset, but it looks like
increasing the number of training steps train_config.num_steps in the config file and also adding images in the tfrecord files should be enough for that.

TensorFlow model saving to be approached differently during training Vs. deployment?

Assume that I have a CNN which I am training on some dataset. The most important part of the model is the CNN architecture.
Now when I write a code, I define the model structure in a Python class. However, outside that class, I define a number of other nodes such as loss, accuracy, tf.Variable to keep count of epochs and so on.
When I am training, for properly resuming the training, I'd like to save all these nodes (e.g - loss, epoch variable etc), and not just the CNN structure.
However, once I am done with training, I would like to save only the CNN architecture and no nodes for loss, accuracy etc. This is because it will enable people using my model to exercise freedom in writing their own finetuning codes.
How to achieve this in TF code ? Can someone show an example ?
Is this approach towards saving followed by others also ? I just want to know if my approach is right.

Tensorflow Object Detection API - What's actually test.record being used for?

I have a few doubts about Tensorflow Object Detection API. Hopefully someone can help me out... Before that, I need to mention that I am following what sendex is doing. So basically, the steps are come from him.
First doubt: Why we need test.record for training? What it does during training?
Second doubt: Sendex is getting images from test.record to test the newly trained model, doesn't the model already knew that images because they are from test.record?
Third doubt: In what type of occasion we need to activate drop_out (in the .config file)?
1) It does nothing during training, you dont need that during training, but at certain time the model begins to overfit. It means the loss on training images continues to go down but the accuracy on testing images stops improving and begins to decline. This is the time when it is needed to stop traininga nd to recognise this moment you need the test.record.
2) Images were used only to evaluate model during training not to train the net.
3) You do not need to activate it, but using dropout you usually achieve higher accuracy. It prevents the net from overfitting.