I've trained SSD mobilent v2 320x320 model with around 4k steps which produced quite a few checkpoints that are saved in my training folder.The issue I am experiencing now is that this only evaluates the latest checkpoint, but I'd like to evaluate all of them at once.
Ideally I would like to see the results in TensorBoard which shows the validation accuracy (mAP) of the different checkpoints as graph - which it does already, but just for the one checkpoint.
I have tried to run my evaluation code to generate a graph for my mAP but it shows my mAP with a simple dot.
Each checkpoint refers to a previous state of your model during training. The graph you see on TensorBoard for mAP, at some points, is the same as the dots that are produced when you run the evaluation once on checkpoint because the checkpoints are not actually different models but your model at different times during training. So the graph of the last model is what you need.
Related
How can I understand which layers are frozen fine-tuning a detection model from Tensorflow Model Zoo 2?
I have already set with success the Path for fine_tune_checkpoint and fine_tune_checkpoint_type: detection and in the file proto I have already read that "detection" means
// 2. "detection": Restores the entire feature extractor.
The only parts of the full detection model that are not restored are the box and class prediction heads.
This option is typically used when you want to use a pre-trained detection model
and train on a new dataset or task which requires different box and class prediction heads.
I didn't really understand what does that means. Restored means Frozen in this context?
As I understand it, currently the Tensorflow 2 Object detection does not freeze any layers when training from a fine tune checkpoint. There is a issue reported here to support specifying which layers to freeze in the pipeline config. If you look at the training step function, you can see that all trainable variables are used when applying gradients during training.
Restored here means that the model weights are copied from the checkpoint to be used as a starting point for training. Frozen would mean that the weights are not changed (i.e. no gradient is applied) during training.
First of all I want to state out that I am familiar with the benefits of transfer learning. Moreover I am able to train a pretrained model from 'modelzoo' on my dataset. But for research purposes I want to train my model from scratch without transferlearning.
I want to adopt the Faster-RCNN Resnet 101 implementation from tensorsflow's Object Detection API to my dataset. If I use one of the pretrained models the training goes as expected and the loss is always in 'normal' ranges (never above about 6). But if I do not use transferlearning the loss jumps very frequently to extrem high values (about 80,000,000), but between those values the loss is in normal ranges. In addition to this I do not see any predictions of the network on images in TensorBoard. It seems like the network does not make any predictions at all. The only thing which I change is to comment out those two lines in the model.config file:
# fine_tune_checkpoint: 'path'
# from_detection_checkpoint: true
I tried a lot of things to find the reason: Changed optimizer, changed the learning rate, used gradient clipping, changed the initializer used different machines to train on but nothing helps. Moreover I inspected my label_map as well as my record file. To ensure that those files are correct I redid the steps mentioned above by using the pascal voc dataset, the script to create records and the label map from the api, but even with this code from the Object Detection API without any code changes, the loss explodes (Tensorflow Object Detection API own inputs).
I usually just use train.py to train using Tensorflow Object Detection API. However, I read from https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/discussion/68581 that you can also use model_main.py to train your model and see real-time plots and images on Tensorboard.
How do you exactly use model_main.py on Tensorboard?
What is the difference between train.py and model_main.py?
On TensorBoard, the model_main.py output similar graphs like train.py, but in model_main.py, the performance of the model on the evaluation dataset is measured too.
model_main.py is the newer version in TensorFlow Object Detection API. It is used for training and also evaluating the model. When using train.py we have to run a separate program for evaluation (eval.py), while model_main.py executes both. For example, training code will be running for a certain time (for example 5 mins or every 2000 steps), then the training will be stopped and evaluation will be run. After the evaluation has finished, the training will be continued again. Then the same cycle is repeated again.
The newer version of Object Detection API of Tensorflow offers model_main.py that trains as well as evaluates the model using the various pre-conditions and preprocessing where as the older versions of Tensorflow Object Detection APIs uses train.py for training and eval.py for evaluating.
Reference : https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10
I've successfully trained the inception v3 model on custom 200 classes from scratch. Now I have ckpt files in my output dir. How to use those models to run inference?
Preferably, load the model on GPU and pass images whenever I want while the model persists on GPU. Using TensorFlow serving is not an option for me.
Note: I've tried to freeze these models but failed to correctly put output_nodes while freezing. Used ImagenetV3/Predictions/Softmax but couldn't use it with feed_dict as I couldn't get required tensors from freezed model.
There is poor documentation on TF site & repo on this inference part.
It sounds like you're on the right track, you don't really do anything different at inference time as you do at training time except that you don't ask it to compute the optimizer at inference time, and by not doing so, no weights are ever updated.
The save and restore guide in tensorflow documentation explains how to restore a model from checkpoint:
https://www.tensorflow.org/programmers_guide/saved_model
You have two options when restoring a model, either you build the OPS again from code (usually a build_graph() method) then load the variables in from the checkpoint, I use this method most commonly. Or you can load the graph definition & variables in from the checkpoint if the graph definition was saved with the checkpoint.
Once you've loaded the graph you'll create a session and ask the graph to compute just the output. The tensor ImagenetV3/Predictions/Softmax looks right to me (I'm not immediately familiar with the particular model you're working with). You will need to pass in the appropriate inputs, your images, and possibly whatever parameters the graph requires, sometimes an is_train boolean is needed, and other such details.
Since you aren't asking tensorflow to compute the optimizer operation no weights will be updated. There's really no difference between training and inference other than what operations you request the graph to compute.
Tensorflow will use the GPU by default just as it did with training, so all of that is pretty much handled behind the scenes for you.
I've trained a seq2seq model for machine translation (DE-EN). And I have saved the trained model checkpoint. Now, I'd like to fine-tune this model checkpoint to some specific domain data samples which have not been seen in previous training phase. Is there a way to achieve this in tensorflow? Like modifying the embedding matrix somehow.
I couldn't find any relevant papers or works addressing this issue.
Also, I'm aware of the fact that the vocabulary files needs to be updated according to new sentence pairs. But, then do we have to again start training from scratch? Isn't there an easy way to dynamically update the vocabulary files and embedding matrix according to the new samples and continue training from the latest checkpoint?