Tensorflow Object Detection API w/ TPU Training - Display more granular Tensorboard plots - tensorflow

I've been following this tutorial on the Tensorflow Object Detection API, and I've successfully trained my own object detection model using Google's Cloud TPUs.
However, the problem is that on Tensorboard, the plots I'm seeing only have 2 data points each (so it just plots a straight line), like this:
...whereas I want to see more "granular" plots like these below, which are much more detailed:
The tutorial I've been following acknowledges that this issue is caused by the fact that TPU training requires very few steps to train:
Note that these graphs only have 2 points plotted since the model
trains quickly in very few steps (if you’ve used TensorBoard before
you may be used to seeing more of a curve here)
I tried adding save_checkpoints_steps=50 in the file model_tpu_main.py (see code fragment below), and when I re-ran training, I was able to get a more granular plot, with 1 data point every 300 steps or so.
config = tf.contrib.tpu.RunConfig(
# I added this line below:
save_checkpoints_steps=50,
master=tpu_grpc_url,
evaluation_master=tpu_grpc_url,
model_dir=FLAGS.model_dir,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_shards))
However, my training job is actually saving a checkpoint every 100 steps, rather than every 300 steps. Looking at the logs, my evaluation job is running every 300 steps. Is there a way I can make my evaluation job run every 100 steps (whenever there's a new checkpoint) so that I can get more granular plots on Tensorboard?

Code which addresses this issue is explained by a technical lead for the Google cloud platform in a Medium blogpost. Alternatively go directly to the Github code.
The train_and_evaluate function of 81 lines defines an TPUEstimator, train_input_fn and eval_input_fn. Then it iterates to the training steps and calls estimator.train and estimator.evaluate in each iteration. The metrics can be defined in the model_fn, which is called image_classifier. Note that it currently has no effect to add tf.summary calls in the model functions since the TPU does not support it:
"TensorBoard summaries are a great way see inside your model. A minimal set of basic summaries are automatically recorded by the TPUEstimator, to event files in the model_dir. Custom summaries, however, are currently unsupported when training on a Cloud TPU. So while the TPUEstimator will still run locally with summaries, it will fail if used on a TPU." (source)
If summaries are important it might be more convenient to switch to training on GPU.
Personally I think writing this code is quite a hassle for something which should be handled by the API. Please update this answer if better solutions exist! I'm looking forward to it.

Set save_summary_steps in RunConfig to 100, so you get the statistics you want
Also iterations_per_loop to 100 so that the training doesn't go more steps
p.s. I hope you realize that checkpointing is very slow. You are probably raising the cost of your job just for the sake of a pretty graph :)

You can try adding throttle_secs=100 to the EvalSpecs constructor here. The default is 600 seconds.

Related

Tensorflow Object Detection API low loss low confidence - checkpoint not saving weights

A few months ago I trained a custom object detector on stanford dogs using efficientdet_d0_512x512 and only 2 classes of dogs with success. Not changing the code, I tried doing that again and the model was pumping out really low confidence scores (<1%), even though the loss in the training process was low.
I then tried resuming the training using the checkpoint generated after the initial training and the loss starts high as if the checkpoint did not exist.
I also tried working with faster rcnn, getting the same results. Here's the code:
https://colab.research.google.com/drive/1fE3TYRyRrvKI2sVSQOPUaOzA9JItkuFL?usp=sharing
I'm thinking that the exporting is not working and the trained weights are not saved. Any ideas?
It's seems indeed that your checkpoint cannot be loaded as the many warnings you're getting :
WARNING:tensorflow:Unresolved object in checkpoint:
(root). .......
After a few research I found this issue on the model repository of Tensorflow Object Detection API : https://github.com/tensorflow/models/issues/8892#issuecomment-680207038.
Basically it's saying that you should change :
fine_tune_checkpoint_type = "detection"
to :
fine_tune_checkpoint_type = "fine_tune"
so that the checkpoint loaded will be used in fine-tuning and the number of classes doesn't cause issues between your configuration file and the one you're starting with.
It also suggests to be careful whereas your modeldir (where the custom checkpoints and tensorboard events will be saved) is empty or not for the same reasons but it seems that you're good with that on your colab notebook.
Also on a different subject you should be careful to your learning rate, right now you're using a cosine_decay_learning_rate which requires some warmup steps, 2500 in that case. However you are using only 800 so the warmups steps are not completed when you stop the training! If for some reason you want to keep the number of steps low you should change your learning rate to a manual_step_learning_rate or exponential_decay_learning rate. Otherwise you should keep your training going for much longer.
EDIT : After further investigation the problem might be a bit deeper confers to this issue on github : https://github.com/tensorflow/models/issues/9229
You might want to keep an eye on this to see where this is going.

Tensorflow inference run time high on first data point, decreases on subsequent data points

I am running inference using one of the models from TensorFlow's object detection module. I'm looping over my test images in the same session and doing sess.run(). However, on profiling these runs, I realize the first run always has a higher time as compared to the subsequent runs.
I found an answer here, as to why that happens, but there was no solution on how to fix.
I'm deploying the object detection inference pipeline on an Intel i7 CPU. The time for one session.run(), for 1,2,3, and the 4th image looks something like (in seconds):
1. 84.7132628
2. 1.495621681
3. 1.505012751
4. 1.501652718
Just a background on what all I have tried:
I tried using the TFRecords approach TensorFlow gave as a sample here. I hoped it would work better because it doesn't use a feed_dict. But since more I/O operations are involved, I'm not sure it'll be ideal. I tried making it work without writing to the disk, but always got some error regarding the encoding of the image.
I tried using the TensorFlow datasets to feed the data, but I wasn't sure how to provide the input, since the during inference I need to provide input for "image tensor" key in the graph. Any ideas on how to use this to provide input to a frozen graph?
Any help will be greatly appreciated!
TLDR: Looking to reduce the run time of inference for the first image - for deployment purposes.
Even though I have seen that the first inference takes longer, the difference (84 Vs 1.5) that is shown there seems to be a bit unbelievable. Are you counting the time to load model also, inside this time metric? Can this could be the difference for the large time difference? Is the topology that complex, so that this time difference could be justified?
My suggestions:
Try Openvino : See if the topology you are working on, is supported in Openvino. OpenVino is known to speed up the inference workloads by a substantial amount due to it's capability to optimize network operations. Also, the time taken to load openvino model, is comparitively lower in most of the cases.
Regarding the TFRecords Approach, could you please share the exact error and at what stage you got the error?
Regarding Tensorflow datasets, could you please check out https://github.com/tensorflow/tensorflow/issues/23523 & https://www.tensorflow.org/guide/datasets. Regarding the "image tensor" key in the graph, I hope, your original inference pipeline should give you some clues.

Tensorflow object detection API not displaying global steps

I am new here. I recently started working with object detection and decided to use the Tensorflow object detection API. But, when I start training the model, it does not display the global step like it should, although it's still training in the background.
Details:
I am training on a server and accessing it using OpenSSH on Windows. I trained a custom dataset, by collecting pictures and labeling them. I trained it using model_main.py. Also, until a couple of months back, the API was a little different, and only recently they changed to the latest version. For instance, earlier it used to use train.py for training, instead of model_main.py. All the online tutorial I can find use train.py, so it might be a problem with the latest commit. But I don't find anyone else fining this problem.
Thanks in advance!
Add tf.logging.set_verbosity(tf.logging.INFO) after the import section of the model_main.py script. It will display a summary after every 100th step.
As Thommy257 suggested, adding tf.logging.set_verbosity(tf.logging.INFO) after the import section of model_main.py prints the summary after every 100 steps by default.
Further, to specify the frequency of the summary, change
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
to
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, log_step_count_steps=k)
where it will print after every k steps.
Regarding the recent change to model_main , the previous version is available at the folder "legacy". I use train.py and eval.py from this legacy folder with the same functionality as before.

Deep Learning with TensorFlow on Compute Engine VM

I'm actualy new in Machine Learning, but this theme is vary interesting for me, so Im using TensorFlow to classify some images from MNIST datasets...I run this code on Compute Engine(VM) at Google Cloud, because my computer is to weak for this. And the code actualy run well, but the problam is that when I each time enter to my VM and run the same code I need to wait while my model is training on CNN, and after I can make some tests or experiment with my data to plot or import some external images to impruve my accuracy etc.
Is There is some way to save my result of trainin model just once, some where, that when I will decide for example to enter to the same VM tomorrow...and dont wait anymore while my model is training. Is that possible to do this ?
Or there is maybe some another way to do something similar ?
You can save a trained model in TensorFlow and then use it later by loading it; that way you only have to train your model once, and use it as many times as you want. To do that, you can follow the TensorFlow documentation regarding that topic, where you can find information on how to save and load the model. In short, you will have to use the SavedModelBuilder class to define the type and location of your saved model, and then add the MetaGraphs and variables you want to save. Loading the saved model for posterior usage is even easier, as you will only have to run a command pointing to the location of the file in which the model was exported.
On the other hand, I would strongly recommend you to change your working environment in such a way that it can be more profitable for you. In Google Cloud you have the Cloud ML Engine service, which might be good for the type of work you are developing. It allows you to train your models and perform predictions without the need of an instance running all the required software. I happen to have worked a little bit with TensorFlow recently, and at first I was also working with a virtualized instance, but after following some tutorials I was able to save some money by migrating my work to ML Engine, as you are only charged for the usage. If you are using your VM only with that purpose, take a look at it.
You can of course consult all the available documentation, but as a first quickstart, if you are interested in ML Engine, I recommend you to have a look at how to train your models and how to get your predictions.

Sharing Queue between two graphs in tensorflow

Is it possible to share a queue between two graphs in TensorFlow? I'd like to do a kind of bootstrapping to select "hard negative" examples during training.
To speed up the process, I want separate threads for hard negative example selection, and for the training process. The hard negative selection is based on the evaluation of the current model, and it will load its graph from a checkpoint file. The training graph is run on another thread and writes the checkpoint file. The two graphs should share the same queue: the training graph will consume examples and the hard negative selection will produce them.
Currently there's no support for sharing state between different graphs in the open-source version of TensorFlow: each graph runs in a separate session, and each session uses an isolated set of devices.
However, it seems like it would be possible to achieve your goal using a queue in single graph. Simply construct a queue (using e.g. tf.FIFOQueue) and use tf.import_graph_def() to import the graph from the checkpoint file into the current graph. Using the return_elements argument to tf.import_graph_def() you can specify the name of the tensor that will contain the negative examples, and then add a q.enqueue_many() operation to add them to your queue. You would then fork a thread to run the enqueue_many operation in a loop. In your training graph, you can use q.dequeue_many() to get a batch of negative examples, and use them as the input to your training process.