Tensorflow inference run time high on first data point, decreases on subsequent data points - tensorflow

I am running inference using one of the models from TensorFlow's object detection module. I'm looping over my test images in the same session and doing sess.run(). However, on profiling these runs, I realize the first run always has a higher time as compared to the subsequent runs.
I found an answer here, as to why that happens, but there was no solution on how to fix.
I'm deploying the object detection inference pipeline on an Intel i7 CPU. The time for one session.run(), for 1,2,3, and the 4th image looks something like (in seconds):
1. 84.7132628
2. 1.495621681
3. 1.505012751
4. 1.501652718
Just a background on what all I have tried:
I tried using the TFRecords approach TensorFlow gave as a sample here. I hoped it would work better because it doesn't use a feed_dict. But since more I/O operations are involved, I'm not sure it'll be ideal. I tried making it work without writing to the disk, but always got some error regarding the encoding of the image.
I tried using the TensorFlow datasets to feed the data, but I wasn't sure how to provide the input, since the during inference I need to provide input for "image tensor" key in the graph. Any ideas on how to use this to provide input to a frozen graph?
Any help will be greatly appreciated!
TLDR: Looking to reduce the run time of inference for the first image - for deployment purposes.

Even though I have seen that the first inference takes longer, the difference (84 Vs 1.5) that is shown there seems to be a bit unbelievable. Are you counting the time to load model also, inside this time metric? Can this could be the difference for the large time difference? Is the topology that complex, so that this time difference could be justified?
My suggestions:
Try Openvino : See if the topology you are working on, is supported in Openvino. OpenVino is known to speed up the inference workloads by a substantial amount due to it's capability to optimize network operations. Also, the time taken to load openvino model, is comparitively lower in most of the cases.
Regarding the TFRecords Approach, could you please share the exact error and at what stage you got the error?
Regarding Tensorflow datasets, could you please check out https://github.com/tensorflow/tensorflow/issues/23523 & https://www.tensorflow.org/guide/datasets. Regarding the "image tensor" key in the graph, I hope, your original inference pipeline should give you some clues.

Related

Model training using kaggle kernel

I have been using Kaggle to work on GPU. I have found that whenever I train my model again my accuracy on validation data changes. I don't get a consistent result. Is it because of the GPU I am accessing?
The initial weights for two networks are generally random and randomness causes variation. This means that training a second time will lead to a slightly different solution. To ensure exact reproducibility you need two things: i) your code to be deterministic, ii) the same seed for the random number generator (RNG).
The random number generator is library-specific. For numpy and tensorflow you can set it like this at the start (!) of your program:
np.random.seed(1337)
tf.random.set_seed(1337)
This means that repeatedly training the network should give you the same result. To get a different sample, you would have to initialize the RNG differently.
The non-determistic aspect is important to consider for some Optimizers. As an example, Adam used to be just that for some previous versions in Tensorflow/CUDA, meaning that you were unable to reproduce the same execution no matter how hard you tried.
That being said, of course slight differences can also be caused if your code executes CUDNN or CUDA-optimized methods. Switching between versions will then also potentially cause minor differences.

Object Detection model stuck at low mAP

I am trying to reproduce the results of the SSDLite model reported in the MobileNetV2 paper (arXiv:1801.04381), which should achieve about 22.1% mAP on the COCO detection challenge. However, I am stuck at 9% mAP. This is strange behavior because the model does work somewhat, but is still far off from the reported result. Can this much of a gap be caused by hyperparameters/optimizer choices (I am using adam instead of sgd), or is it almost certain that there is a bug in my implementation?
It is also worth mentioning that the model successfully overfits a small subset of the training set, but on the whole training set the loss seems to reach a plateau fairly quickly.
Has anyone encountered a problem similar to this?
Can this much of a gap be caused by hyperparameters/optimizer choices
(I am using adam instead of sgd), or is it almost certain that there
is a bug in my implementation?
Even small changes in the hyperparameters and a different optimizer choice can impact the training and the resulting precision of the classifier a lot. So your low precision might not necessary be due to a bug but could also be due to wrong parametrization.
It is also worth mentioning that the model successfully overfits a
small subset of the training set, but on the whole training set the
loss seems to reach a plateau fairly quickly.
Seems like you run into a local optimum which only works for a subset of your data, which could also be a pointer for an suboptimal parameterization.
Like #Matias Valdenegro also mentioned, to reproduce the exact result you might have to use the same parameters as in the original implementation.

Training segmentation model, 4 GPUs are working, 1 fills and getting: "CUDA error: out of memory"

I'm trying to build a segmentation model,and I keep getting
"CUDA error: out of memory",after ivestigating, I realized that all the 4 GPUs are working but one of them is filling.
Some technical details:
My Model:
the model is written in pytorch and has 3.8M parameters.
My Hardware:
I have 4 GPUs with 12GRAM (Titan V) each.
I'm trying to understand why one of my GPUs is filling up, and what am I doing wrong.
Evidence:
as can be seen from the screenshot below, all the GPUs are working, but one of them just keep filling until he gets his limit.
Code:
I'll try to explain what I did in the code:
First my model:
model = model.cuda()
model = nn.DataParallel(model, device_ids=None)
Second, Inputs and targets:
inputs = inputs.to('cuda')
masks = masks.to('cuda')
Those are the lines that working with the GPUs, if I missed something, and you need anything else, please share.
I'm feeling like I'm missing something so basic, that will affect not only this model but also the models in the future, I'll be more than happy for some help.
Thanks a lot!
Without knowing much of the details I can say the following
nvidia-smi is not the most reliable and up-to-date measurement mechanism
the PyTorch GPU allocator does not help either - it will cache blocks of memory artificially blowing up used resources (though it is not an issue here)
I believe there is still a "master" GPU which is the one data is loaded to directly (and then broadcast to other GPUs in DataParallel)
I don't know enough about PyTorch to reliably answer, but you can definitely check if a single GPU setup works with batch size divided by 4. And perhaps if you can load the model + the batch at one (without processing it).

How to fix incorrect guess in image recognition

I'm very new to this stuff so please bear with me. I followed a quick simple video about image recognition/classification in YT and the program indeed could classify the image with a high percentage. But then I do have some other images that was incorrectly classified.
On tensorflow site: https://www.tensorflow.org/tutorials/image_retraining#distortions
However, one should generally avoid point-fixing individual errors in
the test set, since they are likely to merely reflect more general
problems in the (much larger) training set.
so here are my questions:
What would be the best way to correct the program's guess? eg. image is B but the app returned with the results "A - 70%, B - 30%"
If the answer to one would be to retrain again, how do I go about retraining the program again without deleting the previous bottlenecks files created? ie. I want the program to keep learning while retaining previous data I already trained it to recognize.
Unfortunately there is often no easy fix, because the model you are training is highly complex and very hard for a human to interpret.
However, there are techniques you can use to try and reduce your test error. First make sure your model isn't overfitting or underfitting by observing the difference between train and test errors. If either is the case then try applying standard techniques, such as choosing a deeper model and/or using more filters if underfitting or adding regularization if overfitting.
Since you say you are already classifying correctly a high percentage of the time, I would start inspecting misclassified examples directly to try and gain insight into what you might be able to improve.
If possible, try and observe what your misclassified images have in common. If you are lucky they will all fall into one or a small number of categories. Here are some examples of what you might see and possible solutions:
Problem: Dogs facing left are misclassified as cats
Solution: Try augmenting your training set with rotations
Problem: Darker images are being misclassified
Solution: Make sure you are normalizing your images properly
It is also possible that you have reached the limits of your current approach. If you still need to do better consider trying a different approach like using a pretrained network for image recognition, such as VGG.

Why is my transfer learning implementation on tensorflow is throwing me an error after a few iterations?

I am using inception v1 architecture for transfer learning. I have downloded the checkpoints file, nets, pre-processing file from the below github repository
https://github.com/tensorflow/models/tree/master/slim
I have 3700 images and pooling out the last pooling layer filters from the graph for each of my image and appending it to a listĀ . With every iteration the ram usage is increasing and finally killing the run at around 2000 images. Can you tell me what mistake I have doneĀ ?
https://github.com/Prakashvanapalli/TensorFlow/blob/master/Transfer_Learning/inception_v1_finallayer.py
Even if I remove the list appending and just trying to print the results. this is still happening. I guess the mistake is with the way of calling the graph. When I see my ram usage , with every iteration it is becoming heavy and I don't know why this is happening as I am not saving anything nor there is a difference between 1st iteration
From my point, I am just sending one Image and getting the outputs and saving them. So it should work irrespective of how many images I send.
I have tried it on both GPU (6GB) and CPU (32GB).
You seem to be storing images in your graph as tf.constants. These will be persistent, and will cause memory issues like you're experiencing. Instead, I would recommend either placeholders or queues. Queues are very flexible, and can be very high performance, but can also get quite complicated. You may want to start with just a placeholder.
For a full-complexity example of an image input pipeline, you could look at the Inception model.