Avoiding exhausting GPU resources in convNN Tensorflow - tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.

Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.

Related

Understanding 2D convolution output size

I am a beginner in Convolutional DL. I saw the following architecture in paper Simultaneous Feature Learning and Hash Coding with Deep Neural Networks: For images of size 256*256,
I do not understand the output size of the first 2D convolution: 96*54*54. 96 seems fine as the number of filters is 96. But, if we apply the following formula for the output size: size = [(W−K+2P)/S]+1 = [(256 - 11 + 2*0)/4] + 1 = 62.25 ~ 62. I have assumed the padding, P to be 0 as it is not mentioned in the paper anywhere. Keras Conv2D API produces the same 96*62*62 size output. Then, why paper points to 96*54*54? What am I missing?
Well, it reminded me AlexNet paper where there was a similar mistake. Your calculation is correct. I think they mistakenly write 256x256 instead of 224x224, in which case the calculation for the input layer is,
(224-11+2*0)/4 + 1 = 54.25 ~ 54
It's highly possible that authors mistakenly wrote 256x256 instead of the real architecture input size being 224x224 (that was the case in AlexNet also), or the other less possible option is they wrote 256x256 which was the real architecture input size, but do the calculations for 224x224. The latter is ignorable as I think it is a very silly mistake and I don't think that's even an option.
Thus, I believe the true input size was 224x224 instead of 256x256.

how to make embedding column through features directly?

I'm learning wide&deep model for ctr. My data has a feature user_id which has more than 2**26 values. How I can get embedding column through this feature? I used
user_id = tf.feature_column.categorical_column_with_hash_bucket('user_id', hash_bucket_size=2**26),
user_id_emb = tf.feature_column.embedding_column(user_id, dimension=95),
but it shows out of memeory.
So, 2**26 is about 64M. You want 95 embedding dimensions. Each will be a float32 by default. That is 4 bytes. 4 * 95 ~= 400 bytes per user_id. So you need 64M * 400 ~= 25.6 Gbytes of memory to store the embedding.
Make sure you can allocate that much on your system. It should be all in ram (swap will make everything much slower). If you placed this on a GPU it won't work since most GPUs don't have so much memory available. An embedding of only 20 dimensions should use about 5Gbytes which is more likely to fit in memory.
The easiest thing is to lower the number of embedding dimensions.
If you have multiple systems available you can shard the embedding (see partitioner parameter for variable related functions).
Another thing you can do is cluster some user_ids together (lower the hash_bucket_size). Or replace user_ids by a combination of other features that would describe the user sufficiently for your model.

TensorFlow Object Detection API: evaluation mAP behaves weirdly?

I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)
Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...

Fitting Large Matrix Calculations into Memory when using Tensorflow

I am attempting to build a model which has two phases.
The first takes an input image and passes it through a conv-deconv network. The resulting Tensor has entries corresponding to pixels in a desired output image (same size as the input image).
To calculate the final output image I want to take the value generated at each pixel location from the first phase and use it as an additional input to a reduction function that is applied over the entire input image. This second step has no trainable variables, but it does have computation/memory costs that grow exponentially with the size of the input (each output pixel is a function of all input pixels).
I'm currently using the tf.map_fn to calculate the output image. I'm mapping the output pixel calculation function onto the results from the first phase. My desire is that tensorflow would allocate the memory to store the intermediate tensors needed for each pixel calculation and then free that memory before moving on to the next pixel calculation. But instead it seems to never free the intermediate calculations causing OOM errors.
Is there someway to tell tensorflow (either explicitly or implicitly) that it should free the memory allocated to hold the data of a Tensor that is no longer needed in the calculation?
TensorFlow deallocates memory for the tensor as soon as the tensor is no longer needed for any future calculations. You can verify this by looking at memory deallocation messages as shown in this notebook.
It's possible you are running out of memory because TensorFlow executes nodes in a memory inefficient order.
As an example, consider following computation:
k = 2000
a = tf.random_uniform(shape=(k,k))
for i in range(n):
a = tf.matmul(a, tf.random_uniform(shape=(k,k)))
The order in which it is evaluated can be shown below
All the circles (tf.random_uniform) nodes are evaluated first, followed by squares (tf.matmul). This has O(n) memory requirement compared to O(1) for the optimal order.
You can use control dependencies to force a specific execution order, ie, using helper function as below:
import tensorflow.contrib.graph_editor as ge
def run_after(a_tensor, b_tensor):
"""Force a to run after b"""
ge.reroute.add_control_inputs(a_tensor.op, [b_tensor.op])

Getting each example exactly once

For monitoring my model's performance on my evaluation dataset, I'm using tf.train.string_input_producer for the filenames queue on .tfr files, then I feed the parsed examples to the tf.train.batch function, that produces batches of a fixed size.
Assume my evaluation dataset contains exactly 761 examples (a prime number). To read all the examples exactly once, I have to have a batch size that divides 761, but there is no such, except 1 that will be too slow and 761 that will not fit in my GPU. Any standard way for reading each example exactly once?
Actually, my dataset size is not 761, but there is no number in the reasonable range of 50-300 that divides it exactly. Also I'm working with many different datasets, and finding a number that approximately divides the number of examples in each dataset can be a hassle.
Note that using the num_epochs parameter to tf.train.string_input_producer does not solve the issue.
Thanks!
You can use reader.read_up_to as in this example. Your last batch will be smaller, so you need to make sure your network doesn't hard-wire batch-size anywhere