I am training a fairly complex network and noticed that the event file continues to grow during training and can reach a size of 2GB or more. I assume that the size of the event file should be approximately constant over a training session -- correct? I tried using tf.get_default_graph().finalize() placed just before the training loop per mrry's suggestion but that did not cause any errors. How else can I go about debugging this?
Related
I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's tf.data.Dataset API and doing all the recommended things like prefetching and num_parallel_calls=tf.data.experimental.AUTOTUNE
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps.
i encountered this phrase few times before, mostly in the context of neural networks and tensorflow, but i get the impression its something more general and not restricted to these environments.
here for example, they say that this "convolution warmup" process takes about 10k iterations.
why do convolutions need to warmup? what prevents them from reaching their top speed right away?
one thing that i can think of is memory allocation. if so, i would expect that it would be solved after 1 (or at least <10) iteration. why 10k?
edit for clarification: i understand that the warmup is a time period or number of iterations that have to be done until the convolution operator reaches its top speed (time per operator).
what i ask is - why is it needed and what happens during this time that makes the convolution faster?
Training neural networks works by offering training data, calculating the output error, and backpropagating the error back to the individual connections. For symmetry breaking, the training doesn't start with all zeros, but by random connection strengths.
It turns out that with the random initialization, the first training iterations aren't really effective. The network isn't anywhere near to the desired behavior, so the errors calculated are large. Backpropagating these large errors would lead to overshoot.
A warmup phase is intended to get the initial network away from a random network, and towards a first approximation of the desired network. Once the approximation has been achieved, the learning rate can be accelerated.
This is an empirical result. The number of iterations will depend on the complexity of your program domain, and therefore also with the complexity of the necessary network. Convolutional neural networks are fairly complex, so warmup is more important for them.
You are not alone to claiming the timer per iteration varies.
I run the same example and I get the same question.And I can say the main reason is the differnet input image shape and obeject number to detect.
I offer my test result to discuss it.
I enable trace and get the timeline at the first,then I found that Conv2D occurrences vary between steps in gpu stream all compulte,Then I use export TF_CUDNN_USE_AUTOTUNE=0 to disable autotune.
then there are same number of Conv2D in the timeline,and the time are about 0.4s .
the time cost are still different ,but much closer!
Here is the situation: I am working with a large Tensorflow record file. It's about 50 GB. However the machine I'm doing this training on has 128 GB of RAM. 50 is less than 128, so even though this is a large file you would think that it would be possible to keep it in memory and save on slow I/O operators. But I'm using the TFRecordDataset class and it seems like the whole TFRecord system is designed specifically to not do that, and I don't see any way to force it to keep the records in memory. And since it reloads them every epoch I am wasting an inordinate amount of time on slow I/O operations reading from that 50 GB file.
I suppose I could load the records into memory in python and then load them into my model one by one with a feed_dict, bypassing the whole Dataset class. But that seems like a less elegant way to handle things and would require some redesign. Everything would be much simpler if I could just force the TFRecordDataset to load everything into memory and keep it there between epochs...
You need tf.data.Dataset.cache() operation. To achieve the desired effect (keeping the file in memory), put it right after the TFRecordDataset and don't provide any arguments to it:
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.cache()
When the cache() operation is invoked without arguments, than caching is done in memory.
Also if you have some postprocessing of these records, like with dataset.map(...), then it could be even more beneficial to put the cache() operation in the end of the input pipeline.
More information can be found in the "Input Pipeline Performance Guide" Map and Cache section.
I am using the new dataset api to train a simple feed-forward DL model. I am interested in maximizing training speed. Since my network size isn't huge, as expected I see low GPU utilization. That is fine. But what I don't understand is why CPU usage is also far from 100%. I am using a multi-cpu/gpu core machine. Currently I get up to 140 steps / sec where batch_size = 128. If I cache the dataset I can get up to 210 steps (after initial scan). So I expect that with sufficient prefetching, I should be able to reach the same speed without caching. However with various prefetching and prefetch_to_device parameters, I cannot get more than 140 steps / sec. I also set num_parallel_calls to the number of cpu cores, which improves by about 20%.
Ideally I'd like the prefetching thread to be on a disjoint cpu core from the rest of the input pipeline, so that whatever benefit it provides is strictly additive. But from the cpu usage profiling I suspect that the prefetching and input processing occur on every core:
Is there a way to have more control over cpu allocation? I have tried prefetch(1), prefetch(500), and several other values (right after batch or at the end of the dataset construction), as well as in combination with prefetch_to_device(gpu_device, batch_size = None, 1, 500, etc). So far prefetch(500) without prefetch_to_device works the best.
Why doesn't prefetch try to exhaust all the cpu power on my machine? What are other possible bottlenecks in training speed?
Many thanks!
The Dataset.prefetch(buffer_size) transformation adds pipeline parallelism and (bounded) buffering to your input pipeline. Therefore, increasing the buffer_size may increase the fraction of time when the input to the Dataset.prefetch() is running (because the buffer is more likely to have free space), but it does not increase the speed at which the input runs (and hence the CPU usage).
Typically, to increase the speed of the pipeline and increase CPU usage, you would add data parallelism by adding num_parallel_calls=N to any Dataset.map() transformations, and you might also consider using tf.contrib.data.parallel_interleave() to process many input sources concurrently and avoid blocking on I/O.
The tf.data Performance Guide has more details about how to improve the performance of input pipelines, including these suggestions.
I recently tried running tensor flow object-detection with faster_rcnn_nas model on k80 graphics card, which got usable 11 GB memory. But still it crashed and appears that it required more memory based on the console errors.
My training data set has 1000 images of size 400X500 (approx.) and test data has 200 images of same size.
I am wondering what would be the approx memory needed for running the aster_rcnn_nas model or in general is it possible to know the memory requirements for any other model ?
Tensorflow doesn't have an easy way to compute memory requirements (yet), and it's a bit of a job to work it out by hand. You could probably just run it on the CPU and look at the process to get a ballpark number.
But the solution to your problem is straightforward. Trying to pass 1000 images of size 400x500 is insanity. That would probably exceed the capacity of the largest 24GB GPUs. You can probably only pass through 10's or 100's of images per batch. You need to split up your batch and process it in multiple iterations.
In fact during training you should be taking a random sample of images and training on this (this is the "stochastic" part of gradient descent). This is known as the "batch size". For the test set you might get all 200 images to go through (since you don't run backprop), but if not then you'll have to split up the test set too (this is quite common).