I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's tf.data.Dataset API and doing all the recommended things like prefetching and num_parallel_calls=tf.data.experimental.AUTOTUNE
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps.
Related
I am using the new dataset api to train a simple feed-forward DL model. I am interested in maximizing training speed. Since my network size isn't huge, as expected I see low GPU utilization. That is fine. But what I don't understand is why CPU usage is also far from 100%. I am using a multi-cpu/gpu core machine. Currently I get up to 140 steps / sec where batch_size = 128. If I cache the dataset I can get up to 210 steps (after initial scan). So I expect that with sufficient prefetching, I should be able to reach the same speed without caching. However with various prefetching and prefetch_to_device parameters, I cannot get more than 140 steps / sec. I also set num_parallel_calls to the number of cpu cores, which improves by about 20%.
Ideally I'd like the prefetching thread to be on a disjoint cpu core from the rest of the input pipeline, so that whatever benefit it provides is strictly additive. But from the cpu usage profiling I suspect that the prefetching and input processing occur on every core:
Is there a way to have more control over cpu allocation? I have tried prefetch(1), prefetch(500), and several other values (right after batch or at the end of the dataset construction), as well as in combination with prefetch_to_device(gpu_device, batch_size = None, 1, 500, etc). So far prefetch(500) without prefetch_to_device works the best.
Why doesn't prefetch try to exhaust all the cpu power on my machine? What are other possible bottlenecks in training speed?
Many thanks!
The Dataset.prefetch(buffer_size) transformation adds pipeline parallelism and (bounded) buffering to your input pipeline. Therefore, increasing the buffer_size may increase the fraction of time when the input to the Dataset.prefetch() is running (because the buffer is more likely to have free space), but it does not increase the speed at which the input runs (and hence the CPU usage).
Typically, to increase the speed of the pipeline and increase CPU usage, you would add data parallelism by adding num_parallel_calls=N to any Dataset.map() transformations, and you might also consider using tf.contrib.data.parallel_interleave() to process many input sources concurrently and avoid blocking on I/O.
The tf.data Performance Guide has more details about how to improve the performance of input pipelines, including these suggestions.
I recently tried running tensor flow object-detection with faster_rcnn_nas model on k80 graphics card, which got usable 11 GB memory. But still it crashed and appears that it required more memory based on the console errors.
My training data set has 1000 images of size 400X500 (approx.) and test data has 200 images of same size.
I am wondering what would be the approx memory needed for running the aster_rcnn_nas model or in general is it possible to know the memory requirements for any other model ?
Tensorflow doesn't have an easy way to compute memory requirements (yet), and it's a bit of a job to work it out by hand. You could probably just run it on the CPU and look at the process to get a ballpark number.
But the solution to your problem is straightforward. Trying to pass 1000 images of size 400x500 is insanity. That would probably exceed the capacity of the largest 24GB GPUs. You can probably only pass through 10's or 100's of images per batch. You need to split up your batch and process it in multiple iterations.
In fact during training you should be taking a random sample of images and training on this (this is the "stochastic" part of gradient descent). This is known as the "batch size". For the test set you might get all 200 images to go through (since you don't run backprop), but if not then you'll have to split up the test set too (this is quite common).
Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?
Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around).
When doing off-policy reinforcement learning (which means you can use transitions samples generated by a "behavioral" policy, different from the one you are currently learning), an experience replay is generally used. Therefore, you can grab a bunch of transitions from this large buffer and use a GPU to optimize the learning objective with SGD (c.f. DQN, DDPG).
One instance of CPU-GPU hybrid approach for RL is this - https://github.com/NVlabs/GA3C.
Here, multiple CPUs are used to interact with different instances of the environment. "Trainer" and "Predictor" processes then collect the interactions using multi-process queues, and pass them to a GPU for back-propagation.
I am interested in the costs of using config.gpu_options.allow_growth=True, which I read about here.
I understand that there are some performance losses initially, as tensorflow allocates memory in multiple steps, but are there long run consequences?
E.g. if I have a computer that only runs tensorflow with config.gpu_options.allow_growth=True, will it after say an hour of training run slower (batches per second) than if I didn't use the option?
When you use allow_growth = True , the GPU memory is not pre-allocated and will be able to grow as you need it. This will lead to smaller memory usage (as otherwise default options was to use the whole of memory) but decreases the performance if not user properly, as it requires a more complex handling of the memory.
I've implemented the game where the user must spot 5 differences in two side by side images, and I've made the image comparison engine to find the different regions first. The performance is pretty good (4-10 ms to compare 800x600), but I'm aware GPUs have so much power.
My question is could a performance gain be realized by using all those cores (just to compare each pixel once)... at the cost of copying the images in. My hunch says it may be worthwhile, but my understanding of GPUs is foggy.
Yes, implementing this process to run on the GPU can result in much faster processing time. The amount of performance increase you get is, as you allude to, related to the size of the images you use. The bigger the images, the faster the GPU will complete the process compared to the CPU.
In the case of processing just two images, with dimensions of 800 x 600, the GPU will still be faster. Relatively, that is a very small amount of memory and can be written to the GPU memory quickly.
The algorithm of performing this process on the GPU is not overly complicated, but assuming a person had no experience of writing code for the graphics card, the cost of learning how to code a GPU is potentially not worth the result of having this algorithm implemented on a GPU. If however, the goal was to learn GPU programming, this could be a good early exercise. I would recommend, to first learn gpu programming, which will take some time and should start with even simpler exercises.