I am training a large network like ResNet with very small batch size say 25. When I do that, I get a very low and oscillating GPU utilization. I have seen several posts regarding the low GPU utilization in PyTorch. However, they are suggesting either of the following:
“Increase the batchsize.”: But, this is not a computational choice and I want my batch size to be small.
“Increase the number of workers as dataloading might be the bottleneck.”: First of all dataloading is not the bottleneck as it takes much less time. Secondly, increasing the number of loaders increases the running time of my code. Third, low and oscillating GPU utilization persists even after increasing the number of loaders. Hence, this suggestion also does not apply.
“Set shuffle = False”: Again not a feasible solution as I have to shuffle my data somehow.
Do you have any other suggestion for more effective use of GPUs when we have small batchsize?
Related
I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's tf.data.Dataset API and doing all the recommended things like prefetching and num_parallel_calls=tf.data.experimental.AUTOTUNE
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps.
I am using the new dataset api to train a simple feed-forward DL model. I am interested in maximizing training speed. Since my network size isn't huge, as expected I see low GPU utilization. That is fine. But what I don't understand is why CPU usage is also far from 100%. I am using a multi-cpu/gpu core machine. Currently I get up to 140 steps / sec where batch_size = 128. If I cache the dataset I can get up to 210 steps (after initial scan). So I expect that with sufficient prefetching, I should be able to reach the same speed without caching. However with various prefetching and prefetch_to_device parameters, I cannot get more than 140 steps / sec. I also set num_parallel_calls to the number of cpu cores, which improves by about 20%.
Ideally I'd like the prefetching thread to be on a disjoint cpu core from the rest of the input pipeline, so that whatever benefit it provides is strictly additive. But from the cpu usage profiling I suspect that the prefetching and input processing occur on every core:
Is there a way to have more control over cpu allocation? I have tried prefetch(1), prefetch(500), and several other values (right after batch or at the end of the dataset construction), as well as in combination with prefetch_to_device(gpu_device, batch_size = None, 1, 500, etc). So far prefetch(500) without prefetch_to_device works the best.
Why doesn't prefetch try to exhaust all the cpu power on my machine? What are other possible bottlenecks in training speed?
Many thanks!
The Dataset.prefetch(buffer_size) transformation adds pipeline parallelism and (bounded) buffering to your input pipeline. Therefore, increasing the buffer_size may increase the fraction of time when the input to the Dataset.prefetch() is running (because the buffer is more likely to have free space), but it does not increase the speed at which the input runs (and hence the CPU usage).
Typically, to increase the speed of the pipeline and increase CPU usage, you would add data parallelism by adding num_parallel_calls=N to any Dataset.map() transformations, and you might also consider using tf.contrib.data.parallel_interleave() to process many input sources concurrently and avoid blocking on I/O.
The tf.data Performance Guide has more details about how to improve the performance of input pipelines, including these suggestions.
If one gpu/cpu have twice as much GFlops then the other does that mean that the neural network on that device will train twice as fast ?
FLOP or floating point operations per second is a measure of performance, meaning how fast the computer can perform calculations. GFLOP is simply a Giga FLOP. So having GPU with 2 times higher GFLOP value is very likely to speed up the training process. However factor of 2 would be kind of upper-bound, because you will have other parts which do not depend on computing power, like memory speed, RAM or even other conditions like a cooling system of your GPU/CPU and other (yeah this can affect speed of calculations). And here you should ask what percent of the training time is actually taken by GPU/CPU calculations? if it's 80%, then you can speed up training significantly, if it is 20% then probably not.
If you are sure that most of the time is taken by GPU calculations, next what you should now is what affects FLOP amount:
Number of cores. If the system has more cores it has more FLOPs (more parallel computations), but this will help only in case your code is very parallelizable and GPU with, let's say twice less cores, was not enough to perform all those operations at once. So if that's the case and now you use 2 times more parallel calculations, then training speed - decreases. This is more applicable to large convolutional networks, but not as efficient for fully connected or recurrent.
Core frequency. If the GPU has higher frequency of cores - it can calculate faster. This part is very important and if your GPU has higher frequency, then the training will speed up for any type of neural network.
Architecture. You probably heard of different GPU architectures like Pascal, Tesla and others. So this part can affect number of instruction performed in a single cycle. In other words, how many instructions are performed in one processor cycle and we have 'frequency' of this cycles in a second. So if an architecture results in twice more FLOPs then it will also highly likely to reduce training time similar to the previous paragraph.
Thus it is hard to say how much you will gain from higher amount of FLOPs. If you use two gpus then you will increase FLOPs by 2 similar to paragraph 1. Using two GPUs will also increase GPU memory and it is helpful if single GPU had not enough and the code had to read data from memory frequently.
Thus, effect of FLOPs on the training speed is quite complex, so it will depend on a lot of factors like how parallel is your network, how achieved higher amount of FLOPs, memory usage and other.
I have used MPI+CUDA for this implementation of a computation on multiple GPUs. The GPU cluster used has 12 nodes each having 6 K40 GPUs. When I use 6 GPUs, they are from same computing node. However, if I measure the execution time by changing the number of GPUs, I get almost no speedup when I use 4 GPUs instead of 2, or 6 instead of 2.
Below is the graph of execution time with the number of GPUs on two different input sizes. Strangely, the application achieves speed-up on increasing the number of GPUs in the system further. The initial flat part is unexplained though.
I also measured the communication time via nvprof. On adding GPUs, the number of calls to cudaMemcpy increase, as expected. However, surprisingly the average time for the completion of cudaMemcpy calls decreases with increasing GPUs. This should not happen as the size of each data transfer stays same, only the number of data transfers increase.
So there are mainly two questions:
1) Does anybody has some possible explanation for the initial flat part in the graph?
2) How is the cudaMemCpy time decreasing on adding more GPUs to the system?
Any help will be highly appreciated.
I've implemented the game where the user must spot 5 differences in two side by side images, and I've made the image comparison engine to find the different regions first. The performance is pretty good (4-10 ms to compare 800x600), but I'm aware GPUs have so much power.
My question is could a performance gain be realized by using all those cores (just to compare each pixel once)... at the cost of copying the images in. My hunch says it may be worthwhile, but my understanding of GPUs is foggy.
Yes, implementing this process to run on the GPU can result in much faster processing time. The amount of performance increase you get is, as you allude to, related to the size of the images you use. The bigger the images, the faster the GPU will complete the process compared to the CPU.
In the case of processing just two images, with dimensions of 800 x 600, the GPU will still be faster. Relatively, that is a very small amount of memory and can be written to the GPU memory quickly.
The algorithm of performing this process on the GPU is not overly complicated, but assuming a person had no experience of writing code for the graphics card, the cost of learning how to code a GPU is potentially not worth the result of having this algorithm implemented on a GPU. If however, the goal was to learn GPU programming, this could be a good early exercise. I would recommend, to first learn gpu programming, which will take some time and should start with even simpler exercises.