The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are
This prevent the data from going from CPU to GPU to CPU to GPU back again.
This frees the GPU of these tasks to focus on training.
I am not sure to understand either arguments.
Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU? Why preprocessing operations and not any other operation on the graph, why are they/should they be special?
Even though I understand the rationale behind putting the CPU to work rather than keeping it idle, compared to the huge convolutions and other gradient backpropagation a training step has to do, I would have assumed that random cropping, flip and other standard preprocessing steps on the input images should be nowhere near in term of computation needs, and should be executed in a fraction of the time. Even if we think of preprocessing as mostly moving things around (crop, flips), I think GPU memory should be faster for that. Yet doing preprocessing on the CPU can yield a 6+-fold increase in throughput according to the same guide.
I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.
Can somebody make sense out of this?
It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.
For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.
How we can achieve this with the combination of CPU and GPU processing?
We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]
Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.
As Dey, 2014 [1] have mentioned:
When a parallelized program is computed on the GPGPU, first the data
is copied from the memory to the GPU and after computation the data is
written back to the memory from the GPU using the PCI-e bus (Refer to
Fig. 1.8). Thus for every computation, data has to be copied to and
fro device-host-memory. Although the computation is very fast in
GPGPU, but because of the gap between the device-host-memory due to
communication via PCI-e, the bottleneck in the performance is
generated.
For this reason it is advisable that:
You do the preprocessing on CPU, where the CPU does the initial
computation, prepares and sends the rest of the repetitive
parallelised tasks to the GPU for further processing.
I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:
EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)
Now, to answer your question:
Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?
As quoted from the performance guide of Tensorflow [2],
When preprocessing occurs on the GPU the flow of data is CPU -> GPU
(preprocessing) -> CPU -> GPU (training). The data is bounced back and
forth between the CPU and GPU.
If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:
After computation of nodes on GPU, data is sent back on the memory
and CPU fetches that memory for further processing. GPU does not have
enough memory on-board (on GPU itself) to keep all the data on it for computational prupose. So
back-and-forth of data is inevitable. To optimise this data flow, you
do preprocessing on CPU, then the data (for training purposes), which is prepared for parallelizable tasks, is sent to the memory and GPU
fetches that preprocessed data and work on it.
In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of
If your training loop runs faster when using SSDs vs HDDs for storing
your input data, you could could be I/O bottlenecked.If this is the
case, you should pre-process your input data, creating a few large
TFRecord files.
This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.
Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!
References:
[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014
[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide
I quote at first two arguments from the performance guide and I think
your two questions concern these two arguments respectively.
The data is bounced back and forth between the CPU and GPU. ...
Another benefit is preprocessing on the CPU frees GPU time to focus on training.
(1) Operations like file reader, queue and dequeue can only be performed in CPU, operations like reshape, cast, per_image_standardization can be in CPU or GPU. So a wild guess for your first question: if the code doesn't specify /cpu:0, the program will perform in CPU by readers, then pre-process images in GPU, and finally enqueue and dequeue in CPU. (Not sure I am correct. waiting for an expert to verify...)
(2) For the second question, I have the same doubt too. When you train a large network, most of time is spent on the huge convolutions and the gradient computation, not on preprocessing images. However, when they mean 6X+ increase in samples/sec processed, I think they mean the training on MNIST, where a small network is usually used. So that makes sense. Smaller convolutions spend much less time so the time spent on preprocessing is relatively large. 6X+ increase is possible for this case. But preprocessing on the CPU frees GPU time to focus on training is a reasonable explanation.
Hope this could help you.
Related
I'm running some GPU-accelerated PyTorch code and training it against a custom dataset, but while monitoring the state of my workstation during the process, I see GPU usage along the following lines:
I have never written my own GPU primitives, but I have a long history of doing low-level optimizations for CPU-intensive workloads and my experience there makes me concerned that while pytorch/torchvision are offloading the work to the GPU, it may not be an ideal workload for GPU acceleration.
When optimizing CPU-bound code, the goal is to try and get the CPU to perform as much (meaningful) work as possible in a unit of time: a supposedly CPU-bound task that shows only 20% CPU utilization (of a single core or of all cores, depending on whether the task is parallelizable or not) is a task that is not being performed efficiently because the CPU is sitting idle when ideally it would be working towards your goal. Low CPU usage means that something other than number crunching is taking up your wall clock time, whether it's inefficient locking, heavy context switching, pipeline flushes, locking IO in the main loop, etc. which prevents the workload from properly saturating the CPU.
When looking at the GPU utilization in the chart above, and again speaking as a complete novice when it comes to GPU utilization, it strikes me that the GPU usage is extremely low and appears to be limited by the rate at which data is being copied into the GPU memory. Is this assumption correct? I would expect to see a spike in copy (to GPU) followed by an extended period of calculations/transforms, followed by a brief copy (back from the GPU), repeated ad infinitum.
I notice that despite the low (albeit constant) copy utilization, the GPU memory is constantly peaking at the 8GB limit. Can I assume the workload is being limited by the low GPU memory available (i.e. not maxing out the copy bandwidth because there's only so much that can be copied)?
Does that mean this is a workload better suited for the CPU (in this particular case with this RTX 2080 and in general with any card)?
I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)
TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when
I am avaluating Spin using Promela for model checking, but processing time is an issue to me.
I have seen that I can use Multi Core to improve the calculation but what about GPU/Cuda support to speed up the calculations ? Can I do this at all ?
regards
Adrian
GPU support is not included in Spin but is an active area of research. Most SPIN problems that are slow enough to seek a speed up are also large enough to exceed the local memory on a GPU. As a result the CPU memory needs to be used to store the explored state space and then memory bandwidth, CPU <==> GPU, swamp any computational speed increases. If however, your state space is small then the GPU may be amenable to use; yet, Spin does not include such support.
If there is a matrix addition application that is implemented by hybrid CPU-GPU (in CUDA (i.e) using pthreads where each thread performs a partial matrix addition in host CPU and in GPU), for instance, if the matrix size is 1000, first 500 will be computed by host-CPU and the rest by GPU, basically the computation is split between cpu and gpu, so is this the best when compared to CPU only computation and GPU only computation.
Please, help me understand this concept.
Is there any profiling tool that will help find such kind of computation performance between those 3 ?. I'm new to CUDA so any help/guidance will be appreciated.
Thank you!
The problem with CPU-GPU hybrid computations where you need the result back on CPU is the latency between the two. If you expect to do some computation on GPU and have the result back on CPU, there can be easily several milliseconds of delay from starting the computation on GPU to get the results back on CPU, so the amount of work done on GPU should be significant. Or you need significant amount of work on CPU between starting GPU computation and getting the results back from GPU. Performing 1000 element matrix addition is tiny amount of work thus you would be better off performing the entire computation on CPU instead. You also have the overhead of transferring the data back and forth between the CPU & GPU across the PCI bus which adds to the overhead, so computations which require small amount of data transferred between the two lean more towards hybrid solution.
If you never need to read the result back from GPU to CPU, then you don't have the latency issue though. For example you could do N-body simulation on GPU and perform visualization on GPU as well thus never needing the result on CPU. But the moment you need the result of the simulation back to CPU you have to deal with the latency issue.
The CUDA programming guide states that
"Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth."
It goes on to calculate theoretical bandwidth which is in the order of hundreds of gigabytes per second. I am at a loss as to why how many bytes one can read/write to global memory is a reflection of how well optimised a kernel is.
If I have a kernel which does intensive computation on data stored in shared memory and/or registers, with only a single read at the start and write out at the end from and to global memory, surely the effective bandwidth will be small, while the kernel itself may be very efficient.
Could any one further explain bandwidth in this context?
Thanks
most all nontrivial computational kernels, in CPU and GPU land, memory bound.
GPU has very high computational intensity and throughput, but access to main memory is very slow and has high latency, few hundred cycles per read/store versus four cycles for mmany arithmetic operations.
It sounds like your kernel is computation bound, so your luck. However you still have to watch out for shared memory bank conflict, which can serialize portions of code unexpectedly.
Most kernels are memory bound so maximising memory throughput is critical. If you're lucky enough to have a compute bound kernel then optimizing for computation is generally easier. You do need to look out for divergence and you should still ensure you have enough threads to hide memory latency.
Check out the Advanced CUDA C presentation for more information, including some tips for how to compare your realised performance with theoretical performance. The CUDA Best Practices Gude also has some good information, it's available as part of the CUDA toolkit (download from the NVIDIA site).
Typically kernels are fairly small and simple and perform the same operation on a lot of data. You might have a bunch of kernels that you invoke in sequence to perform some more complex operation (think of it as a processing pipeline). Obviously the throughput of your pipeline will depend both on how efficient your kernels are and whether you are limited by memory bandwidth in any way.