Tensorflow: GPU memory exhuastion and max pool layers

Tensorflow: GPU memory exhuastion and max pool layers - gpu

I have a pretty simple CNN setup (3 convolution layers into a fully connected layer) , using a Geforce GTX 960 GPU with 4GB of memory. I was hitting a GPU resource exhausted issue, and while playing around I discovered if I put a pooling layer (which reduces the overall size of the layer by a factor of 4) it actually made the problem worse. In other words, I was hitting the resource exhausted, and then after removing the pooling layer, I was able to train. I intuitively thought that since a pooling layer would reduce the number of operations downstream during the optimization piece of the training that it would help the memory constraints. Does any know how a max pooling layer could potentially add operations and increase memory like this?

Related

Prediction with GPU is much slower than with CPU?

curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?

Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.

GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

Assign Keras/TF/PyTorch layer to hardware type

Suppose we have the following architecture:
Multiple CNN layers
RNN layer
(Time-distributed) Dense classification layer
We want to train this architecture now. Our fancy GPU is very fast at solving the CNN layers. Although using a lower clockrate, it can perform many convolutions in parallel, thus the speed. Our fancy CPU however is faster for the (very long) resulting time series, because the time steps cannot be parallelized, and the processing profits from the higher CPU clockrate. So the (supposedly) smart idea for execution would look like this:
Multiple CNN layers (run on GPU)
RNN layer (run on CPU)
(Time-distributed) Dense classification layer (run on GPU/CPU)
This lead me to two important questions:
Is it possible, with any of the frameworks mentioned in the title, to distribute certain layers to certain hardware, and how?
If it is possible, would the overhead for the additional memory operations, e.g. tranferring between GPU-/CPU-RAM, render the whole idea useless?

Basically, in Pytorch you can control the device on which variables/parameters reside. AFAIK, it is your responsibility to make sure that for each operation all the arguments reside on the same device: i.e., you cannot conv(x, y) where x is on GPU and y is on CPU.
This is done via pytorch's .to() method that moves a module/variable .to('cpu') or .to('cuda:0')

As Shai mentioned you can control this yourself in pytorch so in theory you can have parts of your model on different devices. You have then to move data between devices in your forward pass.
I think the overhead as you mentioned would make the performance worst. The cuda RNN implementation benefits greatly from running on a gpu anyways :)

Can Tensorflow saturate a GPU with a batch size of 1?

I'm running a modest 5 layer convolutional network in tensorflow here.
If I have a large enough batch size I can pretty much saturate the GPU. I've implemented asynchronous loading with the tf.StagingArea, and I print a warning message if it's ever starved because it's waiting on data loading. I've confirmed that it is not waiting on the data loading at any point.
As I reduce my batch size I notice that the GPU utilization drops. When I have a batch size of 1 I get 35% utilization on a 1080 TI GPU. At a batch size of 20 I get 50% utilization, and so on. It's also less notable on slower GPUs.
I've tested it with a loop over nothing but the main sess.run call to ensure I don't have any other code slowing things down.
I've reviewed the TF high-performance models documentation, but don't note any reference to small batch sizes.
I wonder if anyone has any insight on improving the performance, other than, "duh, just increase the batch size". I include batch size as a hyperparameter because it affects model regularization. I'd like to test with small batch sizes but still utilize the GPU fully if possible. Also, this is generally the case when a single train step takes sufficiently less than ~0.1 sec to process.

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output

Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.

You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)

The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.

Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/

Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.

Minimizing GPU's idle time when using TensorFlow

How can we minimize the idle time of a GPU when training a network using Tensorflow ?
To do this :-
I used multiple Python threads to preprocess data and feed it to a tf.RandomShuffleQueue from where the TensorFlow took the data.
I thought that this will be more efficient than the feed_dict method.
However I still find on doing nvidia-smi that my GPU still goes from 100% utilization to 0% utilization and back to 100% quite often.
Since my network is large and the dataset is also large 12 million, any fruitful advice on speeding up would be very helpful.
Is my thinking that reading data directly from a tf.Queue is better than feed_dict correct ?
NOTE: I am using a 12 GB Titan X GPU (Maxwell architecture)

You are correct on assuming that feeding through a queue is better than feed_dict, for multiple reasons (mainly loading and preprocessing done on CPU, and not on the main thread). But one thing that can undermine this is if the GPU consume the data faster than it is loaded. You should therefore monitor the size of your queue to check if you have times where the queue size is 0.
If this is the case, I would recommand you to move your threading process into the graph, tensorflow as some nice mecanismes to allow batch loading (your loading batchs should be larger than your training batchs to maximise your loading efficiency, I personnaly use training batchs of 128 and loading batchs of 1024) in threads on CPU very efficiently. Moreover, you should place your queue on CPU and give it a large maximum size, you will be able to take advantage of the large size of RAM memory (I always have more than 16000 images loaded in RAM, waiting for training).
If you still have troubles, you should check tensorflow's performance guide:
https://www.tensorflow.org/guide/data_performance

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas