Training on multi-GPUs with a small batch size - tensorflow

I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?

With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).

Related

Prediction with GPU is much slower than with CPU?

curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?
Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.
GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

Will adding GPU cards automatically scale tensorflow usage?

Suppose I can train with sample size N, batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error.
Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow?
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Will plugging additional GPU cards automatically solve this problem?
No. You have to change your Tensorflow code to explicitly compute different operations on different devices (e.g: compute the gradients over a single batch on every GPU, then send the computed gradients to a coordinator that accumulates the received gradients and updates the model parameters averaging these gradients).
Also, Tensorflow is so flexible that allows you to specify different operations for every different device (or different remote nodes, it's the same).
You could do data augmentation on a single computational node and let the others process the data without applying this function. You can execute certain operation on a device or set of devices only.
it is impossible with pure tensorflow?
It's possible with tensorflow, but you have to change the code you wrote for a single train/inference device.
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Blockchains that work using POW (Proof Of Work) requires to solve a difficult problem using a brute-force like approach (they compute a lot's of hash with different inputs until they found a valid hash).
That means that if your single GPU can guess 1000 hash/s, 2 identical GPUs can guess 2 x 1000 hash/s.
The computation the GPUs are doing are completely uncorrelated: the data produced by the GPU:0 is not used by the GPU:1 and there are no synchronization points between the computations. This means that the task that a GPU do can be executed in parallel by another GPU (obviously with different inputs per GPU, so the devices compute hashes to solve different problems given by the network)
Back to Tensorflow: once you modified your code to work with different GPUs, you could train your network faster (in short because you're using bigger batches)

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.

How to tell if my neural network is crashing due to memory errors?

I am trying to calibrate my expectations around a single laptop's ability to train a neural network. I am using tensorflow and keras and after about say 10 minutes, it crashes. I've seen killsignal 9 exit code 137, and I'm wondering if this is due to insufficient memory? Other times, when one-hot encoding using np_utils.to_categorical() I've seen the words memoryerror in the console, and that's it and my script crashes. This is just trying to transform the outputs into what a neural net expects before it even runs.
I have 6400 inputs and 1500 outputs and a small hidden layer of 100 nodes. Batch size 128.
That's it. It's not even deep. It crashes whether using an nvidia gpu or a 4 core cpu. For you pros, is my network too big to train on my system (i7 4 cores, 16gb ram, nvidia GT 750m , compute capability 3.0). Is my neural network considered a large one? I have 3 million samples, btw.
1) How do I estimate the amount of memory required for my network? Is it 6400 (# inputs) * 1500 (#outputs) * 4 bytes (per parameter) = 38.4 gb? Can I see how much memory is being used in real time on a mac somewhere? I've used activity monitor and the memory pressure gauge is normal.
2) GPUs typically are maxing out at 8gb-12gb of RAM, whereas CPUs on desktops could easily have 64 gb. So if the memory requirements of my network exceed 8gb of RAM, would it be impossible to train on a single GPU?
3) what is the difference, especially memory wise, between batch_size and batch_training?
Thank you!
Your calculation was correct with the multiplication, with the exception, that you are dealing with mega bytes and not giga bytes. The actual requirement is 6400*100*4 + 100*1500*4, which should ~4 MB if you use the default float32. You multiply the layer sizes of two subsequent layers together, because every neuron is connected to every neuron in the subsequent layer. Then the whole memory requirement is multiplied by the batch size. This is why convolutional layers are used to train deep networks.
For gpu I am using nvidia-smi to monitor the memory requirements on linux. A google search gave me this for mac: https://phvu.net/2015/03/30/nvidia-smi-on-macos/. If the memory requirements exceed the GPU memory you can not train it on the gpu. You could train it on a cpu, but that will take ages.
There are multiple ways to train with a large training set. Normally generator are used to train on batches. This means only loading the parts of the training set you actually need (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory).
Finding the memory requirements for your neural network not only depends on the size of the network or the number of parameters itself. For calculating the memory foot print of the neural network, one document that I always go to is the Stanford CS231n Convolutional Neural Networks for Visual Recognition course notes. Please take a look at the portion where they find the memory requirements for each and every layer of the network.
To add to that, batch size (the number of inputs per one batch) is a crucial factor in deciding the 'memory usage'. For example, in a newer NVIDIA P100 GPU, I can go as much as 2048 images per batch if I train a CIFAR10 model and less than 512 or 256 images if I train AlexNet on ImageNet dataset. The input size matters a lot, so does the batch size since the GPU memory need to account for the batch of inputs.
One way to test the batch size which works is to do nvidia-smi and see how much memory is used. Since doing it every now and then is boring, I usually do watch nvidia-smi in my Linux machine. In my MAC, I do not have a NVIDIA GPU installed so I seldom use these tricks. When I want to, I will write quick bash scripts like these:
while true; do nvidia-smi; sleep 0.5; clear; done
You can port install watch in Mac as well.
Also, two of my most favorite tools of all time are htop and dstat.
htop gives you a much better graphical interface to the famous top command in Linux. It gives you real-time information regarding your memory and processor usage, along with the different processes. If you give sudo access to htop, you can change the niceness and other parameters directly from the interface.
dstat gives you real time information about your I/O. In most cases, I will add two flags -d and -n to specify disk and network usage only.
Fortunately, htop can be brew installed on Mac by running:
brew install htop
dstat on the other hand is not directly available. Please look into ifstat or iostat for similar functionalities.
A screenshot of htop command in Mac.

TensorFlow: How to measure how much GPU memory each tensor takes?

I'm currently implementing YOLO in TensorFlow and I'm a little surprised on how much memory that is taking. On my GPU I can train YOLO using their Darknet framework with batch size 64. On TensorFlow I can only do it with batch size 6, with 8 I already run out of memory. For the test phase I can run with batch size 64 without running out of memory.
I am wondering how I can calculate how much memory is being consumed by each tensor? Are all tensors by default saved in the GPU? Can I simply calculate the total memory consumption as the shape * 32 bits?
I noticed that since I'm using momentum, all my tensors also have a /Momentum tensor. Could that also be using a lot of memory?
I am augmenting my dataset with a method distorted_inputs, very similar to the one defined in the CIFAR-10 tutorial. Could it be that this part is occupying a huge chunk of memory? I believe Darknet does the modifications in the CPU.
Now that 1258 has been closed, you can enable memory logging in Python by setting an environment variable before importing TensorFlow:
import os
os.environ['TF_CPP_MIN_VLOG_LEVEL']='3'
import tensorflow as tf
There will be a lot of logging as a result of this. You'll want to grep the results to find the appropriate lines. For example:
grep MemoryLogTensorAllocation train.log
Sorry for the slow reply. Unfortunately right now the only way to set the log level is to edit tensorflow/core/platform/logging.h and recompile with e.g.
#define VLOG_IS_ON(lvl) ((lvl) <= 1)
There is a bug open 1258 to control logging more elegantly.
MemoryLogTensorOutput entries are logged at the end of each Op execution, and indicate the tensors that hold the outputs of the Op. It's useful to know these tensors since the memory is not released until the downstream Op consumes the tensors, which may be much later on in a large graph.
See the description in this (commit).
The memory allocation is raw info is there although it needs a script to collect the information in an easy to read form.