How to split TensorFlow graph (model) onto multiple GPUs to avoid OOM? - tensorflow

So I have this very large and deep model I implemented with TensorFlow r1.2, running on an NVIDIA Tesla k40 with 12 GB of memory. The model consists of several RNNs, a bunch of weight and embedding matrices as well as bias vectors. When I launched the training program, it first took about 2-3 hours to build to model, and then crashed due to OOM issues. I tried to reduce batch size to even 1 data sample per batch, but still ran into the same issue.
If I google tensorflow muitlple gpu, the examples I found mainly focused on utilizing multiple GPUs by parallel model design, which means to have each GPU run the same graph and have the CPU calculate the total gradient thus propagate back to each parameters.
I know one possible solution might be running the model on an GPU with larger memory. But I wonder if there's a way to split my graph (model) into different parts sequentially and assign them to different GPUs?

The official guide on using GPUs shows you that example in "Using multiple GPUs". You just need to create the operations within different tf.device contexts; the nodes will still be added to the same graph, but they will be annotated with device directives indicating where they should be run. For example:
with tf.device("/gpu:0"):
net0 = make_subnet0()
with tf.device("/gpu:1"):
net1 = make_subnet1()
result = combine_subnets(net0, net1)

Related

Prediction with GPU is much slower than with CPU?

curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?
Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.
GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

Non Deterministic Results Using GPUs with Tensorflow and Tensorflow Serving . .. Why?

We have an object detection model developed in Tensorflow (1.10 and 1.3) that uses a standard CNN and some extra layers. We host the model in Tensorflow Serving 1.13.0 using a saved model format, on Nvidia Tesla V100 GPUs with Cuda 10 and CUDNN 7.4.x. (We use the Google containers images and/or dockerfiles for Tensorflow serving.)
We run unit tests to ensure that prediction results are what we expect. These all work great on CPU. But when we run them on the above GPU/CUDA/CUDNN configuration, we get differences in the prediction probabilities ranging from .001 to .0005.
Our goals are to understand:
why this happens?
is there anything we can do to prevent it?
If there is something we can do to prevent it, does that entail some sort of trade off, such as performance?
We have tried the following experiments:
Multiple runs of the same model on tensorflow GPU using checkpoint with batchsize of 1
results identical
Multiple runs of the same model on GPU using checkpoint with various batchsizes
results off by .001
Multiple runs of the same model on CPU using checkpoint with various batchsizes
results identical
Multiple runs of the same model on tensorflow serviing GPU using checkpoint with batchsize of 1
results identical
Comparing runs with checkpoint to runs with saved model on GPU
results off by .005
Compare runs with checkpoint to runs with savedmodel on CPU
results identical
Experimented with changing the batch_size and setting TF_CUDNN_USE_AUTOTUNE=0 on GPU
reduces max difference from .001 to .0005
Experimented with adding intra_op_parallelism_threads=1, inter_op_parallelism_threads=1 didn’t make any difference when used with TF_CUDNN_USE_AUTOTUNE=0
results no different than the above
IN SUMMARY: We have a few cases where the results of running inference on GPU are different:
Using a checkpoint versus a saved model.
Batchsize = 1 versus various batch sizes
Setting TF_CUDNN_USE_AUTOTUNE=0 reduces the difference when using various batch sizes
This happens with TF 1.10 AND 1.13.1
Again, our goals are to understand:
Why this happens?
Is there anything we can do to prevent it?
If there is something we can do to prevent it, does that entail some sort of trade off, such as performance?
I have some crazy nondeterministic stuff going on, that didn't occur in my laptop's GPU but happened in the server's GPUs.
Solution: Now I call cudaDeviceSynchronize() every time after a call to a cublas, cusolver, etc., function, and the nondeterministic issue dissapeared! :) It made me really crazy and angry but aparently because those libraries use stream, then you can end using the content of a device pointer before the results have been written completely by those libs' functions.

Training on multi-GPUs with a small batch size

I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?
With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).

Data Parallelism for RNN in tensorflow

Recently, I have used tensorflow to develop an NMT system. I tried to train this system on multi-gpus using data-parallelism method to speed up it. I follow the standard data-parallelism way widely used in tensorflow. For example, if we want to run it on a 8-gpus computer. First, we construct a large batch which contains 8 times the size of batch used in a single GPU. Then we split this large batch equally to 8 mini-batch. We separately train them in different gpus. In the end, we collect gradients to update paramters. But I find when I used dynamic_rnn, the average time taken by one iteration in 8 gpus is two times long of that taken by one iteration trained in a single gpu. I make sure the batch size for each gpu is the same. Who has a better way to speed up the training of RNN in tensorflow?

Will adding GPU cards automatically scale tensorflow usage?

Suppose I can train with sample size N, batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error.
Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow?
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Will plugging additional GPU cards automatically solve this problem?
No. You have to change your Tensorflow code to explicitly compute different operations on different devices (e.g: compute the gradients over a single batch on every GPU, then send the computed gradients to a coordinator that accumulates the received gradients and updates the model parameters averaging these gradients).
Also, Tensorflow is so flexible that allows you to specify different operations for every different device (or different remote nodes, it's the same).
You could do data augmentation on a single computational node and let the others process the data without applying this function. You can execute certain operation on a device or set of devices only.
it is impossible with pure tensorflow?
It's possible with tensorflow, but you have to change the code you wrote for a single train/inference device.
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Blockchains that work using POW (Proof Of Work) requires to solve a difficult problem using a brute-force like approach (they compute a lot's of hash with different inputs until they found a valid hash).
That means that if your single GPU can guess 1000 hash/s, 2 identical GPUs can guess 2 x 1000 hash/s.
The computation the GPUs are doing are completely uncorrelated: the data produced by the GPU:0 is not used by the GPU:1 and there are no synchronization points between the computations. This means that the task that a GPU do can be executed in parallel by another GPU (obviously with different inputs per GPU, so the devices compute hashes to solve different problems given by the network)
Back to Tensorflow: once you modified your code to work with different GPUs, you could train your network faster (in short because you're using bigger batches)