tensorflow wide linear model inference on gpu slow - tensorflow

I am training a sparse logistic regression model on tensorflow. This problem is specifically about the inference part. I am trying to benchmark inference on on cpu and gpu. I am using the Nvidia P100 gpu (4 dies) on my current GCE box. I am new to gpu so sorry for naive questions.
The model is pretty big ~54k operation (is it considered big compared to dnn or imagenet models ? ) . When i log device placement , i only see gpu:0 being used , and rest of them unused ? I don't do any device placement during training , but during inference i want it to optimally place and use gpu.
Few things i observed : my input node placehoolder (feed_dict) is placed on cpu, so i assume my data is being copied from cpu to gpu ? how does feed_dict exactly work behind the scene ?
1) How can i place my data on which i want to run prediction directly on gpu ? Note : my training runs on distributed cpu with multiple terabytes so i cannot have constant or variable directly in my graph during training , but my inference i can definitely have small batches of data that i would directly like to place on gpu. Are there ways i can achieve this ?
2) Since i am using P100 gpu , i think it has unified memory with host , is it possible to have zerocopy and directly have my data loaded into gpu ? How can i do this from python , java and c++ code. Currently i use feed_dict which from various google sources i think is not at all optimal .
3) Is there some tool or profiler i can use to see when i profile code like :
for epoch_step in epochs:
start_time = time.time()
for i in range(epoch_step):
result = session.run(output, feed_dict={input_example: records_batch})
end_time = time.time()
print("Batch {} epochs {} :time {}".format(batch_size, epoch_step, str(end_time - start_time)))
how much time is being spent on 1) cpu to gpu data transfer 2) session run overhead 3) gpu utilization (currently i use nvidia-smi periodically to monitor
4) kernel call overhead on cpu vs gpu (I assume each invokation of sess.run invokes 1 kernel call right ?
my current bench marking results :
CPU :
Batch size : 10
NumberEpochs TimeGPU TimeCPU
10 5.473 0.484
20 11.673 0.963
40 22.716 1.922
100 56.998 4.822
200 113.483 9.773
Batch size : 100
NumberEpochs TimeGPU TimeCPU
10 5.904 0.507
20 11.708 1.004
40 23.046 1.952
100 58.493 4.989
200 118.272 9.912
Batch size : 1000
NumberEpochs TimeGPU TimeCPU
10 5.986 0.653
20 12.020 1.261
40 23.887 2.530
100 59.598 6.312
200 118.561 12.518
Batch size : 10k
NumberEpochs TimeGPU TimeCPU
10 7.542 0.969
20 14.764 1.923
40 29.308 3.838
100 72.588 9.822
200 146.156 19.542
Batch size : 100k
NumberEpochs TimeGPU TimeCPU
10 11.285 9.613
20 22.680 18.652
40 44.065 35.727
100 112.604 86.960
200 225.377 174.652
Batch size : 200k
NumberEpochs TimeGPU TimeCPU
10 19.306 21.587
20 38.918 41.346
40 78.730 81.456
100 191.367 202.523
200 387.704 419.223
Some notable observations:
As batch size increase i see my gpu utilization increase (reaches to 100% for the only gpu it uses , is there a way i can tell tf to use other gpu too)
at batch size 200k is the only time i see my naive benchmarking shows gpu has minor gain as compared to cpu.
Increasing batch size for a given epoch has minimal effect on time both cpu and gpu until batch size <= 10k. But after that increasing batch size from 10k -> 100k -> 200k the time also increase quite fast i.e for a given epoch let us say 10 batch size 10, 100, 1k, 10k the cpu time and gpu time remain pretty stable ~5-7 sec for gpu and 0.48-0.96 sec for cpu (meaning that sess.run has much higher overhead than computing of graph themselves ?), but increasing batch size further the compute time increase at much faster rate i.e for epoch 10 100k->200k gputime increase from 11 -> 19 sec and cpu time also doubles , why so ? It seems for larger batch size even though i have just one sess.run , but internally it splits that into smaller batch and calls sess.run twice because epoch 20 batch size 100k matches more closely with epoch 10 batch 200k ..
How can i improve my inference further , i believe i am not usding all gpus optimally.
Are there any ideas around how can i benchmark better to get better breakdowns of time for cpu-> gpu transfer and actual speedup for graph computation from moving from cpu to gpu ?
Loading data better directly if possible zero copy into gpu ?
Can i place some nodes to gpu only during inference to get better performance ?
Ideas around quantization or optimizing inference graph ?
Any more ideas to improve gpu based inference . May be xla based optimization or tensrort ? i want to have high performance inference code to run these computations on gpu while the application server crunches on cpu.

One source of information are the TensorFlow docs on performance, including Optimizing for GPU and High Performance Models.
That said, those guides tend to target training more than batch inference, though certainly some of the principles still apply.
I will note that, unless you are using DistributionStrategy, TensorFlow will not automatically put ops on more than a single GPU (source).
In your particularly case, I don't believe GPUs are yet well-tuned to do the type of sparse operation required for your model, so I don't actually expect it to do that well on a GPU (if you log the device placement there's a chance the lookup is done on the CPU). A logistic regression model has only an (sparse) input layer and an output layer, so generally there are very few math ops. GPUs excel the most when they are doing lots of matrix multiplies, convolutions, etc.
Finally, I would encourage you to use TensorRT to optimize your graph, though for your particular model there's no guarantee it does much better.

Related

Multi-GPU training does not reduce training time

I have tried training three UNet models using keras for image segmentation to assess the effect of multi-GPU training.
First model was trained using 1 batch size on 1 GPU (P100). Each training step took ~254ms. (Note it is step, not epoch).
Second model was trained using 2 batch size using 1 GPU (P100). Each training step took ~399ms.
Third model was trained using 2 batch size using 2 GPUs (P100). Each training step took ~370ms. Logically it should have taken the same time as the first case, since both GPUs process 1 batch in parallel but it took more time.
Anyone who can tell whether multi-GPU training results in reduced training time or not? For reference, I tried all the models using keras.
I presume that this is due to the fact that you use a very small batch_size; in this case, the cost of distributing the gradients/computations over two GPUs and fetching them back (as well as CPU to GPU(2) data distribution) outweigh the parallel time advantage that you might gain versus the sequential training(on 1 GPU).
Expect to see a bigger difference for a batch size of 8/16 for instance.

Tuning Tensorflow Estimators based on batch size, hash bucket size, memory etc on CPU?

We're testing out various estimators such as LinearEstimator, DNNClassifier etc. Right now we are restricted to use only CPU for training, and we're testing out parameters and levers such as
CPU: 8~32 cpu's
Memory: 16~48 GB
Batch/Buffer size(dataset.batch(n)) : n=128~512
Hash bucket_size: 10,000 ~ 500,000
Number of threads: Tensorflow default, which should be number of logical cores
Optimizer: GradientDescent, FtrlOptimizer
Result: global steps per second * batch_size of around 20~50
So we're getting via Tensorboard global steps per second * bucket_size of around 20~50, and increasing CPU and memory has its limits.
We see similar results regardkess of Optimizer and its configurations.
Are we doing something wrong, and are there other levers we can use? Is there a limit as to how much you can optimize your model training methods, and should we move on to GPU's and take advantage of its matrix multiplication efficiencies?
You can try optimizing your input pipeline with Dataset API. Consider converting your data to tfrecords, it can give substantial improvements. If you have multiple CPUs you can setup a cluster. But it all depends heavily on what data you have. And take a look
https://www.tensorflow.org/guide/performance/datasets
https://www.tensorflow.org/guide/performance/overview

How to fix low volatile GPU-Util with Tensorflow-GPU and Keras?

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.

GPU temperature reads 88 C when training a LSTM on tensorflow

I've got a 1 layer LSTM model in tensorflow and the temperature reading of my GPU gets rather high during the training phase. Always varying between 80 C and 90 C. My GPU is a water cooled gtx 1080 "Super-clocked" edition in a 24/7 refrigerated room. The model works, but this temperature worries me. I'd like to know if this is normal and safe.
I'm training the LSTM for a next-word-prediction problem with tokenized reddit comments. I got the idea from different tutorials in wildml.com. Here are some details about it:
Tensorflow 1.2.1, Cuda tk 8.0, Cudnn 6.0, Nvidia Driver 375.66
My training data consists of 200 K reddit comments.
My word dictionary consists of 8000 words, which means 8000 classes of classification for each prediction
I use GLOVE pre-trained 100 Dimensions embeddings of Wikipedia words
I'm not using placeholders to feed my input. It's all done with TFRecordfiles readers, which input the examples to a 100k capacity random shuffle queue
From the random shuffle queue, it goes to a padding FIFO queue, where I generated zero-paddaded mini-batches of 20
The 20 size mini batches go to a tf.dynamic_rnn() with LSTM cell with Hidden dimension of 150
I mask the losses using tf.sign() and minimize the result with Adam optimizer
I've noticed that the temperature rises a lot when I raise the mini-batch size. 1 size mini-batches (single examples), it reads between 72-75 C. With 10 size mini-batches, it immediately goes to 78 C and stays in the range of 78-84 C. With 20 size mini-batches, 84-88 C. With 30 size mini-batches, 87-92 C.
If I raise the hidden dimension to 200, 250, 300, etc, while maintaining the minibatch size fixed, I also get similar temperature raises.
I've also trained the same model, but feeding the data with placeholders only, i.e, not using TFRecord, Queues and mini-batches. It stays around 65 C, but it's obviously far from optimized and ideal to use placeholders for feeding the net.
I really appreciate your help, I'm kinda desperate, to be honest.
-----------------EDIT---------------------
It turns out the water cooler pump was configured on my bios to variate according to the CPU temp...Obviously the GPU temp wouldn't affect it and thats what happened. It was running on 50 % of its capacity. Well, I've ajusted it to stay 100% all the time and now the same model runs with max temp of approx. 83 C. Still not perfect, but a huge improvement. I guess that with the complexity of my model + the really high 1.8 GHz clock of my GPU there's not much I can do.
The maximum design temperature of the GTX 1080 according to nvidia is 94 C. Anything below that and you should be safe.
Maximum GPU Temperature (in C) 94
The fact that the GPU temperature rises when you raise the mini-batch sizes is a good sign, this means that your GPU is working as hard as it can. In fact, if your GPU is not at ~80-90 C, this means that it is not working at full power, and you are losing some performance.

Tensorflow batching is very slow

I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM