Bfloat16 training in GPUs - tensorflow

Hi I am trying to train a model using the new bfloat16 datatype variables. I know this is supported in Google TPUs. I was wondering if anyone has tried training using GPUs (for example, GTX 1080 Ti). Is that even possible, whether the GPU tensor cores are supportive? If anyone has any experience please share your thoughts.
Many thanks!

I had posted this question in Tensorflow github community. Here is their response so far -
"
bfloat16 support isn't complete for GPUs, as it's not supported natively by the devices.
For performance you'll want to use float32 or float16 for GPU execution (though float16 can be difficult to train models with). TPUs support bfloat16 for effectively all operations (but you currently have to migrate your model to work on the TPU).
"

Related

Does xgboost use the TPU for `gpu-hist` (if TPU is available)?

I am curious if xgboost will use a TPU in google colab if such is available? It certainly makes very good use of the GPU...
thanks for your question. TPU's are typically great at Neural Network based models. I have not heard of a way to do tree-based computations on TPUs, and I highly doubt that TPUs would be performant at that today as things stand.

How to get the exact GPU memory usage for Keras

I recently started learning Keras and TensorFlow. I am testing out a few models currently on the MNIST dataset (pretty basic stuff). I wanted to know, exactly how much my model is consuming memory-wise, during training and inference. I tried googling but did not find much info.
I came across Nvidia-smi. I tried using config.gpu_options.allow_growth = True option but still am not able to use the exact memory python.exe is consuming due to some issues with Nvidia-smi. I know that I could run a separate pass of train and inference, but this is too cumbersome. It is very easy if I could just find the right API to do the job.
Tensorflow being such a well known and well-used library, I am hoping to find a better and faster way to get to these numbers.
Finally, once again my question is:
How to get the exact memory usage for a Keras model during training and inference.
Relevant specs:
OS: Windows 10
GPU: GTX 1050
TensorFlow version: 1.14
Please let me know if any other details are required.
Thanks!

How to do parallel GPU inferencing in Tensorflow 2.0 + Keras?

Let's begin with the premise that I'm newly approaching to TensorFlow and deep learning in general.
I have TF 2.0 Keras-style model trained using tf.Model.train(), two available GPUs and I'm looking to scale down inference times.
I trained the model distributing across GPUs using the extremely handy tf.distribute.MirroredStrategy().scope() context manager
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model.compile(...)
model.train(...)
both GPUs get effectively used (even if I'm not quite happy with the results accuracy).
I can't seem to find a similar strategy for distributing inference between GPUs with the tf.Model.predict() method: when i run model.predict() I get (obviously) usage from only one of the two GPUs.
Is it possible to istantiate the same model on both GPUs and feed them different chunks of data in parallel?
There are posts that suggest how to do it in TF 1.x but I can't seem to replicate the results in TF2.0
https://medium.com/#sbp3624/tensorflow-multi-gpu-for-inferencing-test-time-58e952a2ed95
Tensorflow: simultaneous prediction on GPU and CPU
my mental struggles with the question are mainly
TF 1.x is tf.Session()based while sessions are implicit in TF2.0, if I get it correctly, the solutions I read use separate sessions for each GPU and I don't really know how to replicate it in TF2.0
I don't know how to use the model.predict() method with a specific session.
I know that the question is probably not well-formulated but I summarize it as:
Does anybody have a clue on how to run Keras-style model.predict() on multiple GPUs (inferencing on a different batch of data on each GPU in a parallel way) in TF2.0?
Thanks in advance for any help.
Try to load model in tf.distribute.MirroredStrategy and use greater batch_size
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.models.load_model(saved_model_path)
result = model.predict(batch_size=greater_batch_size)
There still does not seem to be an official example for distributed inference. There is a potential solution here using tf.distribute.MirroredStrategy: https://github.com/tensorflow/tensorflow/issues/37686. However, it does not seem to fully utilize multi gpus

What's the impact of using a GPU in the performance of serving a TensorFlow model?

I trained a neural network using a GPU (1080 ti). The training speed on GPU is far better than using CPU.
Currently, I want to serve this model using TensorFlow Serving. I just interested to know if using GPU in the serving process has a same impact on performance?
Since the training apply on batches but inferencing (serving) uses asynchronous requests, do you suggest using GPU in serving a model using TensorFlow serving?
You still need to do a lot of tensor operations on the graph to predict something. So GPU still provides performance improvement for inference. Take a look at this nvidia paper, they have not tested their stuff on TF, but it is still relevant:
Our results show that GPUs provide state-of-the-art inference
performance and energy efficiency, making them the platform of choice
for anyone wanting to deploy a trained neural network in the field. In
particular, the Titan X delivers between 5.3 and 6.7 times higher
performance than the 16-core Xeon E5 CPU while achieving 3.6 to 4.4
times higher energy efficiency.
The short answer is yes, you'll get roughly the same speedup for running on the GPU after training. With a few minor qualifications.
You're running 2 passes over the data in training, which all happens on the GPU, during the feedforward inference you're doing less work, so there will be more time spent transferring data to the GPU memory relative to computations than in training. This is probably a minor difference though. And you can now asynchronously load the GPU if that's an issue (https://github.com/tensorflow/tensorflow/issues/7679).
Whether you'll actually need a GPU to do inference depends on your workload. If your workload isn't overly demanding you might get away with using the CPU anyway, after all, the computation workload is less than half, per sample, so consider the number of requests per second you'll need to serve and test out whether you overload your CPU to achieve that. If you do, time to get the GPU out!

Strategies for improving performance when using Tensorflow w / C++?

I'm fairly new to Tensorflow in and ML in general and am wondering what strategies I can use to increase performance of an application I am building.
My app is using the Tensorflow C++ interface, with a source compiled TF 0.11 libtensorflow_cc.so (built with bazel build -c opt --copt=-mavx and optionally adding --config=cuda) for either AVX or AVX + CUDA on Mac OS X 10.12.1, on an MacBook Pro 2.8 GHz Intel Core i7 (2 cores 8 threads) with 16GB ram and a Nvidia 750m w/ 2GB VRam)
My application is using Inception V3 model and pulling feature vectors from pool_3 layer. I'm decoding video frames via native API's and passing those in memory buffers to the C++ interface for TF and running them into a session.
I'm not currently batching, but I am caching my session and re-using it for each individual decoded frame / tensor submission. Ive noticed that both CPU and GPU performance is about the same, taking about 40 to 50 seconds to process 222 frames, which seems very slow to me. Ive confirmed CUDA is being invoked, loaded, and the GPU is functioning (or appears so).
Some questions:
In general what should I expect for reasonable performance time wise of TF doing a frame of Inception on a consumer laptop?
How much of a difference does batching make for these operations? For tensors of 1x299x299x3 , I imagine I am doing more PCI transfer waiting than waiting on for meaningful work from the GPU?
if so Is there a good example of batching under C++ for InceptionV3?
Is there operations that cause additional CPU->GPU Syncronization that might otherwise be avoided?
Is there a way to ensure my sessions / graphs share resources ? Can I use nested scopes somehow in this manner? I couldn't quite get that to work but likely missed something.
Any good documentation of general strategies for things to do / avoid?
My code is below:
https://github.com/Synopsis/Synopsis/blob/TensorFlow/Synopsis/TensorFlowAnalyzer/TensorFlowAnalyzer.mm
Thank you very much
For reference, OpenCV analysis using perceptual hash, histogram, dense optical flow, sparse optical flow for point tracking, and simple saliency detection takes 4 to 5 seconds for the same 222 frames using CPU or CPU + OpenCL.
https://github.com/Synopsis/Synopsis/tree/TensorFlow/Synopsis/StandardAnalyzer
Answering your last question first, if there's documentation about performance optimization, yes:
The TensorFlow Performance Guide
The TensorFlow GPU profiling hints
Laptop performance is highly variable, and TF isn't particularly optimized for laptop GPUs. The numbers you're getting (222 frames in 40-50 seconds) ~= 5 fps don't seem crazy on a laptop platform, using the 2016 version of TensorFlow, with inception. With some of the performance improvements outlined in the performance guide above, that should probably be doubled in late 2017.
For batching, yes - the newer example inception model code allows a variable batch size at inference time. This is mostly about whether the model itself was defined to handle a batch size, which is something improved since 2016.
Batching for inference will make a pretty big difference on GPU. Whether it helps on CPU depends a lot -- for example, if you build with MKL-DNN support, batching should be considered mandatory, but basic TensorFlow may not benefit as much.