How to do parallel GPU inferencing in Tensorflow 2.0 + Keras? - tensorflow

Let's begin with the premise that I'm newly approaching to TensorFlow and deep learning in general.
I have TF 2.0 Keras-style model trained using tf.Model.train(), two available GPUs and I'm looking to scale down inference times.
I trained the model distributing across GPUs using the extremely handy tf.distribute.MirroredStrategy().scope() context manager
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model.compile(...)
model.train(...)
both GPUs get effectively used (even if I'm not quite happy with the results accuracy).
I can't seem to find a similar strategy for distributing inference between GPUs with the tf.Model.predict() method: when i run model.predict() I get (obviously) usage from only one of the two GPUs.
Is it possible to istantiate the same model on both GPUs and feed them different chunks of data in parallel?
There are posts that suggest how to do it in TF 1.x but I can't seem to replicate the results in TF2.0
https://medium.com/#sbp3624/tensorflow-multi-gpu-for-inferencing-test-time-58e952a2ed95
Tensorflow: simultaneous prediction on GPU and CPU
my mental struggles with the question are mainly
TF 1.x is tf.Session()based while sessions are implicit in TF2.0, if I get it correctly, the solutions I read use separate sessions for each GPU and I don't really know how to replicate it in TF2.0
I don't know how to use the model.predict() method with a specific session.
I know that the question is probably not well-formulated but I summarize it as:
Does anybody have a clue on how to run Keras-style model.predict() on multiple GPUs (inferencing on a different batch of data on each GPU in a parallel way) in TF2.0?
Thanks in advance for any help.

Try to load model in tf.distribute.MirroredStrategy and use greater batch_size
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.models.load_model(saved_model_path)
result = model.predict(batch_size=greater_batch_size)

There still does not seem to be an official example for distributed inference. There is a potential solution here using tf.distribute.MirroredStrategy: https://github.com/tensorflow/tensorflow/issues/37686. However, it does not seem to fully utilize multi gpus

Related

Data augmentation on GPU

As tf.data augmentations are executed only on CPUs. I need a way to run certain augmentations on the TPU for an audio project.
For example,
CPU: tf.recs read -> audio crop -> noise addition.
TPU: spectogram -> Mixup Augmentation.
Most augmentations can be done as a Keras Layer on top of the model, but MixUp requires both changes in input as well as label.
Is there a way to do it using tf keras APIs.
And if there is any way we can transfer part of tf.data to run on TPU that will also be helpful.
As you have rightly mentioned and as per the Tensorflow documentation also the preprocessing of tf.data is done on CPU only.
However, you can do some workaround to preprocess your tf.data using TPU/GPU by directly using transformation function in your model with something like below code.
input = tf.keras.layers.Input((512,512,3))
x = tf.keras.layers.Lambda(transform)(input)
You can follow this Kaggle post for detailed discussion on this topic.
See the Tensorflow guide that discusses preprocessing data before the model or inside the model. By including preprocessing inside the model, the GPU is leveraged instead of the CPU, it makes the model portable, and it helps reduce the training/serving skew. The guide also has multiple recipes to get you started too. It doesn't explicitly state this works for a TPU but it can be tried.

Tensorflow Serving Performance Very Slow vs Direct Inference

I am running in the following scenario:
Single Node Kubernetes Cluster (1x i7-8700K, 1x RTX 2070, 32GB RAM)
1 Tensorflow Serving Pod
4 Inference Client Pods
What the inference clients do is they get images from 4 separate cameras (1 each) and pass it to TF-Serving for inference in order to get the understanding of what is seen on the video feeds.
I have previously been doing inference inside the Inference Client Pods individually by calling TensorFlow directly but that hasn't been good on the RAM of the graphics card. Tensorflow Serving has been introduced to the mix quite recently in order to optimize RAM as we don't load duplicated models to the graphics card.
And the performance is not looking good, for a 1080p images it looks like this:
Direct TF: 20ms for input tensor creation, 70ms for inference.
TF-Serving: 80ms for GRPC serialization, 700-800ms for inference.
The TF-Serving pod is the only one that has access to the GPU and it is bound exclusively. Everything else operates on CPU.
Are there any performance tweaks I could do?
The model I'm running is Faster R-CNN Inception V2 from the TF Model Zoo.
Many thanks in advance!
This is from TF Serving documentation:
Please note, while the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.
From my own experience, I've found TF Serving to be useful in providing an abstraction over model serving which is consistent, and does not require implementing custom serving functionalities. Model versioning and multi-model which come out-of-the-box save you lots of time and are great additions.
Additionally, I would also recommend batching your requests if you haven't already. I would also suggest playing around with the TENSORFLOW_INTER_OP_PARALLELISM, TENSORFLOW_INTRA_OP_PARALLELISM, OMP_NUM_THREADS arguments to TF Serving. Here is an explanation of what they are
Maybe you could try OpenVINO? It's a heavily optimized toolkit for inference. You could utilize your i7-8700K and run some frames in parallel. Here are some performance benchmarks for very similar i7-8700T.
There is even OpenVINO Model Server which is very similar to Tensorflow Serving.
Disclaimer: I work on OpenVINO.

How to get the exact GPU memory usage for Keras

I recently started learning Keras and TensorFlow. I am testing out a few models currently on the MNIST dataset (pretty basic stuff). I wanted to know, exactly how much my model is consuming memory-wise, during training and inference. I tried googling but did not find much info.
I came across Nvidia-smi. I tried using config.gpu_options.allow_growth = True option but still am not able to use the exact memory python.exe is consuming due to some issues with Nvidia-smi. I know that I could run a separate pass of train and inference, but this is too cumbersome. It is very easy if I could just find the right API to do the job.
Tensorflow being such a well known and well-used library, I am hoping to find a better and faster way to get to these numbers.
Finally, once again my question is:
How to get the exact memory usage for a Keras model during training and inference.
Relevant specs:
OS: Windows 10
GPU: GTX 1050
TensorFlow version: 1.14
Please let me know if any other details are required.
Thanks!

Bfloat16 training in GPUs

Hi I am trying to train a model using the new bfloat16 datatype variables. I know this is supported in Google TPUs. I was wondering if anyone has tried training using GPUs (for example, GTX 1080 Ti). Is that even possible, whether the GPU tensor cores are supportive? If anyone has any experience please share your thoughts.
Many thanks!
I had posted this question in Tensorflow github community. Here is their response so far -
"
bfloat16 support isn't complete for GPUs, as it's not supported natively by the devices.
For performance you'll want to use float32 or float16 for GPU execution (though float16 can be difficult to train models with). TPUs support bfloat16 for effectively all operations (but you currently have to migrate your model to work on the TPU).
"

Running Keras Sequential model in tf session

Given a Keras sequential model (specifically a 2 layer LSTM): How do we run it in a tf session?
I have to train the model multiple times in a single script and I run out of memory pretty fast. Is running it in a tf session the right solution? If not, what is?
In keras, model.fit() or model.fit_generator() are usually used for model training. An example can be found here. If you don't need to stick with tf session, pure keras implementation is also a good choice.
For the out of memory issue, there may be multiple potential reasons. Maybe you can check if your dataset is too large? Without further info about your code, it's hard to say.