Execution of Inference Workloads on Coral Dev Board in CPU, GPU and TPU simultaneously - tensorflow

I am currently working on executing inference workloads on Coral Dev Board with TensorFlow Lite. I am trying to run inference on CPU,GPU and TPU simultaneously to reduce inference latency.
Could you guys help me understand how I can execute inference on all the devices simultaneously? I could divide the layers of network for training phase in CPU and GPU but I am having trouble in assigning layers of the network to each devices for inference.The code is written in python language with keras API in Tensorflow.
Thanks.

As of now, if you compile your CPU TFLite model with the edgeTPU compiler (https://coral.ai/docs/edgetpu/compiler/) then the compiler tries to Map the operations on the TPU only (as long as the operations are supported by the TPU)
The Edge TPU compiler cannot partition the model more than once, and as soon as an unsupported operation occurs, that operation and everything after it executes on the CPU, even if supported operations occur later.
So partitioning a single TFLite model into CPU, GPU and TPU is not feasible as of now.

Related

How to do inference with tensorflow2 with multi GPUs

I have a large dateset to inference. There are 10 gpus in my machine. When I do inference, only one GPU work. The frame I use is tensorflow2.6. I used to use pytorch. But now I have to use tensorflow which I am not familiar with for some reasons.
I want to know how to use all gpus and keep the order of the Dataset at the same time in the inference process

Does BytesInUse in Tensorflow get all the GPU memory used on one gpu divice or only the GPU memory used by the model where BytesInUse is?

If there are several Tensorflow models running on gpu0, and I
run 'tf.contrib.memory_stats.BytesInUse()' in model1, will the result be the gpu memory used by all the models, or only the memory used by model1?
And if I already have a Tensorflow model running on GPU, how can I get the amount of gpu memory used by the model by using BytesInUse in another python script?

Can device run tensorflow lite be used as a work task when performing distribute training?

Can a device running tensorflow lite be used as a work task of parameter server when performing distribute training?
At this point TensorFlow Lite performs only the forward pass (aka inference), not the back propagation (BP), so it doesn't fit into the training pattern (many iterations of forward and BP).
Plus, TensorFlow Lite is designed to be small and fast on resource constrained devices so it does not make much sense to try to use it in training.

Is it possible to train a H2O model with GPU and predict with a CPU?

For trainining speed, it would be nice to be able to train a H2O model with GPUs, take the model file, and then predict on a machine without GPUs.
It seems like that should be possible in theory, but with the H2O release 3.13.0.341, that doesn't seem to happen, except for XGBoost model.
When I run gpustat -cup I can see the GPUs kick in when I train H2O's XGBoost model. This doesn't happen with DL, DRF, GLM, or GBM.
I wouldn't be surprised if a difference in float point size (16, 32, 64) could cause some inconsistency, not to mention the vagaries due to multiprocessor modeling, but I think I could live with that.
(This is related to my question here, but now that I understand the environment better I can see that the GPUs aren't used all the time.)
How can I tell if H2O 3.11.0.266 is running with GPUs?
The new XGBoost integration in H2O is the only GPU-capable algorithm in H2O (proper) at this time. So you can train an XGBoost model on GPUs and score on CPUs, but that's not true for the other H2O algorithms.
There is also the H2O Deep Water project, which provides integration between H2O and three third-party deep learning backends (MXNet, Caffe and TensorFlow), all of which are GPU-capable. So you can train those models using a GPU and score on a CPU as well. You can download the H2O Deep Water jar file (or R package, or Python module) at the Deep Water link above, and you can find out more info in the Deep Water GitHub repo README.
Yes, you do the heavy job of training on a GPU, save weights and then, your CPU will only do the matrix multiplication for predictions.
In Keras you can train your model and save Neural Network weights:
model.save_weights('your_model_weights.h5')
model.load_weights('your_model_weights.h5')

Tensorflow 0.6 GPU Issue

I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block