I have a large dateset to inference. There are 10 gpus in my machine. When I do inference, only one GPU work. The frame I use is tensorflow2.6. I used to use pytorch. But now I have to use tensorflow which I am not familiar with for some reasons.
I want to know how to use all gpus and keep the order of the Dataset at the same time in the inference process
Related
I am currently working on executing inference workloads on Coral Dev Board with TensorFlow Lite. I am trying to run inference on CPU,GPU and TPU simultaneously to reduce inference latency.
Could you guys help me understand how I can execute inference on all the devices simultaneously? I could divide the layers of network for training phase in CPU and GPU but I am having trouble in assigning layers of the network to each devices for inference.The code is written in python language with keras API in Tensorflow.
Thanks.
As of now, if you compile your CPU TFLite model with the edgeTPU compiler (https://coral.ai/docs/edgetpu/compiler/) then the compiler tries to Map the operations on the TPU only (as long as the operations are supported by the TPU)
The Edge TPU compiler cannot partition the model more than once, and as soon as an unsupported operation occurs, that operation and everything after it executes on the CPU, even if supported operations occur later.
So partitioning a single TFLite model into CPU, GPU and TPU is not feasible as of now.
curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?
Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.
GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.
Let's begin with the premise that I'm newly approaching to TensorFlow and deep learning in general.
I have TF 2.0 Keras-style model trained using tf.Model.train(), two available GPUs and I'm looking to scale down inference times.
I trained the model distributing across GPUs using the extremely handy tf.distribute.MirroredStrategy().scope() context manager
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model.compile(...)
model.train(...)
both GPUs get effectively used (even if I'm not quite happy with the results accuracy).
I can't seem to find a similar strategy for distributing inference between GPUs with the tf.Model.predict() method: when i run model.predict() I get (obviously) usage from only one of the two GPUs.
Is it possible to istantiate the same model on both GPUs and feed them different chunks of data in parallel?
There are posts that suggest how to do it in TF 1.x but I can't seem to replicate the results in TF2.0
https://medium.com/#sbp3624/tensorflow-multi-gpu-for-inferencing-test-time-58e952a2ed95
Tensorflow: simultaneous prediction on GPU and CPU
my mental struggles with the question are mainly
TF 1.x is tf.Session()based while sessions are implicit in TF2.0, if I get it correctly, the solutions I read use separate sessions for each GPU and I don't really know how to replicate it in TF2.0
I don't know how to use the model.predict() method with a specific session.
I know that the question is probably not well-formulated but I summarize it as:
Does anybody have a clue on how to run Keras-style model.predict() on multiple GPUs (inferencing on a different batch of data on each GPU in a parallel way) in TF2.0?
Thanks in advance for any help.
Try to load model in tf.distribute.MirroredStrategy and use greater batch_size
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.models.load_model(saved_model_path)
result = model.predict(batch_size=greater_batch_size)
There still does not seem to be an official example for distributed inference. There is a potential solution here using tf.distribute.MirroredStrategy: https://github.com/tensorflow/tensorflow/issues/37686. However, it does not seem to fully utilize multi gpus
Tensorflow uses different PRNGs for efficiency but I want reproducible/deterministic results across tensorflow devices. Is there any way I can do that with tensorflow? Also, what PRNG functions does Tensorflow use when it runs on CPU and on GPU?
Edit: Setting random seed tf.set_random_seed(1) will give you deterministic results if you train/infer on the same device but it will not give you the same results if you load the model and continue training on a different device.
I have been training my tensorflow retraining algorithm using a single GTX Titan and it works just fine, but when I try to use multiple gpus in the flower of retraining example it does not work and seems to only utilize one GPU when I run it in Nvidia SMI.
Why is this happening as it does work with multiple gpus when retraining at Inception model from scratch but not during retraining?
TensorFlow's flower retraining example does not work with multiple GPUs at all, even if you set --num_gpus > 1. It should support a single GPU as you noted.
The model needs to be modified to utilize multiple GPUs in parallel. Unfortunately, a single TensorFlow operation like the flower retraining example can't automatically be split over multiple GPUs at this time.