Distributed training in Tensorflow using multiple GPUs in Google Colab - tensorflow

I have recently become interested in incorporating distributed training into my Tensorflow projects. I am using Google Colab and Python 3 to implement a Neural Network with customized, distributed, training loops, as described in this guide:
https://www.tensorflow.org/tutorials/distribute/training_loops
In that guide under section 'Create a strategy to distribute the variables and the graph', there is a picture of some code that basically sets up a 'MirroredStrategy' and then prints the number of generated replicas of the model, see below.
Console output
From what I can understand, the output indicates that the MirroredStrategy has only created one replica of the model, and thereofore, only one GPU will be used to train the model. My question: is Google Colab limited to training on a single GPU?
I have tried to call MirroredStrategy() both with, and without, GPU acceleration, but I only get one model replica every time. This is a bit surprising because when I use the multiprocessing package in Python, I get four threads. I therefore expected that it would be possible to train four models in parallel in Google Colab. Are there issues with Tensorflows implementation of distributed training?

On google colab, you can only use one GPU, that is the limit from Google. However, you can run different programs on different gpu instances so by creating different colab files and connect them with gpus but you can not place the same model on many gpu instances in parallel.
There are no problems with mirrored startegy, talking from personal experience it works fine if you have more than one GPU.

Related

How to use Model Parallelism with a custom Tensorflow 2.0 model on TPUs?

To replicate Multimodal Few-Shot Learning with Frozen Language Models, I am trying to train a ~7B parameter subclassed TF2 model on a TPUv3-32. Out of the 7B parameters, roughly 6B parameters are frozen.
I want to use model and data parallelism to train it as efficiently as possible. As far as I know, MeshTensorflow can only be used for models written in TF1.
I tried using experimental_device_assignment from TPUStrategy but it was placing all the variables only on the 1st(0th) core of the TPU which quickly ran out of memory.
On a TPUv3-8, I tried to keep computation_shape = [2, 2, 1, 2] and [1, 1, 1, 2] and num_replicas = 1 but it didn't work.
I am also open to using GPUs to train it.
According to the cloud TPU documents, there is no official support:
Does Cloud TPU support model parallelism?
Model parallelism (or executing non-identical programs on the multiple cores within a single Cloud TPU device) is not currently supported.
https://cloud.google.com/tpu/docs/faq
The underlying issue may be that there is no automatic sharding of the computation graph in TPUStrategy so the graph is all placed one device, unless (in the model code) you manually assign device placements for weights and operations to the logical devices as created by DeviceAssignment.build and handle communication across the devices carefully.
That said, there is another TF2-compatible library (also from Google) that could help if you are building a big Transformer where you want layers that are friendly to graph sharding: Lingvo. In their Github, there is an example of sharding a model on a TPU v3-512 node. The library has Google's open sourced GPipe which can help speed up model parallel training loops. Lingvo should also work with GPUs.

Tensorflow Serving Performance Very Slow vs Direct Inference

I am running in the following scenario:
Single Node Kubernetes Cluster (1x i7-8700K, 1x RTX 2070, 32GB RAM)
1 Tensorflow Serving Pod
4 Inference Client Pods
What the inference clients do is they get images from 4 separate cameras (1 each) and pass it to TF-Serving for inference in order to get the understanding of what is seen on the video feeds.
I have previously been doing inference inside the Inference Client Pods individually by calling TensorFlow directly but that hasn't been good on the RAM of the graphics card. Tensorflow Serving has been introduced to the mix quite recently in order to optimize RAM as we don't load duplicated models to the graphics card.
And the performance is not looking good, for a 1080p images it looks like this:
Direct TF: 20ms for input tensor creation, 70ms for inference.
TF-Serving: 80ms for GRPC serialization, 700-800ms for inference.
The TF-Serving pod is the only one that has access to the GPU and it is bound exclusively. Everything else operates on CPU.
Are there any performance tweaks I could do?
The model I'm running is Faster R-CNN Inception V2 from the TF Model Zoo.
Many thanks in advance!
This is from TF Serving documentation:
Please note, while the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.
From my own experience, I've found TF Serving to be useful in providing an abstraction over model serving which is consistent, and does not require implementing custom serving functionalities. Model versioning and multi-model which come out-of-the-box save you lots of time and are great additions.
Additionally, I would also recommend batching your requests if you haven't already. I would also suggest playing around with the TENSORFLOW_INTER_OP_PARALLELISM, TENSORFLOW_INTRA_OP_PARALLELISM, OMP_NUM_THREADS arguments to TF Serving. Here is an explanation of what they are
Maybe you could try OpenVINO? It's a heavily optimized toolkit for inference. You could utilize your i7-8700K and run some frames in parallel. Here are some performance benchmarks for very similar i7-8700T.
There is even OpenVINO Model Server which is very similar to Tensorflow Serving.
Disclaimer: I work on OpenVINO.

Can we run training and validation on separate GPUs using tensorflow object detection API running on tensorflow 1.12?

I have two Nvidia Titan X cards on my machine and want to finetune COCO pretrained Inception V2 model on a single specific class. I have created the train/val tfrecords and changed the config to run the tensorflow object detection training pipeline.
I am able to start the training but it hangs (without any OOM) whenever it tries to evaluate a checkpoint. Currently it is using only GPU 0 with other resource parameters (like RAM, CPU, IO etc) in normal range. So I am guessing that GPU is the bottleneck. I wanted to try splitting training and validation on separate GPUs and see if it works.
I tried to look for a place where I could do something like setting "CUDA_VISIBLE_DEVICES" differently for both the processes but unfortunately the latest tensorflow object detection API code (using tensorflow 1.12) makes it very difficult to do so. I am also unable to verify my assumption about training and validation running in same process as my machine hangs. Could someone please suggest where to look for to solve it?

Already implemented neural network on Google Cloud Platform

I have implemented a neural network model using Python and Tensorflow, which normally runs on my own computer.
Now I would like to train it on new datasets on the Google Cloud Platform. Do you think it is possible? Do I need to change my code?
Thank you very much for your help!
Google Cloud offers the Cloud ML Engine service, which allows to train your models and perform predictions without the need of running and maintaining an instance with the required software.
In order to run the TensorFlow NN models you already have, you will not need to change your code, you will only have to package the trainer appropriately, as described in the documentation, and run a ML Engine job that performs the training itself. Once you have your model, you can also deploy it in the same service and later get predictions with different features depending on your requirements (urgency in getting the predictions, data set sources, etc.).
Alternatively, as suggested in the comments, you can always launch a Compute Engine instance and run there your TensorFlow model as if you were doing it locally in your computer. However, I would strongly recommend the approach I proposed earlier, as you will be saving some money, because you will only be charged for your usage (training jobs and/or predictions) and do not need to configure an instance from scratch.

How to split TensorFlow graph (model) onto multiple GPUs to avoid OOM?

So I have this very large and deep model I implemented with TensorFlow r1.2, running on an NVIDIA Tesla k40 with 12 GB of memory. The model consists of several RNNs, a bunch of weight and embedding matrices as well as bias vectors. When I launched the training program, it first took about 2-3 hours to build to model, and then crashed due to OOM issues. I tried to reduce batch size to even 1 data sample per batch, but still ran into the same issue.
If I google tensorflow muitlple gpu, the examples I found mainly focused on utilizing multiple GPUs by parallel model design, which means to have each GPU run the same graph and have the CPU calculate the total gradient thus propagate back to each parameters.
I know one possible solution might be running the model on an GPU with larger memory. But I wonder if there's a way to split my graph (model) into different parts sequentially and assign them to different GPUs?
The official guide on using GPUs shows you that example in "Using multiple GPUs". You just need to create the operations within different tf.device contexts; the nodes will still be added to the same graph, but they will be annotated with device directives indicating where they should be run. For example:
with tf.device("/gpu:0"):
net0 = make_subnet0()
with tf.device("/gpu:1"):
net1 = make_subnet1()
result = combine_subnets(net0, net1)