Which distributed strategy should I use when running training without GPU - tensorflow

There are many distributed strategy in Tensorflow,
MirroredStrategy
TPUStrategy
MultiWorkerMirroredStrategy
CentralStorageStrategy
ParameterServerStrategy
OneDeviceStrategy
Some of them run on one machine and broadcast model to different GPUs, and some of them use different GPUs on different machines.
My question if I don't have GPU in my server, which strategy should I use, so I can get the benefit when running distributed training?
I try to run MirroredStrategy, ParameterServerStrategy on four machines, but it seems that it's slower than running on a single machine.

Related

Distributed training in Tensorflow using multiple GPUs in Google Colab

I have recently become interested in incorporating distributed training into my Tensorflow projects. I am using Google Colab and Python 3 to implement a Neural Network with customized, distributed, training loops, as described in this guide:
https://www.tensorflow.org/tutorials/distribute/training_loops
In that guide under section 'Create a strategy to distribute the variables and the graph', there is a picture of some code that basically sets up a 'MirroredStrategy' and then prints the number of generated replicas of the model, see below.
Console output
From what I can understand, the output indicates that the MirroredStrategy has only created one replica of the model, and thereofore, only one GPU will be used to train the model. My question: is Google Colab limited to training on a single GPU?
I have tried to call MirroredStrategy() both with, and without, GPU acceleration, but I only get one model replica every time. This is a bit surprising because when I use the multiprocessing package in Python, I get four threads. I therefore expected that it would be possible to train four models in parallel in Google Colab. Are there issues with Tensorflows implementation of distributed training?
On google colab, you can only use one GPU, that is the limit from Google. However, you can run different programs on different gpu instances so by creating different colab files and connect them with gpus but you can not place the same model on many gpu instances in parallel.
There are no problems with mirrored startegy, talking from personal experience it works fine if you have more than one GPU.

Multi gpu training with estimators

in this link https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_estimator they say that when using Estimator for multi-worker training, it is necessary to shard the dataset by the number of workers to ensure model convergence.By multi-worker they mean multiple gpus in one system or distributed training? i have 2 gpus in one system, do i have to shard the dataset?
No you don't - multiple workers refer to a cluster of machines.
For single machine with multiple GPUs you don't need to shard it.
This tutorial explains the MirroredStrategy which you want for multiple GPUs: https://www.tensorflow.org/beta/tutorials/distribute/keras
For different distributed strategies for different setups you can refer here for more information: https://www.tensorflow.org/beta/guide/distribute_strategy#types_of_strategies

TensorFlow on multiple machines with multiple GPUs?

I'm a new in machine learning and Tensorflow. I have a question about distributed training in TensorFlow. I've read about multi GPUs environments and it looks that it is quite possible (https://www.tensorflow.org/guide/using_gpu).
But what about multiple machines with multiple GPUs? Is it possible to divide machine training tasks between few machines? Is there a specific algorithms/tasks, which require such distribution or multiple GPUs are enough for machine learning? Will there be demand on this?
Thanks
It is possible.
You can run same model on multiple machines using data parallelism with distributed strategies or horovod to speed up your training. In that case you are running the same model across multiple machines to emulate a larger batch.
You can also go for a little less conventional way with GPipe or TF-Mesh to split a single model across multiple machines to increase number of model layers or even split individual layers across multiple workers.

tensorflow: difference between multi GPUs and distributed tensorflow

I am little confused about these two concepts.
I saw some examples about multi GPU without using clusters and servers in the code.
Are these two different? What is the difference?
Thanks a lot!
It depends a little on the perspective from which you look at it. In any multi-* setup, either multi-GPU or multi-machine, you need to decide how to split up your computation across the parallel resources. In a single-node, multi-GPU setup, there are two very reasonable choices:
(1) Intra-model parallelism. If a model has long, independent computation paths, then you can split the model across multiple GPUs and have each compute a part of it. This requires careful understanding of the model and the computational dependencies.
(2) Replicated training. Start up multiple copies of the model, train them, and then synchronize their learning (the gradients applied to their weights & biases).
Our released Inception model has some good diagrams in the readme that show how both multi-GPU and distributed training work.
But to tl;dr that source: In a multi-GPU setup, it's often best to synchronously update the model by storing the weights on the CPU (well, in its attached DRAM). But in a multi-machine setup, we often use a separate "parameter server" that stores and propagates the weight updates. To scale that to a lot of replicas, you can shard the parameters across multiple parameter servers.
With multiple GPUs and parameter servers, you'll find yourself being more careful about device placement using constructs such as with tf.device('/gpu:1'), or placing weights on the parameter servers using tf.train.replica_device_setter to assign it on /job:ps or /job:worker.
In general, training on a bunch of GPUs in a single machine is much more efficient -- it takes more than 16 distributed GPUs to equal the performance of 8 GPUs in a single machine -- but distributed training lets you scale to even larger numbers, and harness more CPU.
Well until recently there was no open-source cluster version of tensor flow - just single machine with zero or more GPU.
The new release v0.9 may or may not have changed things.
The article in the original release documentation (Oct 2015) showed that Google has cluster-based solutions - but they had not open-sourced them.
Here is what the whitepaper says:
3.2 Multi-Device Execution Once a system has multiple devices, there are two main complications: deciding which device to place the computation for each node in the graph, and then managing the required
communication of data across device boundaries implied by these
placement decisions. This subsection discusses these two issues

Distributed tensorflow allocations

I have two related questions on controlling the distributed training for an experiment with 2 machines each having multiple GPUs.
Following the tensorflow Distributed Inception guidelines, I see that each process implements the data preprocessing queues and readers; now to achieve data parallelism with either synchronous or asynchronous replicated training, how does TF make sure that each worker processes a minibatch that no other worker has or will process for a particular epoch? Since all queue-runners point to the same dataset, is there some built-in coordination between workers to not process the same examples more than once in one epoch (e.g. in sync SGD)?
Is it possible to specify the GPU device for each worker process as well; as part of the cluster spec? Or does it need to be mentioned in code while running the training op or something? Or is this not recommended?