I'm a new in machine learning and Tensorflow. I have a question about distributed training in TensorFlow. I've read about multi GPUs environments and it looks that it is quite possible (https://www.tensorflow.org/guide/using_gpu).
But what about multiple machines with multiple GPUs? Is it possible to divide machine training tasks between few machines? Is there a specific algorithms/tasks, which require such distribution or multiple GPUs are enough for machine learning? Will there be demand on this?
Thanks
It is possible.
You can run same model on multiple machines using data parallelism with distributed strategies or horovod to speed up your training. In that case you are running the same model across multiple machines to emulate a larger batch.
You can also go for a little less conventional way with GPipe or TF-Mesh to split a single model across multiple machines to increase number of model layers or even split individual layers across multiple workers.
Related
I have a large model (GPT-J 6B) and two 16G GPUs (V100 with no NVLink). I would like to do inference (generation). GPT-J needs 24G memory, so I need to split the model across my two GPUs.
For me, simplicity is much more important than maximizing utilization/throughput.
What is the easiest way to distribute the network? Is it possible to put first half of the network on one GPU and the second half on the other? (pipeline-parallel essentially)
There are many distributed strategy in Tensorflow,
MirroredStrategy
TPUStrategy
MultiWorkerMirroredStrategy
CentralStorageStrategy
ParameterServerStrategy
OneDeviceStrategy
Some of them run on one machine and broadcast model to different GPUs, and some of them use different GPUs on different machines.
My question if I don't have GPU in my server, which strategy should I use, so I can get the benefit when running distributed training?
I try to run MirroredStrategy, ParameterServerStrategy on four machines, but it seems that it's slower than running on a single machine.
in this link https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_estimator they say that when using Estimator for multi-worker training, it is necessary to shard the dataset by the number of workers to ensure model convergence.By multi-worker they mean multiple gpus in one system or distributed training? i have 2 gpus in one system, do i have to shard the dataset?
No you don't - multiple workers refer to a cluster of machines.
For single machine with multiple GPUs you don't need to shard it.
This tutorial explains the MirroredStrategy which you want for multiple GPUs: https://www.tensorflow.org/beta/tutorials/distribute/keras
For different distributed strategies for different setups you can refer here for more information: https://www.tensorflow.org/beta/guide/distribute_strategy#types_of_strategies
By searching on Google, I can find the following two types of deployment about tensorflow training:
Training on a single node and multiple GPUs, such as CNN;
Distributed training on multiple nodes, such as between-graph replica training;
Is there any example of using multi-node multi-GPU? To be specific, there exist two levels of parallelism:
On the first level, the parameter servers and workers are distributed among different nodes;
On the second level, each worker on a single machine will use multiple GPUs for training;
Tensorflow Inception model documentation on GitHub (link) has a very good explanation on different types of training, make sure to check it out and their source code.
also, you can have a look at this code, it also does distribute training in a slightly different way.
I am little confused about these two concepts.
I saw some examples about multi GPU without using clusters and servers in the code.
Are these two different? What is the difference?
Thanks a lot!
It depends a little on the perspective from which you look at it. In any multi-* setup, either multi-GPU or multi-machine, you need to decide how to split up your computation across the parallel resources. In a single-node, multi-GPU setup, there are two very reasonable choices:
(1) Intra-model parallelism. If a model has long, independent computation paths, then you can split the model across multiple GPUs and have each compute a part of it. This requires careful understanding of the model and the computational dependencies.
(2) Replicated training. Start up multiple copies of the model, train them, and then synchronize their learning (the gradients applied to their weights & biases).
Our released Inception model has some good diagrams in the readme that show how both multi-GPU and distributed training work.
But to tl;dr that source: In a multi-GPU setup, it's often best to synchronously update the model by storing the weights on the CPU (well, in its attached DRAM). But in a multi-machine setup, we often use a separate "parameter server" that stores and propagates the weight updates. To scale that to a lot of replicas, you can shard the parameters across multiple parameter servers.
With multiple GPUs and parameter servers, you'll find yourself being more careful about device placement using constructs such as with tf.device('/gpu:1'), or placing weights on the parameter servers using tf.train.replica_device_setter to assign it on /job:ps or /job:worker.
In general, training on a bunch of GPUs in a single machine is much more efficient -- it takes more than 16 distributed GPUs to equal the performance of 8 GPUs in a single machine -- but distributed training lets you scale to even larger numbers, and harness more CPU.
Well until recently there was no open-source cluster version of tensor flow - just single machine with zero or more GPU.
The new release v0.9 may or may not have changed things.
The article in the original release documentation (Oct 2015) showed that Google has cluster-based solutions - but they had not open-sourced them.
Here is what the whitepaper says:
3.2 Multi-Device Execution Once a system has multiple devices, there are two main complications: deciding which device to place the computation for each node in the graph, and then managing the required
communication of data across device boundaries implied by these
placement decisions. This subsection discusses these two issues