in this link https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_estimator they say that when using Estimator for multi-worker training, it is necessary to shard the dataset by the number of workers to ensure model convergence.By multi-worker they mean multiple gpus in one system or distributed training? i have 2 gpus in one system, do i have to shard the dataset?
No you don't - multiple workers refer to a cluster of machines.
For single machine with multiple GPUs you don't need to shard it.
This tutorial explains the MirroredStrategy which you want for multiple GPUs: https://www.tensorflow.org/beta/tutorials/distribute/keras
For different distributed strategies for different setups you can refer here for more information: https://www.tensorflow.org/beta/guide/distribute_strategy#types_of_strategies
Related
Suppose we have a simple TensorFlow model with a few convolutional layers. We like to train this model on a cluster of computers that is not equipped with GPUs. Each computational node of this cluster might have 1 or multiple cores. Is it possible out-of-the-box?
If not, which packages are able to do that? Are those packages able to perform data and model parallelism?
According to Tensorflow documentation
tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs.
As mentioned above it supports CPU for distributed training by considering all devices should be in the same network.
Yes, you can use multiple devices for training model and need to have cluster and worker configuration to be done on couple of devices as shown below.
tf_config = {
'cluster': {
'worker': ['localhost:1234', 'localhost:6789']
},
'task': {'type': 'worker', 'index': 0}
}
To know about the configuration and training model, please refer Multi-worker training with Keras.
According to this SO answer
tf.distribute.Strategy is integrated to tf.keras, so when model.fit is
used with tf.distribute.Strategy instance and then using
strategy.scope() for your model allows to create distributed
variables. This allows it to equally divide your input data on your
devices.
Note: One can be benefited using distributed training when dealing with huge data and complex models (i.e. w.r.t performance).
There are many distributed strategy in Tensorflow,
MirroredStrategy
TPUStrategy
MultiWorkerMirroredStrategy
CentralStorageStrategy
ParameterServerStrategy
OneDeviceStrategy
Some of them run on one machine and broadcast model to different GPUs, and some of them use different GPUs on different machines.
My question if I don't have GPU in my server, which strategy should I use, so I can get the benefit when running distributed training?
I try to run MirroredStrategy, ParameterServerStrategy on four machines, but it seems that it's slower than running on a single machine.
I'm a new in machine learning and Tensorflow. I have a question about distributed training in TensorFlow. I've read about multi GPUs environments and it looks that it is quite possible (https://www.tensorflow.org/guide/using_gpu).
But what about multiple machines with multiple GPUs? Is it possible to divide machine training tasks between few machines? Is there a specific algorithms/tasks, which require such distribution or multiple GPUs are enough for machine learning? Will there be demand on this?
Thanks
It is possible.
You can run same model on multiple machines using data parallelism with distributed strategies or horovod to speed up your training. In that case you are running the same model across multiple machines to emulate a larger batch.
You can also go for a little less conventional way with GPipe or TF-Mesh to split a single model across multiple machines to increase number of model layers or even split individual layers across multiple workers.
By searching on Google, I can find the following two types of deployment about tensorflow training:
Training on a single node and multiple GPUs, such as CNN;
Distributed training on multiple nodes, such as between-graph replica training;
Is there any example of using multi-node multi-GPU? To be specific, there exist two levels of parallelism:
On the first level, the parameter servers and workers are distributed among different nodes;
On the second level, each worker on a single machine will use multiple GPUs for training;
Tensorflow Inception model documentation on GitHub (link) has a very good explanation on different types of training, make sure to check it out and their source code.
also, you can have a look at this code, it also does distribute training in a slightly different way.
I have two related questions on controlling the distributed training for an experiment with 2 machines each having multiple GPUs.
Following the tensorflow Distributed Inception guidelines, I see that each process implements the data preprocessing queues and readers; now to achieve data parallelism with either synchronous or asynchronous replicated training, how does TF make sure that each worker processes a minibatch that no other worker has or will process for a particular epoch? Since all queue-runners point to the same dataset, is there some built-in coordination between workers to not process the same examples more than once in one epoch (e.g. in sync SGD)?
Is it possible to specify the GPU device for each worker process as well; as part of the cluster spec? Or does it need to be mentioned in code while running the training op or something? Or is this not recommended?