I'm trying to build a distribute tensorflow framwork template, but there are serval problems confused me.
when I used --sync_replas=True in the script,does it mean I use Synchronous training as in doc?
why the global step in worker_0.log and worker_1.log
is not successively increment?
why the global step not start with 0 but like this
1499169072.773628: Worker 0: training step 1 done (global step: 339)
what's the relation between training step and global step?
As you can see from the create cluster script, I created an independent cluster.Can I run multiple different models on this cluster at the same time?
Probably but depends on the particular library
During distributed training it's possible to have race conditions so the increments and reads of the global step are not fully ordered. This is fine.
This is probably because you're loading from a checkpoint?
Unclear, depends on the library you're using
One model per cluster is much easier to manage. It's fine to create multiple tf clusters on the same set of machines, though.
Related
I am trying to distribute my workload to multiple GPUs with AWS Sagemaker. I am using a custom algorithm for a DCGAN with tensorflow 2.0. The code thus far works perfect on a single GPU. I decided to implement the same code but with horovod distribution across multiple GPUs to reduce run time. The code, when changed from the original to horovod, seems to work the same, and the training time is roughly the same. However, when I print out hvd.size() I am only getting a size of 1, regardless of the multiple GPU's present. Tensorflow recognizes all the present GPU's; Horovod, no.
I've tried running my code on both Sagemaker and on an EC2 instance in a docker container, and in both environments the same issue persists.
Here is the a link to my github repo:
Here
I've also tried using a different neural network entirely from the horovod repository, updated to tf2.0:
hvdmnist
At this point I am only trying to get the GPU's within one instance to be utilized, and am not trying utilize multiple instances.
I think I might be missing a dependency of some sort in the docker image, either that or there is some sort of prerequisite command for me to run. I don't really know.
Thanks.
By searching on Google, I can find the following two types of deployment about tensorflow training:
Training on a single node and multiple GPUs, such as CNN;
Distributed training on multiple nodes, such as between-graph replica training;
Is there any example of using multi-node multi-GPU? To be specific, there exist two levels of parallelism:
On the first level, the parameter servers and workers are distributed among different nodes;
On the second level, each worker on a single machine will use multiple GPUs for training;
Tensorflow Inception model documentation on GitHub (link) has a very good explanation on different types of training, make sure to check it out and their source code.
also, you can have a look at this code, it also does distribute training in a slightly different way.
I want to train 3 networks of different scales on 3 GPUs in paralle, then make an ensemble of these outputs. (with tensorflow)
Do I need to creat a graph and a corresponding session for each network? And what to do to make these sessions run concurrently rather than sequentially?
Do I need to creat a graph and a corresponding session for each network?
yes, since networks are not exactly same, separate graph and sessions are required to create.
And what to do to make these sessions run concurrently rather than sequentially?
Suppose you have three training scripts train_graph1.py, train_graph2.py, train_graph3.py. you need to run all those scrips separately at the same time to run concurrently.
CUDA_VISIBLE_DEVICES=0, train_graph1.py ....
CUDA_VISIBLE_DEVICES=1, train_graph2.py ....
CUDA_VISIBLE_DEVICES=2, train_graph3.py ....
Currently i'm implementing a large custom model and referencing the multi gpu example of CIFAR 10 that comes along with tensorflow. However the code I ended up writing based on that was not clean and was error prone. For e.g. I had to find every trainable variable and add "with tf.device('/cpu:0')".
Are there more efficient/cleaner ways of adapting for multi gpu execution?
Many thanks for any support.
Here's an example from Rafal
You make a loop over towers with the body constructing ith tower as with tf.device(assign_to_gpu(i)). The function assign_to_gpu treats variables differently and assigns them onto "ps-device".
Note: we found that when GPUs are p2p connected, training was faster when variables were kept gpu:0 rather than cpu:0
I am little confused about these two concepts.
I saw some examples about multi GPU without using clusters and servers in the code.
Are these two different? What is the difference?
Thanks a lot!
It depends a little on the perspective from which you look at it. In any multi-* setup, either multi-GPU or multi-machine, you need to decide how to split up your computation across the parallel resources. In a single-node, multi-GPU setup, there are two very reasonable choices:
(1) Intra-model parallelism. If a model has long, independent computation paths, then you can split the model across multiple GPUs and have each compute a part of it. This requires careful understanding of the model and the computational dependencies.
(2) Replicated training. Start up multiple copies of the model, train them, and then synchronize their learning (the gradients applied to their weights & biases).
Our released Inception model has some good diagrams in the readme that show how both multi-GPU and distributed training work.
But to tl;dr that source: In a multi-GPU setup, it's often best to synchronously update the model by storing the weights on the CPU (well, in its attached DRAM). But in a multi-machine setup, we often use a separate "parameter server" that stores and propagates the weight updates. To scale that to a lot of replicas, you can shard the parameters across multiple parameter servers.
With multiple GPUs and parameter servers, you'll find yourself being more careful about device placement using constructs such as with tf.device('/gpu:1'), or placing weights on the parameter servers using tf.train.replica_device_setter to assign it on /job:ps or /job:worker.
In general, training on a bunch of GPUs in a single machine is much more efficient -- it takes more than 16 distributed GPUs to equal the performance of 8 GPUs in a single machine -- but distributed training lets you scale to even larger numbers, and harness more CPU.
Well until recently there was no open-source cluster version of tensor flow - just single machine with zero or more GPU.
The new release v0.9 may or may not have changed things.
The article in the original release documentation (Oct 2015) showed that Google has cluster-based solutions - but they had not open-sourced them.
Here is what the whitepaper says:
3.2 Multi-Device Execution Once a system has multiple devices, there are two main complications: deciding which device to place the computation for each node in the graph, and then managing the required
communication of data across device boundaries implied by these
placement decisions. This subsection discusses these two issues