I have two related questions on controlling the distributed training for an experiment with 2 machines each having multiple GPUs.
Following the tensorflow Distributed Inception guidelines, I see that each process implements the data preprocessing queues and readers; now to achieve data parallelism with either synchronous or asynchronous replicated training, how does TF make sure that each worker processes a minibatch that no other worker has or will process for a particular epoch? Since all queue-runners point to the same dataset, is there some built-in coordination between workers to not process the same examples more than once in one epoch (e.g. in sync SGD)?
Is it possible to specify the GPU device for each worker process as well; as part of the cluster spec? Or does it need to be mentioned in code while running the training op or something? Or is this not recommended?
Related
Question
Does each layer in a Tensorflow/Keras sequential neural network (NN) work all the time? In a CPU, there are multiple pipeline stages and each stage keeps working and not waiting for the previous stage.
Depth of a pipeline in a CPU's architecture
Suppose there is a network:
[matmul(0) -> batch-norm(1) -> activation(2) -> matmul(3) -> loss(4)].
While a batch i is being processed in the batch-norm(1) layer, next batch i+1 can be processed in matmul(0) like a stage in CPU. I wonder if such concurrent executions are happening, or all the GPU/CPU are dedicated to a single layer at a time.
I saw Tensorflow uses graphs and tf.function for executions, and suppose parallel/concurrent executions would be scheduled based on the graph. How layer execution is planned from the graph perspective?
#mon
Excellent question! AFAIK, TensorFlow doesn't execute layers separately. Once the model is compiled, the graph is executed all at once in CPU, without any parallel processing. In GPU environments, the parts of the graph which can be processed in parallel are in fact processed in parallel. The image that you shared is a low level illustration of how the CPU works, and I don't think it can be directly applied here. Please correct me if I am wrong.
in this link https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_estimator they say that when using Estimator for multi-worker training, it is necessary to shard the dataset by the number of workers to ensure model convergence.By multi-worker they mean multiple gpus in one system or distributed training? i have 2 gpus in one system, do i have to shard the dataset?
No you don't - multiple workers refer to a cluster of machines.
For single machine with multiple GPUs you don't need to shard it.
This tutorial explains the MirroredStrategy which you want for multiple GPUs: https://www.tensorflow.org/beta/tutorials/distribute/keras
For different distributed strategies for different setups you can refer here for more information: https://www.tensorflow.org/beta/guide/distribute_strategy#types_of_strategies
I've used TensorFlow but am new to distributed TensorFlow for training models. My understanding is that current best practices favor the data-parallel model with asynchronous updates:
A paper published by the Google Brain team in April 2016 benchmarked
various approaches and found that data parallelism with synchronous
updates using a few spare replicas was the most efficient, not only
converging faster but also producing a better model. -- Chapter 12 of
Hands-On Machine Learning with Scikit-Learn and Tensorflow.
Now, my confusion from reading further about this architecture is figuring out which component applies the parameter updates: the workers or the parameter server?
In my illustration below, it's clear to me that the workers compute the gradients dJ/dw (the gradient of the loss J with respect to the parameter weights w). But who applies the gradient descent update rule?
What's a bit confusing is that this O'Reilly article on Distributed TensorFlow states the following:
In the more centralized architecture, the devices send their output in
the form of gradients to the parameter servers. These servers collect
and aggregate the gradients. In synchronous training, the parameter
servers compute the latest up-to-date version of the model, and send
it back to devices. In asynchronous training, parameter servers send
gradients to devices that locally compute the new model. In both
architectures, the loop repeats until training terminates.
The above paragraph suggests that in asynchronous training:
The workers compute gradients and send it to the parameter server.
The parameter server broadcasts the gradients to the workers.
Each worker receives the broadcasted gradients and applies the update rule.
Is my understanding correct? If it is, then that doesn't seem very asynchronous to me because the workers have to wait for the parameter server to broadcast the gradients. Any explanation would be appreciated.
I realize this was asked in 2018, but let's give it a shot.
Each Worker compute gradients
When a worker is done computing gradients it sends it to the parameter server.
The worker then gets sent the new parameters from the parameter server, without waiting for the other workers.
In the synchronous part, the workers will not continue training before every worker has sent its update to the server.
What this means in the asynchronous case is that every worker can have slightly different gradients, because they are fetching the gradients without waiting for every worker to update the parameter server.
By searching on Google, I can find the following two types of deployment about tensorflow training:
Training on a single node and multiple GPUs, such as CNN;
Distributed training on multiple nodes, such as between-graph replica training;
Is there any example of using multi-node multi-GPU? To be specific, there exist two levels of parallelism:
On the first level, the parameter servers and workers are distributed among different nodes;
On the second level, each worker on a single machine will use multiple GPUs for training;
Tensorflow Inception model documentation on GitHub (link) has a very good explanation on different types of training, make sure to check it out and their source code.
also, you can have a look at this code, it also does distribute training in a slightly different way.
I have access to a computer with multiple CPU cores (i.e., 56) and when training models using Tensorflow I would like to make the maximum usage of the aforementioned cores, by making each one of the cores an independent trainer of the model.
In Tensorflow's documentation, I found these two parameters (Inter and Intra Op parallelism) that control the parallelism while training models. However, these two parameters do not allow to perform what I intend.
How can I make each core an independent worker? (i.e., batches of samples are sharded by each one of the workers, and then each worker computes gradients based on the samples that were assigned. Finally, each worker updates the variables (which are shared by all the workers) according to the gradients it has calculated.
To parallelize among all 56 CPU cores effectively you will have to use Distributed TensorFlow. It is also possible to parallelize using threading but it will not scale well for so many cores.