Performance improvement using Between-Graph replication in distributed tensorflow - tensorflow

I have gone through this
answer, but it didn't give the rationale for choosing multiple clients in Between-Graph replication for improving performance. How will using Between-Graph replication improve performance, when compared to In-Graph replication?

In-graph replication works fine for multiple devices on the same machine, but it doesn't scale well to cluster-size, because one client has to take care of coordination between all devices (even those located on different nodes).
Say, for example, that you have two GPUs, one on the client's machine and another on a second machine. Thanks to Tensorflow's magic, a simple with tf.device('address_of_the_gpu_on_the_other_machine'): will place operations on the remote computer's GPU. The graph will then run on both machines, but data will then need to be gathered from both before being able to proceed in the computation (loss computation, etc). Network communication will slow down your training (and of course, the more machines, the more communication needed).
Between-graph replication, on the other hand, scales much better because each machine has its own client that only needs to coordinate communication to the parameter server and execution of its own operations. Graphs "overlap" on the parameter server, which updates one set of variables that are shared among all the worker graphs. Moreover, communication overhead is also greatly reduced, because now you only need to have fast communication to the parameter servers, but no machine needs to wait for other machines to complete before moving on to the next training iteration.
How are the graphs different between the two methods?
In-graph replication:
In this method, you have only one graph managed by the client. This graph have nodes that are spread over multiple devices, even across different machines. This means that, for example, having two machines PC1 and PC2 on a network, the client will explicitly dispatch operations to one machine or the other. The graph technically is not "replicated", only some parts of it are distributed. Typically, the client has a big batch of data that is split in sub-batches, each of which is fed to a compute-intensive part of the graph. Only this compute-intensive part is replicated, but all the part before the split and after the computation (e.g., loss calculation) runs on the client. This is a bottleneck.
Note, also, that it´'s the client that decides which operations go to which machine, so theoretically one could have different parts of the graph on different nodes. You can decide to replicate identically the compute-intensive part on all your nodes, or you could, in principle, say "all the convolutions are on PC1, all dense layers go to PC2". Tensorflow's magic will insert data transfers where appropriate to make things work for you.
Between-graph replication:
Here you have multiple similar copies of the same graph. Why similar? because all of them have the compute-intensive part (as above), but also the input pipeline, the loss calculation and their own optimizer (assuming you're using asynchronous training (the default). This is another layer of complexity that I'll leave aside). (Delving deeper in Tensorflow's distributed framework, you'll also find out that not all workers (and their graphs) are equal, there is one "chief" worker that does initialization, checkpointing and summary logging, but this is not critical to understanding the general idea).
Unlike above, here you need a special machine, the parameter server (PS), that acts as central repository for the graph's variables (Caveat: not all the variables, only the global ones, like global_step and the weights of your network). You need this because now at each iteration, every worker will fetch the most recent values of the variables at each iteration of the training step. It then sends to the PS the updates that must be applied to the variables and the PS will actually do the update.
How is this different from the method above?
For one thing, there is no "big batch" that gets split among workers. Every worker processes as much data as it can handle, there is no need for splitting and putting things back together afterwards. This means, there is no need for synchronization of workers, because the training loops are entirely independent. The training, however, is not independent, because the updates that worker A does to the variables will be seen by worker B, because they both share the same variables. This means that the more workers you have, the faster the training (subject to diminished returns) because effectively the variables are updated more often (approximately every time_for_a_train_loop/number_of_workers seconds). Again, this happens without coordination between workers, which incidentally also makes the training more robust: if a worker dies, the others can continue (with some caveats due to having a chief worker).
One last cool feature of this method is that, in principle, there is no loss in performance using a heterogeneous cluster. Every machine runs as fast as it can and awaits nobody. Should you try running in-graph replication on a heterogeneous cluster, you'd be limited in speed by the slowest machine (because you collect all results before continuing).

Related

When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?

When training a neural network across multiple servers and GPUs, I can't think of a scenario where the ParameterServerStrategy would be preferable to the MultiWorkerMirroredStrategy.
What are the ParameterServerStrategy's main use cases and why would it be better than using MultiWorkerMirroredStrategy?
MultiWorkerMirroredStrategy is intended for synchronous distributed training across multiple workers, each of which can have multiple GPUs
ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training.
One of the key differences is that ParameterServerStrategy can be used for asynchronous training, while MultiWorkerMirroredStrategy is intended for Synchronous distributed training. In MultiWorkerMirroredStrategy a copy of all variables in the model is kept on each device across all workers, and a communication method is needed to keep all variables in sync. In contrast, in ParameterServerStrategy each variable of the model is placed on one parameter server.
This matters because:
In synchronous training, all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue. If the failed or preempted worker does not restart for some reason, your workers will keep waiting.
In contrast in ParameterServerStrategy, each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker. Hence unlike MultiWorkerMirroredStrategy, different workers are not waiting on each other.
I guess the question is, do you expect workers to fail, and will the delay in restarting them slow down training when MultiWorkerMirroredStrategy ? If that is the case, maybe ParameterServerStrategy is better.
EDIT: Answers to questions in comments:
So is the only benefit of PSS the fact that it resists better to
failing workers than MWMS?
Not exactly - even if workers do not fail in MWMS, as workers still need to be in sync there could be network bottle necks.
If so, then I imagine it would only be useful when training on many
workers, say 20 or more, or else the probability that a worker will
fail during training is low (and it can be avoided by saving regular
snapshots).
Maybe not, it depends on the situation. Perhaps in your scenario the probability of failure is low. In someone else's scenario there may be a higher probability. For the same number of workers, the longer a job is, there is more likelihood of a failure occurring in the middle of a job. To illustrate further (with an over simplistic example), if I have the same number of nodes, but theyre simply slower, they could take much longer to do a job, and hence there is greater likelihood of any kind of interruption / failure occurring during the job.
(and it can be avoided by saving regular snapshots).
Not sure I understand what you mean - if a worker fails, and you've saved a snapshot, then you haven't lost data. But the worker still needs to restart. In the interim between failure and restarting other workers may be waiting.
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
I will first try to answer it from a conceptual point of view.
I would say try looking at it from a different angle - in a synchronous operation, you're waiting for something else to finish, and you may be idle till that something gives you what you need.
In constrast in an asynchronous operation, you do your own work and when you need more you ask for it.
There is no hard and fast rule about whether synchronous operations or asynchronous operations are better. It depends on the situation.
I will now try to answer it from an optimization point of view:
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
In a distributed system it is possible that your bottleneck could be CPU / GPU, Disk or Network. Nowadays networks are really fast, and in some cases faster than disk. Depending on your workers configuration CPU / GPU could be the bottle neck. So it really depends on the configuration of your hardware and network.
Hence I would do some performance testing to determine where the bottlenecks in your system are, and optimize for your specific problem.
EDIT: Additional follow up questions:
One last thing: in your experience, in what use cases is PSS used? I
mean, both PSS and MWMS are obviously for use with large datasets (or
else a single machine would suffice), but what about the model? Would
PSS be better for larger models? And in your experience, is MWMS more
frequently used?
I think cost and the type of problem being worked on may influence the choice. For example, both AWS and GCP offer “spot instances” / “premptible instances” which are heavily discounted servers that can be taken away at any moment. In such a scenario, it may make sense to use PSS - even though machine failure is unlikely, a instance may simply be taken away without notice because it is a “spot instance”. If you use PSS, then the performance impact of servers disappearing may not be as large as when using MWMS.
If you’re using dedicated instances, the instances are dedicated to you, and will not be taken away - the only risk of interruption is machine failure. In such cases MWMS may be more attractive if you can take advantage of performance optimisations or plugin architecture.

Roles of parameter servers and workers

What exact role do parameter servers and workers have the during distributed training of neural networks? (e.g. in Distributed TensorFlow)
Perhaps breaking it down as follows:
During the forward pass
During the backward pass
For example:
Are parameter servers only responsible for storing and providing variable values in an ACID store?
Do different parameter servers manage different variables in the graph?
Do parameter servers receive gradients themshelves (and thus adding them up)?
Parameter Servers — This is actually same as a worker. Typically it’s a CPU where you store the variables you need in the workers. In my case this is where I defined the weights variables needed for my networks
Workers — This is where we do most of our computation intensive work.
In the forward pass — We take variables from Parameter servers, do something with them on our workers
In the backward pass — We send the current state back to the parameter servers which do some update operation and give us the new weights to try out
Are parameter servers only responsible for storing and providing variable values in an ACID store? ==> Yes, as per Tensorflow Documentation and Medium Article.
Do different parameter servers manage different variables in the graph? ==> Yes, inferred from the statement,
In addition, to that you can decide to have more than one parameter
server for efficiency reasons. Using parameters the server can provide
better network utilization, and it allows to scale models to more
parallel machines. It is possible to allocate more than one parameter
server.
from this link.
Do parameter servers receive gradients themselves (and thus adding them up)? ==> No. AFAIK, it receives the Updated Weights because computation of Gradients and modifying the Weights using the Formula,
W1 = W0 - Learning Rate * Gradients
happens in the Workers.

In Distributed Tensorflow, what is the effect of having multiple parameter servers?

When we have a parameter server which is updated by its workers, what is the effect of having multiple parameter servers for the same number of workers?
i.e. what happens when we have multiple parameter servers instead of one parameter server?
Thank you.
This is known as having multiple parameter server shards. This gives some more details
https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf, especially section 4.1
To apply SGD to large data sets, we introduce Downpour SGD, a variant
of asynchronous stochastic gradient descent that uses multiple
replicas of a single DistBelief model. The basic approach is as
follows: We divide the training data into a number of subsets and run
a copy of the model on each of these subsets. The models communicate
updates through a centralized parameter server, which keeps the
current state of all parameters for the model, sharded across many
machines (e.g., if we have 10 parameter server shards, each shard is
responsible for storing and applying updates to 1/10th of the model
parameters) (Figure 2)

Between-graph replication in tensorflow: sessions and variables

Question about between-graph replication in distributed Tensorflow, because I didn't get few moments from tutorials. As I understood the current model:
We have parameter server which we just launch in separate process and make server.join().
We have workers, each of them builds the similar computational graph which contains parameter nodes linked to parameter server (through tf.train.replica_device_setter) and calculation nodes placed on workers themselves.
What I didn't find:
How sessions are working in this model? Because in examples/tutorials it is hidden behind tf.train.Supervisor.
Do we have separate sessions on each worker or just one huge session that accumulate graphs from all the workers and parameter server?
How global variables are initialized on parameter server? I wonder that I can initialize them in one of the worker processes (choose it as a "master"), if I linked these parameters on the worker through tf.train.replica_device_setter. Is that correct?
In the following gist:
https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca
global variables initialized just in parameter server process and all the workers consider them initialized. How is that possible, given that they even work in different sessions? I could not replicate it in simpler example.
I have main session in core program where I perform training of the model. Part of training loop is collection of data, which in turn requires calculation on the tensorflow cluster. So I need to create this cluster, put on the parameter server current state of the trained model, then collect data from calculation and continue with training loop. How can I: 1) pass current trained model to the cluster? 2) Extract collected data from cluster and pass it to main program?
Thanks in advance!
EDIT:
To q.3:
It was answered previously (In tensorflow, is variable value the only context information a session stores?) that in distributed runtime variables are shared between sessions.
Does it mean that when I create session with some "target", then all the variables will be shared between those sessions that run on the same graph?
Guess I can try answering these questions by myself, at least it may be helpful for other newbies trying to harness distributed Tensorflow, because as of now there is lack of concise and clear blog posts on that topic.
Hope more knowledgeable people will correct me if needed.
We have separate sessions on all the servers, and these sessions share their resources (variables, queues, and readers) but only in distributed setting (i.e. you pass server.target to tf.Session constructor).
Ref: https://www.tensorflow.org/api_docs/python/client/session_management#Session
Parameter variables usually are initialized in one "master" process. It can be process where parameter server is launched. But it is not strictly necessary to do it in just one process.
Because of p.1. Replicated :)
Thanks to ideas from #YaroslavBulatov, I came to the following approach, which appears to be the simplest possible:
Cluster: one local "calculation server" and N "workers".
"Calculation server" keeps all the parameters of global network and performs training steps. All training ops are assigned to it.
"Workers" collect data in parallel and then put it in Queue; these data are used by "calculation server" when doing training steps.
So, high-level algorithm:
launch all the units in cluster
build comp graph and training ops on calculation server
build comp graph on workers (variables are linked to calculation
server).
collect data with workers
perform training step on calculation server and update global
network
repeat 4-5 till convergence :)
As of now I did coordination between calculation server and workers through Queues (when to start collection of data and when to start training step), which is definitely not the most elegant solution. Any feedback is very welcome.
I also stumbled upon these and very similar and related questions. I tried to clarify all of this in my overview of distributed TensorFlow. Maybe that is useful for some.
More specifically, let me try to answer your questions:
You say you do between-graph replication, i.e. you build a separate computation graph for every worker. This implies that you also have a separate session everywhere, because there would be no way to use that computation graph otherwise. The server (tf.distribute.Server) will not use the local computation graph. It will just execute things when remote sessions (clients) connect to it. The session has the graph. If there would be only a single session, there would also only be a single graph, and then you have in-graph replication.
If you share the variables (e.g. they live on a parameter server), it is enough if one of the workers does the initialization (e.g. the parameter server itself). Otherwise it depends on the specific distributed strategy and how you do the synchronization of the variables. E.g. a mirrored variables has separate copies on every replica, and you would need to make sure in some way that they are synchronized.
There is only a single copy of the variable in this case, which lives on the parameter server. All read and write on this variable would be a RPC call to the parameter server.
I'm not exactly sure what you mean by main program. You would have multiple instances of your program, one for each worker. But you likely will mark one of the worker as the chief worker, which has some further responsibilities like saving the checkpoint. But otherwise, all the workers are equal and do all the same thing (this is again for between-graph replication). How your gradient accumulation or parameter update looks like depends on your strategy (e.g. whether you do sync training or async training, etc.).

Algorithm for distributed messaging?

I have a distributed application across which I'd like to replicate a single, eventually consistent state. The data is suitable for a CRDT (http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf) which has the excellent property that each node, given the same set of messages, will deterministically converge to the same value without complicated consensus protocols.
However, I need another messaging/log layer that will ensure that each node actually sees every message, even in the face of adverse network conditions.
Specifically, I'm looking for an algorithm that has the following properties:
Works on an asynchronous network.
Nodes are only necessarily aware of their neighbors, not the whole network.
Nodes may be added or dropped at any time (that is, the network is not of a fixed size or topology).
The network can be acyclic (this can be a requirement, if necessary).
Is capable of bringing up to date a node that has become behind due to temporary network outage or dropped messages.
Is capable of bringing a new, empty node joining the cluster up to date.
There is not a hard limit on the time taken for the network to converge on a value (that is, for every node to recieve every message), but given no partitions it should be fairly quick (in fuzzy terms, a matter of seconds, not minutes).
Is bounded in size. Algorithms that keep the entire message history (which will grow boundlessly) are unsuitable.
Is anyone aware of an algorithm with these properties?