What exact role do parameter servers and workers have the during distributed training of neural networks? (e.g. in Distributed TensorFlow)
Perhaps breaking it down as follows:
During the forward pass
During the backward pass
For example:
Are parameter servers only responsible for storing and providing variable values in an ACID store?
Do different parameter servers manage different variables in the graph?
Do parameter servers receive gradients themshelves (and thus adding them up)?
Parameter Servers — This is actually same as a worker. Typically it’s a CPU where you store the variables you need in the workers. In my case this is where I defined the weights variables needed for my networks
Workers — This is where we do most of our computation intensive work.
In the forward pass — We take variables from Parameter servers, do something with them on our workers
In the backward pass — We send the current state back to the parameter servers which do some update operation and give us the new weights to try out
Are parameter servers only responsible for storing and providing variable values in an ACID store? ==> Yes, as per Tensorflow Documentation and Medium Article.
Do different parameter servers manage different variables in the graph? ==> Yes, inferred from the statement,
In addition, to that you can decide to have more than one parameter
server for efficiency reasons. Using parameters the server can provide
better network utilization, and it allows to scale models to more
parallel machines. It is possible to allocate more than one parameter
server.
from this link.
Do parameter servers receive gradients themselves (and thus adding them up)? ==> No. AFAIK, it receives the Updated Weights because computation of Gradients and modifying the Weights using the Formula,
W1 = W0 - Learning Rate * Gradients
happens in the Workers.
Related
Part of federated learning research is based on operations performed on the communications between the server and clients such as dropping part of the updates (drop some gradients describing a model) exchanged between clients and server or discarding an update from a specific client in a certain communication round. I want to know if such capabilities are supported by Tensorflow-federated (TFF) framework and how they are supported because, from a first look, it seems to me the level of abstraction of TFF API does not allow such operations. Thank you.
TFF's language design intentionally avoids a notion of client identity; there is desire to avoid making a "Client X" addressable and discarding its update or sending it different data.
However, there may be a way to run simulations of the type of computations mentioned. TFF does support expressing the following:
Computations that condition on properties of tensors, for example ignore an update that has nan values. One way this could be accomplished would be by writing a tff.tf_computation that conditionally zeros out the weight of updates before tff.federated_mean. This technique is used in tff.learning.build_federated_averaing_process()
Simulations that run a different computations on different sets of clients (where a set maybe a single client). Since the reference executor parameterizes clients by the data they posses, a writer of TFF could write two tff.federated_computations, apply them to different simulation data, and combine the results.
I have gone through this
answer, but it didn't give the rationale for choosing multiple clients in Between-Graph replication for improving performance. How will using Between-Graph replication improve performance, when compared to In-Graph replication?
In-graph replication works fine for multiple devices on the same machine, but it doesn't scale well to cluster-size, because one client has to take care of coordination between all devices (even those located on different nodes).
Say, for example, that you have two GPUs, one on the client's machine and another on a second machine. Thanks to Tensorflow's magic, a simple with tf.device('address_of_the_gpu_on_the_other_machine'): will place operations on the remote computer's GPU. The graph will then run on both machines, but data will then need to be gathered from both before being able to proceed in the computation (loss computation, etc). Network communication will slow down your training (and of course, the more machines, the more communication needed).
Between-graph replication, on the other hand, scales much better because each machine has its own client that only needs to coordinate communication to the parameter server and execution of its own operations. Graphs "overlap" on the parameter server, which updates one set of variables that are shared among all the worker graphs. Moreover, communication overhead is also greatly reduced, because now you only need to have fast communication to the parameter servers, but no machine needs to wait for other machines to complete before moving on to the next training iteration.
How are the graphs different between the two methods?
In-graph replication:
In this method, you have only one graph managed by the client. This graph have nodes that are spread over multiple devices, even across different machines. This means that, for example, having two machines PC1 and PC2 on a network, the client will explicitly dispatch operations to one machine or the other. The graph technically is not "replicated", only some parts of it are distributed. Typically, the client has a big batch of data that is split in sub-batches, each of which is fed to a compute-intensive part of the graph. Only this compute-intensive part is replicated, but all the part before the split and after the computation (e.g., loss calculation) runs on the client. This is a bottleneck.
Note, also, that it´'s the client that decides which operations go to which machine, so theoretically one could have different parts of the graph on different nodes. You can decide to replicate identically the compute-intensive part on all your nodes, or you could, in principle, say "all the convolutions are on PC1, all dense layers go to PC2". Tensorflow's magic will insert data transfers where appropriate to make things work for you.
Between-graph replication:
Here you have multiple similar copies of the same graph. Why similar? because all of them have the compute-intensive part (as above), but also the input pipeline, the loss calculation and their own optimizer (assuming you're using asynchronous training (the default). This is another layer of complexity that I'll leave aside). (Delving deeper in Tensorflow's distributed framework, you'll also find out that not all workers (and their graphs) are equal, there is one "chief" worker that does initialization, checkpointing and summary logging, but this is not critical to understanding the general idea).
Unlike above, here you need a special machine, the parameter server (PS), that acts as central repository for the graph's variables (Caveat: not all the variables, only the global ones, like global_step and the weights of your network). You need this because now at each iteration, every worker will fetch the most recent values of the variables at each iteration of the training step. It then sends to the PS the updates that must be applied to the variables and the PS will actually do the update.
How is this different from the method above?
For one thing, there is no "big batch" that gets split among workers. Every worker processes as much data as it can handle, there is no need for splitting and putting things back together afterwards. This means, there is no need for synchronization of workers, because the training loops are entirely independent. The training, however, is not independent, because the updates that worker A does to the variables will be seen by worker B, because they both share the same variables. This means that the more workers you have, the faster the training (subject to diminished returns) because effectively the variables are updated more often (approximately every time_for_a_train_loop/number_of_workers seconds). Again, this happens without coordination between workers, which incidentally also makes the training more robust: if a worker dies, the others can continue (with some caveats due to having a chief worker).
One last cool feature of this method is that, in principle, there is no loss in performance using a heterogeneous cluster. Every machine runs as fast as it can and awaits nobody. Should you try running in-graph replication on a heterogeneous cluster, you'd be limited in speed by the slowest machine (because you collect all results before continuing).
When we have a parameter server which is updated by its workers, what is the effect of having multiple parameter servers for the same number of workers?
i.e. what happens when we have multiple parameter servers instead of one parameter server?
Thank you.
This is known as having multiple parameter server shards. This gives some more details
https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf, especially section 4.1
To apply SGD to large data sets, we introduce Downpour SGD, a variant
of asynchronous stochastic gradient descent that uses multiple
replicas of a single DistBelief model. The basic approach is as
follows: We divide the training data into a number of subsets and run
a copy of the model on each of these subsets. The models communicate
updates through a centralized parameter server, which keeps the
current state of all parameters for the model, sharded across many
machines (e.g., if we have 10 parameter server shards, each shard is
responsible for storing and applying updates to 1/10th of the model
parameters) (Figure 2)
Question about between-graph replication in distributed Tensorflow, because I didn't get few moments from tutorials. As I understood the current model:
We have parameter server which we just launch in separate process and make server.join().
We have workers, each of them builds the similar computational graph which contains parameter nodes linked to parameter server (through tf.train.replica_device_setter) and calculation nodes placed on workers themselves.
What I didn't find:
How sessions are working in this model? Because in examples/tutorials it is hidden behind tf.train.Supervisor.
Do we have separate sessions on each worker or just one huge session that accumulate graphs from all the workers and parameter server?
How global variables are initialized on parameter server? I wonder that I can initialize them in one of the worker processes (choose it as a "master"), if I linked these parameters on the worker through tf.train.replica_device_setter. Is that correct?
In the following gist:
https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca
global variables initialized just in parameter server process and all the workers consider them initialized. How is that possible, given that they even work in different sessions? I could not replicate it in simpler example.
I have main session in core program where I perform training of the model. Part of training loop is collection of data, which in turn requires calculation on the tensorflow cluster. So I need to create this cluster, put on the parameter server current state of the trained model, then collect data from calculation and continue with training loop. How can I: 1) pass current trained model to the cluster? 2) Extract collected data from cluster and pass it to main program?
Thanks in advance!
EDIT:
To q.3:
It was answered previously (In tensorflow, is variable value the only context information a session stores?) that in distributed runtime variables are shared between sessions.
Does it mean that when I create session with some "target", then all the variables will be shared between those sessions that run on the same graph?
Guess I can try answering these questions by myself, at least it may be helpful for other newbies trying to harness distributed Tensorflow, because as of now there is lack of concise and clear blog posts on that topic.
Hope more knowledgeable people will correct me if needed.
We have separate sessions on all the servers, and these sessions share their resources (variables, queues, and readers) but only in distributed setting (i.e. you pass server.target to tf.Session constructor).
Ref: https://www.tensorflow.org/api_docs/python/client/session_management#Session
Parameter variables usually are initialized in one "master" process. It can be process where parameter server is launched. But it is not strictly necessary to do it in just one process.
Because of p.1. Replicated :)
Thanks to ideas from #YaroslavBulatov, I came to the following approach, which appears to be the simplest possible:
Cluster: one local "calculation server" and N "workers".
"Calculation server" keeps all the parameters of global network and performs training steps. All training ops are assigned to it.
"Workers" collect data in parallel and then put it in Queue; these data are used by "calculation server" when doing training steps.
So, high-level algorithm:
launch all the units in cluster
build comp graph and training ops on calculation server
build comp graph on workers (variables are linked to calculation
server).
collect data with workers
perform training step on calculation server and update global
network
repeat 4-5 till convergence :)
As of now I did coordination between calculation server and workers through Queues (when to start collection of data and when to start training step), which is definitely not the most elegant solution. Any feedback is very welcome.
I also stumbled upon these and very similar and related questions. I tried to clarify all of this in my overview of distributed TensorFlow. Maybe that is useful for some.
More specifically, let me try to answer your questions:
You say you do between-graph replication, i.e. you build a separate computation graph for every worker. This implies that you also have a separate session everywhere, because there would be no way to use that computation graph otherwise. The server (tf.distribute.Server) will not use the local computation graph. It will just execute things when remote sessions (clients) connect to it. The session has the graph. If there would be only a single session, there would also only be a single graph, and then you have in-graph replication.
If you share the variables (e.g. they live on a parameter server), it is enough if one of the workers does the initialization (e.g. the parameter server itself). Otherwise it depends on the specific distributed strategy and how you do the synchronization of the variables. E.g. a mirrored variables has separate copies on every replica, and you would need to make sure in some way that they are synchronized.
There is only a single copy of the variable in this case, which lives on the parameter server. All read and write on this variable would be a RPC call to the parameter server.
I'm not exactly sure what you mean by main program. You would have multiple instances of your program, one for each worker. But you likely will mark one of the worker as the chief worker, which has some further responsibilities like saving the checkpoint. But otherwise, all the workers are equal and do all the same thing (this is again for between-graph replication). How your gradient accumulation or parameter update looks like depends on your strategy (e.g. whether you do sync training or async training, etc.).
I read the document of Distributed TensorFlow and have a question about between-graph replication.
https://www.tensorflow.org/versions/master/how_tos/distributed/index.html
In my understanding, between-graph replication training creates same number of graphs as workers and the graphs share tf.Variables on parameter servers.
That is, one worker creates one session and one graph, and all graphs share same tf.Variable.
However, I just thought two different sessions can not share the same tf.Variable.
Is it misunderstanding?
For your last question:
"Can two different sessions share the same tf.Variable?"
For distributed sessions(e.g. Session("grpc://..")), they can.
For direct sessions, they can't.
In distributed training, variables are managed by tf.Server(), persistent across sessions. Remember? Server are created before sessions. It lives longer than tf.Sessions.