When training a neural network across multiple servers and GPUs, I can't think of a scenario where the ParameterServerStrategy would be preferable to the MultiWorkerMirroredStrategy.
What are the ParameterServerStrategy's main use cases and why would it be better than using MultiWorkerMirroredStrategy?
MultiWorkerMirroredStrategy is intended for synchronous distributed training across multiple workers, each of which can have multiple GPUs
ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training.
One of the key differences is that ParameterServerStrategy can be used for asynchronous training, while MultiWorkerMirroredStrategy is intended for Synchronous distributed training. In MultiWorkerMirroredStrategy a copy of all variables in the model is kept on each device across all workers, and a communication method is needed to keep all variables in sync. In contrast, in ParameterServerStrategy each variable of the model is placed on one parameter server.
This matters because:
In synchronous training, all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue. If the failed or preempted worker does not restart for some reason, your workers will keep waiting.
In contrast in ParameterServerStrategy, each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker. Hence unlike MultiWorkerMirroredStrategy, different workers are not waiting on each other.
I guess the question is, do you expect workers to fail, and will the delay in restarting them slow down training when MultiWorkerMirroredStrategy ? If that is the case, maybe ParameterServerStrategy is better.
EDIT: Answers to questions in comments:
So is the only benefit of PSS the fact that it resists better to
failing workers than MWMS?
Not exactly - even if workers do not fail in MWMS, as workers still need to be in sync there could be network bottle necks.
If so, then I imagine it would only be useful when training on many
workers, say 20 or more, or else the probability that a worker will
fail during training is low (and it can be avoided by saving regular
snapshots).
Maybe not, it depends on the situation. Perhaps in your scenario the probability of failure is low. In someone else's scenario there may be a higher probability. For the same number of workers, the longer a job is, there is more likelihood of a failure occurring in the middle of a job. To illustrate further (with an over simplistic example), if I have the same number of nodes, but theyre simply slower, they could take much longer to do a job, and hence there is greater likelihood of any kind of interruption / failure occurring during the job.
(and it can be avoided by saving regular snapshots).
Not sure I understand what you mean - if a worker fails, and you've saved a snapshot, then you haven't lost data. But the worker still needs to restart. In the interim between failure and restarting other workers may be waiting.
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
I will first try to answer it from a conceptual point of view.
I would say try looking at it from a different angle - in a synchronous operation, you're waiting for something else to finish, and you may be idle till that something gives you what you need.
In constrast in an asynchronous operation, you do your own work and when you need more you ask for it.
There is no hard and fast rule about whether synchronous operations or asynchronous operations are better. It depends on the situation.
I will now try to answer it from an optimization point of view:
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
In a distributed system it is possible that your bottleneck could be CPU / GPU, Disk or Network. Nowadays networks are really fast, and in some cases faster than disk. Depending on your workers configuration CPU / GPU could be the bottle neck. So it really depends on the configuration of your hardware and network.
Hence I would do some performance testing to determine where the bottlenecks in your system are, and optimize for your specific problem.
EDIT: Additional follow up questions:
One last thing: in your experience, in what use cases is PSS used? I
mean, both PSS and MWMS are obviously for use with large datasets (or
else a single machine would suffice), but what about the model? Would
PSS be better for larger models? And in your experience, is MWMS more
frequently used?
I think cost and the type of problem being worked on may influence the choice. For example, both AWS and GCP offer “spot instances” / “premptible instances” which are heavily discounted servers that can be taken away at any moment. In such a scenario, it may make sense to use PSS - even though machine failure is unlikely, a instance may simply be taken away without notice because it is a “spot instance”. If you use PSS, then the performance impact of servers disappearing may not be as large as when using MWMS.
If you’re using dedicated instances, the instances are dedicated to you, and will not be taken away - the only risk of interruption is machine failure. In such cases MWMS may be more attractive if you can take advantage of performance optimisations or plugin architecture.
Related
Redis can be scaled using replicas and shards. However:
replicas scale only reads, but can provide HA
shards scale both reads and writes, and have the added benefit of requiring less memory than adding a shard.
Based on these facts, if I'm not interested in HA does it make sense to always use shards and not replicas since I get the benefit of scaling both reads and writes, with a smaller memory footprint (and lower costs)?
Yes you can.
About HA, you have to be sure you define/know what is the application behaviour if this shard is becoming not available. (dataloss, service unavailable, ...)
On the replica-read, without having information about your application it is hard to tell; but most of the time a Redis instance (shard) is enough to deal with lot of load. A very "short" rules is, that a shard can deal with 25Gb of data, 25.000 operations/seconds with a sub-ms latency without any problem. Obviously this depends of the type of operations, data and command your are doing; it could be a lot more ops/sec if you do basic set/get.
And usually when we have more than this, we use Clustering to distribute the load.
So before going into the "replica-read" route (that I am trying to avoid as much as possible), take a look to your application, do some benchmark on a single shard: and you will probably see that everything is ok (at least from the workload point of view, but you will have a SPOF if you do not replicate)
Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?
Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around).
When doing off-policy reinforcement learning (which means you can use transitions samples generated by a "behavioral" policy, different from the one you are currently learning), an experience replay is generally used. Therefore, you can grab a bunch of transitions from this large buffer and use a GPU to optimize the learning objective with SGD (c.f. DQN, DDPG).
One instance of CPU-GPU hybrid approach for RL is this - https://github.com/NVlabs/GA3C.
Here, multiple CPUs are used to interact with different instances of the environment. "Trainer" and "Predictor" processes then collect the interactions using multi-process queues, and pass them to a GPU for back-propagation.
I have gone through this
answer, but it didn't give the rationale for choosing multiple clients in Between-Graph replication for improving performance. How will using Between-Graph replication improve performance, when compared to In-Graph replication?
In-graph replication works fine for multiple devices on the same machine, but it doesn't scale well to cluster-size, because one client has to take care of coordination between all devices (even those located on different nodes).
Say, for example, that you have two GPUs, one on the client's machine and another on a second machine. Thanks to Tensorflow's magic, a simple with tf.device('address_of_the_gpu_on_the_other_machine'): will place operations on the remote computer's GPU. The graph will then run on both machines, but data will then need to be gathered from both before being able to proceed in the computation (loss computation, etc). Network communication will slow down your training (and of course, the more machines, the more communication needed).
Between-graph replication, on the other hand, scales much better because each machine has its own client that only needs to coordinate communication to the parameter server and execution of its own operations. Graphs "overlap" on the parameter server, which updates one set of variables that are shared among all the worker graphs. Moreover, communication overhead is also greatly reduced, because now you only need to have fast communication to the parameter servers, but no machine needs to wait for other machines to complete before moving on to the next training iteration.
How are the graphs different between the two methods?
In-graph replication:
In this method, you have only one graph managed by the client. This graph have nodes that are spread over multiple devices, even across different machines. This means that, for example, having two machines PC1 and PC2 on a network, the client will explicitly dispatch operations to one machine or the other. The graph technically is not "replicated", only some parts of it are distributed. Typically, the client has a big batch of data that is split in sub-batches, each of which is fed to a compute-intensive part of the graph. Only this compute-intensive part is replicated, but all the part before the split and after the computation (e.g., loss calculation) runs on the client. This is a bottleneck.
Note, also, that it´'s the client that decides which operations go to which machine, so theoretically one could have different parts of the graph on different nodes. You can decide to replicate identically the compute-intensive part on all your nodes, or you could, in principle, say "all the convolutions are on PC1, all dense layers go to PC2". Tensorflow's magic will insert data transfers where appropriate to make things work for you.
Between-graph replication:
Here you have multiple similar copies of the same graph. Why similar? because all of them have the compute-intensive part (as above), but also the input pipeline, the loss calculation and their own optimizer (assuming you're using asynchronous training (the default). This is another layer of complexity that I'll leave aside). (Delving deeper in Tensorflow's distributed framework, you'll also find out that not all workers (and their graphs) are equal, there is one "chief" worker that does initialization, checkpointing and summary logging, but this is not critical to understanding the general idea).
Unlike above, here you need a special machine, the parameter server (PS), that acts as central repository for the graph's variables (Caveat: not all the variables, only the global ones, like global_step and the weights of your network). You need this because now at each iteration, every worker will fetch the most recent values of the variables at each iteration of the training step. It then sends to the PS the updates that must be applied to the variables and the PS will actually do the update.
How is this different from the method above?
For one thing, there is no "big batch" that gets split among workers. Every worker processes as much data as it can handle, there is no need for splitting and putting things back together afterwards. This means, there is no need for synchronization of workers, because the training loops are entirely independent. The training, however, is not independent, because the updates that worker A does to the variables will be seen by worker B, because they both share the same variables. This means that the more workers you have, the faster the training (subject to diminished returns) because effectively the variables are updated more often (approximately every time_for_a_train_loop/number_of_workers seconds). Again, this happens without coordination between workers, which incidentally also makes the training more robust: if a worker dies, the others can continue (with some caveats due to having a chief worker).
One last cool feature of this method is that, in principle, there is no loss in performance using a heterogeneous cluster. Every machine runs as fast as it can and awaits nobody. Should you try running in-graph replication on a heterogeneous cluster, you'd be limited in speed by the slowest machine (because you collect all results before continuing).
Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?
I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.
I wrote a MPI program that seems to run ok, but I wonder about performance. Master thread needs to do 10 or more times MPI_Send, and the worker receives data 10 or more times and sends it. I wonder if it gives a performance penalty and whether I could transfer everything in single structs or which other technique could I benefit from.
Other general question, once a mpi program works more or less, what are the best optimization techniques.
It's usually the case that sending 1 large message is faster than sending 10 small messages. The time cost of sending a message is well modelled by considering a latency (how long it would take to send an empty message, which is non-zero because of the overhead of function calls, network latency, etc) and a bandwidth (how much longer it takes to send an extra byte given that the network communications has already started). By bundling up messages into one message, you only incurr the latency cost once, and this is often a win (although it's always possible to come up with cases where it isn't). The best way to know for any particular code is simply to try. Note that MPI datatypes allow you very powerful ways to describe the layout of your data in memory so that you can take it almost directly from memory to the network without having to do an intermediate copy into some buffer (so-called "marshalling" of the data).
As to more general optimization questions about MPI -- without knowing more, all we can do is give you advice which is so general as to not be very useful. Minimize the amount of communications which need to be done; wherever possible, use built-in MPI tools (collectives, etc) rather than implementing your own.
One way to fully understand the performance of your MPI application is to run it within the SimGrid platform simulator. The tooling and models provided are sufficient to get realistic timing predictions of mid-range applications (like, a few dozen thousands lines of C or Fortran), and it can be associated to adapted visualization tools that can help you fully understand what is going on in your application, and the actual performance tradeoffs that you have to consider.
For a demo, please refer to this screencast: https://www.youtube.com/watch?v=NOxFOR_t3xI