Algorithm for distributed messaging? - replication

I have a distributed application across which I'd like to replicate a single, eventually consistent state. The data is suitable for a CRDT (http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf) which has the excellent property that each node, given the same set of messages, will deterministically converge to the same value without complicated consensus protocols.
However, I need another messaging/log layer that will ensure that each node actually sees every message, even in the face of adverse network conditions.
Specifically, I'm looking for an algorithm that has the following properties:
Works on an asynchronous network.
Nodes are only necessarily aware of their neighbors, not the whole network.
Nodes may be added or dropped at any time (that is, the network is not of a fixed size or topology).
The network can be acyclic (this can be a requirement, if necessary).
Is capable of bringing up to date a node that has become behind due to temporary network outage or dropped messages.
Is capable of bringing a new, empty node joining the cluster up to date.
There is not a hard limit on the time taken for the network to converge on a value (that is, for every node to recieve every message), but given no partitions it should be fairly quick (in fuzzy terms, a matter of seconds, not minutes).
Is bounded in size. Algorithms that keep the entire message history (which will grow boundlessly) are unsuitable.
Is anyone aware of an algorithm with these properties?

Related

When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?

When training a neural network across multiple servers and GPUs, I can't think of a scenario where the ParameterServerStrategy would be preferable to the MultiWorkerMirroredStrategy.
What are the ParameterServerStrategy's main use cases and why would it be better than using MultiWorkerMirroredStrategy?
MultiWorkerMirroredStrategy is intended for synchronous distributed training across multiple workers, each of which can have multiple GPUs
ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training.
One of the key differences is that ParameterServerStrategy can be used for asynchronous training, while MultiWorkerMirroredStrategy is intended for Synchronous distributed training. In MultiWorkerMirroredStrategy a copy of all variables in the model is kept on each device across all workers, and a communication method is needed to keep all variables in sync. In contrast, in ParameterServerStrategy each variable of the model is placed on one parameter server.
This matters because:
In synchronous training, all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue. If the failed or preempted worker does not restart for some reason, your workers will keep waiting.
In contrast in ParameterServerStrategy, each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker. Hence unlike MultiWorkerMirroredStrategy, different workers are not waiting on each other.
I guess the question is, do you expect workers to fail, and will the delay in restarting them slow down training when MultiWorkerMirroredStrategy ? If that is the case, maybe ParameterServerStrategy is better.
EDIT: Answers to questions in comments:
So is the only benefit of PSS the fact that it resists better to
failing workers than MWMS?
Not exactly - even if workers do not fail in MWMS, as workers still need to be in sync there could be network bottle necks.
If so, then I imagine it would only be useful when training on many
workers, say 20 or more, or else the probability that a worker will
fail during training is low (and it can be avoided by saving regular
snapshots).
Maybe not, it depends on the situation. Perhaps in your scenario the probability of failure is low. In someone else's scenario there may be a higher probability. For the same number of workers, the longer a job is, there is more likelihood of a failure occurring in the middle of a job. To illustrate further (with an over simplistic example), if I have the same number of nodes, but theyre simply slower, they could take much longer to do a job, and hence there is greater likelihood of any kind of interruption / failure occurring during the job.
(and it can be avoided by saving regular snapshots).
Not sure I understand what you mean - if a worker fails, and you've saved a snapshot, then you haven't lost data. But the worker still needs to restart. In the interim between failure and restarting other workers may be waiting.
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
I will first try to answer it from a conceptual point of view.
I would say try looking at it from a different angle - in a synchronous operation, you're waiting for something else to finish, and you may be idle till that something gives you what you need.
In constrast in an asynchronous operation, you do your own work and when you need more you ask for it.
There is no hard and fast rule about whether synchronous operations or asynchronous operations are better. It depends on the situation.
I will now try to answer it from an optimization point of view:
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
In a distributed system it is possible that your bottleneck could be CPU / GPU, Disk or Network. Nowadays networks are really fast, and in some cases faster than disk. Depending on your workers configuration CPU / GPU could be the bottle neck. So it really depends on the configuration of your hardware and network.
Hence I would do some performance testing to determine where the bottlenecks in your system are, and optimize for your specific problem.
EDIT: Additional follow up questions:
One last thing: in your experience, in what use cases is PSS used? I
mean, both PSS and MWMS are obviously for use with large datasets (or
else a single machine would suffice), but what about the model? Would
PSS be better for larger models? And in your experience, is MWMS more
frequently used?
I think cost and the type of problem being worked on may influence the choice. For example, both AWS and GCP offer “spot instances” / “premptible instances” which are heavily discounted servers that can be taken away at any moment. In such a scenario, it may make sense to use PSS - even though machine failure is unlikely, a instance may simply be taken away without notice because it is a “spot instance”. If you use PSS, then the performance impact of servers disappearing may not be as large as when using MWMS.
If you’re using dedicated instances, the instances are dedicated to you, and will not be taken away - the only risk of interruption is machine failure. In such cases MWMS may be more attractive if you can take advantage of performance optimisations or plugin architecture.

Performance improvement using Between-Graph replication in distributed tensorflow

I have gone through this
answer, but it didn't give the rationale for choosing multiple clients in Between-Graph replication for improving performance. How will using Between-Graph replication improve performance, when compared to In-Graph replication?
In-graph replication works fine for multiple devices on the same machine, but it doesn't scale well to cluster-size, because one client has to take care of coordination between all devices (even those located on different nodes).
Say, for example, that you have two GPUs, one on the client's machine and another on a second machine. Thanks to Tensorflow's magic, a simple with tf.device('address_of_the_gpu_on_the_other_machine'): will place operations on the remote computer's GPU. The graph will then run on both machines, but data will then need to be gathered from both before being able to proceed in the computation (loss computation, etc). Network communication will slow down your training (and of course, the more machines, the more communication needed).
Between-graph replication, on the other hand, scales much better because each machine has its own client that only needs to coordinate communication to the parameter server and execution of its own operations. Graphs "overlap" on the parameter server, which updates one set of variables that are shared among all the worker graphs. Moreover, communication overhead is also greatly reduced, because now you only need to have fast communication to the parameter servers, but no machine needs to wait for other machines to complete before moving on to the next training iteration.
How are the graphs different between the two methods?
In-graph replication:
In this method, you have only one graph managed by the client. This graph have nodes that are spread over multiple devices, even across different machines. This means that, for example, having two machines PC1 and PC2 on a network, the client will explicitly dispatch operations to one machine or the other. The graph technically is not "replicated", only some parts of it are distributed. Typically, the client has a big batch of data that is split in sub-batches, each of which is fed to a compute-intensive part of the graph. Only this compute-intensive part is replicated, but all the part before the split and after the computation (e.g., loss calculation) runs on the client. This is a bottleneck.
Note, also, that it´'s the client that decides which operations go to which machine, so theoretically one could have different parts of the graph on different nodes. You can decide to replicate identically the compute-intensive part on all your nodes, or you could, in principle, say "all the convolutions are on PC1, all dense layers go to PC2". Tensorflow's magic will insert data transfers where appropriate to make things work for you.
Between-graph replication:
Here you have multiple similar copies of the same graph. Why similar? because all of them have the compute-intensive part (as above), but also the input pipeline, the loss calculation and their own optimizer (assuming you're using asynchronous training (the default). This is another layer of complexity that I'll leave aside). (Delving deeper in Tensorflow's distributed framework, you'll also find out that not all workers (and their graphs) are equal, there is one "chief" worker that does initialization, checkpointing and summary logging, but this is not critical to understanding the general idea).
Unlike above, here you need a special machine, the parameter server (PS), that acts as central repository for the graph's variables (Caveat: not all the variables, only the global ones, like global_step and the weights of your network). You need this because now at each iteration, every worker will fetch the most recent values of the variables at each iteration of the training step. It then sends to the PS the updates that must be applied to the variables and the PS will actually do the update.
How is this different from the method above?
For one thing, there is no "big batch" that gets split among workers. Every worker processes as much data as it can handle, there is no need for splitting and putting things back together afterwards. This means, there is no need for synchronization of workers, because the training loops are entirely independent. The training, however, is not independent, because the updates that worker A does to the variables will be seen by worker B, because they both share the same variables. This means that the more workers you have, the faster the training (subject to diminished returns) because effectively the variables are updated more often (approximately every time_for_a_train_loop/number_of_workers seconds). Again, this happens without coordination between workers, which incidentally also makes the training more robust: if a worker dies, the others can continue (with some caveats due to having a chief worker).
One last cool feature of this method is that, in principle, there is no loss in performance using a heterogeneous cluster. Every machine runs as fast as it can and awaits nobody. Should you try running in-graph replication on a heterogeneous cluster, you'd be limited in speed by the slowest machine (because you collect all results before continuing).

Why are hotspots on my redis cluster bad?

I have a redis cluster and I am planning to add keys which I know will have a much heavier read/update frequency than other keys. I assume this might cause hotspots on my cluster. Why is this bad and how can I avoid it ?
Hotspot on keys is ok, if these keys can sharding to different redis nodes. But if there is hotspot on some redis nodes/machines, that will be bad, as the memory/cpu load of these machines will be quite heavy, while other nodes are not efficiently used.
If you know exactly what these keys are, you can calculate slots of them by yourself at first, with CRC16 of the key modulo 16384.
And then you can distribute these slots to different redis nodes.
Whether or not items will cause a hot spot on a particular node or nodes depends on a bunch of factors. As already mentioned, hotspotting on a single key isn't necessarily a problem if the overall cluster traffic is still relatively even and the node that key is on isn't taxed. If each of your cluster nodes are handling 1000 commands/sec and on one of those nodes all of the commands are one related to one key, it's not really going to matter since all of the commands are processed serially on a single thread, it's all the same.
However, if you have 5 nodes, all of which are handling 1000 commands/sec, but you add a new key to one node which makes that single node incur another 3000 commands/sec, one of your 5 nodes is now handling 50% of the processing. This means that it's going to take longer for that node handle all of its normal 1000 commands/sec, and 1/5 of your traffic is now arbitrarily much slower.
Part of the general idea of distributing/sharding a database is not only to increase storage capacity but to balance work as well. Unbalancing that work will end up unbalancing or screwing up your application performance. It'll either cause 1/Nth of your request load to be arbitrarily slower due to accessing your bogged down node, or it'll increase processing time across the board if your requests potentially access multiple nodes. Evenly spreading load gives an application better capacity to handle higher load without adversely effecting performance.
But there's also the practical consideration of whether the access to your new keys are proportionally large to your ongoing traffic. If your cluster is handling 1000+ commands/sec/node and a single key will add 10 req/sec to a single particular node, you'll probably be fine just fine either way.

What is cyclic and acyclic communication?

So I've already searched if there was a question like this posted before, but I wasn't able to find the answer I liked.
I've been working with some PLCs and variable frequency drives lately and thought it was about time I finally found out what cyclic and non-cyclic communication is.
So correct me if I'm wrong, but when I think of cyclic data, I think of data that is continuously being updated and is able to be sent/sampled to other devices. With relation to what I'm doing, I'm thinking that the variable frequency drive is able to update information such as speed and frequency that can be sampled/read from a PLC. This is what I would consider cyclic communication, something that is always updating a certain type of information that can be sent as data.
So I might be completely wrong with this assumption, and that leaves me with the question of what exactly would be considered non-cyclic or acyclic communication.
Any help?
Forenote: This is mostly a programming based site, and while your question does have an answer within the contexts of programming, I happen to know that in your industrial application, the importance of cyclic vs acyclic tends to be very hardware/protocol specific, and is really more of a networking problem than a programming one.
Cyclic data is not simply "continuous" data. In industry, it refers to data delivered on a guaranteed (or at least highly predictable) schedule. If the data stream were to violate the schedule, it could have disastrous consequences (a VFD misses its shutdown command by a fraction of a second, and you lose your arm!).
Acyclic data is still reliable for machine control, it is just delivered in a less deterministic way (on the order of milliseconds, sometimes up to several seconds). When accessing a single VFD with a single PLC, you will probably never notice this bursting behavior, and in fact, you may perceive smoother and quicker data transmissions. From the hardware interface perspective, acyclic data transfer does not provide as strong of a guarantee about if or when one machine will respond to the request of another.
Both forms of data transfer deliver data at speeds much faster than humans can deal with, but in certain applications they will each have their own consequences.
Cyclic networks usually must take the form of master/slave, where only one device is allowed speak at a time, and answers are always returned, even if just to confirm that the message was received. Cyclic networks usually do not allow as many devices on the same wire, and often they will pass larger amounts of data at slower rates.
Acyclic networks might be thought of as a bit more choatic, but since they skip handshaking formalities, they can often cheat more devices onto the network and get higher speeds all at the same time. This comes at the cost of occasional data collisions/bottlenecks, and sometimes even, requests for critical data are simply ignored/lost with no indication of failure or success from the target ( in the case the sender will likely be sitting and waiting desperately for a message it will not get, and often then trigger process watchdogs that will shutdown the system).
From a programmer perspective, not much is different between these two transmission types.
What will usually dictate a situation,
how many devices are running on the wire (sometimes this forces the answer right away)
how sensitive/volatile is the data they want to share (how useful are messages if they are a little late)
how much data they might be required to send at any given time ( shifting demands on a network that already produces race conditions can be hard to anticipate/avoid if you don't see it coming before hand).
Hope that helps :)

Travelling Salesman and Map/Reduce: Abandon Channel

This is an academic rather than practical question. In the Traveling Salesman Problem, or any other which involves finding a minimum optimization ... if one were using a map/reduce approach it seems like there would be some value to having some means for the current minimum result to be broadcast to all of the computational nodes in some manner that allows them to abandon computations which exceed that.
In other words if we map the problem out we'd like each node to know when to give up on a given partial result before it's complete but when it's already exceeded some other solution.
One approach that comes immediately to mind would be if the reducer had a means to provide feedback to the mapper. Consider if we had 100 nodes, and millions of paths being fed to them by the mapper. If the reducer feeds the best result to the mapper than that value could be including as an argument along with each new path (problem subset). In this approach the granularity is fairly rough ... the 100 nodes will each keep grinding away on their partition of the problem to completion and only get the new minimum with their next request from the mapper. (For a small number of nodes and a huge number of problem partitions/subsets to work across this granularity would be inconsequential; also it's likely that one could apply heuristics to the sequence in which the possible routes or problem subsets are fed to the nodes to get a rapid convergence towards the optimum and thus minimize the amount of "wasted" computation performed by the nodes).
Another approach that comes to mind would be for the nodes to be actively subscribed to some sort of channel, or multicast or even broadcast from which they could glean new minimums from their computational loop. In that case they could immediately abandon a bad computation when notified of a better solution (by one of their peers).
So, my questions are:
Is this concept covered by any terms of art in relation to existing map/reduce discussions
Do any of the current map/reduce frameworks provide features to support this sort of dynamic feedback?
Is there some flaw with this idea ... some reason why it's stupid?
that's a cool theme, that doesn't have that much literature, that was done on it before. So this is pretty much a brainstorming post, rather than an answer to all your problems ;)
So every TSP can be expressed as a graph, that looks possibly like this one: (taken it from the german Wikipedia)
Now you can run a graph algorithm on it. MapReduce can be used for graph processing quite well, although it has much overhead.
You need a paradigm that is called "Message Passing". It was described in this paper here: Paper.
And I blog'd about it in terms of graph exploration, it tells quite simple how it works. My Blogpost
This is the way how you can tell the mapper what is the current minimum result (maybe just for the vertex itself).
With all the knowledge in the back of the mind, it should be pretty standard to think of a branch and bound algorithm (that you described) to get to the goal. Like having a random start vertex and branching to every adjacent vertex. This causes a message to be send to each of this adjacents with the cost it can be reached from the start vertex (Map Step). The vertex itself only updates its cost if it is lower than the currently stored cost (Reduce Step). Initially this should be set to infinity.
You're doing this over and over again until you've reached the start vertex again (obviously after you visited every other one). So you have to somehow keep track of the currently best way to reach a vertex, this can be stored in the vertex itself, too. And every now and then you have to bound this branching and cut off branches that are too costly, this can be done in the reduce step after reading the messages.
Basically this is just a mix of graph algorithms in MapReduce and a kind of shortest paths.
Note that this won't yield to the optimal way between the nodes, it is still a heuristic thing. And you're just parallizing the NP-hard problem.
BUT a little self-advertising again, maybe you've read it already in the blog post I've linked, there exists an abstraction to MapReduce, that has way less overhead in this kind of graph processing. It is called BSP (Bulk synchonous parallel). It is more freely in the communication and it's computing model. So I'm sure that this can be a lot better implemented with BSP than MapReduce. You can realize these channels you've spoken about better with it.
I'm currently involved in an Summer of Code project which targets these SSSP problems with BSP. Maybe you want to visit if you're interested. This could then be a part solution, it is described very well in my blog, too. SSSP's in my blog
I'm excited to hear some feedback ;)
It seems that Storm implements what I was thinking of. It's essentially a computational topology (think of how each compute node might be routing results based on a key/hashing function to the specific reducers).
This is not exactly what I described, but might be useful if one had a sufficiently low-latency way to propagate current bounding (i.e. local optimum information) which each node in the topology could update/receive in order to know which results to discard.