Distributed system design

Distributed system design - system

In a distributed system, a certain node distributes 'X' units of work equally across 'N' nodes (via socket message passing).
As we increase the number of worker nodes, each nodes completes his job faster but we have to set-up more connections.
In a real situation, it would be similar to changing 10 nodes in a Hadoop-like system with each node processing 100GB by 1,000,000 nodes with each node processing 1MB.
What's the impact of setting up more connections in this case? Is this a big overhead in poll() function?
What's the best approach?

Sounds like you will need to consult Amdahl's Law.
At least it was how I computed how many machines on a high-speed switch were optimal for my parallel computations.

Does it have to use sockets and message passing between Supervisor and Worker?
You can use some type of queuing so avoid putting load onto the Supervisor. Or a distributed file system similar to HDFS to distribute the tasks and collect the results.
It also depends on the number of nodes you are planning to deploy the Workers on. 1,000,000 nodes is a very big number therefore in that case, you'll have to distribute the tasks into multiple queues.
The thing to be careful about is what will happen if all the nodes finish their tasks at the same time. It would be worth putting some variability into when they can request for a new task. ZooKeeper (http://hadoop.apache.org/zookeeper/) is potentially something you can also use to synchronise the jobs.

Can you measure your network cost? The time spent working on the worker machine should be only part of the cost of the message pass and receive.
Also can you describe the O notation for handling each worker result into the master result?
Does your master round robin expected responses?
btw -- if your worker nodes are finishing quicker but underutilizing the cpu resources you may be missing a design trade-off?
of course, you could be the rule or the exception to any law(argument/out of date research). ;-)

Related

Underlying hardware mapping of Vulkan queues

Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.
After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)
So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.

GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.
That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.
And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:
I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them
Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.
Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.
With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.
With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.
It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.
But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.
The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.

scheduling realtime systems vs on-line systems vs batch systems

Im trying to understand how their scheduling criteria works.
why IO bound and CPU bound mix are more important to batch processes?.
does Preemptive scheduling important to all?
thanks alot for the help.

A mixed system typically allows the system manager to create batch queues into which those with appropriate privileges may submit jobs. Usual purpose of the batch queues is use the CPU when interactive processes are not using it.
Usually batch queues are assigned priorities that override the normal user priorities. The system manager usually assigns batch queues prior levels that are lower than the normal interactive priority. If you set the priority low enough, the batch queue does not interfere with interactive users.
It is also possible to schedule the batch queues so that they only run at specified times (e.g., between 2AM and 6AM).
The system manager does not concern himself with I/O bound or CPU bound.

How to do performance testing of RabbitMQ Cluster to do further fine tuning?

I have created a RabbitMQ Cluster which is successfully queuing messages being generated by application. I need to do performance testing of the cluster to find out overall efficiency of the cluster and take decisions to do further fine tuning to enhance performance. We tried with PerfTest java tool. But could not achieve much.

I guess the questions begin with which interface are you looking to test? That will decide your tool(s) which support that interface.
Are you looking to both push and pop?
How many queues?
How many producers and how many consumers? Will you create a slight vacuum with the consumers to affect an always or nearly empty queue set?
How will you define efficiency? Is this defined by the number of items in the queue, the time to push or pop from the queue or some combination of the the previous?
???

Should I try to use as many queues as possible?

On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?

Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.

To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.

That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.

Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.

Redis Cluster vs ZeroMQ in Pub/Sub, for horizontally scaled distributed systems

If I were to design a huge distributed system whose throughput should scale linearly with the number of subscribers and number of channels in the system, which would be better ?
1) Redis Cluster (only for Redis 3.0 alpha, if its in cluster mode, you can publish in one node and subscribe in another completely different node, and the messages will propagate and reach you). The complexity of Publish is O(N+M), where N is the number of subscribed clients and M is the number of subscribed patterns in the system, but how does it scale when in a Redis Cluster ? I accept educated guesses on this.
2) ZeroMQ since 3.x, it does server-side filtering, so it also has some time complexity there, but I have not seen anything about it in the documentation. If I wanted to scale it, I could just have lots of servers publishing to whatever channels, and each subscriber would connect to all the servers, and subscribe for the desired channel. That seems nice.
So which of those is better for horizontal scaling of a huge publisher system ? What are other solutions I should look into ? Remember, I want to minimize latency and throughput, but being able to scale horizontally.

You want to minimize latency, I guess. The number of channels is irrelevant. The key factors are the number of publishers and number of subscribers, message size, number of messages per second per publisher, number of messages received by each subscriber, roughly. ZeroMQ can do several million small messages per second from one node to another; your bottleneck will be the network long before it's the software. Most high-volume pubsub architectures therefore use something like PGM multicast, which ZeroMQ supports.

In Redis, like in ZeroMQ, the bottleneck will be the network. Redis can reach millions of messages per second, at least as much if not more than ZeroMQ.
You should be aware that the current implementation of Redis Cluster distributes PUBLISH messages across all cluster nodes using the inter-node bus. This approach assumes that PUBLISH is extremely cheap on Redis (as explained in this issue on Github).
However, there is a small overhead involved which is inter-node communication. As you scale up this overhead will be more significant. There is another Redis Cluster implementation I'm aware of - please note it's a commercial one - in which channels or patterns are distributed across cluster nodes in a similar fashion to the way Redis keys are distributed. At least according to the vendor, this should save the overhead of inter-node communication and increase performance, but I have not benchmarked it myself.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas