We are looking into creating a distributed system for task execution, where the tasks have priorities in .NET (C#). There are a lot of options, I would like to get your take on it. The options & their disadvantages are:
1) Amazon's SWF (Simple WorkFlow) - in .NET we can't use a framework such as java's FLOW which simplifies. this means a lot of boilerplate code. In addition, this offering from amazon doesn't seem to be very popular (so: no community support, and might eventually disappear)
2) Building our own on top of a queuing system
2.a) SQS - not really a FIFO, and using 2 queues (normal and high priority) won't give us granular control over the priorities (we might be able to live with that)
2.b) RabbitMQ - administrative overhead (setting it up, configuring it in cluster mode for reliability, etc)
3) I have received another suggestion to use "event driven" without queues. I can't see how it's possible, maybe someone can help clarify it to me? (oh, and, is it related to a technology called Akka (actor based))
Thank you
SQS is probably going to be the simplest - very little code is required, and the cost is extremely low and the setup time is minimal.
If 2 queues and hi/low priority is not enough then create 3 queues, or 5 queues or 10 queues - you can be as granular as you need to be.
You can have multiple worker machines scanning all the queues in priority in order, or have some machines just dedicated to processing the hi-priortiy queues, and these machines could be bigger/faster if you want to process even quicker.
Another option is to have seperate auto-scaling policies that will spin up more/faster machines based on a small increase in the length of the high-priority queues, but only scale up smaller/cheaper machines, when the low-priority queue gets very long....lots of options to choose from and fine-tune you solution.
Related
I have created a RabbitMQ Cluster which is successfully queuing messages being generated by application. I need to do performance testing of the cluster to find out overall efficiency of the cluster and take decisions to do further fine tuning to enhance performance. We tried with PerfTest java tool. But could not achieve much.
I guess the questions begin with which interface are you looking to test? That will decide your tool(s) which support that interface.
Are you looking to both push and pop?
How many queues?
How many producers and how many consumers? Will you create a slight vacuum with the consumers to affect an always or nearly empty queue set?
How will you define efficiency? Is this defined by the number of items in the queue, the time to push or pop from the queue or some combination of the the previous?
???
On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?
Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.
To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.
That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.
Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.
If I were to design a huge distributed system whose throughput should scale linearly with the number of subscribers and number of channels in the system, which would be better ?
1) Redis Cluster (only for Redis 3.0 alpha, if its in cluster mode, you can publish in one node and subscribe in another completely different node, and the messages will propagate and reach you). The complexity of Publish is O(N+M), where N is the number of subscribed clients and M is the number of subscribed patterns in the system, but how does it scale when in a Redis Cluster ? I accept educated guesses on this.
2) ZeroMQ since 3.x, it does server-side filtering, so it also has some time complexity there, but I have not seen anything about it in the documentation. If I wanted to scale it, I could just have lots of servers publishing to whatever channels, and each subscriber would connect to all the servers, and subscribe for the desired channel. That seems nice.
So which of those is better for horizontal scaling of a huge publisher system ? What are other solutions I should look into ? Remember, I want to minimize latency and throughput, but being able to scale horizontally.
You want to minimize latency, I guess. The number of channels is irrelevant. The key factors are the number of publishers and number of subscribers, message size, number of messages per second per publisher, number of messages received by each subscriber, roughly. ZeroMQ can do several million small messages per second from one node to another; your bottleneck will be the network long before it's the software. Most high-volume pubsub architectures therefore use something like PGM multicast, which ZeroMQ supports.
In Redis, like in ZeroMQ, the bottleneck will be the network. Redis can reach millions of messages per second, at least as much if not more than ZeroMQ.
You should be aware that the current implementation of Redis Cluster distributes PUBLISH messages across all cluster nodes using the inter-node bus. This approach assumes that PUBLISH is extremely cheap on Redis (as explained in this issue on Github).
However, there is a small overhead involved which is inter-node communication. As you scale up this overhead will be more significant. There is another Redis Cluster implementation I'm aware of - please note it's a commercial one - in which channels or patterns are distributed across cluster nodes in a similar fashion to the way Redis keys are distributed. At least according to the vendor, this should save the overhead of inter-node communication and increase performance, but I have not benchmarked it myself.
Quartz and RabbitMQ, which is the difference between these technologies?
Can they be used together?
Could these technologies be installed on the hardware that hosts the web server or is it better to have dedicated hardware for them?
Let's first assume you mean Quartz, a scheduler, not Quartz, a Mac OS X graphics layer. ;)
RabbitMQ is a message queue. Message queues make sure messages reach their destination, persisting during downtimes and load-balancing between multiple worker processes. You generally want a message queue if you have several processes doing different types of work and you need a way to distribute the work load.
Quartz is a scheduler. Schedulers make sure events happen at the right time, possibly ensuring one event is properly executed before another may start, or catching up with the schedule after a downtime. You generally want a scheduler if the basic OS capabilities like crontab etc. are not sufficient for your needs.
Combining the two concepts can be powerful: have the scheduler trigger events or chains of events into the message queue, and have many workers listen on their respective queues to perform the assigned tasks.
Depending on what you want to achieve, it may be perfectly ok to have everything on the same machine. When you experience poor performance you can decide if you want a bigger machine or distribute the work load on many smaller ones.
You may want to look at the tutorials on RabbitMQ's and Quartz's web sites to see if either or both things are right for your purpose.
I am looking for a message queue which would replicate messages across a cluster of servers. I am aware that this will cause a performance hit, but that's what the requirements are - message persistence is very important.
The replication can be asynchronous, but it should be there - if there's a large backlog of messages waiting for processing, they shouldn't be lost.
So far I didn't manage to find anything from the well-known MQs. HornetQ for example supported message replication in 2.0 but in 2.2 it seems to be removed. RabbitMQ doesn't replicate messages at all, etc.
Is there anything out there that could meet my requirements?
There are at least three ways of tackling this that come to mind, depending upon how robust you need the solution to be.
One: pick any messaging tech, then replicate your disk-storage. Using something like DRBD you can have the file-backed storage copied to another machine under the covers. If your primary box dies, you should be able to restart on your second machine from the replicated files.
Two: Keep looking. There are various commercial systems that definitely do this, two such (no financial benefit on my part) are Informatica Ultra Messaging (formerly 29West) and Solace. These are commonly used in the financial community.
Three: build your own. ZeroMQ is one such toolkit that you could use to roll-your-own system from pre-built messaging blocks. Even a system that does not officially support it could fairly easily be configured to publish all messages to two queues. Your reader would have to drain both somehow, so this may well be a non-starter, but possible in any case.
Overall: do test your performance assumptions, as all of these will have various performance implications in various scenarios.
Amazon SQS is designed with this very thing in mind, but because of the consistency model (which is a part of messaging anyway), you're responsible for de-duplicating messages on the consumer side. Granted, SQS maybe somewhat slow and the costs can add up for lots of messages, but if you want to guarantee that no messages are lost, then it's a pretty solid way to go.
new Kafka 0.8.1 offers replication!