Quartz and RabbitMQ, which is the difference between these technologies?
Can they be used together?
Could these technologies be installed on the hardware that hosts the web server or is it better to have dedicated hardware for them?
Let's first assume you mean Quartz, a scheduler, not Quartz, a Mac OS X graphics layer. ;)
RabbitMQ is a message queue. Message queues make sure messages reach their destination, persisting during downtimes and load-balancing between multiple worker processes. You generally want a message queue if you have several processes doing different types of work and you need a way to distribute the work load.
Quartz is a scheduler. Schedulers make sure events happen at the right time, possibly ensuring one event is properly executed before another may start, or catching up with the schedule after a downtime. You generally want a scheduler if the basic OS capabilities like crontab etc. are not sufficient for your needs.
Combining the two concepts can be powerful: have the scheduler trigger events or chains of events into the message queue, and have many workers listen on their respective queues to perform the assigned tasks.
Depending on what you want to achieve, it may be perfectly ok to have everything on the same machine. When you experience poor performance you can decide if you want a bigger machine or distribute the work load on many smaller ones.
You may want to look at the tutorials on RabbitMQ's and Quartz's web sites to see if either or both things are right for your purpose.
Related
Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.
After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)
So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.
GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.
That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.
And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:
I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them
Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.
Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.
With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.
With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.
It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.
But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.
The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.
I have created a RabbitMQ Cluster which is successfully queuing messages being generated by application. I need to do performance testing of the cluster to find out overall efficiency of the cluster and take decisions to do further fine tuning to enhance performance. We tried with PerfTest java tool. But could not achieve much.
I guess the questions begin with which interface are you looking to test? That will decide your tool(s) which support that interface.
Are you looking to both push and pop?
How many queues?
How many producers and how many consumers? Will you create a slight vacuum with the consumers to affect an always or nearly empty queue set?
How will you define efficiency? Is this defined by the number of items in the queue, the time to push or pop from the queue or some combination of the the previous?
???
On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?
Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.
To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.
That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.
Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.
We are looking into creating a distributed system for task execution, where the tasks have priorities in .NET (C#). There are a lot of options, I would like to get your take on it. The options & their disadvantages are:
1) Amazon's SWF (Simple WorkFlow) - in .NET we can't use a framework such as java's FLOW which simplifies. this means a lot of boilerplate code. In addition, this offering from amazon doesn't seem to be very popular (so: no community support, and might eventually disappear)
2) Building our own on top of a queuing system
2.a) SQS - not really a FIFO, and using 2 queues (normal and high priority) won't give us granular control over the priorities (we might be able to live with that)
2.b) RabbitMQ - administrative overhead (setting it up, configuring it in cluster mode for reliability, etc)
3) I have received another suggestion to use "event driven" without queues. I can't see how it's possible, maybe someone can help clarify it to me? (oh, and, is it related to a technology called Akka (actor based))
Thank you
SQS is probably going to be the simplest - very little code is required, and the cost is extremely low and the setup time is minimal.
If 2 queues and hi/low priority is not enough then create 3 queues, or 5 queues or 10 queues - you can be as granular as you need to be.
You can have multiple worker machines scanning all the queues in priority in order, or have some machines just dedicated to processing the hi-priortiy queues, and these machines could be bigger/faster if you want to process even quicker.
Another option is to have seperate auto-scaling policies that will spin up more/faster machines based on a small increase in the length of the high-priority queues, but only scale up smaller/cheaper machines, when the low-priority queue gets very long....lots of options to choose from and fine-tune you solution.
I am looking for a message queue which would replicate messages across a cluster of servers. I am aware that this will cause a performance hit, but that's what the requirements are - message persistence is very important.
The replication can be asynchronous, but it should be there - if there's a large backlog of messages waiting for processing, they shouldn't be lost.
So far I didn't manage to find anything from the well-known MQs. HornetQ for example supported message replication in 2.0 but in 2.2 it seems to be removed. RabbitMQ doesn't replicate messages at all, etc.
Is there anything out there that could meet my requirements?
There are at least three ways of tackling this that come to mind, depending upon how robust you need the solution to be.
One: pick any messaging tech, then replicate your disk-storage. Using something like DRBD you can have the file-backed storage copied to another machine under the covers. If your primary box dies, you should be able to restart on your second machine from the replicated files.
Two: Keep looking. There are various commercial systems that definitely do this, two such (no financial benefit on my part) are Informatica Ultra Messaging (formerly 29West) and Solace. These are commonly used in the financial community.
Three: build your own. ZeroMQ is one such toolkit that you could use to roll-your-own system from pre-built messaging blocks. Even a system that does not officially support it could fairly easily be configured to publish all messages to two queues. Your reader would have to drain both somehow, so this may well be a non-starter, but possible in any case.
Overall: do test your performance assumptions, as all of these will have various performance implications in various scenarios.
Amazon SQS is designed with this very thing in mind, but because of the consistency model (which is a part of messaging anyway), you're responsible for de-duplicating messages on the consumer side. Granted, SQS maybe somewhat slow and the costs can add up for lots of messages, but if you want to guarantee that no messages are lost, then it's a pretty solid way to go.
new Kafka 0.8.1 offers replication!