Should I try to use as many queues as possible? - vulkan

On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?

Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.

To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.

That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.

Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.


Does Vulkan parallel rendering relies on multiple queues?

I'm a newbie of Vulkan, and not very clear on how parallel rendering works, here's some question (the "queue" mentioned below refers specifically to the graphics queue:
Does parallel rendering relies on a device which supports more than one queue?
If question 1 is a yes, what if the physical device only have one queue, but Vulkan abstracted to 4 queues (which is the real case of my macbook's gpu), will the rendering in this case really parallel?
If question 1 is a yes, what if there is only one queue in Vulkan's abstraction, does that mean the device defiantly can render objects in parallel.
P.S. About question 2, when I use Metal api, the number of queues are only one, but when using Vulkan api, the number is 4, I'm not sure it is right to say "the physical device only have one queue".
I have the sneaking suspicion you are abusing the word "parallel". Make sure you know what it means.
Rendering on GPU is by nature embarrassingly parallel. Typically one queue can feed the entire GPU, and typically apps assume that is true.
In all likelihood they made the number of queues equal to the CPU core count. In Vulkan, submissions to a single queue always need to be externally synchronized. Having more queues allows to submit from multiple threads without synchronization.
If there is only one Vulkan queue, you can only submit to one queue. And any submission has to be synchronized with mutex or coming only from one thread in the first place.

Underlying hardware mapping of Vulkan queues

Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.
After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)
So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.
GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.
That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.
And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:
I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them
Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.
Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.
With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.
With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.
It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.
But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.
The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.

Why do queues in a queue family in Vulkan need priority if we can't distinguish between them?

As asked in the title. My main point is "why", as in what's the benefiting factor in such logical structure for queues and queue families.
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
I've observed an interesting fact: that in my "Intel(R) Core(TM) i3-8100B CPU # 3.60GHz" Mac Mini, there are 2 GPUs listed in "" (from LunarG SDK). My bad, the app linked against 2 libMoltonVK.dylib (1 in "Contents/Frameworks", 1 in "/usr/local/lib").
"Why" is not a great question for SO format. It leads to speculation.
The queues are distinguishable in Vulkan. They each have their index with which they can be distinguished. Keep in mind they are rather a driver thing. Even when the driver has more queues, even single one typically can use all the GPU's computing resources.
Furthermore Vulkan specification does not really say what should happen when you supply a specific priority value. It is perfectly valid for driver\GPU to ignore it.
Chip makers do have compute units that are independent. They can theoretically execute different code from each other. But it is not usually advantageous. In the usual work rendering some regular W × H image, it saturates all the compute units with the same work.
Why: because you can submit different types of work that're of different importance, and you can give a hint to the Vulkan implementation what you want to be done first-most.
Everything else in the question are pointless:
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Not necessarily, those may be logical queues that're time-sliced.
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
No, a contemporary API called Metal (from Apple) don't have a queue count or the concept of queue family at all.

How can I speed up a Mac app processing 5000 independent tasks?

I have a long running (5-10 hours) Mac app that processes 5000 items. Each item is processed by performing a number of transforms (using Saxon), running a bunch of scripts (in Python and Racket), collecting data, and serializing it as a set of XML files, a SQLite database, and a CoreData database. Each item is completely independent from every other item.
In summary, it does a lot, takes a long time, and appears to be highly parallelizable.
After loading up all the items that need processing it, the app uses GCD to parallelize the work, using dispatch_apply:
dispatch_apply(numberOfItems, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^(size_t i) {
#autoreleasepool {
I'm running the app on a Mac Pro with 12 cores (24 virtual). So I would expect to have 24 items being processed at all times. However, I found through logging that the number of items being processed varies between 8 and 24. This is literally adding hours to the run time (assuming it could work on 24 items at a time).
On the one hand, perhaps GCD is really, really smart and it is already giving me the maximum throughput. But I'm worried that, because much of the work happens in scripts that are spawned by this app, maybe GCD is reasoning from incomplete information and isn't making the best decisions.
Any ideas how to improve performance? After correctness, the number one desired attribute is shortening how long it takes this app to run. I don't care about power consumption, hogging the Mac Pro, or anything else.
UPDATE: In fact, this looks alarming in the docs: "The actual number of tasks executed by a concurrent queue at any given moment is variable and can change dynamically as conditions in your application change. Many factors affect the number of tasks executed by the concurrent queues, including the number of available cores, the amount of work being done by other processes, and the number and priority of tasks in other serial dispatch queues." (emphasis added) It looks like having other processes doing work will adversely affect scheduling in the app.
It'd be nice to be able to just say "run these blocks concurrently, one per core, don't try to do anything smarter".
If you are bound and determined, you can explicitly spawn 24 threads using the NSThread API, and have each of those threads pull from a synchronized queue of work items. I would bet money that performance would get noticeably worse.
GCD works at its most efficient when the work items submitted to it never block. That said, the workload you're describing is rather complex and rife with opportunities for your threads to block. For starters, you're spawning a bunch of other processes. Right here, this means that you're already relying on the OS to divvy up time/resources between your master task and these slave tasks. Other than setting the OS priority of each subprocess, the OS scheduler has no way to know which processes are more important than others, and by default, your subprocesses are going to have the same priority as their parent. That said, it doesn't sound like you have anything to gain by tweaking process priorities. I'm assuming you're blocking the master task thread that's waiting for the slave tasks to complete. That is effectively parking that thread -- it can do no useful work. But like I said, I don't think there's much to be gained by tweaking the OS priorities of your slave tasks, because this really sounds like it's an I/O bound workflow...
You go on to describe three I/O-heavy operations ("serializing it as a set of XML files, a SQLite database, and a CoreData database.") So now you have all these different threads and processes vying for what is presumably a shared bulk storage device. (i.e. unless you're writing to 24 different databases, on 24 separate hard drives, one for each core, your process is ultimately going to be serialized at the disk accesses.) Even if you had 24 different hard drives, writing to a hard drive (even an SSD) is comparatively slow. Your threads are going to be taken off of the CPU they were running on (so that another thread that's waiting can run) for virtually any blocking disk write.
If you wanted to maximize the performance you're getting out of GCD, you would probably want to rewrite all the stuff you're doing in subtasks in C/C++/Objective-C, bringing them in-process, and then conducting all the associated I/O using dispatch_io primitives. For API where you don't control the low-level reads and writes, you would want to carefully manage and tune your workload to optimize it for the hardware you have. For instance, if you have a bunch of stuff to write to a single, shared SQLite database, there's no point in ever having more than one thread trying to write to that database at once. You'd be better off making one thread (or a serial GCD queue) to write to SQLite and submitting tasks to that after pre-processing is done.
I could go on for quite a while here, but the bottom line is that you've got a complex, seemingly I/O bound workflow here. At the highest-level, CPU utilization or "number of running threads" is going to be a particularly poor measure of performance for such a task. By using sub-processes (i.e. scripts), you're putting a lot of control into the hands of the OS, which knows effectively nothing about your workload a priori, and therefore can do nothing except use its general scheduler to divvy up resources. GCD's opaque thread pool management is really the least of your problems.
On a practical level, if you want to speed things up, go buy multiple, faster (i.e. SSD) hard drives, and rework your task/workflow to utilize them separately and in parallel. I suspect that would yield the biggest bang for your buck (for some equivalence relation of time == money == hardware.)

Grand Central Dispatch vs NSThreads?

I searched a variety of sources but don't really understand the difference between using NSThreads and GCD. I'm completely new to the OS X platform so I might be completely misinterpreting this.
From what I read online, GCD seems to do the exact same thing as basic threads (POSIX, NSThreads etc.) while adding much more technical jargon ("blocks"). It seems to just overcomplicate the basic thread creation system (create thread, run function).
What exactly is GCD and why would it ever be preferred over traditional threading? When should traditional threads be used rather than GCD? And finally is there a reason for GCD's strange syntax? ("blocks" instead of simply calling functions).
I am on Mac OS X 10.6.8 Snow Leopard and I am not programming for iOS - I am programming for Macs. I am using Xcode 3.6.8 in Cocoa, creating a GUI application.
Advantages of Dispatch
The advantages of dispatch are mostly outlined here:
Migrating Away from Threads
The idea is that you eliminate work on your part, since the paradigm fits MOST code more easily.
It reduces the memory penalty your application pays for storing thread stacks in the application’s memory space.
It eliminates the code needed to create and configure your threads.
It eliminates the code needed to manage and schedule work on threads.
It simplifies the code you have to write.
Empirically, using GCD-type locking instead of #synchronized is about 80% faster or more, though micro-benchmarks may be deceiving. Read more here, though I think the advice to go async with writes does not apply in many cases, and it's slower (but it's asynchronous).
Advantages of Threads
Why would you continue to use Threads? From the same document:
It is important to remember that queues are not a panacea for
replacing threads. The asynchronous programming model offered by
queues is appropriate in situations where latency is not an issue.
Even though queues offer ways to configure the execution priority of
tasks in the queue, higher execution priorities do not guarantee the
execution of tasks at specific times. Therefore, threads are still a
more appropriate choice in cases where you need minimal latency, such
as in audio and video playback.
Another place where I haven't personally found an ideal solution using queues is daemon processes that need to be constantly rescheduled. Not that you cannot reschedule them, but looping within a NSThread method is simpler (I think). Edit: Now I'm convinced that even in this context, GCD-style locking would be faster, and you could also do a loop within a GCD-dispatched operation.
Blocks in Objective-C?
Blocks are really horrible in Objective-C due to the awful syntax (though Xcode can sometimes help with autocompletion, at least). If you look at blocks in Ruby (or any other language, pretty much) you'll see how simple and elegant they are for dispatching operations. I'd say that you'll get used to the Objective-C syntax, but I really think that you'll get used to copying from your examples a lot :)
You might find my examples from here to be helpful, or just distracting. Not sure.
While the answers so far are about the context of threads vs GCD inside the domain of a single application and the differences it has for programming, the reason you should always prefer GCD is because of multitasking environments (since you are on MacOSX and not iOS). Threads are ok if your application is running alone on your machine. Say, you have a video edition program and want to apply some effect to the video. The render is going to take 10 minutes on a machine with eight cores. Fine.
Now, while the video app is churning in the background, you open an image edition program and play with some high resolution image, decide to apply some special image filter and your image application being clever detects you have eight cores and starts eight threads to process the image. Nice isn't it? Except that's terrible for performance. The image edition app doesn't know anything about the video app (and vice versa) and therefore both will request their respectively optimum number of threads. And there will be pain and blood while the cores try to switch from one thread to another, because to avoid starvation the CPU will eventually let all threads run, even though in this situation it would be more optimal to run only 4 threads for the video app and 4 threads for the image app.
For a more detailed reference, take a look at where you can see a benchmark of an HTTP server using GCD versus thread, and see how it scales. Once you understand the problem threads have for multicore machines in multi-app environments, you will always want to use GCD, simply because threads are not always optimal, while GCD potentially can be since the OS can scale thread usage per app depending on load.
Please, remember we won't have more GHz in our machines any time soon. From now on we will only have more cores, so it's your duty to use the best tool for this environment, and that is GCD.
Blocks allow for passing a block of code to execute. Once you get past the "strange syntax", they are quite powerful.
GCD also uses queues which if used properly can help with lock free concurrency if the code executing in the separate queues are isolated. It's a simpler way to offer background and concurrency while minimizing the chance for deadlocks (if used right).
The "strange syntax" is because they chose the caret (^) because it was one of the few symbols that wasn't overloaded as an operator in C++
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Although you might think rewriting your code for dispatch queues would
be difficult, it is often easier to write code for dispatch queues
than it is to write code for threads. The key to writing your code is
to design tasks that are self-contained and able to run
asynchronously. (This is actually true for both threads and dispatch
Although you would be right to point out that two tasks running in a
serial queue do not run concurrently, you have to remember that if two
threads take a lock at the same time, any concurrency offered by the
threads is lost or significantly reduced. More importantly, the
threaded model requires the creation of two threads, which take up
both kernel and user-space memory. Dispatch queues do not pay the same
memory penalty for their threads, and the threads they do use are kept
busy and not blocked.
GCD (Grand Central Dispatch): GCD provides and manages FIFO queues to which your application can submit tasks in the form of block objects. Work submitted to dispatch queues are executed on a pool of threads fully managed by the system. No guarantee is made as to the thread on which a task executes. Why GCD over threads :
How much work your CPU cores are doing
How many CPU cores you have.
How much threads should be spawned.
If GCD needs it can go down into the kernel and communicate about resources, thus better scheduling.
Less load on kernel and better sync with OS
GCD uses existing threads from thread pool instead of creating and then destroying.
Best advantage of the system’s hardware resources, while allowing the operating system to balance the load of all the programs currently running along with considerations like heating and battery life.
I have shared my experience with threads, operating system and GCD AT