How to know the limitations of specific VxWorks version? - embedded

I was trying to download VxWorks 5.x (specifically 5.11) manual, to understand the limitations with respect to the number of tasks, message queues, stack sizes , memory restrictions and semaphores.
Can any one of you please post any link to download or share the above-mentioned limitations?

There are not, as far as I am aware, any inherent limits in the number of tasks, message queues, stack sizes, sempahores etc.
While there clearly will be an upper bound, this is primarily dictated by the hardware platform, in particular the amount of memory.
This is especially true of message queues, and the stack size for various tasks.
If you are after particular documentation, you should contact your WindRiver support rep - documentation is freely available to all licensed users


Vulkan Queue Families Clarification

Is specified somewhere in Vulkan spec that presentation capabilities, vkGetPhysicalDeviceSurfaceSupportKHR returning true, is related just to families with VK_QUEUE_GRAPHICS_BIT bit flag or transfer exclusive family can possibly return true too?
I was probably little bit confused by naming in Vulkan tutorial (, but I assume that what is there named as presentQueue and presentFamily is just in fact (another one or same) graphics queue and graphic family and it has no relation to queue families of group VK_QUEUE_TRANSFER_BIT (if the queue family does not contain both flags).
Are my assumptions right or I am misunderstanding something?
Strictly speaking, there is no such thing as a "present family", nor is there a "graphics family". There are just different queue families which support different types of operations, like presentation or graphics.
How many different queue families are supported and which capabilities they have depends on your specific GPU model. To get a good overview of all this information, I can recommend the tool Hardware Capability Viewer by Sascha Willems. (It is also used to populate the database behind the awesome if you choose to upload your data.)
On an NVIDIA RTX 2080, for example, I get the following different queue families with the following capabilities:
Queue family #0 supports transfer, graphics, compute, and presentation
Queue family #1 supports only transfer, and nothing else
Queue family #2 supports transfer, compute, and presentation
As you can see, I could use queues from queue families #0 or #2 to send images to the presentation engine, but not queues from queue family #1.
I find the capabilities quite interesting. It means that for certain use cases, using one of the specialized queue families (i.e. #1 or #2) can lead to more optimal performance than using queues from family #0 which are able to perform any operation. Using different queues can enable your application to parallelize work better, but it will generally also require some sort of work-synchronization between the different queues.
Queues from family #2 are often referred to as "async compute queues" (I think, this terminology came mostly from the DirectX world), and can be enabled in games' graphics settings for quite a while now (if supported). What I have spotted recently is the option to enable "present from compute" (Doom Eternal offers this setting) and again, this would refer to queues from family #2. I would guess that this does not automatically lead to increased performance (which is why it can be enabled/disabled) but on some GPUs it definitely will.
Answering your specific question: A queue family does not have to support graphics capabilities in order to support presentation. There are queue families (e.g. on an RTX 2080) which support compute and presentation, but not graphics. All of this depends on the specific GPU model. I don't know if there are any GPUs that offer transfer-only queue families with presentation support --- maybe that doesn't make too much sense, so I guess rather not.

Why do queues in a queue family in Vulkan need priority if we can't distinguish between them?

As asked in the title. My main point is "why", as in what's the benefiting factor in such logical structure for queues and queue families.
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
I've observed an interesting fact: that in my "Intel(R) Core(TM) i3-8100B CPU # 3.60GHz" Mac Mini, there are 2 GPUs listed in "" (from LunarG SDK). My bad, the app linked against 2 libMoltonVK.dylib (1 in "Contents/Frameworks", 1 in "/usr/local/lib").
"Why" is not a great question for SO format. It leads to speculation.
The queues are distinguishable in Vulkan. They each have their index with which they can be distinguished. Keep in mind they are rather a driver thing. Even when the driver has more queues, even single one typically can use all the GPU's computing resources.
Furthermore Vulkan specification does not really say what should happen when you supply a specific priority value. It is perfectly valid for driver\GPU to ignore it.
Chip makers do have compute units that are independent. They can theoretically execute different code from each other. But it is not usually advantageous. In the usual work rendering some regular W × H image, it saturates all the compute units with the same work.
Why: because you can submit different types of work that're of different importance, and you can give a hint to the Vulkan implementation what you want to be done first-most.
Everything else in the question are pointless:
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Not necessarily, those may be logical queues that're time-sliced.
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
No, a contemporary API called Metal (from Apple) don't have a queue count or the concept of queue family at all.

Should I try to use as many queues as possible?

On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?
Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.
To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.
That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.
Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.

Is it meaningful to monitor physical memory usage on AIX?

Due to AIX's special memory-using algorithm, is it meaning to monitor the physical memory usage in order to find out the memory bottleneck during performance tuning?
If not, then what kind of KPI am i supposed to keep eyes on so as to determine whether we need to enlarge the RAM capacity or not?
If a program requires more memory that is available as RAM, the OS will start swapping memory sections to disk as it sees fit. You'll need to monitor the output of vmstat and look for paging activity. I don't have access to an AIX machine now to illustrate with an example, but I recall the man page is pretty good at explaining what data is represented there.
Also, this looks to be a good writeup about another AIX specfic systems monitoring tool, and watching your systems overall memory (svgmon).
To track the size of your individual application instance(s), there are several options, with the most common being ps. Again, you'll have to check the man page to get information on which options to use. There are several columns for memory sz per process. You can compare those values to the overall memory that's available on your machine, and understand, by tracking over time, if your application is only increasing is memory, or if it releases memory when it is done with a task.
Finally, there's quite a body of information from IBM on performance tuning for AIX, but I was never able to find a road map guide to reading that information. A lot of it assumes you know facts and features that aren't explained in the current doc set, so you then have to try and find an explanation, which oftens leads to searching for yet another layer of explanations. ! :^/

Spread vs MPI vs zeromq?

In one of the answers to Broadcast like UDP with the Reliability of TCP, a user mentions the Spread messaging API. I've also run across one called ØMQ. I also have some familiarity with MPI.
So, my main question is: why would I choose one over the other? More specifically, why would I choose to use Spread or ØMQ when there are mature implementations of MPI to be had?
MPI was deisgned tightly-coupled compute clusters with fast, reliable networks. Spread and ØMQ are designed for large distributed systems. If you're designing a parallel scientific application, go with MPI, but if you are designing a persistent distributed system that needs to be resilient to faults and network instability, use one of the others.
MPI has very limited facilities for fault tolerance; the default error handling behavior in most implementations is a system-wide fail. Also, the semantics of MPI require that all messages sent eventually be consumed. This makes a lot of sense for simulations on a cluster, but not for a distributed application.
I have not used any of these libraries, but I may be able to give some hints.
MPI is a communication protocol while Spread and ØMQ are actual implementation.
MPI comes from "parallel" programming while Spread comes from "distributed" programming.
So, it really depends on whether you are trying to build a parallel system or distributed system. They are related to each other, but the implied connotations/goals are different. Parallel programming deals with increasing computational power by using multiple computers simultaneously. Distributed programming deals with reliable (consistent, fault-tolerant and highly available) group of computers.
The concept of "reliability" is slightly different from that of TCP. TCP's reliability is "give this packet to the end program no matter what." The distributed programming's reliability is "even if some machines die, the system as a whole continues to work in consistent manner." To really guarantee that all participants got the message, one would need something like 2 phase commit or one of faster alternatives.
You're addressing very different APIs here, with different notions about the kind of services provided and infrastructure for each of them. I don't know enough about MPI and Spread to answer for them, but I can help a little more with ZeroMQ.
ZeroMQ is a simple messaging communication library. It does nothing else than send a message to different peers (including local ones) based on a restricted set of common messaging patterns (PUSH/PULL, REQUEST/REPLY, PUB/SUB, etc.). It handles client connection, retrieval, and basic congestion strictly based on those patterns and you have to do the rest yourself.
Although appearing very restricted, this simple behavior is mostly what you would need for the communication layer of your application. It lets you scale very quickly from a simple prototype, all in memory, to more complex distributed applications in various environments, using simple proxies and gateways between nodes. However, don't expect it to do node deployment, network discovery, or server monitoring; You will have to do it yourself.
Briefly, use zeromq if you have an application that you want to scale from the simple multithread process to a distributed and variable environment, or that you want to experiment and prototype quickly and that no solutions seems to fit with your model. Expect however to have to put some effort on the deployment and monitoring of your network if you want to scale to a very large cluster.