Can I get performance gain by putting the codes which hit more often inside if block when writing an if/else branch in a shader program? - gpu

I learned that gpu can optimize the else block out if the branch condition remains the same in a gpu thread group(a warp or whatever). In my assumption, when gpu is dealing with a branch, it always do the if block first, then the else block. When the if block is done, gpu checks if the condtion expression evals to true on every threads among a thread group. If this is the case, the else block can be skipped in that thread group.
So I guess that if I put the codes which hit more often inside the if block, I can benefit more from this branch skipping mechanism. Is that true?

Related

What is the first synchronization scope of vkCmdWaitEvents?

In 7.5, the Vulkan spec says about vkCmdWaitEvents
The first synchronization scope only includes event signal operations that operate on members of pEvents, and the operations that happened-before the event signal operations. Event signal operations performed by vkCmdSetEvent that occur earlier in submission order are included in the first synchronization scope, if the logically latest pipeline stage in their stageMask parameter is logically earlier than or equal to the logically latest pipeline stage in srcStageMask.
I'm confused by this phrasing. Does this mean the first synchronization scope is the signalling of events that are passed in to pEvents, plus any events that are submitted earlier and meet the stage mask and submission order requirement, or is it event signals are both passed in and meet the requirement?
In either case, since you can just pass in events with pEvents, what is srcStageMask is useful for?
The first synchronization scope only includes event signal operations that operate on members of pEvents, and the operations that happened-before the event signal operations.
The first scope of vkCmdWaitEvents is only the hypothetical signal on the pEvent (and all the stuff that happens-before it transitively, as would be defined by whatever signaled the event).
Event signal operations performed by vkCmdSetEvent that occur earlier in submission order are included in the first synchronization scope, [...]
vkCmdSetEvent cannot be reordered past vkCmdWaitEvents by the driver. It would basically be a broken if it did. I.e. if you call:
vkCmdSetEvent(e);
vkCmdWaitEvents(e);
then the driver is not allowed to execute it as:
vkCmdWaitEvents(e);
vkCmdSetEvent(e);
if the logically latest pipeline stage in their stageMask parameter is logically earlier than or equal to the logically latest pipeline stage in srcStageMask.
The reordering prohibition only applies if certain rules are followed.
If you record:
vkCmdSetEvent(e, stageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT);
vkCmdWaitEvents(e, srcStageMask = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT);
then you would read that as: "signal event after all BOTTOM_OF_PIPEs, and then wait on all events that signal before TOP_OF_PIPE." But that is an empty set! Nothing can be after BOTTOM_OF_PIPE and before TOP_OF_PIPE at the same time! So the wait might not register such a signal.
In either case, since you can just pass in events with pEvents, what is srcStageMask is useful for?
Imagine a signal being nothing other than a bit being flipped somewhere in the memory. Then you might as well be asking why there are pipeline stages in Vulkan pipeline barriers at all. Worst case scenario, some driver might need where in pipeline stuff originates and where stuff is consumed.
Usually I think that stageMask == srcStageMask, but as a matter of design, Vulkan driver is not forced to remember your own stuff for you. It will simply ask you again.

Vulkan's execution model and sycnhronization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to clear up my confusion around Vulkan's execution model and I would like to have my understanding verified and get answers to questions that still remain unclear to me.
So my understanding is following:
The host and the device execute completely asynchronously with respect to each other. I have to use VkFence to synchronize between them, i.e. when I want to know that a particular submission has finished executing on the device, I have to wait on the host for the appropriate VkFence to be signaled.
Different command queues execute asynchronously with respect to each other. Vulkan specification does not provide any guarantees about the order in which submissions to these queues start or finish execution. So vkQueueSubmit on queue A executes completely independently from vkQueueSubmit on queue B and I have to use VkSemaphore in order to make sure that for example submission to queue B starts executing after the submission to queue A is finished.
However different commands submitted to the same command queue respect their submission order, which means that commands submitted later won't start execution unless commands submitted earlier have already started their execution, but on the other hand this does not mean that these later commands cannot finish execution before earlier commands.
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw). Actually they execute right away on the host (not on the device) and modify the state of VkCommandBuffer and this cumulatively modified state is used in recording action commands that come after.
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers. It can be thought of as having one pipeline barrier in the beginning of render pass instance (in place of vkCmdBeginRenderPass), one at the end of render pass instance (in place of vkCmdEndRenderPass) and one pipeline barrier after each subpass (in place of vkCmdNextSubpass).
In my head the mental model of how commands execute on a single command queue is as one huge stream of commands (ordered in the order that they were recorded to command buffer and the order that these command buffers were submitted to the queue) split by pipeline barriers. Each pipeline barrier splits the stream into two sections, commands before the barrier (section A) and commands after the barrier (section B). Commands in section B are allowed to start (or rather continue their execution with pipeline stage Y) only after all commands in section A have finished executing pipeline stage X.
Questions:
The Vulkan specification (section 2.2.1. Queue Operation) states:
Command buffer submissions to a single queue respect submission order
and other implicit ordering guarantees, but otherwise may overlap or
execute out of order. Other types of batches and queue submissions
against a single queue (e.g. sparse memory binding) have no implicit
ordering constraints with any other queue submission or batch.
Lets say that in my program I have only one general queue, that can issue all kinds of commands (graphics, compute, transfer, presentation, ...), so does the above statement mean the following ?
vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started, ... but vkQueueBindSparse or vkQueuePresentKHR can start at any time regardless of when they were issued by the host ... In other words I always have to use VkSemaphore to ensure that presentation (vkQueuePresentKHR) starts at the right time (only after all my graphics work was submitted and executed and thus is ready to be presented).
I am a little bit confused with the definition of submission order within command buffers themselves. Specification states (section 6.2. Implicit Synchronization Guarantees):
1)
For commands recorded outside a render pass, this includes all other
commands recorded outside a render pass, including
vkCmdBeginRenderPass and vkCmdEndRenderPass commands; it does not
directly include commands inside a render pass.
2)
For commands recorded inside a render pass, this includes all other
commands recorded inside the same subpass, including the
vkCmdBeginRenderPass and vkCmdEndRenderPass commands that delimit the
same render pass instance; it does not include commands recorded to
other subpasses.
The first bullet point seems to be clear. The submission order is the order in which commands were recorded to command buffers, whilst whatever is inside of a vkCmdBeginRenderPass and vkCmdEndRenderPass block is considered as one command for the purpose of this bullet point. The second bullet point is a bit unclear to me though. How is the submission order defined here ? It is clear that any command within a specific subpass does not start its execution unless a previous command has already started its execution or unless vkCmdBeginRenderPass was executed. But what about different subpasses ? Does this mean that subpass 1 can start its execution before subpass 0 has started its execution ? This does not make sense to me. What would make sense, is if later subpasses are only allowed to start once previous subpasses have finished.
Vulkan specification (section 6.1.2. Pipeline Stages) states:
Execution of operations across pipeline stages must adhere to implicit
ordering guarantees, particularly including pipeline stage order.
Does this mean that for example Vertex shader stage from draw call 2 is not allowed to begin execution unless vertex shader stage from draw call 1 has already started its execution ?
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A). I mean would it make the commands in command buffer B wait to start execution until commands in command buffer A are finished ? I read somewhere that synchronization between different command buffers is the job for events, but according to my understanding this should also be possible with barriers.
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
So as I see it, there are several different parallelisms in Vulkan:
Between CPU and GPU, these are synchronized with VkFence
Between different commands queues on the GPU, these are synchronized with VkSemaphore
Between different submissions to the same queue, exception seem to be submissions with vkQueueSubmit. These are also synchronized with VkSemaphore.
Between different draw calls. These are synchronized with pipeline barrier.
This one is the most confusing to me. So if I have a drawcall that in some way uses the results of any previous drawcall or writes to the same render target (framebuffer), then as far as I understand, I need to make sure that the later drawcall sees the memory effects of all previous drawcalls. But what about, when I am rendering a scene with a bunch of game characters, trees and buildings. Lets say that each such object is one drawcall and all these drawcalls write to the same framebuffer. Do I need to issue a memory barrier after every drawcall ? Intuitively this feels redundant and the demos that I checked out did not issue any barriers in this case, but are there any guarantees that drawcalls logically following after will see the memory effects of drawcalls logically before them ? The question is, when do I need to synchronize between different drawcalls ?
Within a single draw call. Synchronization on this level is possible with shader atomic instructions.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine. In other words if every fragment shader reads and writes only its corresponding pixel or vertex data, I do not need to worry about synchronization within the same drawcall.
The host and the device execute completely asynchronously with respect to each other.
Yes.
Unless explicit synchronization is used (that is VkFence, vk*WaitIdle, VkEvent). Or the one rare implicit synchronization ( host writes are visible to device access from any subsequent vkQueueSubmit).
Do note there also has to be a "memory domain operation". I.e. you must use VK_PIPELINE_STAGE_HOST_BIT when reading output of GPU on the CPU. (VkFence alone, doing the execution and memory dependency, does not suffice).
Different command queues execute asynchronously with respect to each other.
Correct. In other words, commands from any two queues may run serially, next to each other (in parallel), or even be pre-empted and time-shared, or some combination of the above. Anything goes. Unless explicit synchronization (VkSemaphore or VkFence) is used.
However different commands submitted to the same command queue respect their submission order
Yes. But it is only specification formalism that has no real-world effect. It is only specified so we have formal linguistic framework in which to describe other stuff in the specification (e.g. it specifies nomenclature necessary to describe the behavior of pipeline barriers).
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw).
No, that is not exactly how I would describe it.
They are not "delayed". They are simply executed exactly where they are recorded in the command buffers.
This is perhaps one of the things where we need the "submission order" formalism. All commands later in submission order after state command see the new state. (I.e. only the commands recorded after the state command see the new state).
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers.
I don't think so. It is actually perhaps bit more complex.
What it does is more efficient synchronization, although it perhaps defines functionally the same synchronization as pipeline barriers could. What it does differently is that (among other things) it defines this synchronization as a monolith (i.e. you tell the driver upfront what resources you are gonna use, and you outline all the things you are gonna do to them later).
Render Pass is a harness necessitated by mobile tiling architecture GPU. On desktop it is also useful if they have some architectural inspiration from the mobile GPUs, or simply as an oracle for driver optimization.
so does the above statement mean the following ? vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started
Yes, and no. Read above about the formalism of submission order.
Technically, yes, the commands are guaranteed to execute its VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT in order. But that stage does nothing.
AIS, it is only specification formalism used for other things. It does not say anything in of itself.
I am a little bit confused with the definition of submission order within command buffers themselves.
Yes, the language is bit tricky. The part that trips you up is the subpasses. Note that subpasses are by definition also asynchronous. Therefore we cannot use the simple rule in quote "1)".
If I decode it, what the spec quote means is:
a) Any command recorded before the Render Pass Instance (i.e. before vkCmdBeginRenderPass) is earlier in submission order than the vkCmdBeginRenderPass, and earlier than any and all the commands in the subpasses. (And vice versa, anything in the subpasses is later in submission order.)
b) Similarly any command recorded after the Render Pass Instance (i.e. after vkCmdEndRenderPass) is later in submission order than the vkCmdEndRenderPass, and later than any and all the commands in the subpasses.
c) The commands in a single subpass have the submission order same as the order they were recorded in (vkCmd*).
d) Commands in any two subpasses do not have submission order wrt each other.
Remember submission order is only a formalism. What "d)" means in reality is only that you cannot execute vkCmdPipelineBarrier in subpass 1 and expect that barrier to cover anything from subpass 0. (What you must do is use the VkSubpassDependency instead of vkCmdPipelineBarrier to achieve dependency between subpass 0 and 1.)
Execution of operations across pipeline stages must adhere to implicit ordering guarantees, particularly including pipeline stage order.
This is only an introductory statement linking to some of the other stuff in the specification. It does not say anything in of itself.
"implicit ordering guarantees" links to the submission order we covered.
"pipeline stage order" simply links to pipeline stage ordering. This simply specifies "logical order" between pipeline stages (e.g. Vertex Shader is before Fragment Shader). What it means is whenever you use stage flag bit in any srcStage parameter, Vulkan will implicitly assume you also mean any logically earlier stage flag bit. (And similarly for dstStage).
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A)
Yes, that is the general idea.
Think of it like this: vkQueueSubmit concatenates the commands from command buffer at the end of the Queue. It is called "queue" for a reason. Therefore a pipeline barrier affects the command buffer that was submitted earlier. (And BTW that's why it is called submission order)
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
Yes, but that is a code rot.
In this case use VK_PIPELINE_STAGE_ALL_COMMANDS_BIT instead. It is much easier to understand for anyone reading such code.
So as I see it, there are several different parallelisms in Vulkan:
Asynchrony.
Parallelism is not guaranteed. I.e. the driver is allowed to serialize the workload, or time-share it.
But e.g. with some common sense you can guess there will be (notable) parallelism between CPU and GPU, if it is a dedicated GPU.
The question is, when do I need to synchronize between different drawcalls ?
Yes, I think no framebuffer sync between draw commands is one of the exceptions\simplifications Vulkan has.
I believe people support it by the specification of Primitive Order and Rasterization Order.
I.e. in a single subpass you should not need a pipeline barrier between two vkCmdDraw* to synchronize the color and depth buffer. (I think) you still need to explicitly synchronize draw in a subpass with other subpasses and with outside of the render pass instance.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine.
Yes. The pipeline and the fixed and programmable stages should work similarly as in OpenGL. You should for most part be able to use OpenGL's shaders with little to no modification and achieve the same behavior.

How do you change the maximum count of a semaphore after you've created it?

If you have a semaphore that is being used to restrict access to a shared resource or limit the number of concurrent actions, what is the locking algorithm to be able to change the maximum value of that semaphore once it's in use?
Example 1
In NSOperationQueue, there is a property named maxConcurrentOperationCount. This value can be changed after the queue has been created. The documentation notes that changing this value doesn't affect any operations already running, but it does affect pending jobs, which presumably are waiting on a lock or semaphore to execute.
Since that semaphore is potentially being held by pending operations, you can just replace it with one with a new count. So another lock must be needed in the change somewhere, but where?
Example 2:
In most of Apple's Metal sample code, they use a semaphore with an initial count of 3 to manage in-flight buffers. I'd like to experiment changing that number while my application is running, just to see how big of a difference it makes. I could tear down the entire class that uses that semaphore and then rebuild the Metal pipeline, but that's a bit heavy handed. Like above, I'm curious how I can structure a sequence of locks or semaphores to allow me to swap out that semaphore for a different one while everything is running.
My experience is with Grand Central Dispatch, but I'm equally interested in a C++ implementation that might use those locking or atomic constructs.
I should add that I'm aware I can technically just make unbalanced calls to signal and wait but that doesn't seem right to me. Namely, whatever code that is making these changes needs to be able to block itself if wait takes awhile to reduce the count...

Is an Empty VkCommandBuffer executed when submitted to a queue?

Hey guys I wonder if we submit a VkSubmitInfo containing one empty VkCommandBuffer to the queue, if it will be executed or ignored. I mean will the semaphores in VkSubmitInfo::pWaitSemaphore and VkSubmitInfo::pDestSemaphore still be considered when submitting an empty VkCommandBuffer ?
Looks a stupid question but what I want is to "multiply" the only one semaphore that gets out of the vkAcquireNextImageKHR.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
if it will be executed or ignored.
What would be the difference? If there are no commands in the command buffer, then executing it will do nothing.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
This has nothing to do with the execution of the CB itself. The behavior of a batch doesn't change just because the CB doesn't do anything.
However, unless you have multiple queues waiting on the completion of this queue's operations, there's really no reason to have multiple destination semaphores. The batch containing the real work could just wait on the pWaitSemaphores.
Also, there's no reason to have empty batches that only wait on a single semaphore. Let's say you have batch Q, which signals the pWaitSemaphores that this empty batch signals. Well, there's no reason that batch Q's pDstSemaphores couldn't signal the semaphores that you want the empty batch to signal. After all, vkQueueSubmit semaphore wait operations have, as its destination command scope, all subsequent commands for that queue from vkQueueSubmit calls, the current one or subsequent ones.
So you would only need an empty batch if you have to wait on multiple semaphores that are signals from different batches on different queues. And such a complex dependency layout strongly suggests an over-complicated dependency design that will lead to reduced performance.
Even waiting on acquire makes no sense for this. You only need to wait on acquire if that queue is going to manipulate to the acquired image. Well, you can't manipulate an image from multiple queues simultaneously. So there's no point in signaling a bunch of semaphores when acquire completes; that's why acquire only takes one.
So I want to simulate a Fence only with semaphores and see what goes faster.
This suggests strongly that you're thinking about things incorrectly.
You use a fence when you want the CPU to detect the completion of a GPU operation. For vkAcquireNextImageKHR, you would use a fence if you need the CPU to know when the image has been acquired.
Semaphores are about the GPU detecting when a GPU operation has completed, regardless of whether the operation comes from a queue or not. So if the GPU needs to wait until an image is acquired, you use a semaphore.
It doesn't matter which is faster because they do different things.

Timed simulation with SystemC

With reference to question SystemC module not working with SC_THREAD, a timed simulation is imitated using next_trigger(). As I understood from this article, this restarts the thread after the specified time:
next_trigger(double, sc_time_unit): The process shall be triggered when specified time has elapsed.
I.e. it effectively executes the operations after the occurrence of this instruction after the time specified, but also executes the operations found before that instruction. I have the feeling that the repeated utilization of next_trigger within an SC_THREAD may result in 'glitches' in the simulation.
Q1: Is my feeling correct?
Q2: Is there another possibility to delay execution (something that suspending the thread for the given time, rather than restarting it)
First of all next_trigger can only be used with SC_METHOD's as mentioned here:
next_trigger() is used with process methods, one's which are not threads.
Here are a few pointer's in term of SystemC processes:
SC_METHOD's are processes which must complete it's execution at one pass.(e.g.: a simple function call)
Note: Do not use while(1) loops in SC_METHOD's.
SC_THREAD's are processes which are separate thread of execution, one must explicitly use wait() statements here to synchronize the SystemC kernel simulation. This is the place where you will mostly find while(1) (infinite) loops in use.
For suspending the thread for some simulation time you can use the wait() statement to introduce the perceived delay.
But for better understanding you need to understand the difference between static and dynamic sensitivity in SystemC refer here for more information.