In a 2d application where you're drawing a lot of individual sprites, will the rasterization stage inevitably become a bottleneck? [duplicate] - vulkan

I'm in the processing of learning Vulkan, and I have just integrated ImGui into my code using the Vulkan-GLFW example in the original ImGui repo, and it works fine.
Now I want to render both the GUI and my 3D model on the screen at the same time, and since the GUI and the model definitely needs different shaders, I need to use multiple pipelines and submit multiples commands. The GUI is partly transparent, so I would like it to be rendered after the model. The Vulkan specs states that the execution order of commands are not likely to be the order that I record the commands, thus I need synchronization of some kind. In this Reddit post several methods of exactly achieving my goals was proposed, and I once believed that I must use multiple subpasses (together with subpass dependency) or barriers or other synchronization methods like that to solve this problem.
Then I had a look at SaschaWillems' Vulkan examples, in the ImGui example though, I see no synchronization between the two draw calls, it just record the command to draw the model first, and then the command to draw the GUI.
I am confused. Is synchronization really needed in this case, or did I misunderstand something about command re-ordering or blending? Thanks.

Think about what you're doing for a second. Why do you think there needs to be synchronization between the two sets of commands? Because the second set of commands needs to blend with the data in the first set, right? And therefore, it needs to do a read/modify/write (RMW), which must be able to read data written by the previous set of commands. The data being read has to have been written, and that typically requires synchronization.
But think a bit more about what that means. Blending has to read from the framebuffer to do its job. But... so does the depth test, right? It has to read the existing sample's depth value, compare it with the incoming fragment, and then discard the fragment or not based on the depth test. So basically every draw call that uses a depth test contains a framebuffer read/modify/wright.
And yet... your depth tests work. Not only do they work between draw calls without explicit synchronization, they also work within a draw call. If two triangles in a draw call overlap, you don't have any problem with seeing the bottom one through the top one, right? You don't have to do inter-triangle synchronization to make sure that the previous triangles' writes are finished before the reads.
So somehow, the depth test's RMW works without any explicit synchronization. So... why do you think that this is untrue of the blend stage's RMW?
The Vulkan specification states that commands, and stages within commands, will execute in a largely unordered way, with several exceptions. The most obvious being the presence of explicit execution barriers/dependencies. But it also says that the fixed-function per-sample testing and blending stages will always execute (as if) in submission order (within a subpass). Not only that, it requires that the triangles generated within a command also execute these stages (as if) in a specific, well-defined order.
That's why your depth test doesn't need synchronization; Vulkan requires that this is handled. This is also why your blending will not need synchronization (within a subpass).
So you have plenty of options (in order from fastest to slowest):
Render your UI in the same subpass as the non-UI. Just change pipelines as appropriate.
Render your UI in a subpass with an explicit dependency on the framebuffer images of the non-UI subpass. While this is technically slower, it probably won't be slower by much if at all. Also, this is useful for deferred rendering, since your UI needs to happen after your lighting pass, which will undoubtedly be its own subpass.
Render your UI in a different render pass. This would only really be needed for cases where you need to do some full-screen work (SSAO) that would force your non-UI render pass to terminate anyway.

Related

Why some commands can be recorded only outside of a render pass?

I don't know is it an API feature (I'm almost sure it's not) or a GPU specifics, but why, for example, vkCmdWaitEvents can be recorded inside and outside of a render pass, but vkCmdResetEvent can be recorded only outside? The same applies to other commands.
When it comes to event setting in particular, they play havoc with how the render pass model interacts with tile-based renderers.
Recall that the whole point of the complexity of the render pass model is to service the needs of tile-based renderers (TBRs). When a TBR encounters a complex series of subpasses, the way it wants to execute them is as follows.
It does all of the vertex processing stages for all of the rendering commands for all of the subpasses, all at once, storing the resulting vertex data in a buffer for later consumption. Then for each tile, it executes the rasterization stages for each subpass on the primitives that are involved in the building of that tile.
Note that this is the ideal case; specific things can make it fail to various degrees, but even then, it tends to fail in batches, where you do can execute several subpasses of a render pass like this.
So let's say you want to set an event in the middle of a subpass. OK... when does that actually happen? Remember that set-event command actually sets the event after all of the preceeding commands have completed. In a TBR, if everything is proceeding as above, when does it get set? Well ideally, all vertex processing for the entire renderpass is supposed to happen before any rasterization, so setting the event has to happen after the vertex processing is done. And all rasterization processing happens on a tile-by-tile basis, processing whichever primitives overlap that tile. Because of the fragmented rendering process, it's difficult to know when an individual rendering command has completed.
So the only place the set-event call could happen is... after the entire renderpass has completed. That is obviously not very useful.
The alternative is to have the act of issuing a ckCmdSetEvent call fundamentally reshape how the implementation builds the entire render pass. To break up the subpass into the stuff that happened before the event and the stuff that happens after the event.
But the reason why VkRenderPass is so big and complex, the reason why VkPipelines have to reference a specific subpass of a render pass, and the reason why vkCmdPipelineBarrier within a render pass requires you to specify a subpass self-dependency, is so that a TBR implementation can know up front when and where it will have to break the ideal TBR rendering scheme. Having a function introduce that breakup without forewarning works against this idea.
Furthermore, Vulkan is designed so that, if something is going to have to be implemented highly inefficiently, then it is either impossible to do directly or the API really makes it look really inefficient. vkCmd(Re)SetEvent cannot be efficiently implemented within a render pass on TBR hardware, so you can't do it period.
Note that vkCmdWaitEvents doesn't have this problem, because the system knows that the wait is waiting on something outside of a render pass. So it's just some particular stage that has to wait on the event to complete. If it's a vertex stage doing the waiting, it's easy enough to set that wait at the beginning of that command's processing. If it's a fragment stage, it can just insert the wait at the beginning of all rasterization processing; it's not the most efficient way to handle it, but since all vertex processing has executed, odds are good that the event has been set by then.
For other kinds of commands, recall that the dependency graph of everything that happens within a render pass is defined within VkRenderPass itself. The subpass dependency graph is there. You can't even issue a normal vkCmdPipelineBarrier within a render pass, not unless that subpass has an explicit self-dependency in the subpass dependency graph.
So what good would it be to issue a compute shader dispatch or a memory transfer operation in the middle of a subpass if you cannot wait for the operation to finish in that subpass or a later one? If you can't wait on the operation to end, then you cannot use its results. And if you can't use its results... you may as well have issued it before the render pass.
And the reason you can't have other dependencies goes back to TBRs. The dependency graph is an inseparable part of the render pass to allow TBRs to know up-front what the relationship between subpasses is. That allows them to know whether they can build their ideal renderer, and when/where that might break down.
Since the TBR model of render passes makes such waiting impractical, there's no point in allowing you to issue such commands.
Because a renderpass is a special construct that implies focusing work solely on the framebuffer.
In addition each of the subpasses are allowed to be run in parallel unless they have an explicit dependency between them.
This has an effect on how they would need to be synchronized to other instructions in the other subpasses.
Doing copies dominates use of the memory bus and would stall render work that depends on it. Doing that inside the renderpass creates a big gpu bubble that can be easily resolved by putting it outside and making sure its finished by the time you start the renderpass.
Some hardware also has dedicated copy units that is separate from the graphics hardware so the less synchronizing you need to do between them the better.

vulkan: pWaitDstStageMask member in VkSubmitInfo

In VkSubmitInfo, when pWaitDstStageMask[0] is VK_PIPLINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, vulkan implementation executes pipeline stages without waitng for pWaitSemaphores[0] until it reaches Color Attachment Stage.
However, if the command buffer has multiple subpasses and multiple draw commands, then does WaitDstStageMask mean the stages of all draw commands?
If I want vulkan implementation to wait the semaphore when it reaches color attachment output stage of the last subpass, what should I do?
You probably don't actually want to do this. On hardware that benefits from multi-subpass renderpasses, the fragment work for the entire renderpass will be scheduled and execute as essentially a single monolithic chunk of work. E.g. all subpasses will execute for some pixel (x,y) regions before any subpasses are executed for some other pixel (x,y) regions. So it doesn't really make sense to say insert a synchronization barrier on an external event between two subpasses. So you need to think about what your renderpass is doing and whether it is really open to the kinds of optimizations they were designed for.
If not, then treating the subpasses (or at least the final one) as independent renderpasses isn't going to be a loss anyway, so you might as well just put it in a separate renderpass in a separate submit batch, and put the semaphore wait before it.
If so, then you just want to do the semaphore wait before the COLOR_ATTACHMENT stage for the whole renderpass anyway.
In such situation You have (I think) two options:
You can split render pass - exclude the last subpass and submit it's commands in a separate render pass recorded in a separate command buffer so You can specify semaphore for which it should wait on (but this doesn't sound too reasonably) or...
You can use events - You should signal events after the commands which generate results later commands require, and then, in the last sub-pass You should wait on that event just before the commands that indeed need to wait.
The second approach is probably preferable (despite You are not using submission's pWaitSemaphores and pWaitDstStageMask fields), but it also has it's restrictions:
vkCmdWaitEvents must not be used to wait on event signal operations occuring on other queues.
And I'm not sure, but maybe subpass dependencies may also help You here. Clever definitions of submission's pWaitSemaphores and render pass's subpass dependencies may be enough to do the job. But I'm not too confident in explaining subpass dependencies (I'm not sure I fully understand them) so don't rely on this. Maybe someone can confirm this. Bot the above two option definitely will do the trick.

Multiple instances of same Vulkan subpass

I have been reading through many online tutorials on creating a Vulkan renderer, however, the idea of subpasses is still very unclear to me.
Say I have the following scenario: I need to do a first subpass for setup (fill a depth buffer for testing etc) then have a subpass for every light in the scene (the number of which could change at any time). Because each lighting subpass is exactly the same, would it be possible to declare 2 subpasses and have multiple instances of the same subpass?
The term "pass" here does not mean "full-screen pass" or something like that. Subpasses only matter in terms of what you're rendering to (and reading from previous subpass renderings as input attachments). Where your data comes from (descriptors/push constants), what vertex data they get, what shaders they use, none of that matters to the subpass. The only things the subpass controls are render targets.
So unless different lights are rendering to different images, then there's no reason to give each light a subpass. You simply issue the rendering commands for all of your lights within the same subpass.

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

Working around WebGL readPixels being slow

I'm trying to use WebGL to speed up computations in a simulation of a small quantum circuit, like what the Quantum Computing Playground does. The problem I'm running into is that readPixels takes ~10ms, but I want to call it several times per frame while animating in order to get information out of gpu-land and into javascript-land.
As an example, here's my exact use case. The following circuit animation was created by computing things about the state between each column of gates, in order to show the inline-with-the-wire probability-of-being-on graphing:
The way I'm computing those things now, I'd need to call readPixels eight times for the above circuit (once after each column of gates). This is waaaaay too slow at the moment, easily taking 50ms when I profile it (bleh).
What are some tricks for speeding up readPixels in this kind of use case?
Are there configuration options that significantly affect the speed of readPixels? (e.g. the pixel format, the size, not having a depth buffer)
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Should I try to aggregate all the textures I'm reading into a single megatexture and sort things out after a single big read?
Should I be using a different method to get the information back out of the textures?
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Yes, yes, yes. readPixels is fundamentally a blocking, pipeline-stalling operation, and it is always going to kill your performance wherever it happens, because it's sending a request for data to the GPU and then waiting for it to respond, which normal draw calls don't have to do.
Do readPixels as few times as you can (use a single combined buffer to read from). Do it as late as you can. Everything else hardly matters.
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
This will get you immensely better performance.
If your graphics are all like you show above, you shouldn't need to do any “layout” at all (which is good, because it'd be very awkward to implement) — everything but the text is some kind of color or boundary animation which could easily be done in a shader, and all the layout can be just a static vertex buffer (each vertex has attributes which point at which simulation-state-texel it should be depending on).
The text will be more tedious merely because you need to load all the digits into a texture to use as a spritesheet and do the lookups into that, but that's a standard technique. (Oh, and divide/modulo to get the digits.)
I don't know enough about your use case but just guessing, Why do you need to readPixels at all?
First, you don't need to draw text or your the static parts of your diagram in WebGL. Put another canvas or svg or img over the WebGL canvas, set the css so they overlap. Let the browser composite them. Then you don't have to do it.
Second, let's assume you have a texture that has your computed results in it. Can't you just then make some geometry that matches the places in your diagram that needs to have colors and use texture coords to look up the results from the correct places in the results texture? Then you don't need to call readPixels at all. That shader can use a ramp texture lookup or any other technique to convert the results to other colors to shade the animated parts of your diagram.
If you want to draw numbers based on the result you can use a technique like this so you'd make a shader at references the result shader to look at a result value and then indexes glyphs from another texture based on that.
Am I making any sense?