HOST_WRITE useful or validation layers wrong?

HOST_WRITE useful or validation layers wrong? - vulkan

In the specs about HOST_WRITE_BIT it is write :
For host writes to be seen by subsequent command buffer operations, a pipeline barrier from a source of VK_ACCESS_HOST_WRITE_BIT and VK_PIPELINE_STAGE_HOST_BIT to a destination of the relevant device pipeline stages and access types must be performed. Note that such a barrier is performed implicitly upon each command buffer submission, so an explicit barrier is only rarely needed
However, when you transition (through a vkCmdPipelineBarrier (so in a commandBuffer)) one image from PREINITIALIZED layout with a srcAccessMask to 0 instead of HOST_WRITE_BIT, you get an error :
validation layer: Source AccessMask 0 [None] must have required access bit 16384 [VK_ACCESS_HOST_WRITE_BIT] when layout is VK_IMAGE_LAYOUT_PREINITIALIZED, unless the app has previously added a barrier for this transition.
Is there an error from the specs? From the validation layers? Is the barrier we are talking about is a pure execution barrier and not a memory one? Am I missing something?

My private opinion is: validation layers bug.
It simply checks layout vs access flags and does not seem to be aware of this corner case:
https://github.com/KhronosGroup/Vulkan-LoaderAndValidationLayers/blob/master/layers/core_validation.cpp#L9005
There's also a case when you reconsider and do not host write to the VK_IMAGE_LAYOUT_PREINITIALIZED Image at all (and thus need no barrier), isn't it?
I believe the layer message is a WARNING, not ERROR. It may mean it is "only" a heuristic and some false positives are expected (until they improve the layers, which seem posible, but not so trivial for this case).
They even only lately corrected possibility for 0 access flags for the presentation, so it is not far fetched they would (with similar mind) forget something like that in the layers.
I would report an Issue there. They do not bite and worse that can happen is that some Khronos insider more knowledgable than me will explain why you are wrong.
That being said, perhaps the VK_PIPELINE_STAGE_HOST_BIT is unnecessary too (and TOP should suffice)?

Related

In a 2d application where you're drawing a lot of individual sprites, will the rasterization stage inevitably become a bottleneck? [duplicate]

I'm in the processing of learning Vulkan, and I have just integrated ImGui into my code using the Vulkan-GLFW example in the original ImGui repo, and it works fine.
Now I want to render both the GUI and my 3D model on the screen at the same time, and since the GUI and the model definitely needs different shaders, I need to use multiple pipelines and submit multiples commands. The GUI is partly transparent, so I would like it to be rendered after the model. The Vulkan specs states that the execution order of commands are not likely to be the order that I record the commands, thus I need synchronization of some kind. In this Reddit post several methods of exactly achieving my goals was proposed, and I once believed that I must use multiple subpasses (together with subpass dependency) or barriers or other synchronization methods like that to solve this problem.
Then I had a look at SaschaWillems' Vulkan examples, in the ImGui example though, I see no synchronization between the two draw calls, it just record the command to draw the model first, and then the command to draw the GUI.
I am confused. Is synchronization really needed in this case, or did I misunderstand something about command re-ordering or blending? Thanks.

Think about what you're doing for a second. Why do you think there needs to be synchronization between the two sets of commands? Because the second set of commands needs to blend with the data in the first set, right? And therefore, it needs to do a read/modify/write (RMW), which must be able to read data written by the previous set of commands. The data being read has to have been written, and that typically requires synchronization.
But think a bit more about what that means. Blending has to read from the framebuffer to do its job. But... so does the depth test, right? It has to read the existing sample's depth value, compare it with the incoming fragment, and then discard the fragment or not based on the depth test. So basically every draw call that uses a depth test contains a framebuffer read/modify/wright.
And yet... your depth tests work. Not only do they work between draw calls without explicit synchronization, they also work within a draw call. If two triangles in a draw call overlap, you don't have any problem with seeing the bottom one through the top one, right? You don't have to do inter-triangle synchronization to make sure that the previous triangles' writes are finished before the reads.
So somehow, the depth test's RMW works without any explicit synchronization. So... why do you think that this is untrue of the blend stage's RMW?
The Vulkan specification states that commands, and stages within commands, will execute in a largely unordered way, with several exceptions. The most obvious being the presence of explicit execution barriers/dependencies. But it also says that the fixed-function per-sample testing and blending stages will always execute (as if) in submission order (within a subpass). Not only that, it requires that the triangles generated within a command also execute these stages (as if) in a specific, well-defined order.
That's why your depth test doesn't need synchronization; Vulkan requires that this is handled. This is also why your blending will not need synchronization (within a subpass).
So you have plenty of options (in order from fastest to slowest):
Render your UI in the same subpass as the non-UI. Just change pipelines as appropriate.
Render your UI in a subpass with an explicit dependency on the framebuffer images of the non-UI subpass. While this is technically slower, it probably won't be slower by much if at all. Also, this is useful for deferred rendering, since your UI needs to happen after your lighting pass, which will undoubtedly be its own subpass.
Render your UI in a different render pass. This would only really be needed for cases where you need to do some full-screen work (SSAO) that would force your non-UI render pass to terminate anyway.

What is the difference between Symbolic and Concrete model checking when the search is bounded in time?

Could someone please spend a few words to explain to someone who does not come from a formal methods background what is the difference between verifying a specification using Symbolic Model Checking and doing the same using Concrete Model Checking, when the search is bounded in time? I am referring to the definition of SMC and Concrete MC made in UPPAAL.
In particular, I wrote a program that uses UPPAAL Java API to verify a query against a network of timed automata. If the query is verified, UPPAAL returns a symbolic trace to parse or something else if it is not. If the verification takes too long I decided to halt the verification process, return a message and move on with the next query to verify. Everything is good so far.
Recently, I have been playing around with UPPAAL Stratego which natively offers the possibility of choosing a maximum time or depth of exploration to bound the search. However, this options can be applied only when the verification is carried out using Concrete Model Checking.
My question is : is there any difference in halting the symbolic verification process, as I am doing in my Java program and what UPPAAL Stratego does natively? In both case I don't get an answer (or a trace) but what about the "reliability" of the exploration?
Which would be better (i.e. more complete) between the two options? Halting the symbolic verification or halting the concrete verification?
My understanding so far is that in Symbolic Model Checking, the possible states are defined by using intervals of variables whilst in Concrete Model Checking variables assume an actual value. My view is that, in terms of "completeness", halting the SMC after some time is more "complete" since the exploration of the state space happens systematically using BFS or DFS algorithm and, if I use BFS, I can be "sure" that within N steps nothing bad happens. But again, my background in model checking is not rich, so I might have get it completely wrong.
Also, if could drop any reference to the strategies, it would be really appreciated.
Thanks!

Zero derivative calculation during optimization (no impact to objective)

I am writing an optimization code for a finite-difference radiation solver model. I started to use "src_indices" for connecting parameters rather promoting all the variables. But when I changed the connection, optimization does not calculate derivatives, gives "no impact to objective" error, and successfully terminates optimization after first iteration. Could not find any clue for finding the error in the logs (Bug may be in a completely different reason).
Is there any suggestion where I can start?
I uploaded the code to GitHub https://github.com/TufanAkba/opt_question

The first thing that comes to mind when you mention "design variables have no impact on objective" is that there may be a missing connection. Since this behavior only started after you changed the connection style, I think this is even more likely.
There are a couple of tools you can use to diagnose this. The first is the n2 viewer, which you can launch by typing the following at your command prompt:
openmdao n2 receiver_opt.py
This will launch a browser window that contains a graphical model viewer which is described in detail here. You can use this to explore the structure of your model. To find unconnected inputs in your model, look for any input blocks that are colored orange. These are technically connected to a hidden IndepVarComp called _auto_ivc, and will include design variables, which are set by the optimizer. You will want to look for any that should be connected to other component outputs.
OpenMDAO also has a connection viewer that just shows connections.
openmdao view_connections receiver_opt.py
You can use this tool to just focus on the connections. It is described here. If you choose to use this, just filter to see any connection to _auto_ivc in the source output string to see the unconnected inputs.
If you reach this point, and are satisfied that all the connections are correct, then there are a couple of other possibilities:
Are all of your src_indices correct? Maybe some of them are an empty set, or maybe some create a "degenerate" case. For example, if you have a set of cascading components that each multiply an incoming vector by a diagonal matrix, and if your indices are [0] in one connection, and [4] in another connection, then you've effectively severed the entire model. None of our visualization tools can pick that up, and you will need to inspect the indices manually.
It could also be a derivative problem, though what you describe sounds like connections. In that case, I recommend using check_partials to look for any missing or incorrect derivatives.
Are you computing any derivatives using complex step? It is possible that you are losing the complex part of the calculation through a complex-unsafe operation. Checking your derivatives against 'fd' can help to find these.

What would be the value of VkAccessFlags for a VkBuffer or VkImage after being allocated memory?

So I create a bunch of buffers and images, and I need to set up a memory barrier for some reason.
How do I know what to specify in the srcAccessMask field for the barrier struct of a newly created buffer or image, seeing as at that point I wouldn't have specified the access flags for it? How do I decide what initial access flags to specify for the first memory barrier applied to a buffer or image?
Specifying initial values for other parameters in Vk*MemoryBarrier is easy since I can clearly know, say, the original layout of an image, but it isn't apparent to me what the value of srcAccessMask could be the first time I set up a barrier.
Is it based on the usage flags specified during creation of the object concerned? Or is there some other way that can be used to find out?

So, let's assume vkCreateImage and VK_LAYOUT_UNDEFINED.
Nowhere the specification says it defines some scheduled operation. So it is healthy to assume all its work is done as soon as it returns. Plus, it does not even have memory.
So any synchronization needs would be of the memory you bind to it. Let's assume it is just fresh memory from vkAllocate. Similarly, nowhere it is said in the specification that it defines some scheduled operation.
Even so, there's really only two options. Either the implementation does nothing with the memory, or it null-fills it (for security reason). In the case it null-fills it, that must be done in a way you cannot access the original data (even using synchronization errors). So it is healthy to assume the memory has no "synchronization baggage" on it.
So simply srcStageMask = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT (no previous outstanding scheduled operation) and srcAccessMask = 0 (no previous writes) should be correct.

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.

You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....

If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas