Combined image samplers vs seprate sampled image and sampler - vulkan

I want to access multiple textures with the same sampling parameters from a fragment shader (For instance, texture and normal map). Moreover, images change frequently whilst sampler stays stationary (suppose the texture is a video). I've found contradictory information about how it can be done. Vulkan Cookbook states that using combined image samplers might have a performance benefit on some platforms, but this Reddit answer states that combined image samplers don't make any sense.
My question is: Is there any reason to not use separate sampled images and one sampler (for both images) considering it makes the program's logic more simple?

Odds are good that which one you pick will not be the primary limiting factor in your application's performance. It's speed is more likely to be determined by the user factors: how efficient you are at building CBs, walking through your data structures, and so forth.
So use whichever works best for your needs and move on.
this Reddit answer states that combined image samplers don't make any sense.
Considering that said "answer" claims that this statement from the specification:
On some implementations, it may be more efficient to sample from an image using
a combination of sampler and sampled image that are stored together in the
descriptor set in a combined descriptor.
"warns you that [combined image samplers] may not be as efficient on some platforms", it's best to just ignore whatever they said and move on.

Related

Let's get swapchain's image count straight

After getting familiar with tons of books, tutorials and documentation regarding Vulkan I am still really confused by how does swapchain image count work.
Documentation on swapchain image count:
VkSwapchainCreateInfoKHR::minImageCount is the minimum number of presentable images that the application needs. The implementation will either create the swapchain with at least that many images, or it will fail to create the swapchain.
After reading this field's description, my understanding is that if I will create swapchain with minImageCount value greater than or equal to VkSurfaceCapabilitiesKHR::minImageCount and lesser or equal to VkSurfaceCapabilitiesKHR::maxImageCount then I will be able to acquire minImageCount images, because it is number of images that the application needs.
Let's assume the following values:
VkSurfaceCapabilitiesKHR::minImageCount == 2
VkSurfaceCapabilitiesKHR::maxImageCount == 8
VkSwapchainCreateInfoKHR::minImageCount == 3
In such case I expect to be able to acquire 3 images from swapchain, let's say one designated to be presented, one waiting for being presented and one for drawing (just like in triple buffering case).
On the other hand many tutorials advise to set VkSwapchainCreateInfoKHR::minImageCount value to VkSwapchainCreateInfoKHR::minImageCount + 1, explaining that not all images created in swapchain are designated to be acquired by the application, because some of them might be used by driver internally.
Example: Discussion
Is there any reliable explanation on how to pick number of images in swapchain so that the application won't be forced to wait for image acquisition?
Ultimately, the details of image presentation are not in your control. Asking for more images may make it less likely to encounter a CPU blocking situation, but there is no count, or other parameter, which can guarantee that it can't happen. Using more swapchain images merely makes it less likely.
However, you can easily tell when blocking happens simply by looking at how vkAcquireNextImageKHR behaves with a timeout of 0. If it returned that no image could be acquired, then you know that you need to wait. This gives you the opportunity to decide what to do with this information.
When things like this happens, you can note that it happened. If it happens frequently enough, it may be worthwhile to recreate the swapchain set with more images. Obviously, this is not a light-weight solution, but it would be fairly hardware neutral.

Why do I need resources per swapchain image

I have been following different tutorials and I don't understand why I need resources per swapchain image instead of per frame in flight.
This tutorial:
https://vulkan-tutorial.com/Uniform_buffers
has a uniform buffer per swapchain image. Why would I need that if different images are not in flight at the same time? Can I not start rewriting if the previous frame has completed?
Also lunarg tutorial on depth buffers says:
And you need only one for rendering each frame, even if the swapchain has more than one image. This is because you can reuse the same depth buffer while using each image in the swapchain.
This doesn't explain anything, it basically says you can because you can. So why can I reuse the depth buffer but not other resources?
It is to minimize synchronization in the case of the simple Hello Cube app.
Let's say your uniforms change each frame. That means main loop is something like:
Poll (or simulate)
Update (e.g. your uniforms)
Draw
Repeat
If step #2 did not have its own uniform, then it needs to write a uniform previous frame is reading. That means it has to sync with a Fence. That would mean the previous frame is no longer considered "in-flight".
It all depends on the way You are using Your resources and the performance You want to achieve.
If, after each frame, You are willing to wait for the rendering to finish and You are still happy with the final performance, You can use only one copy of each resource. Waiting is the easiest synchronization, You are sure that resources are not used anymore, so You can reuse them for the next frame. But if You want to efficiently utilize both CPU's and GPU's power, and You don't want to wait after each frame, then You need to see how each resource is being used.
Depth buffer is usually used only temporarily. If You don't perform any postprocessing, if Your render pass setup uses depth data only internally (You don't specify STORE for storeOp), then You can use only one depth buffer (depth image) all the time. This is because when rendering is done, depth data isn't used anymore, it can be safely discarded. This applies to all other resources that don't need to persist between frames.
But if different data needs to be used for each frame, or if generated data is used in the next frame, then You usually need another copy of a given resource. Updating data requires synchronization - to avoid waiting in such situations You need to have a copy a resource. So in case of uniform buffers, You update data in a given buffer and use it in a given frame. You cannot modify its contents until the frame is finished - so to prepare another frame of animation while the previous one is still being processed on a GPU, You need to use another copy.
Similarly if the generated data is required for the next frame (for example framebuffer used for screen space reflections). Reusing the same resource would cause its contents to be overwritten. That's why You need another copy.
You can find more information here: https://software.intel.com/en-us/articles/api-without-secrets-the-practical-approach-to-vulkan-part-1

Does a 3D engine needs to analyse all the map objects before rendering

Does a 3D engine needs to analyse every single object on the map to see if it's gonna be rendered or not. My understanding is that a line from the center of projection to a pixel in the view plan, the engine will find the closest plan that intersect with it, but wouldn't that mean that for each pixel the engine needs to analyse all objects in the map, is there a way to limits the objects analysed.
Thanks for your help.
Such procedure are called frustum-culling algorithm.
You can also find more information about it here :-
https://en.wikipedia.org/wiki/Viewing_frustum (wiki)
http://www.lighthouse3d.com/tutorials/view-frustum-culling/
http://www.cse.chalmers.se/~uffe/vfc.pdf (better but hard to read)
IMHO, this last link is similar as what Nico Schertler mentioned in comment.
Beware, what you seek for is not the same as "occlusion culling" (another related link "Most efficient algorithm for mesh-level, optimal occlusion culling? ) ", which is another optimization when an object is totally hidden behind another one.
Note that most game-engine render by object (a pack of many triangles - via draw calls, roughly speaking ), not by tracing each pixel (ray-tracing) as you might understand.
Ray-tracing is too expensive in most real-time application.

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

Working around WebGL readPixels being slow

I'm trying to use WebGL to speed up computations in a simulation of a small quantum circuit, like what the Quantum Computing Playground does. The problem I'm running into is that readPixels takes ~10ms, but I want to call it several times per frame while animating in order to get information out of gpu-land and into javascript-land.
As an example, here's my exact use case. The following circuit animation was created by computing things about the state between each column of gates, in order to show the inline-with-the-wire probability-of-being-on graphing:
The way I'm computing those things now, I'd need to call readPixels eight times for the above circuit (once after each column of gates). This is waaaaay too slow at the moment, easily taking 50ms when I profile it (bleh).
What are some tricks for speeding up readPixels in this kind of use case?
Are there configuration options that significantly affect the speed of readPixels? (e.g. the pixel format, the size, not having a depth buffer)
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Should I try to aggregate all the textures I'm reading into a single megatexture and sort things out after a single big read?
Should I be using a different method to get the information back out of the textures?
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Yes, yes, yes. readPixels is fundamentally a blocking, pipeline-stalling operation, and it is always going to kill your performance wherever it happens, because it's sending a request for data to the GPU and then waiting for it to respond, which normal draw calls don't have to do.
Do readPixels as few times as you can (use a single combined buffer to read from). Do it as late as you can. Everything else hardly matters.
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
This will get you immensely better performance.
If your graphics are all like you show above, you shouldn't need to do any “layout” at all (which is good, because it'd be very awkward to implement) — everything but the text is some kind of color or boundary animation which could easily be done in a shader, and all the layout can be just a static vertex buffer (each vertex has attributes which point at which simulation-state-texel it should be depending on).
The text will be more tedious merely because you need to load all the digits into a texture to use as a spritesheet and do the lookups into that, but that's a standard technique. (Oh, and divide/modulo to get the digits.)
I don't know enough about your use case but just guessing, Why do you need to readPixels at all?
First, you don't need to draw text or your the static parts of your diagram in WebGL. Put another canvas or svg or img over the WebGL canvas, set the css so they overlap. Let the browser composite them. Then you don't have to do it.
Second, let's assume you have a texture that has your computed results in it. Can't you just then make some geometry that matches the places in your diagram that needs to have colors and use texture coords to look up the results from the correct places in the results texture? Then you don't need to call readPixels at all. That shader can use a ramp texture lookup or any other technique to convert the results to other colors to shade the animated parts of your diagram.
If you want to draw numbers based on the result you can use a technique like this so you'd make a shader at references the result shader to look at a result value and then indexes glyphs from another texture based on that.
Am I making any sense?