3 vs 2 VkSwapchain images? - vulkan

So a Vulkan swapchain basically has a pool of images, user-defined in number, that it allocates when it is created, and images from the pool cycle like this:
1. An unused image is acquired by the program from the swapchain
2. The image is rendered by the program
3. The image is delivered for presentation to the surface.
4. The image is returned to the pool.
Given that, of what advantage is having 3 images in the swapchain rather than 2?
(I'm asking because I received a BestPractice validation complaint that I was using 2 rather than 3.)
With 2 images, isn't it the case that one will be being presented (3.) while the other is rendered (2.), and then they alternate those roles back and forth?
With 3 images, isn't it the case that one of the three will be idle? Particularly if the loop is locked to the refresh rate anyway?
I'm probably missing something.

So what happens if one image is being presented, one has finished being rendered to by the GPU, and you have some work to render for later presentation? Well, if you're rendering to a swapchain image, that GPU work cannot start until image 0 has finished being presented.
Double buffering will therefore lead to stalling the GPU if the time to present a frame is greater than the time to render a frame. Yes, triple buffering will do the same if the render time consistently is shorter than the present time, but your frame time is right on the edge of the present time, then double buffering has a greater chance of wasting GPU time.
Of course, the downside is latency; triple buffering means that the image you display will be seen one frame later than double buffering.

Related

What do you do if you don't want a depth buffer in your render pass sometimes?

My question relates to any time when sometimes you're writing to an attachment, but other times you're not. So let's just say that my render pass usually writes to a depth buffer, but in some cases I don't need the depth buffer? As far as I can see there are couple of ways to handle this:
Disable depth buffer writing and attach a dummy image (apparently the dummy image has to be as big as the colour buffer, not sure if this is a waste of memory)
Create a separate render pass that only writes to the colour buffer.
In my case I want to treat the colour writing, depth writing and picking writing as separate things. Is it better to use dummy images or to have three render passes (colour, colour-depth, colour-depth-picking)? How does this apply generally? I'm guessing in the future there may be extra attachments that I may want to switch on/off.
Edit: Can you have a render pass that writes to 3 attachments, but only have 1 attachment in the framebuffer if you disable certain attachments? I'm guessing not as I've been told that you need to have images attached to each one of size at least the size of the biggest one, or the size of the framebuffer.

In Vulkan, using FIFO, happens when you don't have an image ready to present when the vertical blank arrives?

I'm new to graphics, and I've been looking at Vulkan presentation modes. I was wondering: in a situation where we've only got 2 images in our swapchain (one that the screen's currently reading from and one that's free), what happens if we don't manage to finish drawing to the currently free image before the next vertical blank? Do we do the presentation and get weird tearing, or do skip the presentation and draw the same image again (I guess giving a "stuttering" effect)? Do we need to define what happens, or is it automatic?
As a side note, is this why people use longer swap chains? i.e. so that if you managed to draw out 2 images to your swap chain while the screen was displaying the last image but now you're running late, at least you can present the newer of the 2 images from before?
I'm not sure how much of this is specific to FIFO or mailbox mode: I guess with mailbox you'll already have used the newest image you've got, so you're stuck again?
[2-image swapchain][1]
[1]: https://i.stack.imgur.com/rxe51.png
Tearing never happens in regular FIFO (or mailbox) mode. When you present an image, this image will be used for all subsequent vblanks until a new image is presented. And since FIFO disallows tearing, in your case, the image will be fully displayed twice.
If you are using a 2-deep swapchain with FIFO, you have to produce each image on time in order to avoid stuttering. With longer swapchains and FIFO, you have more leeway to avoid visible stuttering. With longer swapchains and mailbox, you can get a similar effect, but there will be less visible latency when your application is running on-time.

How do Swapchain minImageCount and VK_PRESENT_MODE_IMMEDIATE_KHR relate to each other?

If we assign VkSwapchainCreateInfoKHR.presentMode = VK_PRESENT_MODE_IMMEDIATE_KHR; as I understand, we have a single buffer and the image is being presented to the surface immediately,
but at the same time SwapChainDetails.surfaceCapabilities.minImageCount = 2; on my GTX 1070.
Does minImageCount mean that minimum number of buffers should be 2 ?
How does it even work ?
They don't relate to each other at all. How many buffers you have has nothing to do with how they get displayed.
Immediate presentation means that, when you present an image, it is displayed immediately. As stated in the standard:
the presentation engine does not wait for a vertical blanking period to update the current image
If the presentation engine only allowed you one swapchain image, then this would mean that the acquire would have to wait until the image is finished being shown. This would make immediate presentation pointless.
Now, you might think that acquiring such an image ought to happen immediately. But that would mean that you could render to the image while the presentation engine is still reading from it, causing potential corruption. But that's not what immediate presentation is for. Immediate presentation changes nothing about the ownership dynamics between the presentation engine and user code; it only changes the timing of how swapchain images are presented.
The ability to write to an image that is actually being presented is governed by the "shared" present modes (VK_PRESENT_MODE_SHARED_DEMAND_REFRESH_KHR and VK_PRESENT_MODE_SHARED_CONTINUOUS_REFRESH_KHR).

How to improve MTKView rendering when using MPSImageScale and MTLBlitCommandEncoder

TL;DR: From within my MTKView's delegate drawInMTKView: method, part of my rendering pass involves adding an MPSImageBilinearScale performance shader and zero or more MTLBlitCommandEncoder requests for generateMipmapsForTexture. Is that a smart thing to do from within drawInMTKView:, which happens on the main thread? Do either of them block the main thread while running or are they only being encoded and then executed later and entirely on the GPU?
Longer Version:
I'm playing around with Metal within the context of an imaging application. I use Core Image to load an image and apply filters. The output image is displayed as a 2D plane in a metal view with a single texture. This works, but to improve performance I wanted to experiment with Core Image's ability to render out smaller tiles at a time. Each tile is rendered into its own IOSurface.
On each render pass, I check if there are any tiles that have been recently rendered. For each rendered tile (which is now an IOSurface), I create a Metal texture from a CVMetalTextureCache that is backed by the surface.
I think use a scaling MPS to copy from the tile-texture into the "master" texture. If a tile was copied over, then I issue a blit command to generate the mipmaps on the master texture.
What I'm seeing is that if my master texture is quite large, then generate the mipmaps can take "a bit of time". The same is true if I have a lot of tiles. It appears this is blocking the main thread because my FPS drops significantly. (The MTKView is running at the standard 60fps.)
If I play around with tile sizes, then I can improve performance in some areas but decrease it in others. For example, increasing the tile size that Core Image renders it creates less tiles, and thus less calls to generate mipmaps and blits, but at the cost of Core Image taking longer to render a region.
If I decrease the size of my "master" texture, then mipmap generation goes faster since only the dirty textures are updates, but there appears to be a lower bounds on how small I should make the master texture because if I make it too small, then I need to pass in a large number of textures to the fragment shader. (And it looks like that limit might be 128?)
What's not entirely clear to me is how much of this I can move off the main thread while still using MTKView. If part of the rendering pass is going to block the main thread, then I'd prefer to move it to a background through so that UI elements (like sliders and checkboxes) remain fully responsive.
Or maybe this isn't the right strategy in the first place? Is there a better way to display really large images in Metal other than tiling? (i.e.: Images larger than Metal's texture size limit of 16384?)

Metal drawable musings... ugh

I have two issues in my Metal App.
My call to currentPassDescriptor is stalling. I have too many drawables, apparently.
I'm wholly confused on how to most performantly configure the multiple MTKViews I am using.
Issue (1)
I have a problem with currentPassDescriptor in my app. It is occasionally blocking (for 1.00s) which, according to the docs, is because there is no currentDrawable available.
Background: I have 4 HD 1920x1080 videos playing concurrently, tiled out onto a 3840x2160 second external display as a debugging configuration. The pixel buffers of these AVPlayer instances are captured by 4 independent CVDIsplayLink callbacks and, from within the callback, there is the draw call to its assigned MTKView. A total of 4 MTKViews are subviews tiled on a single NSWindow, and are configured for manual drawing.
I'm using CVDisplayLink callbacks manually. If I don't, then I get stutter when mousing up on the app’s menus, for example.
Within each draw call, I do a bit of kernel shader work then attempt to obtain the currentPassDescriptor. If successful, I do one pass of a fragment/vertex shader and then present the drawable. My code flow follows Apple’s sample code as well as published examples.
According to the Metal System Trace, most of draw calls take under 5ms. The GPU is about 20-25% utilized and there’s about 25% of the GPU memory free. I can also cause the main thread to usleep() for 1 second without any hiccups.
Without any user interaction, there’s about a 5% chance of the videos stalling out in the first minute. If there’s some UI work going then I see that as windowServer work in Instruments. I also note that AVFoundation seems to cache about 15 frames of video onto the GPU for each AVPlayer.
If the cadence of the draw calls is upset, there's about a 10% chance that things stall completely or some of the videos -- some will completely stall, some will stall with 1hz updates, some won't stall at all. There's also less chance of stalling when running Metal System Trace. The movies that have stalled seem to have done so on obtaining a currentPassDescriptor.
This is really a poor design to have this currentPassDescriptor block for ≈1s during a render loop. So much so that I’m thinking of eschewing the MTKView all together and just drawing to a CAMetalLayer myself. But the docs on CAMetalLayer seem to indicate the same blocking behaviour will occur.
I also grab these 4 pixel buffers on the fly and render sub-size regions-of-interest to 4 smaller MTKViews on the main monitor; but the stutters still occur if this code is removed.
Is the drawable buffer limit per MTKView or per the backing CALayer? The docs for maximumDrawableCount on CAMetalLayer say the number needs to be 2 or 3. This question ties into the configuration of the views.
Issue (2)
My current setup is a 3840x2160 NSWindow with a single content view. This subclass of NSView does some hiding/revealing of the mouse cursor by introducing an NSTrackingRectTag. The MTKViews are tiled subviews on this content view.
Is this the best configuration? Namely, one NSWindow with tiled MTKViews… or should I do one MTKView per window?
I'm also not sure how to best configure these windows/layers — ie. by setting (or clearing) wantsLayer, wantsUpdateLayer, and/or canDrawSubviewsIntoLayer. I'm currently just setting wantsLayer to YES on the single content view. Any hints on this would be great.
Does adjusting these properties collapse all the available drawables to the backing layer only; are there still 2 or 3 per MTKView?
NB: I've attached a sample run of my Metal app. The longest 'work' on the top graph is just under 5ms. The clumps of green/blue are rendering on the 4 MTKViews. The 'work' alternates a bit because one of the videos is a 60fps source; the others are all 30fps.