I want to enable heterogenous multi-GPU support for my Vulkan Application. My goal is to enhance the amount of draw calls (and thereby visible objects) without increasing frame times. I decided it would be best to divide the geometry to multiple GPUs and let each one render its part to the GPU's local Framebuffer. Each local Framebuffer (including depth buffer) would be copied to the logical device containing the presentation engine at the end of the frame.
Is there any way to "join" those Framebuffers depending on the depth values? The pixel value of the framebuffer containing the least depth value should be copied to the final output Frambuffer. A version of vkCmdBlitImage containing a parameter of type VkCompareOp would be nice to have.
Related
I'm trying to draw geometry for multiple models using a single draw call. All the geometry, thusly, resizes within the same vertex/index buffers. The geometry for the different models share the same vertex format, but the vertex amounts for each model can be different.
In the vertex/fragment shaders, what's a technique that can be used to differentiate between the different models, to access their appropriate transforms/textures/etc ?
Are these static models? For traditional static batching:
You only need a single transform relative to the batch origin (position the individual models relative to the batch origin as part of the offline data packaging step).
You can batch your textures in to a single atlas (either a single 2D image with different coordinates for each object, or a texture array with a different layer for each object).
If you do it this way you don't need to different component models - they are effectively just "one large model". Which has nice performance properties ...
For more modern methods, you can try indirect draws with multiple "drawCount" values to index the settings you want. This allows variable buffer offsets and triangle counts to be used, but the rest of the state used needs to be the same.
As an alternative to texture arrays, with bindless texturing you can just programmatically select which texture to use in the shader at runtime. BUT you generally still want it to be at least warp-uniform to avoid a performance hit.
I have a set of 6 video clips shot by 6 cameras atop a vehicle. I'm trying to use CERES to bundle adjust O(1.4K) frames from these 6 cameras. Presumably the intrinsic parameters for each of the 6 video cameras are constant to very good approximation. So I'm trying to figure out a reasonable way to incorporate this physical constraint in CERES' framework.
As a CERES newbie, I've tried searching previous bundle adjustment examples & questions. But surprisingly, I haven't found CERES-solver code examples where camera intrinsics are constrained due to their corresponding to a small number of physical cameras.
Starting with the simple_bundle_adjuster.cc example code, I first tried separating apart the extrinsic, intrinsic and 3D point variables within the double* parameters_ member of the BALProblem class. I hoped that it's possible to construct separate ResidualBlocks for video frames which would have distinct extrinsic and 3D point parameters but shared intrinsic parameters depending upon the video camera's physical ID.
Here is my slightly modified version of the Create() method from the simple_bundle_adjuster.cc example:
static ceres::CostFunction* Create(
int i, const double observed_u, const double observed_v)
{
if (use_common_intrinsics_flag)
{
return (new ceres::AutoDiffCostFunction<
PinholeReprojectionError, 2,
n_extrinsic_params,
n_physical_cameras * n_intrinsic_params, 3>(
new PinholeReprojectionError(i, observed_u, observed_v)));
}...
Unfortunately, I haven't been able to find any way to share the common intrinsic parameters from the 6 physical video cameras which is consistent with CERES.
Q1: Does CERES require combining all extrinsic and intrinsic parameters + 3D points into one large Residual Block for each physical video camera? Is this possible given that the number of frames collected by a video camera is not generally known at compile time?
Q2: Can constancy of intrinsic parameters for each physical video camera be approximately enforced via Upper/Lower Parameter Bounds? Maybe some iterative approach where parameter limits are tightened might work.
If all six cameras have known constant intrinsics. Then there is no reason to have parameter blocks for them, just create a CostFunction object that stores and applies the intrinsics.
If you would like to have them be variable but say with a strong prior, add a parameter block - one for each of the six cameras, and then use that as one of the parameter blocks to the cost function that computes the residual error.
The same idea applies if all six cameras have the same intrinsics, in which case just have have one parameter block.
Let me say the scenario, we have several meshes with the same shader (material type, e.g. PBR material), but the difference between meshes materials are the uniform buffer and textures for rendering them.
For uniform buffer we have a dynamic uniform buffer technique that uniform buffer offsets can be specify for each draw in the command buffer, but for the image till here I didn't find a way of specifying image view in command buffer for descriptor set. In all the sample codes I have seen till now, for every mesh and every material of that mesh they have a new pipeline, descriptor sets and etc.
I think it is not the best way, there must be a way to only have one pipeline and descriptor set and etc for a material type and only change the uniform buffer offset and texture image-view and sampler, am I right?
If I'm wrong, are these samples doing the best way?
How should I specify the VkDescriptorPoolCreateInfo.maxSets (or other limits like that) for dynamic scene that every minute meshes will add and remove?
Update:
I think it is possible to have a same pipeline and descriptor set layout for all of the objects but problem with VkDescriptorPoolCreateInfo.maxSets (or other limits like that) and the best practice still exist.
It is not duplicate
I was seeking for a way of specifying textures like what we can do with dynamic uniform buffer (to reduce number of descriptor sets) and along with this question there were complementary questions mostly to find best practices for the way that's gonna be suggested with an answer.
You have many options.
The simplest mechanism is to divide your descriptor set layout into sets based on the frequency of changes. Things that change per-scene would be in set 0, things that change per-kind-of-object (character, static mesh, etc), would be in set 1, and things that change per-object would be in set 2. Or whatever. The point is that the things that change with greater frequency go in higher numbered sets.
This texture is per-object, so it would be in the highest numbered set. So you would give each object its own descriptor set containing that texture, then apply that descriptor set when you go to render.
As for VkDescriptorPoolCreateInfo.maxSets, you set that to whatever you feel is appropriate for your system. And if you run out, you can always create another pool; nobody's forcing you to use just one.
However, this is only one option. You can also employ array textures or arrays of textures (depending on your hardware capabilities). In either method, you have an array of different images (either as a single image view or multiple views bound to the same arrayed descriptor). Your per-object uniform data would have that object's texture index, so that it can fetch the index from the array texture/array of textures.
I am trying to use a compute shader for image processing. Being new to Vulkan I have some (possibly naive) questions:
I try to look at neighborhood of a pixel. So AFAIK I have 2 possiblities:
a, Pass one image to the compute shader and sample the neighborhood pixels directly (x +/- i, y +/- j)
b, Pass multiple images to the compute shader (each being offset) and sample only the current position (x, y)
Is there any difference in sample performance a vs b (aside from b needing way more memory to being passed to GPU)?
I need to pass on pixel information (+ meta info) from one pipeline stage to another (and read it back out once command is done).
a, can I do this in any other way than passing a image with storage bit set?
b, when reading back information from host I probably need to use a framebuffer?
Using a single image and sampling at offsets (maybe using textureGather?) is going to be more efficient, probably by a lot. Each texturing operation has a cost, and this uses fewer. More importantly, the texture cache in GPUs generally loads a small region around your sample point, so sampling the adjacent pixels is likely going to hit in the cache.
Even better would be to load all the pixels once into shared memory, and then work from there. Then instead of fetching pixel (i,j) from thread (i,j) and all of that thread's eight neighbors, you only fetch it once. You still need extra fetches on the edge of the region handled by a single workgroup. (For what it's worth, this technique is not Vulkan specific: you'll see it used in CUDA, OpenCL, D3D Compute, and GL Compute too).
The only way to persist data out of a compute shader is to write it to a storage buffer or storage image. To read that on the CPU, use vkCmdCopyImageToBuffer or vkCmdCopyBuffer to a host-readable resource, and then map that.
My system is composed of several objects that represent quads. Each quad is represented by the same vertices and therefore, each object only stores matrices that represent the object's transformation through the world, and its own object space. During each render pass, after these matrices are updated with their frame transforms, they are multiplied with the current view and projection matrices to form the MVP matrix for that object. The objects vertices are then sent with the MVP matrix to the shader, where the vertices are multiplied by the MVP matrix. The inefficiency here is that each quad is drawn separately, meaning there is a separate call glDrawElements for each quad. At any given moment, there may be 50 or 60 quads in existence, some move out of scope and are destroyed or their animation may complete, so they're also destroyed, but more will randomly enter existence. Would there be a significant performance gain to storing all the necessary values in a VBO and just calling glDrawElements once during each pass?
Let's first reason about it with some simple mathematics:
At the moment you don't need to push any vertex data onto the GPU (each frame), but 12-16 floats matrix data per quad, and perform a matrix-matrix multiplication per quad on the CPU.
When putting all in one VBO, you have to transfer 4 vertices (~12 floats) per quad, but no matrix data (except for the global VP, of course) and you have to do 4 matrix-vector multiplies (~1 matrix-matrix multiply) on the CPU.
So the amount of work and data transferred doesn't really change much. But what changes is, that the transferred data is shifted from many many small uniform updates to a single large VBO update, which is very likely to be faster (both because a buffer update is likely to be faster from the hardware side than multiple uniform updates, but don't nail me on that, and second because of the much reduced driver overhead). And on top of that comes the even more reduced overhead by using a single large draw call instead of many smaller.
So yes, it will certainly be worth a try, though it has to be evaluated if it is really a "significant" improvement in your particular application.
Would there be a significant performance gain to storing all the
necessary values in a VBO and just calling glDrawElements once during
each pass?
Yes, it would be much faster. First reason, as you correctly identified, will be a single glDrawElements call. And second being the fact that VBO keeps the data in the GPU itself.
If quads move out of scope you can reuse their memory for the new quads. VBO's can be used to draw subregions of the buffer, so you can get big flexibility without memory allocations.
By using VBO's you are minimising interaction with the GPU and so getting the performance benefit.