I've looked through a bunch of tutorials that talk about push constants, allude to possible benefits, but never have I actually seen, even in Vulkan docs, what the heck a "push constant" actually is... I don't understand what they are supposed to be, and what the purpose of push constants are. The closest thing I can find is this post which unfortunately doesn't ask for what they are, but what are the differences between them and another concept and didn't help me much.
What is a push constant, why does it exist and what is it used for? where did its name come from?
Push constants is a way to quickly provide a small amount of uniform data to shaders. It should be much quicker than UBOs but a huge limitation is the size of data - spec requires 128 bytes to be available for a push constant range. Hardware vendors may support more, but compared to other means it is still very little (for example 256 bytes).
Because push constants are much quicker than other descriptors (resources through which we provide data to shaders), they are convenient to use for data that changes between draw calls, like for example transformation matrices.
From the shader perspective, they are declared through layout( push_constant ) qualifier and a block of uniform data. For example:
layout( push_constant ) uniform ColorBlock {
vec4 Color;
} PushConstant;
From the application perspective, if shaders want to use push constants, they must be specified during pipeline layout creation. Then the vkCmdPushConstants() command must be recorded into a command buffer. This function takes, among others, a pointer to a memory from which data to a push constant range should be copied.
Different shader stages of a given pipeline can use the same push constant block (similarly to UBOs) or smaller parts of the whole range. But, what is important, each shader stage can use only one push constant block. It can contain multiple members, though. Another important thing is that the total data size (across all shader stages which use push constants) must fit into the size constraint. So the constraint is not per stage but per whole range.
There is an example in the Vulkan Cookbook's repository showing a simple push constant usage scenario. Sascha Willems's Vulkan examples also contain a sample showing how to use push constants.
Related
My application has a max of 2 frames in flight. The vertex shader takes a uniform buffer containing matrices. Because 2 frames could potentially be rendering at the same time I believe I need to have separate memory for each frame's uniform buffer.
Is it preferable to create a single buffer than holds the uniforms of both frames and use an offset within the buffer to do updates. Or is it better that each frame have its own buffer?
Technically you can do either; it ultimately boils down to what makes the most sense implementation-wise. Provided you can ensure that you're not overwriting buffer data that's still active (and being consumed on the GPU), one memory allocation would be sufficient. It's a big advantage of the Vulkan API (since you have full control) but it does make life more complicated.
In my use-case, I use pages of allocations (kinda like a heap model) where I allocate on demand and return blocks when they're done (basically, if a reference is removed, I age out for however many frames of buffering I have and then free the block). This is targeted at uniform buffers that change infrequently.
For per-draw uniform data, I use push constants - this might be worth looking at for your use-case. Since the per-instance matrix is "use once then discard" and is relatively small, this actually makes life simpler, and could even have a performance benefit, to boot.
Let me say the scenario, we have several meshes with the same shader (material type, e.g. PBR material), but the difference between meshes materials are the uniform buffer and textures for rendering them.
For uniform buffer we have a dynamic uniform buffer technique that uniform buffer offsets can be specify for each draw in the command buffer, but for the image till here I didn't find a way of specifying image view in command buffer for descriptor set. In all the sample codes I have seen till now, for every mesh and every material of that mesh they have a new pipeline, descriptor sets and etc.
I think it is not the best way, there must be a way to only have one pipeline and descriptor set and etc for a material type and only change the uniform buffer offset and texture image-view and sampler, am I right?
If I'm wrong, are these samples doing the best way?
How should I specify the VkDescriptorPoolCreateInfo.maxSets (or other limits like that) for dynamic scene that every minute meshes will add and remove?
Update:
I think it is possible to have a same pipeline and descriptor set layout for all of the objects but problem with VkDescriptorPoolCreateInfo.maxSets (or other limits like that) and the best practice still exist.
It is not duplicate
I was seeking for a way of specifying textures like what we can do with dynamic uniform buffer (to reduce number of descriptor sets) and along with this question there were complementary questions mostly to find best practices for the way that's gonna be suggested with an answer.
You have many options.
The simplest mechanism is to divide your descriptor set layout into sets based on the frequency of changes. Things that change per-scene would be in set 0, things that change per-kind-of-object (character, static mesh, etc), would be in set 1, and things that change per-object would be in set 2. Or whatever. The point is that the things that change with greater frequency go in higher numbered sets.
This texture is per-object, so it would be in the highest numbered set. So you would give each object its own descriptor set containing that texture, then apply that descriptor set when you go to render.
As for VkDescriptorPoolCreateInfo.maxSets, you set that to whatever you feel is appropriate for your system. And if you run out, you can always create another pool; nobody's forcing you to use just one.
However, this is only one option. You can also employ array textures or arrays of textures (depending on your hardware capabilities). In either method, you have an array of different images (either as a single image view or multiple views bound to the same arrayed descriptor). Your per-object uniform data would have that object's texture index, so that it can fetch the index from the array texture/array of textures.
In DirectX12, you render multiple objects in different locations using the equivalent of a single uniform buffer for the world transform like:
// Basic simplified pseudocode
SetRootSignature();
SetPrimitiveTopology();
SetPipelineState();
SetDepthStencilTarget();
SetViewportAndScissor();
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
struct VSConstants
{
QEDx12::Math::Matrix4 modelToProjection;
} vsConstants;
vsConstants.modelToProjection = ViewProjMat * object->GetWorldProj();
SetDynamicConstantBufferView(0, sizeof(vsConstants), &vsConstants);
DrawIndexed();
}
However, in Vulkan, if you do something similar with a single uniform buffer, all the objects are rendered in the location of last world matrix:
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
UploadUniformBuffer(object->GetWorldProj());
DrawIndexed();
}
Is there a way to draw multiple objects with a single uniform buffer in Vulkan, just like in DirectX12?
I'm aware of Sascha Willem's Dynamic uniform buffer example (https://github.com/SaschaWillems/Vulkan/tree/master/dynamicuniformbuffer) where he packs many matrices in one big uniform buffer, and while useful, is not exactly what I am looking for.
Thanks in advance for any help.
I cannot find a function called SetDynamicConstantBufferView in the D3D 12 API. I presume this is some function of your invention, but without knowing what it does, I can only really guess.
It looks like you're uploading data to the buffer object while rendering. If that's the case, well, Vulkan can't do that. And that's a good thing. Uploading to memory that you're currently reading from requires synchronization. You have to issue a barrier between the last rendering command that was reading the data you're about to overwrite, and the next rendering command. It's just not a good idea if you like performance.
But again, I'm not sure exactly what that function is doing, so my understanding may be wrong.
In Vulkan, descriptors are generally not meant to be changed in the middle of rendering a frame. However, the makers of Vulkan realized that users sometimes want to draw using different subsets of the same VkBuffer object. This is what dynamic uniform/storage buffers are for.
You technically don't have multiple uniform buffers; you just have one. But you can use the offset(s) provided to vkCmdBindDescriptorSets to shift where in that buffer the next rendering command(s) will get their data from. So it's a light-weight way to supply different rendering commands with different data.
Basically, you rebind your descriptor sets, but with different pDynamicOffset array values. To make these work, you need to plan ahead. Your pipeline layout has to explicitly declare those descriptors as being dynamic descriptors. And every time you bind the set, you'll need to provide the offset into the buffer used by that descriptor.
That being said, it would probably be better to make your uniform buffer store larger arrays of matrices, using the dynamic offset to jump from one block of matrices to the other. You would tehn
The point of that is that the uniform data you provide (depending on hardware) will remain in shader memory unless you do something to change the offset or shader. There is some small cost to uploading such data, so minimizing the need for such uploads is probably not a bad idea.
So you should go and upload all of your objects buffer data in a single DMA operation. Then you issue a barrier, and do your rendering, using dynamic offsets and such to tell each offset where it goes.
You either have to use Push constants or have separate uniform buffers for each location. These can be bound either with a descriptor per location of dynamic offset.
In Sasha's example you can have more than just the one matrix inside the uniform.
That means that inside UploadUniformBuffer you append the new matrix to the buffer and bind the new location.
In a vertex shader, there is of course a limited amount of uniform storage allowed, and it is my understanding that different systems may implement GLSL in slightly different ways in terms of compiling code. I've heard the recommendation to use constants instead of writing out literals in the vertex shader code.
For instance, the following code could supposedly result in a reduction in available uniform storage. (I don't quite understand how.)
Example 1: With literals
vec4 myVector = vec4(1.0, 0.0, 0.0, 1.0);
To my understanding, there is the possibility that each use of 1.0or 0.0 takes up some amount of the uniform storage space. The recommendation is therefore to turn that previous code into something like the following:
Example 2: With constants instead of literals
const float zero = 0.0;
const float one = 1.0;
vec4 myVector = vec4(one, zero, zero, one);
Does anyone understand the argument behind what's going on? I'm not having any problems with code, I'm just trying to understand the stuff properly so that I don't have problems in the future.
My formal question is the following: specifically for the iOS platform using OpenGL ES 2.0, is the best practice to write out the thing with the literals (example 1), or with constants (example 2). Should I spend my time writing things out with constants each and every time, or should I write out literals and only use constants if the vertex shader fails to compile properly?
Thanks!
Regarding Kimi's mention about not finding anything in the spec, Appendix A-7 of The OpenGL®
ES Shading Language spec does include the following:
When calculating the number of uniform variables used, any literal
constants present in the shader source after preprocessing are
included when calculating the storage requirements. Multiple instances
of identical constants should count multiple times.
This is probably the source of the recommendation in OpenGL® ES 2.0 Programming Guide that Kimi quotes.
However, the spec does not mandate this restriction, and presumably any implementation is free to improve on it, but I cannot find anything either way regarding the iOS GL drivers.
I'm curious, did anyone actually follow up on the ideas of overloading a sample shader with literals, in an attempt to reach any potential maximum uniform limit?
(Sorry...I had intended to post this answer as a comment to Kimi's answer, but don't have the required 50 Rep points yet).
From the OpenGL® ES 2.0 Programming Guide
As far as literal values are concerned, the OpenGL ES 2.0 shading
language spec states that no constant propagation is assumed. This
means that multiple instances of the same literal value(s) will be
counted multiple times. Instead of using literal values, appropriate
const variables should be declared. This avoids having the same
literal value count multiple times, which might cause the vertex
shader to fail to compile if vertex uniform storage requirements
exceed what the implementation supports.
I could not find anything related to this in the actual spec. Also there is no information specific to the iOS.
Also you can check a GLSL Optimizer tool written to tackle this issue (and lots of others).
i am working on embedded software projects in automotive domain. In one of my projects, the application software consumes almost 99% of RAM memory. Actual RAM size available is 12KB. we use TMS470R1B1 Titan F05 microcontroller. I have done some optimisation like finding unused messages in software and deleting them but its still not worth reducing RAM. could you please suggest some good ways to reduce the RAM by some software optimisation?
Unlike speed optimisation, RAM optimisation might be something that requires "a little bit here, a little bit there" all through the code. On the other hand, there may turn out to be some "low hanging fruit".
Arrays and Lookup Tables
Arrays and look-up tables can be good "low-hanging fruit". If you can get a memory map from the linker, check that for large items in RAM.
Check for look-up tables that haven't used the const declaration properly, which puts them in RAM instead of ROM. Especially look out for look-up tables of pointers, which need the const on the correct side of the *, or may need two const declarations. E.g.:
const my_struct_t * param_lookup[] = {...}; // Table is in RAM!
my_struct_t * const param_lookup[] = {...}; // In ROM
const char * const strings[] = {...}; // Two const may be needed; also in ROM
Stack and heap
Perhaps your linker config reserves large amounts of RAM for heap and stack, larger than necessary for your application.
If you don't use heap, you can possibly eliminate that.
If you measure your stack usage and it's well under the allocation, you may be able to reduce the allocation. For ARM processors, there can be several stacks, for several of the operating modes, and you may find that the stacks allocated for the exception or interrupt operating modes are larger than needed.
Other
If you've checked for the easy savings, and still need more, you might need to go through your code and save "here a little, there a little". You can check things like:
Global vs local variables
Check for unnecessary use of static or global variables, where a local variable (on the stack) can be used instead. I've seen code that needed a small temporary array in a function, which was declared static, evidently because "it would take too much stack space". If this happens enough times in the code, it would actually save total memory usage overall to make such variables local again. It might require an increase in the stack size, but will save more memory on reduced global/static variables. (As a side benefit, the functions are more likely to be re-entrant, thread-safe.)
Smaller variables
Variables that can be smaller, e.g. int16_t (short) or int8_t (char) instead of int32_t (int).
Enum variable size
enum variable size may be bigger than necessary. I can't remember what ARM compilers typically do, but some compilers I've used in the past by default made enum variables 2 bytes even though the enum definition really only required 1 byte to store its range. Check compiler settings.
Algorithm implementation
Rework your algorithms. Some algorithms have have a range of possible implementations with a speed/memory trade-off. E.g. AES encryption can use an on-the-fly key calculation which means you don't have to have the entire expanded key in memory. That saves memory, but it's slower.
Deleting unused string literals won't have any effect on RAM usage because they aren't stored in RAM but in ROM. The same goes for code.
What you need to do is cut back on actual variables and possibly the size of your stack/stacks. I'd look for arrays that can be resized and unused varaibles. Also, it's best to avoid dynamic allocation because of the danger of memory fragmentation.
Aside from that, you'll want to make sure that constant data such as lookup tables are stored in ROM. This can usually be achieved with the const keyword.
Make sure the linker produces a MAP file - it will show you where the RAM is used. Sometimes you can find things like string literals/constants that are kept in RAM. Sometimes you'll find there are unused arrays/variables put there by someone else.
IF you have the linker map file it's also easy to attack the modules which are using the most RAM first.
Here are the tricks I've used on the Cell:
Start with the obvious: squeeze 32-bit words into 16s where possible, rearrange structures to eliminate padding, cut down on slack in any arrays. If you've got any arrays of more than eight structures, it's worth using bitfields to pack them down tighter.
Do away with dynamic memory allocation and use static pools. A constant memory footprint is much easier to optimize and you'll be sure of having no leaks.
Scope local allocations tightly so that they don't stay on stack longer than they have to. Some compilers are very bad at recognizing when you're done with a variable, and will leave it on the stack until the function returns. This can be bad with large objects in outer functions that then eat up persistent memory they don't have to as the outer function calls deeper into the tree.
alloca() doesn't clean up until a function returns, so can waste stack longer than you expect.
Enable function body and constant merging in the compiler, so that if it sees eight different consts with the same value, it'll put just one in the text segment and alias them with the linker.
Optimize executable code for size. If you've got a hard realtime deadline, you know exactly how fast your code needs to run, so if you've any spare performance you can make speed/size tradeoffs until you hit that point. Roll loops, pull common code into functions, etc. In some cases you may actually get a space improvement by inlining some functions, if the prolog/epilog overhead is larger than the function body.
The last one is only relevant on architectures that store code in RAM, I guess.
w.r.t functions, following are the handles to optimise the RAM
Make sure that the number of parameters passed to a functions is deeply analysed. On ARM architectures as per AAPCS(ARM arch Procedure Call standard), maximum of 4 parameters can be passed using the registers and rest of the parameters would be pushed into the stack.
Also consider the case of using a global rather than passing the data to a function which is most frequently called with the same parameter.
The deeper the function calls, the heavier is the use of the stack. use any static analysis tool, to get to know worst cast function call path and look for venues to reduce it. When function A is calling function B, B is calling C, which in turn calls D, which in turn calls E and goes deeper. In this case registers can't be at all levels to pass the parameters and so obviously stack will be used.
Try for venues for clubbing the two parameters into one wherever applicable. remember that all the registers are of 32bit in ARM and so further optimisation is also possible.
void abc(bool a, bool b, uint16_t c, uint32_t d, uint8_t e)// makes use of registers and stack
void abc(uint8 ab, uint16_t c, uint32_t d, uint8_t e)//first 2 params can be clubbed. so total of 4 parameters can be passed using registers
Have a re-look on nested interrupt vectors. In any architecture, we use to have scratch-pad registers and preserved registers. Preserved registers are something which needs to be saved before the servicing the interrupt. In case of nested interrupts it will be needing huge stack space to back up the preserved registers to and from the stack.
if objects of type such as structure is passed to the function by value, then it pushes so much of data(depending on the struct size) which will eat up stack space easily. This can be changed to pass by reference.
regards
barani kumar venkatesan
Adding to the previous answers.
If you are running your program from RAM for faster execution, you can create a user defined section which contains all the initialization routines which you are sure that it wont run more than once after your system boots up. After all the initialization functions executed, you can re use the region for heap.
This can be applied to the data section which are identified as not helpful after a certain stage in your program.