Correctly computing index for offsetting into frame-resource buffers - vulkan

Lets say, for a rendering setup, that we have a renderer that can be either double or triple buffered, and this buffering can be changed dynamically, at runtime. So, at any time, it's 2 or 3 frames in flight (FIF).
Frame resources, like buffers, are duplicated to match the FIF-count. Buffers are created large enough to hold (frame-data-size * FIF-count), and reading/writing into those buffers are offsetted accordingly.
For the purpose of offsetting into buffers, is it good enough to use a monotonic clycling index which would go like so:
double buffer: 0, 1, 0, 1, 0, 1, ...
triple buffer: 0, 1, 2, 0, 1, 2, ...
Then, if a change is made to the FIF-count a runtime, we first WaitIdle() on the GPU, and then reset this index to 0.
Is this a safe way of offsetting into buffers, such that we don't trample on data still being used by the GPU?
I'm particularly unsure how this may play with a triple-buffer setup with a swapchain mailbox present-mode.

we first WaitIdle() on the GPU, and then reset this index to 0.
Presumably, you're talking about vkDeviceWaitIdle (which is a function you should almost never use). If so, then as far as non-swapchain assets are concerned, this is safe. vkDeviceWaitIdle will halt the CPU until the GPU device has done all of the work it has been given.
The rules for swapchains don't change just because you waited. You still need to acquire the next image before trying to use it and so forth.
However, waiting doesn't really make sense here. If all you do is just reset the index to 0, that means you didn't reallocate any memory. So if you went from 3 buffers to 2, you still have three buffers worth of storage, and you're only using two of them.
So what did you wait accomplish? You can just keep cycling through your 3 buffers worth of storage even if you only have 2 swapchain images.
The only point in waiting would be if you need to release storage (or allocate more if you're going from 2 to 3 buffers). And even then, the only reason to wait is if it is absolutely imperative to delete your existing memory before allocating new memory.

Related

Vulkan, what variables does an object need? As in a separate mesh that can be updated individually

So I have been experimenting, and I can add a new "object" by adding every model in the scene to the same vertex buffer, but this isn't good for a voxel game because I don't want to have to reorganize the entire world's vertices every time a player destroys a block.
And it appears I can also add a new "object" by creating a new vertex and index buffer for it, and simply binding both it and all other vertex buffers to the command buffers array at the same time like this:
vkCmdBeginRenderPass(commandBuffers[i], &renderPassInfo, VK_SUBPASS_CONTENTS_INLINE);
vkCmdBindPipeline(commandBuffers[i], VK_PIPELINE_BIND_POINT_GRAPHICS, graphicsPipeline);
vkCmdBindDescriptorSets(commandBuffers[i], VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSets[i], 0, nullptr);
// mesh 1
VkBuffer vertexBuffers[] = { vertexBuffer };
VkDeviceSize offsets[] = { 0 };
vkCmdBindVertexBuffers(commandBuffers[i], 0, 1, vertexBuffers, offsets);
vkCmdBindIndexBuffer(commandBuffers[i], indexBuffer, 0, VK_INDEX_TYPE_UINT32);
vkCmdDrawIndexed(commandBuffers[i], static_cast<uint32_t>(indices.size()), 1, 0, 0, 0);
// mesh 2
VkBuffer vertexBuffers2[] = { vertexBuffer2 };
vkCmdBindVertexBuffers(commandBuffers[i], 0, 1, vertexBuffers2, offsets);
vkCmdBindIndexBuffer(commandBuffers[i], indexBuffer2, 0, VK_INDEX_TYPE_UINT32);
vkCmdDrawIndexed(commandBuffers[i], static_cast<uint32_t>(indices.size()), 1, 0, 0, 0);
vkCmdEndRenderPass(commandBuffers[i]);
But then this requires me to bind ALL vertex buffers to the command buffers array every time even when only a single one of those meshes is updated or created/destroyed. So how would I "add" a new "game object," the vertices and indices of which can be updated without having to loop through everything else in the scene too? Or is it relatively quick to bind to an already calculated vertex and index buffer and this is standard?
And I have tried this with a command buffer per object:
VkSubmitInfo submits[] = { submitInfo, submitInfo2 };
if (vkQueueSubmit(graphicsQueue, 2, submits, inFlightFences[currentFrame]) != VK_SUCCESS) {
throw std::runtime_error("failed to submit draw command buffer!");
}
But it only renders the last object in the queue (it will render the first object if I say the submit size is 1).
I have tried adding a separate descriptorset, descritor pool, and pipeline as well and it still only renders the last command buffer in the queue. I tried adding a new commandpool for each object but commandPool is used by dozens of other functions and it really seems like there is supposed to be only one of those.
You split your world into chunks, and draw one chunk at a time. All chunks have some space reserved for them in (a single) vertex buffer, and when something has changed, you only update that one chunk. If a chunk grows too large... Well, you will probably need some sort of a memory allocation system.
Do NOT create separate buffers for every little thing. Buffers just hold data. Any data. You can even store different vertex formats for different pipelines in one same buffer - just in different places within it and binding it with an offset. Do not rebind just to draw a different mesh if all your vertices are packed neatly into array (they most likely are). If you want to only draw a part of a buffer - just use what draw commands give you.
Command buffers are just a block of instructions for the gpu. You dont need one per object. However, one cannot be used and written to at the same time, so you will need at least one per frame in flight and one to write to. Pipelines(descriptor sets, and pretty much whatever else you bind) are just a bunch of state that your gpu starts using once you bind it. At the start of command buffer, the state is undefined - it is NOT inherited between command buffers in any way.

Exiting after N threads in a compute shader

So I have a compute shader kernel with the following logic:
[numthreads(64,1,1)]
void CVProjectOX(uint3 t : SV_DispatchThreadID){
if(t.x >= TotalN)
return;
uint compt = DbMap[t.x];
....
I do understand that it's not ideal to have ifs elses/branching in compute shaders? if so, what is the best way to limit thread work if number of total expected threads aren't expected to match exactly the kernel's numthreads?
For instance in my example, the kernel group of 64 threads, let's say I expect total 961 threads (it could be anything really), if, I dispatch 960, 1 db slot won't be processed, if I dispatch 1024, there will be 63 unnecessary work or maybe work pointing to non-existing db slot. (db slots number will vary).
Is if(t.x > TotalN)/return fine and the right approach here?
Should I just do min, tx = min(t.x, TotalN) and keep writing on the final db slot?
Should I just modulo? tx = t.x % TotalN and rewrite the first db slots?
What other solutions?
Limiting the number of threads this way is fine, yes. But, be aware that an early return like this doesn't actually save (as much) work as you'd expect:
The hardware utilizes SIMD like thread collections (called wavefonts in directX). Depending on the hardware, the usual size of such a wavefont is usually 4 (Intel iGPUs), 32 (NVidia and most AMD GPUs) or 64 (a few AMD GPUs). Due to the nature of SIMD, all threads in such a wavefont always do exactly the same work, you can only "mask out" some of them (meaning, their writes will be ignored and they are fine reading out-of-bounds memory).
This means that, in the worst case (when the wavefont size is 64), when you need to execute 961 threads and are therefore dispatching 1024, there will still be 63 threads executing the code, they just behave like they wouldn't exist. If the wave size is smaller, the hardware might at least early out on some wavefonts, so in these cases the early return does actually save some work.
So it would be the best if you'd never actually need a number of threads that is not a multiple of your group size (which, in turn, is hopefully a multiple of the hardwares wavefont size). But, if that's not possible, limiting the number of threads in that way is the next best option, especially because all threads that do reach the early return are next to each other, which maximizes the chance that a whole wavefont can early out.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

how ext4 works with fallocate

Recently, I am testing the proper usage of ext4 filesystem. what is my expert is that:
when system crashed, the data had been write return ok can not loss, but metadate can.
Here is my usage:
1. call fallocate to alloc centain space
fallocate(fd, 0, 0, 4*1024*1024); //4MB
2. call fsync(fd) let data and metadata write to disks
3. then i call function to randomly write the file with 4k size(random data but not 0). with O_DRICT flagļ¼Œbut not call fsync. I log the offset with return write ok.
4. check the offset that logged. but i find in some offset, read 4k data, is 0. It seems mean that offset isn't used like hole files.
My question is that:
<1. why after calling fallocate and fsync the metadata of the file still seems
indicate some blocks is not used, so when read it return null. It is my understand .
<2. have other api to call, can make sure that in allocate space with file is not holes ,after that when write data return ok with O_DIRECT can make sure the data will not be loss even the system crashed.
Thanks.
Only writing to the file space can eliminate the hole. Without writing, there is no dirty page and fsync simply does nothing.
I am wondering how did you execute you step 4. It seems that you did it by a manual crash, did you? If you read it after write without a crash, it should not be zero, provided you wrote non-zeros. If you read it after a crash, zero can happen if disk cache existed. However, this kind of zero is not like holes, they are zeros read from the disk (very probably the disk contains zeros).

Static data-heavy Rust library seems bloated

I've been developing a Rust library recently to try to provide fast access to a large database (the Unicode character database, which as a flat XML file is 160MB). I also want it to have a small footprint so I've used various approaches to reduce the size. The end result is that I have a series of static slices that look like:
#[derive(Clone,Copy,Eq,PartialEq,Debug)]
pub enum UnicodeCategory {
UppercaseLetter,
LowercaseLetter,
TitlecaseLetter,
ModifierLetter,
OtherLetter,
NonspacingMark,
SpacingMark,
EnclosingMark,
DecimalNumber,
// ...
}
pub static UCD_CAT: &'static [((u8, u8, u8), (u8, u8, u8), UnicodeCategory)] =
&[((0, 0, 0), (0, 0, 31), UnicodeCategory::Control),
((0, 0, 32), (0, 0, 32), UnicodeCategory::SpaceSeparator),
((0, 0, 33), (0, 0, 35), UnicodeCategory::OtherPunctuation),
/* ... */];
// ...
pub static UCD_DECOMP_MAP: &'static [((u8, u8, u8), &'static [(u8, u8, u8)])] =
&[((0, 0, 160), &[(0, 0, 32)]),
((0, 0, 168), &[(0, 0, 32), (0, 3, 8)]),
((0, 0, 170), &[(0, 0, 97)]),
((0, 0, 175), &[(0, 0, 32), (0, 3, 4)]),
((0, 0, 178), &[(0, 0, 50)]),
/* ... */];
In total, all the data should only take up around 600kB max (assuming extra space for alignment etc), but the library produced is 3.3MB in release mode. The source code itself (almost all data) is 2.6MB, so I don't understand why the result would be more. I don't think the extra size is intrinsic as the size was <50kB at the beginning of the project (when I only had ~2kB of data). If it makes a difference, I'm also using the #![no_std] feature.
Is there any reason for the extra binary bloat, and is there a way to reduce the size? In theory I don't see why I shouldn't be able to reduce the library to a megabyte or less.
As per Matthieu's suggestion, I tried analysing the binary with nm.
Because all my tables were represented as borrowed slices, this wasn't very useful for calculating table sizes as they were all in anonymous _refs. What I could determine was the maximum address, 0x1208f8, which would be consistent with a filesize of ~1MB rather than 3.3MB. I also looked through the hex dump to see if there were any null blocks that might explain it, but there weren't.
To see if it was the borrowed slices that were the problem, I turned them into non-borrowed slices ([T; N] form). The filesize didn't change much, but now I could interpret the nm data quite easily. Weirdly, the tables took up exactly how much I expected them to (even more weirdly, they matched my lower bounds when not accounting for alignment, and there was no space between the tables).
I also looked at the tables with nested borrowed slices, e.g. UCD_DECOMP_MAP above. When I removed all of these (about 2/3 of the data), the filesize was ~1MB when it should have only been ~250kB (by my calculations and the highest nm address, 0x3d1d0), so it doesn't look like these tables were the problem either.
I tried extracting the individual files from the .rlib file (which is a simple ar-format archive). It turns out that 40% of the library is just metadata files, and that the actual object file is 1.9MB. Further, when I do this to the library without the borrowed references the object file is 261kB! I then went back to the original library and looked at the sizes of the individual _refs and found that for a table like UCD_DECOMP_MAP: &'static [((u8,u8,u8),&'static [(u8,u8,u8)])], each value of type ((u8,u8,u8),&'static [(u8,u8,u8)]) takes up 24 bytes (3 bytes for the u8 triplet, 5 bytes of padding and 16 bytes for the pointer), and that as a result these tables take up a lot more room than I would have thought. I think I can now fully account for all the filesize.
Of course, 3MB is still quite small, I just wanted to keep the file as small as possible!
Thanks to Matthieu M. and Chris Emerson for pointing me towards the solution. This is a summary of the updates in the question, sorry for the duplication!
It seems that there are two reasons for the supposed bloat:
The .rlib file outputted is not a pure object file, but is an ar archive file. Usually such a file would consist entirely of one or more object files, but rust also includes metadata. Part of the reason for this seems to be to obviate the need for separate header files. This accounted for around 40% of the final filesize.
My calculations turned out to not be accurate for some of the tables, which also happened to be the largest ones. Using nm I was able to find that for normal tables such as UCD_CAT: &'static [((u8,u8,u8), (u8,u8,u8), UnicodeCategory)], the size was 7 bytes for each item (which is actually less than I originally anticipated, assuming 8 bytes for alignment). The total of all these tables was about 230kB, and the object file including just these came in at 260kB (after extraction), so this was all consistent.
However, examining the nm output more closely for the other tables (such as UCD_DECOMP_MAP: &'static [((u8,u8,u8),&'static [(u8,u8,u8)])]) was more difficult because they appear as anonymous borrowed objects. Nevertheless, it turned out that each ((u8,u8,u8),&'static [(u8,u8,u8)]) actually takes up 24 bytes: 3 bytes for the first tuple, 5 bytes of padding, and an unexpected 16 bytes for the pointer. I believe this is because the pointer also includes the size of the referenced array. This added around a megabyte of bloat to the library, but does seem to account for the entire filesize.