Vullkan compute shader caches and barriers - vulkan

I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one
layout(std430, set = 0, binding = 2) buffer Particles{
Particle particles[];
};
layout(std430, set = 0, binding = 4) buffer Constraints{
Constraint constraints[];
};
void main(){
const uint gID = gl_GlobalInvocationID.x;
for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
// first query the constraint, which contains particle_id_1 and particle_id_1
const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass];
// read newest positions
vec3 position1 = particles[c.particle_id_1].position;
vec3 position2 = particles[c.particle_id_2].position;
// modify position1 and position2
position1 += something;
position2 -= something;
// update positions
particles[c.particle_id_1].position = position1;
particles[c.particle_id_2].position = position2;
// in the next iteration, different constraints may use the updated positions
}
}
From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position I copy some of the data from L2 to L1 (or directly to a register).
Then in position1 += something I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles. It would be enough to just put an execution barrier, without memory barrier
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkPipelineStageFlags dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkDependencyFlags dependencyFlags, // here nothing
uint32_t memoryBarrierCount, // here 0
const VkMemoryBarrier* pMemoryBarriers, // nullptr
uint32_t bufferMemoryBarrierCount, // 0
const VkBufferMemoryBarrier* pBufferMemoryBarriers, // nullptr
uint32_t imageMemoryBarrierCount, // 0
const VkImageMemoryBarrier* pImageMemoryBarriers); // nullptr

Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.
If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.
The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.
Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.
However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.
It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.

Implementation-dependent. For all we know, a device might have no cache at all, or in future it might be some quantum magic bs.
Shader assignment operation does not imply anything about anything. There's no "L1" or "L2" mentioned anywhere in the Vulkan specification. It is a concept that does not exist.
Completely divorce ourselves from the cache stuff, and all mental bagage that comes with it.
What is important here is that when you read something, then that thing needs to be "visible to" the reading agent (irrespective of what kind of device you use, and what obscure memory architecture it might have). If it is not "visible to", then you might be reading garbage.
When you write something, this does not happen automatically. The writes are not "visible to" anyone.
First you put your writes into src* part of a memory dependency (e.g. via a pipeline barrier). That will make your writes "available from".
Then you put your reader into dst* that will take all referenced writes that are "available from" and make them "visible to" the second synchronization scope.
If you really want to shoehorn this into a cache system concept, don't think of it as levels of cache. Think of it as separate caches. That something is already in some cache does not mean it is in the particular cache the consumer needs.

Related

Vulkan memory barrier for indirect compute shader dispatch

I have two compute shaders and the first one modifies DispatchIndirectCommand buffer, which is later used by the second one.
// This is the code of the first shader
struct DispatchIndirectCommand{
uint x;
uint y;
uint z;
};
restrict layout(std430, set = 0, binding = 5) buffer Indirect{
DispatchIndirectCommand[1] dispatch_indirect;
};
void main(){
// some stuff
dispatch_indirect[0].x = number_of_groups; // I used debugPrintfEXT to make sure that this number is correct
}
I execute them as follows
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, first_shader);
vkCmdDispatch(cmd_buffer, x, 1, 1);
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, second_shader);
vkCmdDispatchIndirect(cmd_buffer, indirect_buffer, 0);
The problem is that the changes made by first shader are not reflected by the second one.
// This is the code of the second shader
void main(){
debugPrintfEXT("%d", gl_GlobalInvocationID.x); //this seems to never be called
}
I initialise the indirect_buffer with VkDispatchIndirectCommand{.x=0,.y=1,.z=1}, and it seems that the second shader always executes with x==0, because the debugPrintfEXT never prints anything. I tried to add a memory barrier like
VkBufferMemoryBarrier barrier;
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
barrier.srcQueueFamilyIndex = queue_idx;
barrier.dstQueueFamilyIndex = queue_idx;
barrier.buffer = indirect_buffer;
barrier.offset = 0;
barrier.size = sizeof_indirect_buffer;
However, this does not seem to make any difference. What does seem to work, is when I use
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ;
When I use such access flags, the all compute shaders work properly. However, I get a validation error
Validation Error: [ VUID-vkCmdPipelineBarrier-dstAccessMask-02816 ] Object 0: handle = 0x5561b60356c8, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x69c8467d | vkCmdPipelineBarrier(): .pMemoryBarriers[1].dstAccessMask bit VK_ACCESS_INDIRECT_COMMAND_READ_BIT is not supported by stage mask (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). The Vulkan spec states: The dstAccessMask member of each element of pMemoryBarriers must only include access flags that are supported by one or more of the pipeline stages in dstStageMask, as specified in the table of supported access types (https://vulkan.lunarg.com/doc/view/1.2.182.0/linux/1.2-extensions/vkspec.html#VUID-vkCmdPipelineBarrier-dstAccessMask-02816)
Vulkan's documentation states that
VK_ACCESS_INDIRECT_COMMAND_READ_BIT specifies read access to indirect command data read as part of an indirect build, trace, drawing or dispatching command. Such access occurs in the VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage
It looks really confusing. It clearly does mention "dispatching command", but at the same time it says that the stage must be VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT and not VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT. Is the specification contradictory/imprecise or am I missing something?
You seem to be employing trial-and-error strategy. Do not use such strategy in low level APIs, and preferrably in computer engineering in general. Sooner or later you will encounter something that will look like it works when you test it, but be invalid code anyway. The spec did tell you the exact thing to do, so you never ever had a legitimate reason to try any other flags or with no barriers at all.
As you discovered, the appropriate access flag for indirect read is VK_ACCESS_INDIRECT_COMMAND_READ_BIT. And as the spec also says, the appropriate stage is VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
So the barrier for your case should probably be:
srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
dstStageMask = VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
The name of the stage flag is slightly confusing, but the spec is otherwisely very clear it aplies to compute:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT specifies the stage of the pipeline where VkDrawIndirect* / VkDispatchIndirect* / VkTraceRaysIndirect* data structures are consumed.
and also:
For the compute pipeline, the following stages occur in this order:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
PS: Also relevant GitHub Issue: KhronosGroup/Vulkan-Docs#176

What does "VkImageMemoryBarrier::srcAccessMask = 0" mean?

I just read Images Vulkan tutorial, and I didn't understand about "VkImageMemoryBarrier::srcAccessMask = 0".
code:
barrier.srcAccessMask = 0;
barrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
and this tutorial say:
Since the transitionImageLayout function executes a command buffer with only a single command, you could use this implicit synchronization and set srcAccessMask to 0 if you ever needed a VK_ACCESS_HOST_WRITE_BIT dependency in a layout transition.
Q1 : If function have commandbuffer with multi command, then can't use this implicit synchronization?
Q2 : According to the manual page, VK_ACCESS_HOST_WRITE_BIT is 0x00004000. but tutorial use "0". why?
it's "0" mean implicit
it's "VK_ACCESS_HOST_WRITE_BIT" mean explicit ?
Am I understanding correctly?
0 access mask means "nothing". As in, there is no memory dependency the barrier introduces.
Implicit synchronization means Vulkan does it for you. As the tutorial says:
One thing to note is that command buffer submission results in implicit VK_ACCESS_HOST_WRITE_BIT synchronization
Specifically this is Host Write Ordering Guarantee.
Implicit means you don't have to do anything. Any host write to mapped memory is already automatically visible to any device access of any vkQueueSubmit called after the mapped memory write.
Explicit in this case would mean to submit a barrier with VK_PIPELINE_STAGE_HOST_BIT and VK_ACCESS_HOST_*_BIT.
Note the sync guarantees only work one way. So CPU → GPU will be automatic\implicit. But GPU → CPU always need to be explicit (you need a barrier with dst = VK_PIPELINE_STAGE_HOST_BIT to perform memory domain transfer operation).

compute writes and transfer sync

I have a compute shader that writes into a storage buffer. As soon as the compute
queue becomes idle, the storage buffer is transfered to an image. Pipeline barriers
before and after the transfer take care of layout transitions.
The relevant code is as follows:
vkCmdDispatch(...);
...
vkQueueWaitIdle(...);
...
...
VkImageMemoryBarrier i = {};
i.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
i.srcAccessMask = 0;
i.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
i.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
...
i.image = texture;
i.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
i.subresourceRange....
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
...
vkCmdCopyBufferToImage(...);
...
i.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
i.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
i.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
The image is then used in a subsequent renderpass, and everything works just fine.
However, I am a bit concerned that I might be experiencing an Undefined Behaviour,
because even if waiting for the compute queue will ensure execution order between
the buffer writes and the buffer transfer, there is no explicit barrier that ensures writes
from the compute shader are actually available and visible to the buffer transfer.
Is there an implicit buffer or memory barrier (at least in this case) that I
cannot find in the specs (1.1.123 as of today), or any other kind of mechanism,
such that the above code is correct and the compute shader wites are always
available to the buffer transfer?
If not, would I be right to assume there should be a VkBufferMemoryBarrier right
before the first layout-transition pipeline barrier?
I am a bit confused, because reading the specs, I find:
"vkCmdPipelineBarrier is a synchronization command that inserts a dependency
between commands submitted to the same queue, or between commands in the same subpass."
but here I would need to insert a memory dependency from two different queues and
two distinct pipelines, so Im not really sure which pipeline would have to have a
barrier.. if a barrier is even needed in the first place.
You are thinking of a wrong synchronization tool. For synchronization between queues there is VkSemaphore.
There are additional complication in this case. The concept of queue family–resource ownership. In the case of VK_SHARING_MODE_EXCLUSIVE and differing queue families you must proceed as written in the Queue Family Ownership Transfer section of the specification. I.e. use special releasing and acquiring barrier + semaphore.
Otherwisely, only semaphore is needed as explained in Semaphore Signaling and Semaphore Waiting & Unsignaling sections:
The first access scope includes all memory access performed by the device.
The second access scope includes all memory access performed by the device.

Can I implement Branch target buffer in two stage pipelined RISC architecture?

I am trying to implement the BTB in low-level microcontroller such as PIC16. I don't know is it feasible or not. So wanted your suggestion.
Thanks.
The basic BTB is fairly simple, and is the equivalent of
BTBEntry be = BTB[curAddr & BTBBitMask];
nextFetch = be.addr;
which implemented as electronics takes the lower curAddr bits and feed them into a BTB memory and get the next address out.
And when the branch is resolved the result is written back into the BTB.
The lookup can be done in parallel with memory fetch and must be faster as additional steps must be done.
struct BTBEntry {
int addr;
int curAddr; // upper address bits.
}
To not just jump random around in the program due to the addr stored not corresponding to the curAddr, we also need to check if the address we are looking up is for the correct branch.
if ((curAddr & ~BTBBitMask) == be.curAddr)
nextFetch = be.addr; // found in the BTB
else
nextFetch = curAddr + instrutionSize; // not found, take next instruction
So it can be done, if the BTB is small enough and the total time used is less than an instruction fetch. But the effect might not be so large as you might want.

Processes in Operating Systems

When I read a source about the processes and threads in the operating system, I faced this sentence and it sounded weird to me:
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
I think the first sentence is true naturally. However, I cannot understand why the process needs to use solely data and code segment?
#include <stdio.h>
x = 10;
y;
int main(void){
int *array = (int*)malloc(sizeof(int) * 4);
printf("x and y are %d %d", x, y);
return 0;
}
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
If my thoughts are wrong, can anyone explain the reason ?
A process has to store in memory:
Code.
Heap.
Stack.
Data.
BSS.
Except for really trivial ones, a program will use all these segments. Take a look at wikipedia's explanation of what the segments contain.
I think in the sentence the author didn't want to go into details and refers to Stack/Heap/Data/BSS as the data of your program, not the actual data segment.
This statement is not correct.
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
A process has to exist before a program can be executed. On many non-eunuch's systems a single process runs multiple program.s
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
The LINKER deine program segments. The loader follows the instructions of the linker to create the address space.
"bss, data, heap, and code" is a bad way to envision the address space.
There is:
Executable data
Read only data
Read/write data that can be
initialized
uninitialized
Heap and stack are just read/write data. The operating system cannot even tell what data is stack and what is heap. It's all just memory.