Vulkan Queue submission synchronization - vkWaitForFences vs vkQueueWaitIdle [duplicate] - vulkan

I have a function that copies data from one buffer to another, I need to synchronize its execution.
I have such a bad option:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
//Waiting for completion
vkQueueWaitIdle(graphicsQueue);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
}
This option is bad because if I want to execute the copyBuffer() function several times, then all the buffers will be copied strictly one at a time.
I want to use a fence for each function call so that multiple calls can run in parallel.
So far, I have only such a solution:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Create fence
VkFenceCreateInfo fenceInfo{};
fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
fenceInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;
VkFence executionCompleteFence = VK_NULL_HANDLE;
if (vkCreateFence(logicalDevice, &fenceInfo, VK_NULL_HANDLE, &executionCompleteFence) != VK_SUCCESS) {
throw MakeErrorInfo("Failed to create fence");
}
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkWaitForFences(logicalDevice, 1, &executionCompleteFence, VK_TRUE, UINT64_MAX);
vkResetFences(logicalDevice, 1, &executionCompleteFence);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
vkDestroyFence(logicalDevice, executionCompleteFence, VK_NULL_HANDLE);
}
Which of these options is better?
Is the second option written correctly?

Both functions are bad in the same way. They both block the CPU from doing anything until the transfer is done. And they both could be used to potentially submit multiple CBs to the same queue in the same frame, but with different submit commands.
Neither is desirable if performance is something you care about.
Ultimately, what you need to do is have your copyBuffer function not actually perform the copy. You should have a function which builds a command buffer to do a copy. That CB is then stored in a place to be submitted later with other copying CBs. Or better yet, you can have just one copying CB that each command adds to (the first one called in a frame will create the CB).
At some point in the future, before you've submitted the work that will use this data, you need to submit the transfer operations. And the way this works depends on if you're submitting the transfer operations on the same queue as the work that will consume them or not.
If they're on the same queue, then all you need to do is have an event in a command buffer at the end of your batch that synchronizes the transfer operations with their receivers. If you want to be more clever, each transfer operation can have its own event, which the receiving operations will wait on.
And in same-queue transfers, you also want to make sure that you submit the transfers in the same vkQueueSubmit call as the rest of your work. Or to put it another way, you should never make more than one call to vkQueueSubmit for a particular queue in a particular frame.
If you're dealing with separate queues, then things change. A bit. If timeline semaphores aren't an option, you'll need to submit your transfer work before you submit the receiving operations. This is because the transfer batch will need to signal a semaphore that the receiving operation will wait on. And a binary semaphore cannot be waited on until the operation that signals it has been submitted to a queue.
But otherwise, everything else stays the same. Of course, you don't need events since you're synchronizing by semaphore.

The two functions are semantically identical and do exactly the same blocking behavior.
The second is slightly better. vkQueueWaitIdle is kind of a debug and out-of-hotspot feature. It might incur a hidden second submit to signal the implicit fence.
You don't need to reset fence that you subsequently destroy anyway. And you are creating it presignaled, which is a bug. Also you forgot to pass it to the vkQueueSubmit.

Related

Do I need two different queue submissions here?

I have once command buffer 1, and command buffer 2. Both have had their recordings finished, and now I want to submit both of them, preferably in an efficient way (one vkQueueSubmit call.
The second command buffer needs to wait until the first one has completed. I thought that I could do this with one submit:
/* I THINK THE FOLLOWING IS WRONG */
VkSubmitInfo info;
info.pCommandBuffers = /* Pointer to the first command buffer */;
info.commandBufferCount = 2;
info.pSignalSemaphores = /* The first pointer value points to the semaphore
that the first command buffer will signal when completed */
info.pWaitSemaphores = /* The second pointer value points to the same semaphore as above */
So I'm pretty sure the above is a wrong understanding, does that mean the only way in this case to have a dependency of the second command buffer on the first to have two different queue submissions? I mean two different VkSubmitInfo's?
Binary semaphores are meant for interqueue communication. And while you can use timeline semaphores for intraqueue communications, for something like this, it's best to take a step back and ask yourself what is really going on.
Because it's not true that CB2 needs to wait for CB1 to complete. What's happening is that the result of some operation within CB1 needs to complete before some operation within CB2 can start. This is best accomplished with a VkEvent. CB1's operation can set an event that CB2 corresponding operation will wait on.
There's no need to separate these operations into different batches, let alone different queue submissions.

compute writes and transfer sync

I have a compute shader that writes into a storage buffer. As soon as the compute
queue becomes idle, the storage buffer is transfered to an image. Pipeline barriers
before and after the transfer take care of layout transitions.
The relevant code is as follows:
vkCmdDispatch(...);
...
vkQueueWaitIdle(...);
...
...
VkImageMemoryBarrier i = {};
i.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
i.srcAccessMask = 0;
i.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
i.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
...
i.image = texture;
i.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
i.subresourceRange....
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
...
vkCmdCopyBufferToImage(...);
...
i.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
i.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
i.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
The image is then used in a subsequent renderpass, and everything works just fine.
However, I am a bit concerned that I might be experiencing an Undefined Behaviour,
because even if waiting for the compute queue will ensure execution order between
the buffer writes and the buffer transfer, there is no explicit barrier that ensures writes
from the compute shader are actually available and visible to the buffer transfer.
Is there an implicit buffer or memory barrier (at least in this case) that I
cannot find in the specs (1.1.123 as of today), or any other kind of mechanism,
such that the above code is correct and the compute shader wites are always
available to the buffer transfer?
If not, would I be right to assume there should be a VkBufferMemoryBarrier right
before the first layout-transition pipeline barrier?
I am a bit confused, because reading the specs, I find:
"vkCmdPipelineBarrier is a synchronization command that inserts a dependency
between commands submitted to the same queue, or between commands in the same subpass."
but here I would need to insert a memory dependency from two different queues and
two distinct pipelines, so Im not really sure which pipeline would have to have a
barrier.. if a barrier is even needed in the first place.
You are thinking of a wrong synchronization tool. For synchronization between queues there is VkSemaphore.
There are additional complication in this case. The concept of queue family–resource ownership. In the case of VK_SHARING_MODE_EXCLUSIVE and differing queue families you must proceed as written in the Queue Family Ownership Transfer section of the specification. I.e. use special releasing and acquiring barrier + semaphore.
Otherwisely, only semaphore is needed as explained in Semaphore Signaling and Semaphore Waiting & Unsignaling sections:
The first access scope includes all memory access performed by the device.
The second access scope includes all memory access performed by the device.

Chronicle Queue - reader/tailer latency when run at same time while writing

I'm setting up a market data back-testing using Chronicle Queue (CQ), reading data from a binary file then writing into a single CQ and simultaneously reading the data from that CQ and dumping the statistics. I am doing a POC to replace our existing real-time market data feed handler worker queue.
While doing basic read/writes testing on Linux/SSD setup, I see reads are lagging behind writes - in fact latency is accumulating. Both Appender and Tailer are running as separate processes on same host.
Would like to know, if there is any issue in the code I am using?
Below is the code snippet -
Writer -
In constructor -
myQueue = SingleChronicleQueueBuilder.binary(queueName).build();
myAppender = myQueue.acquireAppender();
In data callback -
myAppender.writeDocument(myDataPacket);
myQueue.close();
where myDataPacket is Java object wrapping the byte[] and other fields.
Tailer -
In Constructor -
myQueue = SingleChronicleQueueBuilder.binary(queueName).build();
myTailer = myQueue.createTailer();
In Read method -
while (notLastRecord)
{
if(myTailer.readDocument(myDataPacket))
{
notLastRecord = ;
//do stuff
}
}
myQueue.close();
Any help is highly appreciated.
Thanks,
Pavan
First of all I assume by "reads are lagging behind writes - in fact latency is accumulating" you mean that for every every subsequent message, the time the message is read from the queue is further from the time the event was written to the queue.
If you see latency accumulating like that, most likely the data is produced much quicker then you can consume it which from the use case you described is very much possible - if all you need at the write side is parsing simple text line and dump it into a queue file, it's quick, but if you do some processing when you read the entry from the queue - it might be slower.
From the code it's not clear what/how much work your code is doing, and the code looks OK to me, except you probably shouldn't call queue.close() after each appender.writeDocument() call but most likely you are not doing this otherwise it would blow up.
Without seeing actual code or test case it's impossible to say more.

How to implement prioritized lock with only compare_and_swap?

Given only compare and swap, I know how to implement a lock.
However, how do I implement a spin lock
1) multiple threads can block on it while trying to lock
2) and then the threads are un-blocked (and acquire the lock) in the order that they blocked on it?
Is it even possible? If not, what other primitives do I need?
If so, how do I do it?
Thanks!
You are going to need a list for the waiting threads. You need to add and remove items from the list in a thread safe manner. You will need to be able to sleep threads that fail to acquire the lock. You will need to be able to wake 1 thread when the lock becomes available. In linux you can accomplish the sleep and wait thing by having the thread wait on a signal.
Now there is a lazy way to do this, you might not need to care about waking threads. Here is pseudo code for our skiplist. This is what we do to add an item.
cFails = 0
while (1) {
NewState = OldState = State;
if (cFails > 3 || OldState.Lock) {
sleep(); // not too sophisticated, because they cant be awoken
cFails = 0;
continue;
}
Look for item in skiplist
return item if we found it
// to add the item to the list we need to lock it
// ABA lock uses a version number
NewState.Lock=1;
NewState.nVer++;
if (!CAS(&State,OldState, NewState)) {
++cFails;
continue;
}
// if the thread gets preempted right here, the lock is left on, and other threads
// spinning would waste their entire time slice.
// unlock
OldState = NewState;
NewState.Lock = 0;
NewState.nVer++;
CAS(&State, OldState,NewState);
}
We expect the skiplist to usually find the item and only rarely have to add it. We rarely have a race to add, even with a lot of threads. We tested this with a worst case scenario consisting of lots of threads adding and searching for millions of items to a single list. The result is we rarely saw threads fail to get the lock. So the simple approach that is high performance for the expected case works for us. There is one bad thing that can happen - a thread gets preempted holding the lock. Thats when cFails > 3 catches this and sleeps waiting threads so we don't waste their timeslices with a million useless spins. So cFails is set high enough that it detects that the owner of the lock is not active.

Keeping time using timer interrupts an embedded microcontroller

This question is about programming small microcontrollers without an OS. In particular, I'm interested in PICs at the moment, but the question is general.
I've seen several times the following pattern for keeping time:
Timer interrupt code (say the timer fires every second):
...
if (sec_counter > 0)
sec_counter--;
...
Mainline code (non-interrupt):
sec_counter = 500; // 500 seconds
while (sec_counter)
{
// .. do stuff
}
The mainline code may repeat, set the counter to various values (not just seconds) and so on.
It seems to me there's a race condition here when the assignment to sec_counter in the mainline code isn't atomic. For example, in PIC18 the assignment is translated to 4 ASM statements (loading each byte at the time and selecting the right byte from the memory bank before that). If the interrupt code comes in the middle of this, the final value may be corrupted.
Curiously, if the value assigned is less than 256, the assignment is atomic, so there's no problem.
Am I right about this problem?
What patterns do you use to implement such behavior correctly? I see several options:
Disable interrupts before each assignment to sec_counter and enable after - this isn't pretty
Don't use an interrupt, but a separate timer which is started and then polled. This is clean, but uses up a whole timer (in the previous case the 1-sec firing timer can be used for other purposes as well).
Any other ideas?
The PIC architecture is as atomic as it gets. It ensures that all read-modify-write operations to a memory file are 'atomic'. Although it takes 4-clocks to perform the entire read-modify-write, all 4-clocks are consumed in a single instruction and the next instruction uses the next 4-clock cycle. It is the way that the pipeline works. In 8-clocks, two instructions are in the pipeline.
If the value is larger than 8-bit, it becomes an issue as the PIC is an 8-bit machine and larger operands are handled in multiple instructions. That will introduce atomic issues.
You definitely need to disable the interrupt before setting the counter. Ugly as it may be, it is necessary. It is a good practice to ALWAYS disable the interrupt before configuring hardware registers or software variables affecting the ISR method. If you are writing in C, you should consider all operations as non-atomic. If you find that you have to look at the generated assembly too many times, then it may be better to abandon C and program in assembly. In my experience, this is rarely the case.
Regarding the issue discussed, this is what I suggest:
ISR:
if (countDownFlag)
{
sec_counter--;
}
and setting the counter:
// make sure the countdown isn't running
sec_counter = 500;
countDownFlag = true;
...
// Countdown finished
countDownFlag = false;
You need an extra variable and is better to wrap everything in a function:
void startCountDown(int startValue)
{
sec_counter = 500;
countDownFlag = true;
}
This way you abstract the starting method (and hide ugliness if needed). For example you can easily change it to start a hardware timer without affecting the callers of the method.
Write the value then check that it is the value required would seem to be the simplest alternative.
do {
sec_counter = value;
} while (sec_counter != value);
BTW you should make the variable volatile if using C.
If you need to read the value then you can read it twice.
do {
value = sec_counter;
} while (value != sec_counter);
Because accesses to the sec_counter variable are not atomic, there's really no way to avoid disabling interrupts before accessing this variable in your mainline code and restoring interrupt state after the access if you want deterministic behavior. This would probably be a better choice than dedicating a HW timer for this task (unless you have a surplus of timers, in which case you might as well use one).
If you download Microchip's free TCP/IP Stack there are routines in there that use a timer overflow to keep track of elapsed time. Specifically "tick.c" and "tick.h". Just copy those files over to your project.
Inside those files you can see how they do it.
It's not so curious about the less than 256 moves being atomic - moving an 8 bit value is one opcode so that's as atomic as you get.
The best solution on such a microcontroller as the PIC is to disable interrupts before you change the timer value. You can even check the value of the interrupt flag when you change the variable in the main loop and handle it if you want. Make it a function that changes the value of the variable and you could even call it from the ISR as well.
Well, what does the comparison assembly code look like?
Taken to account that it counts downwards, and that it's just a zero compare, it should be safe if it first checks the MSB, then the LSB. There could be corruption, but it doesn't really matter if it comes in the middle between 0x100 and 0xff and the corrupted compare value is 0x1ff.
The way you are using your timer now, it won't count whole seconds anyway, because you might change it in the middle of a cycle.
So, if you don't care about it. The best way, in my opinion, would be to read the value, and then just compare the difference. It takes a couple of OPs more, but has no multi-threading problems.(Since the timer has priority)
If you are more strict about the time value, I would automatically disable the timer once it counts down to 0, and clear the internal counter of the timer and activate once you need it.
Move the code portion that would be on the main() to a proper function, and have it conditionally called by the ISR.
Also, to avoid any sort of delaying or missing ticks, choose this timer ISR to be a high-prio interrupt (the PIC18 has two levels).
One approach is to have an interrupt keep a byte variable, and have something else which gets called at least once every 256 times the counter is hit; do something like:
// ub==unsigned char; ui==unsigned int; ul==unsigned long
ub now_ctr; // This one is hit by the interrupt
ub prev_ctr;
ul big_ctr;
void poll_counter(void)
{
ub delta_ctr;
delta_ctr = (ub)(now_ctr-prev_ctr);
big_ctr += delta_ctr;
prev_ctr += delta_ctr;
}
A slight variation, if you don't mind forcing the interrupt's counter to stay in sync with the LSB of your big counter:
ul big_ctr;
void poll_counter(void)
{
big_ctr += (ub)(now_ctr - big_ctr);
}
No one addressed the issue of reading multibyte hardware registers (for example a timer.
The timer could roll over and increment its second byte while you're reading it.
Say it's 0x0001ffff and you read it. You might get 0x0010ffff, or 0x00010000.
The 16 bit peripheral register is volatile to your code.
For any volatile "variables", I use the double read technique.
do {
t = timer;
} while (t != timer);