Is dispatch_apply synchronous or asynchronous? - objective-c

I was told that I could use Grand Central Dispatch to run n processes simultaneously, in an asynchronous fashion. The documentation said that if the processes were in a for loop, I could use the function dispatch_apply. But now it's saying
Note that dispatch_apply is synchronous, so all the applied blocks
will have completed by the time it returns.
Does this mean the blocks that are submitted to a queue using dispatch_apply are executed in order? If so, what is the point of using concurrency? Won't the slowdown be the same?

dispatch_apply is, as stated in the docs, synchronous. It runs a block on the specified queue in parallel (if possible) and waits until all the blocks return. If you want to run a block just once asynchronously, use dispatch_async, if you want to run a block multiple times in parallel without blocking your current queue, just call dispatch_apply within dispatch_async:
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_BACKGROUND, 0), ^{
dispatch_apply(10, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_BACKGROUND, 0), ^(size_t size) {
NSLog(#"%lu", size);
});
});

The purpose of the synchronous dispatch_apply is to asynchronously dispatch the inner loop interations to available parallel processing resources. Thus, the overall loop performance may speed up.
Faster loop performance? Very possibly, Yes. (see caveat)
Blocks the thread calling dispatch_apply? Yes, just like loop blocks until completed.
For GCD, dispatch_apply is synchronous since dispatch_apply will not return until all the asynchronous, parallel tasks that dispatch_apply creates have completed.
However, each individual task enqueued by dispatch_apply can run as concurrent asynchronous tasks if the target queue is asynchronous.
For example in Swift:
let batchCount: Int = 10
let queue = dispatch_get_global_queue(QOS_CLASS_UTILITY, 0)
dispatch_apply(batchCount, queue) {
(i: Int) -> Void in
print(i, terminator: " ")
}
print("\ndispatch_apply QOS_CLASS_UTILITY queue completed")
yields unordered output like:
0 8 1 9 2 3 4 5 6 7
dispatch_apply QOS_CLASS_UTILITY queue completed
So, dispatch_apply synchronously blocks when called, but the "batch" of tasks generated by dispatch_apply can run concurrently, asynchronously, in parallel to each other.
Keep in mind the caveat that ...
the work performed during each iteration is distinct from the work
performed during all other iterations, and the order in which each
successive loop finished is unimportant
Also, note that use of a serial queue for the inner loop tasks will not have any performance gain.
Although using a serial queue is permissible and does the right thing
for your code, using such a queue has no real performance advantages
over leaving the loop in place.

You can get performance speed up by using gcl_create_dispatch_queue() with dispatch_apply()
For example:
#import Foundation;
#import OpenCL; // gcl_create_dispatch_queue()
int main() {
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_ALL, NULL);
dispatch_apply(10, queue, ^(size_t t) {
// some code here
});
}
More info:
OpenCL Programming Guide for Mac

Related

Is it guaranteed that only one callback is executed when the wait condition is true?

This is regarding the statement WAIT FOR ASYNCHRONOUS TASKS and the corresponding part of the documentation:
If the result of log_exp is false and there is an asynchronous function call with callback routine, the program waits until a callback routine of a previous function (called asynchronously) has been executed and then checks the logical expression again:
Let's say I spawn 4 tasks, each reducing availability attribute by one, reaching 0. In the callback, they increase the availability attribute by one.
Now when I reach WAIT FOR ASYNCHRONOUS TASKS UNTIL availability > 0 UP TO 6000 SECONDS. the program waits until the counter is increased by a callback.
Question: When the logical expression is checked again, is it guaranteed that the order is
callback->check->callback->check?
Or could it be that availability is e.g. already 3, since it did
callback->callback->callback->check?
It works as per documentation.
WAIT -> CALLBACK -> CHECK , WAIT -> CALLBACK -> CHECK,
until wait condition is true or no more outstanding Callbacks are open.
It is important that the Callback form/method has finished before the check is performed as that routine is responsible for changing the variable/s in the WAIT UNTIL Condition.
An extract from the docu.
If the result of log_exp is false and there is an asynchronous
function call with callback routine, the program waits until a
callback routine of a previous function (called asynchronously) has
been executed and then checks the logical expression again:
If you are concerned about 2 callbacks occurring concurrently,
the callbacks are handled by the kernel sequentially.
There is no guarantee of order, just that the call backs are processed sequentially. Note the waiting program is only executed in 1 Work process at a time. From my tests, it is always the same Work process.

Vulkan Queue submission synchronization - vkWaitForFences vs vkQueueWaitIdle [duplicate]

I have a function that copies data from one buffer to another, I need to synchronize its execution.
I have such a bad option:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
//Waiting for completion
vkQueueWaitIdle(graphicsQueue);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
}
This option is bad because if I want to execute the copyBuffer() function several times, then all the buffers will be copied strictly one at a time.
I want to use a fence for each function call so that multiple calls can run in parallel.
So far, I have only such a solution:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Create fence
VkFenceCreateInfo fenceInfo{};
fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
fenceInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;
VkFence executionCompleteFence = VK_NULL_HANDLE;
if (vkCreateFence(logicalDevice, &fenceInfo, VK_NULL_HANDLE, &executionCompleteFence) != VK_SUCCESS) {
throw MakeErrorInfo("Failed to create fence");
}
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkWaitForFences(logicalDevice, 1, &executionCompleteFence, VK_TRUE, UINT64_MAX);
vkResetFences(logicalDevice, 1, &executionCompleteFence);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
vkDestroyFence(logicalDevice, executionCompleteFence, VK_NULL_HANDLE);
}
Which of these options is better?
Is the second option written correctly?
Both functions are bad in the same way. They both block the CPU from doing anything until the transfer is done. And they both could be used to potentially submit multiple CBs to the same queue in the same frame, but with different submit commands.
Neither is desirable if performance is something you care about.
Ultimately, what you need to do is have your copyBuffer function not actually perform the copy. You should have a function which builds a command buffer to do a copy. That CB is then stored in a place to be submitted later with other copying CBs. Or better yet, you can have just one copying CB that each command adds to (the first one called in a frame will create the CB).
At some point in the future, before you've submitted the work that will use this data, you need to submit the transfer operations. And the way this works depends on if you're submitting the transfer operations on the same queue as the work that will consume them or not.
If they're on the same queue, then all you need to do is have an event in a command buffer at the end of your batch that synchronizes the transfer operations with their receivers. If you want to be more clever, each transfer operation can have its own event, which the receiving operations will wait on.
And in same-queue transfers, you also want to make sure that you submit the transfers in the same vkQueueSubmit call as the rest of your work. Or to put it another way, you should never make more than one call to vkQueueSubmit for a particular queue in a particular frame.
If you're dealing with separate queues, then things change. A bit. If timeline semaphores aren't an option, you'll need to submit your transfer work before you submit the receiving operations. This is because the transfer batch will need to signal a semaphore that the receiving operation will wait on. And a binary semaphore cannot be waited on until the operation that signals it has been submitted to a queue.
But otherwise, everything else stays the same. Of course, you don't need events since you're synchronizing by semaphore.
The two functions are semantically identical and do exactly the same blocking behavior.
The second is slightly better. vkQueueWaitIdle is kind of a debug and out-of-hotspot feature. It might incur a hidden second submit to signal the implicit fence.
You don't need to reset fence that you subsequently destroy anyway. And you are creating it presignaled, which is a bug. Also you forgot to pass it to the vkQueueSubmit.

why while loop is not needed in sem_wait?

I am trying to compare produer consumer problem implementation using cond variable and semaphores.
Implementation using cond variable:
acquire(m); // Acquire this monitor's lock.
while (!p) { // While the condition/predicate/assertion that we are waiting for is not true...
wait(m, cv); // Wait on this monitor's lock and condition variable.
}
// ... Critical section of code goes here ...
signal(cv2); -- OR -- notifyAll(cv2); // cv2 might be the same as cv or different.
release(m);
Implementation using semaphore:
produce:
P(emptyCount)
P(useQueue)
putItemIntoQueue(item)
V(useQueue)
V(fullCount)
why semaphore implementation is not using while loop to check the condition like in cond variable implementation.?
while (!p) { // While the condition/predicate/assertion that we are waiting for is not true...
wait(m, cv); // Wait on this monitor's lock and condition variable.
}
Why do you need a while loop while waiting for a condition variable
Grabbing a semaphore does use a tight loop internally, just like the cond version, but it yields execution back to the scheduler in each iteration to not waste resources like a busy loop would.
When the scheduler has executed some other process for a while, it yields execution back to your thread. If the semaphore is available now, it is grabbed; otherwise it yields back to the scheduler to let some other process run some more before retrying.

Semaphore wait() and signal()

I am going through process synchronization, and facing difficulty in understanding semaphore. So here is my doubt:
the source says that
" Semaphore S is an integer variable that is accessed through standard atomic operations i.e. wait() and signal().
It also provided basic definition of wait()
wait(Semaphore S)
{
while S<=0
; //no operation
S--;
}
Definition of signal()
signal(S)
{
S++;
}
Let the initial value of a semaphore be 1, and say there are two concurrent processes P0 and P1 which are not supposed to perform operations of their critical section simultaneously.
Now say P0 is in its critical section, so the Semaphore S must have value 0, now say P1 wants to enter its critical section so it executes wait(), and in wait() it continuously loops, now to exit from the loop the semaphore value must be incremented, but it may not be possible because according the source, wait() is an atomic operation and can't be interrupted and thus the process P0 can't call signal() in a single processor system.
I want to know, is the understanding i have so far is correct or not. and if correct then how come process P0 call signal() when process P1 is strucked in while loop?
I think the top-voted answer is inaccurate!
Operation wait() and signal() must be completely atomic; no two processes can execute wait() or signal() operation simultaneously because they are implemented in kernel and processes in kernel mode can not be preempted.
If several processes attempt a P(S) simultaneously, only one process will be allowed to proceed(non-preemptive kernel that is free of race condition).
for the above implementation to work preemption is necessary (preemptive kernel)
read about the atomicity of semaphore operations
http://personal.kent.edu/~rmuhamma/OpSystems/Myos/semaphore.htm
https://en.wikibooks.org/wiki/Operating_System_Design/Processes/Semaphores
I think it's an inaccuracy in your source. Atomic for the wait() operation means each iteration of it is atomic, meaning S-- is performed without interruption, but the whole operation is interruptible after each completion of S-- inside the while loop.
I don't think, keeping an infinite while loop inside the wait() operation is wise. I would go for Stallings' example;
void semWait(semaphore s){
s.count--;
if(s.count<0)
*place this process in s.queue and block this process
}
I think what the book means for the atomic operation is testing S<=0 to be true as well as S--. Just like testAndset() it mention before.
if both separate operations S<=0 and S-- are atomic but can be interrupt by other process, this method won't work.
imagine two process p0 and p1, if p0 want to enter the critical section and tested S<=0 to be true. and it was interrupted by p1 and tested S<=0 also be true. then both of the process will enter the critical section. And that's wrong.
the actual not atomic operation is inside the while loop, even if the while loop is empty, other process can still interrupt current one when S<=0 tested to be false, which enable other process can continue their work in critical section and release the lock.
however, I think the code from the book can not actually use in OS since I don't know how to make operations S<=0 to be true and S-- together atomic. more possible way to do that is put the S-- inside the while loop like SomeWittyUsername said.
When a task attempts to acquire a semaphore that is unavailable, the semaphore places the task onto a wait queue and puts the task to sleep.The processor is then free to execute other code.When the semaphore becomes available, one of the tasks on the wait queue is awakened so that it can then acquire the semaphore.
while S<=0
; //no operation This doesn't mean that the processor running this code. The process/task is blocked until it gets the semaphore.
i think ,
when process P1 is strucked in while loop it will be in the wait state.processor will switch over among the process p0 & p1 (context switching) so the priority goes to p0 and it call signal() and then s will be incremented by 1 and p0 exit from the section so process P1 can enter into critical section and can avoid the mutual exclusion

How to implement prioritized lock with only compare_and_swap?

Given only compare and swap, I know how to implement a lock.
However, how do I implement a spin lock
1) multiple threads can block on it while trying to lock
2) and then the threads are un-blocked (and acquire the lock) in the order that they blocked on it?
Is it even possible? If not, what other primitives do I need?
If so, how do I do it?
Thanks!
You are going to need a list for the waiting threads. You need to add and remove items from the list in a thread safe manner. You will need to be able to sleep threads that fail to acquire the lock. You will need to be able to wake 1 thread when the lock becomes available. In linux you can accomplish the sleep and wait thing by having the thread wait on a signal.
Now there is a lazy way to do this, you might not need to care about waking threads. Here is pseudo code for our skiplist. This is what we do to add an item.
cFails = 0
while (1) {
NewState = OldState = State;
if (cFails > 3 || OldState.Lock) {
sleep(); // not too sophisticated, because they cant be awoken
cFails = 0;
continue;
}
Look for item in skiplist
return item if we found it
// to add the item to the list we need to lock it
// ABA lock uses a version number
NewState.Lock=1;
NewState.nVer++;
if (!CAS(&State,OldState, NewState)) {
++cFails;
continue;
}
// if the thread gets preempted right here, the lock is left on, and other threads
// spinning would waste their entire time slice.
// unlock
OldState = NewState;
NewState.Lock = 0;
NewState.nVer++;
CAS(&State, OldState,NewState);
}
We expect the skiplist to usually find the item and only rarely have to add it. We rarely have a race to add, even with a lot of threads. We tested this with a worst case scenario consisting of lots of threads adding and searching for millions of items to a single list. The result is we rarely saw threads fail to get the lock. So the simple approach that is high performance for the expected case works for us. There is one bad thing that can happen - a thread gets preempted holding the lock. Thats when cFails > 3 catches this and sleeps waiting threads so we don't waste their timeslices with a million useless spins. So cFails is set high enough that it detects that the owner of the lock is not active.