Number of pthread_cond_signal()s and pthread_cond_wait()s mis/match

Number of pthread_cond_signal()s and pthread_cond_wait()s mis/match - variables

There is one Producer and n consumers.
producer us assigning n jobs to n consumers and calling pthread_cond_wait() n times to wait for the assigned job to be completed by consumers.
Each Consumer after consuming job calls pthread_cond_signal() to notify the producer.
My question is "Will n calls to pthread_cond_signal() by consumer makes the producer to come out of pthread_cond_wait() n times? Or is there any case where multiple signals be merged into single signal so that pthread_cond_wait() comes out less than n times?

If the producer isn't actually waiting inside a call to pthread_cond_wait() when a consumer thread calls pthread_cond_signal(), then that signal will get 'lost' (ie., if the producer thread later rolls into the pthread_cond_wait(), it will block until another signal is sent).
That is why condition variables must be used in conjunction with some other "boolean predicate" that is checked while holding the mutex used with the condition variable. That predicate is the actual final word on whether or not the thread deciding whether or not to wait should wait. Another reason that the predicate is the final word is that a thread blocked in pthread_cond_wait() can be spuriously awakened.
From the POSIX docs on pthread_cond_wait():
When using condition variables there is always a Boolean predicate
involving shared variables associated with each condition wait that is
true if the thread should proceed. Spurious wakeups from the
pthread_cond_timedwait() or pthread_cond_wait() functions may occur.
Since the return from pthread_cond_timedwait() or pthread_cond_wait()
does not imply anything about the value of this predicate, the
predicate should be re-evaluated upon such return.

Related

Is it guaranteed that only one callback is executed when the wait condition is true?

This is regarding the statement WAIT FOR ASYNCHRONOUS TASKS and the corresponding part of the documentation:
If the result of log_exp is false and there is an asynchronous function call with callback routine, the program waits until a callback routine of a previous function (called asynchronously) has been executed and then checks the logical expression again:
Let's say I spawn 4 tasks, each reducing availability attribute by one, reaching 0. In the callback, they increase the availability attribute by one.
Now when I reach WAIT FOR ASYNCHRONOUS TASKS UNTIL availability > 0 UP TO 6000 SECONDS. the program waits until the counter is increased by a callback.
Question: When the logical expression is checked again, is it guaranteed that the order is
callback->check->callback->check?
Or could it be that availability is e.g. already 3, since it did
callback->callback->callback->check?

It works as per documentation.
WAIT -> CALLBACK -> CHECK , WAIT -> CALLBACK -> CHECK,
until wait condition is true or no more outstanding Callbacks are open.
It is important that the Callback form/method has finished before the check is performed as that routine is responsible for changing the variable/s in the WAIT UNTIL Condition.
An extract from the docu.
If the result of log_exp is false and there is an asynchronous
function call with callback routine, the program waits until a
callback routine of a previous function (called asynchronously) has
been executed and then checks the logical expression again:
If you are concerned about 2 callbacks occurring concurrently,
the callbacks are handled by the kernel sequentially.
There is no guarantee of order, just that the call backs are processed sequentially. Note the waiting program is only executed in 1 Work process at a time. From my tests, it is always the same Work process.

In Redis, how can I guarantee fetching N number items from a list in multi-clients environment?

Assume that there is a key K in Redis that is holding a list of values.
Many producer clients are adding elements to this list, one by one using LPUSH or RPUSH.
On the other hand, another set of consumer clients are popping elements from the list, though with certain restriction. Consumers will only attempt to pop N number of items, only if the list contains at least N number of items. This ensures that the consumer will hold N items in hand after finishing popping process
If the list contains fewer than N number of items, consumers shouldn't even attempt to pop elements from the list at all, because they won't have at least N items at the end.
If there is only 1 Consumer client, the client can simply run LLEN command to check if the list contains at least N items, and subtract N using LPOP/RPOP.
However, if there are many consumer clients, there can be a race condition and they can simultaneously pop items from the list, after reading LLEN >= N. So we might end up in a state where each consumer might pop fewer than N elements, and there is no item left in the list in Redis.
Using a separate locking system seems to be one way to tackle this issue, but I was curious if this type of operation can be done only using Redis commands, such as Multi/Exec/Watch etc.
I checked Multi/Exec approach and it seems that they do not support rollback. Also, all commands executed between Multi/Exec transaction will return 'QUEUED' so I won't be able to know if N number of LPOP that I will be executing in the transaction will all return elements or not.

So all you need is an atomic way to check on list length and pop conditionally.
This is what Lua scripts are for, see EVAL command.
Here a Lua script to get you started:
local len = redis.call('LLEN', KEYS[1])
if len >= tonumber(ARGV[1]) then
local res = {n=len}
for i=1,len do
res[i] = redis.call('LPOP', KEYS[1])
end
return res
else
return false
end
Use as
EVAL "local len = redis.call('LLEN', KEYS[1]) \n if len >= tonumber(ARGV[1]) then \n local res = {n=len} \n for i=1,len do \n res[i] = redis.call('LPOP', KEYS[1]) \n end \n return res \n else \n return false \n end" 1 list 3
This will only pop ARGV[1] elements (the number after the key name) from the list if the list has at least that many elements.
Lua scripts are ran atomically, so there is no race condition between reading clients.
As OP pointed in comments, there is risk of data-loss, say because power failure between LPOPs and the script return. You can use RPOPLPUSH instead of LPOP, storing the elements on a temporary list. Then you also need some tracking, deleting, and recovery logic. Note your client could also die, leaving some elements unprocessed.
You may want to take a look at Redis Streams. This data structure is ideal for distributing load among many clients. When using it with Consumer Groups, it has a pending entries list (PEL), that acts as that temporary list.
Clients then do a XACK to remove elements from the PEL once processed. Then you are also protected from client failures.
Redis Streams are very useful to solve the complex problem you are trying to solve. You may want to do the free course about this.

You could use a prefetcher.
Instead of each consumer greedily picking an item from the queue which leads to the problem of 'water water everywhere, but not a drop to drink', you could have a prefetcher that builds a packet of size = 6. When the prefetcher has a full packet, it can place the item in a separate packet queue (another redis key with list of packets) and pop the items from the main queue in a single transaction. Essentially, what you wrote:
If there is only 1 Consumer client, the client can simply run LLEN
command to check if the list contains at least N items, and subtract N
using LPOP/RPOP.
If the prefetcher doesn't have a full packet, it does nothing and keeps waiting for the main queue size to reach 6.
On the consumer side, they will just query the prefetched packets queue and pop the top packet and go. It is always 1 pre-built packet (size=6 items). If there are no packets available, they wait.
On the producer side, no changes are required. They can keep inserting into the main queue.
BTW, there can be more than one prefetcher task running concurrently and they can synchronize access to the main queue between themselves.
Implementing a scalable prefetcher
Prefetcher implementation could be described using a buffet table analogy. Think of the main queue as a restaurant buffet table where guests can pick up their food and leave. Etiquette demands that the guests follow a queue system and wait for their turn. Prefetchers also would follow something analogous. Here's the algorithm:
Algorithm Prefetch
Begin
while true
check = main queue has 6 items or more // this is a queue read. no locks required
if(check == true)
obtain an exclusive lock on the main queue
if lock successful
begin a transaction
create a packet and fill it with top 6 items from
the queue after popping them
add the packet to the prefetch queue
if packet added to prefetch queue successfully
commit the transaction
else
rollback the transaction
end if
release the lock
else
// someone else has the excl lock, we should just wait
sleep for xx millisecs
end if
end if
end while
End
I am just showing an infinite polling loop here for simplicity. But this could be implemented using pub/sub pattern through Redis Notifications. So, the prefetcher just waits for a notification that the main queue key is receiving an LPUSH and then executes the logic inside the while loop body above.
There are other ways you could do this. But this should give you some ideas.

Understanding CUDA serialization and reconvergence point

EDIT: I realized that I, unfortunately, overlooked a semicolon at the end of the while statement in the first example code and misinterpreted it myself. So there is in fact an empty loop for threads with threadIdx.x != s, a convergency point after that loop and a thread waiting at this point for all the others without incrementing the s variable. I am leaving the original (uncorrected) question below for anyone interested in it. Be aware, that there is a semicolon missing at the end of the second line in the first example and thus, s++ has nothing in common with the cycle body.
--
We were studying serialization in our CUDA lesson and our teacher told us that a code like this:
__shared__ int s = 0;
while (s != threadIdx.x)
s++; // serialized code
would end up with a HW deadlock because the nvcc compiler puts a reconvergence point between the while (s != threadIdx.x) and s++ statements. If I understand it correctly, this means that once the reconvergence point is reached by a thread, this thread stops execution and waits for the other threads until they reach the point too. In this example, however, this never happens, because thread #0 enters the body of the while loop, reaches the reconvergence point without incrementing the s variable and other threads get stuck in an endless loop.
A working solution should be the following:
__shared__ int s = 0;
while (s < blockDim.x)
if (threadIdx.x == s)
s++; // serialized code
Here, all threads within a block enter the body of the loop, all evaluate the condition and only thread #0 increments the s variable in the first iteration (and loop goes on).
My question is, why does the second example work if the first hangs? To be more specific, the if statement is just another point of divergence and in terms of the Assembler language should be compiled into the same conditional jump instruction as the condition in the loop. So why isn't there any reconvergence point before s++ in the second example and has it in fact gone immediately after the statement?
In other sources I have only found that a divergent code is computed independently for every branch - e.g. in an if/else statement, first the if branch is computed with all else-branched threads masked within the same warp and then the other threads compute the else branch while the first wait. There's a reconvergence point after the if/else statement. Why then does the first example freeze, not having the loop split into two branches (a true branch for one thread and a waiting false branch for all the others in a warp)?
Thank you.

It does not make sense to put the reconvergence point between the call to while (s != threadIdx.x) and s++;. It disrupts the program flow since the reconvergence point for a piece of code should be reachable by all threads at compile time. Below picture shows the flowchart of your first piece of code and possible and impossible points of reconvergence.
Regarding this answer about recording the convergence point via SSY instruction, I created below simple kernel resembling your first piece of code
__global__ void kernel_1() {
__shared__ int s;
if(threadIdx.x==0)
s = 0;
__syncthreads();
while (s == threadIdx.x)
s++; // serialized code
}
and compiled it for CC=3.5 with -O3. Below is the result of using cuobjdumbinary tool for the output to observe the CUDA assembly. The result is:
I'm not an expert in reading CUDA assembly but I can see while loop condition checks in lines 0038 and 00a0. At line 00a8, it branches to 0x80 if it satisfies the while loop condition and executes the code block again. The introduction of the reconvergence point is at line 0058 introducing line 0xb8 as the reconvergence point which is after the loop condition check near the exit.
Overall, it is not clear what you're trying to achieve with this piece of code. Also in the second piece of code, the reconvergence point should be again after while loop code block (I don't mean between while and if).

The reason why it "hangs" is neither a HW deadlock nor branching, at least not directly. You produce an endless loop for one or multiple threads (as already suspected).
In your example, there isn't really a convergence point. Since you do not use any synchronization, there aren't any threads that actually wait. What happens here with the while-loop is pretty much a busy-wait.
A kernel only finishes if all threads return. Since you have one (or multiple) endless loops (by accident maybe even none - this is unlikely however) the kernel will never finish.
You declared a shared variable s. This variable is known to all threads within a block.
With your while-statement you basically say (to each thread): increment s until it reaches the value of your (local) thread id. Since all threads are incrementing s in parallel, you introduce race conditions.
Example:
List item
Thread 5 is looping and checking for s to become 5
s is 4
Two threads increment s, it becomes 6
At the same time thread 5 only reached the end of its loop.
Now it reaches the next loop iteration and checks for s and it's not 5.
Thread 5 will never be able to finish since you check via == and the value of s already exceeded the value of the thread id.
Also your solution is quite confusing, because each thread executes the serialized code consecutively (which probably was the intention after all - even though that actually is strange):
Thread 0 will execute the serialized code
After that, thread 1 will execute the serialized code
and so on
Most examples show a program where each thread works on some code, then all threads are synchronized and only single thread executes some more code (maybe it needed the results of all threads).
So, your second example "works" because no thread is stuck in an endless loop, however I can't think of a reason why anyone would use such a code,
since it is confusing and, well, not parallel at all.

Semaphore wait() and signal()

I am going through process synchronization, and facing difficulty in understanding semaphore. So here is my doubt:
the source says that
" Semaphore S is an integer variable that is accessed through standard atomic operations i.e. wait() and signal().
It also provided basic definition of wait()
wait(Semaphore S)
{
while S<=0
; //no operation
S--;
}
Definition of signal()
signal(S)
{
S++;
}
Let the initial value of a semaphore be 1, and say there are two concurrent processes P0 and P1 which are not supposed to perform operations of their critical section simultaneously.
Now say P0 is in its critical section, so the Semaphore S must have value 0, now say P1 wants to enter its critical section so it executes wait(), and in wait() it continuously loops, now to exit from the loop the semaphore value must be incremented, but it may not be possible because according the source, wait() is an atomic operation and can't be interrupted and thus the process P0 can't call signal() in a single processor system.
I want to know, is the understanding i have so far is correct or not. and if correct then how come process P0 call signal() when process P1 is strucked in while loop?

I think the top-voted answer is inaccurate!
Operation wait() and signal() must be completely atomic; no two processes can execute wait() or signal() operation simultaneously because they are implemented in kernel and processes in kernel mode can not be preempted.
If several processes attempt a P(S) simultaneously, only one process will be allowed to proceed(non-preemptive kernel that is free of race condition).
for the above implementation to work preemption is necessary (preemptive kernel)
read about the atomicity of semaphore operations
http://personal.kent.edu/~rmuhamma/OpSystems/Myos/semaphore.htm
https://en.wikibooks.org/wiki/Operating_System_Design/Processes/Semaphores

I think it's an inaccuracy in your source. Atomic for the wait() operation means each iteration of it is atomic, meaning S-- is performed without interruption, but the whole operation is interruptible after each completion of S-- inside the while loop.

I don't think, keeping an infinite while loop inside the wait() operation is wise. I would go for Stallings' example;
void semWait(semaphore s){
s.count--;
if(s.count<0)
*place this process in s.queue and block this process
}

I think what the book means for the atomic operation is testing S<=0 to be true as well as S--. Just like testAndset() it mention before.
if both separate operations S<=0 and S-- are atomic but can be interrupt by other process, this method won't work.
imagine two process p0 and p1, if p0 want to enter the critical section and tested S<=0 to be true. and it was interrupted by p1 and tested S<=0 also be true. then both of the process will enter the critical section. And that's wrong.
the actual not atomic operation is inside the while loop, even if the while loop is empty, other process can still interrupt current one when S<=0 tested to be false, which enable other process can continue their work in critical section and release the lock.
however, I think the code from the book can not actually use in OS since I don't know how to make operations S<=0 to be true and S-- together atomic. more possible way to do that is put the S-- inside the while loop like SomeWittyUsername said.

When a task attempts to acquire a semaphore that is unavailable, the semaphore places the task onto a wait queue and puts the task to sleep.The processor is then free to execute other code.When the semaphore becomes available, one of the tasks on the wait queue is awakened so that it can then acquire the semaphore.
while S<=0
; //no operation This doesn't mean that the processor running this code. The process/task is blocked until it gets the semaphore.

i think ,
when process P1 is strucked in while loop it will be in the wait state.processor will switch over among the process p0 & p1 (context switching) so the priority goes to p0 and it call signal() and then s will be incremented by 1 and p0 exit from the section so process P1 can enter into critical section and can avoid the mutual exclusion

One producer, Two consumers and usage of pthread_cond_signal & pthread_mutex_lock

I am fairly new to pthread programming and am trying to get my head around cond_signal & mutex_lock. I am writing a sample program which has One producer thread and Two consumer threads.
There is a queue between producer and the first consumer and a different queue between producer and the second consumer. My producer is basically a communication interface which reads packets from the network and based on a configured filter delivers the packets to either of the consumers.
I am trying to use pthread_cond_signal & pthread_mutex_lock the following way between producer and consumer.
[At producer]
0) Wait for packets to arrive
1) Lock the mutex pthread_mutex_lock(&cons1Mux)
2) Add the packet to the tail of the consumer queue
3) Signal the Consumer 1 process pthread_cond_signal(&msgForCons1)
4) Unlock the mutex pthread_mutex_lock(&cons1Mux)
5) Go to step 0
[At consumer]
1) Lock the mutex pthread_mutex_lock(&cons1Mux)
2) Wait for signal pthread_cond_wait(&msgForCons1,&cons1Mux)
3) After waking up, read the packet
4) Delete from queue.
5) Unlock the mutex pthread_mutex_unlock(&cons1Mux)
6) Goto Step 1
Are the above steps correct? If a switch happens from the consumer thread exactly after step 5 to the producer thread, then is it possible that the producer may signal a packet is waiting even though the consumer hasn't yet started listening for that signal. Will that cause a "missed signal"?
Are there any other problems with these steps?

Yes, you're correct you could have a problem there: if there are no threads waiting, the pthread_cond_signal is a no-op. It isn't queued up somewhere to trigger a subsequent wait.
What's you're supposed to do is, in the consumer, once you've acquired the mutex, test whether there is any work to do. If there is, well, you have a mutex; take ownership, update state, and do it. You only need to wait if there is nothing to do.
The cannonical example is:
decrement_count()
{ pthread_mutex_lock(&count_lock);
while (count == 0)
pthread_cond_wait(&count_nonzero, &count_lock);
count = count - 1;
pthread_mutex_unlock(&count_lock);
}
increment_count()
{ pthread_mutex_lock(&count_lock);
if (count == 0)
pthread_cond_signal(&count_nonzero);
count = count + 1;
pthread_mutex_unlock(&count_lock);
}
Note how the "consumer" decrementing thread doesn't wait around if there's something to decrement. The pattern applies equally well to the case where count is replaced by a queue size or the validity of a struct containing a message.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas