a few years ago I had an Operating Systems seminar. I had been tasked to create an algorithm for process synchronization using as few semaphores as possible. It should have looked like this:
P1 -> P2 -> P3 -> P4 -> P5
P(n) - process
Only one process running at a time and strict ordering was needed.
Last year I came with solution using 3 semaphores (effectively creating a barrier).
Here is my algorithm:
P S1 S1 S1 S1
4W1 W0 W0 W0 W0
4S0 P S2 S2 S2
3W2 W1 W1 W1
3S1 P S1 S1
2W1 W0 W0
2S0 P S2
W2 W1
S1 P
(execution is from top to bottom, each lane is a single process)
P - real work which needs to be done serialized
W(n) - waitn
S(n) - signaln
4W1 means "do 4 wait1s"
wait1 and signal1 operates with semaphore1 and so on...
Explanation of algorithm:
Every process lane starts
first process will run and others will do signal1()
every other process except the first one will wait for semaphore0 (doing wait0)
after process1 waits for 4 semaphores1 it sends 4 signals0, creating a barrier because other processes waits for first one to successfully complete.
The problem is I can't figure out how to make it work using 2 semaphores.
PS: this is not an assignment, it's a problem that's been lying in my head for too long.
It can't be done using 2 semaphores. 3 is minimum.
Related
On an 8085 processor, an efficient algorithm for dividing a BCD by 2 comes in handy when converting a BCD to binary representation. You might think of recursive subtraction or multiplying by 0.5, however these algorithms require lengthy arithmetics.
Therefore, I would like to share with you the following code (in 8085 assembler) that does it more efficiently. The code has been thoroughly tested on GNUSim8085 and ASM80 emulators. If this code was helpful to you, please share your experience with me.
Before running the code, put the BCD in register A. Set the carry flag if there is a remainder to be received from a more significant byte (worth 50). After execution, register A will contain the result. The carry flag is used to pass the remainder, if any, to the next less significant byte.
The algorithm uses DAA instruction after manipulating C and AC flags in a very special way thus taking into account that any remainder passed down to the next nibble (i.e. half-octet) is worth 5 instead of 8.
;Division of BCD by 2 on an 8085 processor
;Set initial values.
;Register A contains a two-digit BCD. Carry flag contains remainder.
stc
cmc
mvi a, 85H
;Do modified decimal adjust before division.
cmc
cma
rar
adc a
cma
daa
cmc
;Divide by 2.
rar
;Save quotient and remainder to registers B and C.
mov b, a
mvi a, 00H
rar
mov c, a
;Continue working on decimal adjust.
mov a, b
sui 33H
mov b, a
mov a, c
ral
mov a, b
hlt
Suppose a two digit BCD number is represented as:D7D6D5D4 D3D2D1D0
For a division by 2, for binary (or hex), simply right shift the number by one place. If there is an overflow then remainder is 1, and 0 othwerwise. The same things applies to two digit (8-bit) BCD numbers when D4 is 0, i.e. there is no effective bit shift from higher order four bits. Now if D4 is 1 (before the shift), then shifting will introduce a 8 (1000) in the lower order four bits, which apparantly jeopardizes this process. Observe that in BCD the bit shift should introduce 10/2 = 5 not 16/2 = 8. Thus we can simply adjust by subtrating 8-5 = 3 from the lower order four bits, i.e. 03H from the entire number. The following code summarizes this strategy. We assume accumulator holds the data, and after the division the result is kept in the accumulator and remainder is kept in the register B.
MVI B,00H ; remainder = 0
STC
CMC ; clear the carry flag
RAR ; right shift the data
JNC SKIP
INR B ; CY=1 so, remainder = 1
SKIP: MOV D,A ; backup
ANI 08H ; if get D3 after the shift (or D4 before the shift)
MOV A,D ; get the data from backup
JZ FIN ; if D4 before the shift was 0
SUI 03H ; adjustment for the shift
FIN: HLT ; A has the result, B has the remainder
BFS(G,s)
1 for each vertex u ∈ G.V-{s}
2 u.color = WHITE
3 u.d = ∞
4 u.π = NIL
5 s.color = GRAY
6 s.d = 0
7 s.π = NIL
8 Q ≠ Ø
9 ENQUEUE(Q, s)
10 while Q ≠ Ø
11 u = DEQUEUE(Q)
12 for each v ∈ G.Adj[u]
13 if v.color == WHITE
14 v.color = GRAY
15 v.d = u.d + 1
16 v.π = u
17 ENQUEUE(Q, v)
18 u.color = BLACK
The above Breadth First Search code is represented using adjacency lists.
Notations -
G : Graph
s : source vertex
u.color : stores the color of each vertex u ∈ V
u.π : stores predecessor of u
u.d = stores distance from the source s to vertex u computed by the algorithm
Understanding of the code (help me if I'm wrong) -
1. As far as I could understand, the ENQUEUE(Q, s) and DEQUEUE(Q) operations take O(1) time.<br>
2. Since the enqueuing operation occurs for exactly one time for one vertex, it takes a total O(V) time.
3. Since the sum of lengths of all adjacency lists is |E|, total time spent on scanning adjacency lists is O(E).
4. Why is the running time of BFS is O(V+E)?.
Please do not refer me to some website, I've gone through many articles to understand but I'm finding it difficult to understand.
Can anyone please reply to this code by writing the time complexity of each of the 18 lines?
Lines 1-4: O(V) in total
Lines 5-9: O(1) or O(constant)
Line 11: O(V) for all operations of line 11 within the loop (each vertice can only be dequeued once)
Lines 12-13: O(E) in total as you will check through every possible edge once. O(2E) if the edges are bi-directional.
Lines 14-17: O(V) in total as out of the E edges you check, only V vertices will be white.
Line 18: O(V) in total
Summing the complexities gives you
O(4V + E + 1) which simplifies to O(V+E)
New:
It is not O(VE) because at each iteration of the loop starting at line 10, lines 12-13 will only loop through the edges the current node is linked to, not all the edges in the entire graph. Thus, looking from the point of view of the edges, they will only be looped at most twice in a bi-directional graph, once by each node it connects with.
It's known that Intel Core 2 Duo has 3 SSE units. These 3 units allows 3 SSE instructions to be run paralelly (1), for example:
rA0 = mullps(rB0, rC0); \
rA1 = mullps(rB1, rC1); > All 3 take 1 cycle to be scheduled (* - see Remarks).
rA2 = mullps(rB2, rC2); /
It's known also, that each SSE unit consists of 2 modules: one for addition (substraction), and one for multiplication (division). The latter allows to run mullps-addps instruction sequences parallelly (2), for example:
rA0 = mullps(rB0, rC0); \
> All 2 take 1 cycle to be scheduled for 1 SSE module.
rA1 = addps(rB1, rC1); /
Question is the followig: how much cycles each of the following 2 code snippets take to be scheduled?
Code listing A:
rA0 = mullps(rB0, rC0); \
rA1 = mullps(rB1, rC1); |
rA2 = mullps(rB2, rC2); \ Do all 6 execute in one step? (See paragraph (2))
rA3 = addps(rB3, rC3); /
rA4 = addps(rB4, rC4); |
rA5 = addps(rB5, rC5); /
Code listing B:
rA0 = mullps(rB0, rC0); \
rA1 = addps(rB1, rC1); |
rA2 = mullps(rB2, rC2); \ Do all 6 execute in one step? (See paragraph (1))
rA3 = addps(rB3, rC3); /
rA4 = mullps(rB4, rC4); |
rA5 = addps(rB5, rC5); /
Which way of instruction ordering should I prefer, A or B?
More specifically:
Is it possible to distribute 3 mulps to 3 SSE multiplication units (1), and at the same time (2) to distribute addps to their respective SSE addition units, resulting in total 6 instructions per schedule cycle?
If I run N mullps first, and N addps then - which N is optimal?
Remarks
by 'scheduled' I mean throughput rate.
See Agner Fog's instruction tables for which instructions can run on which execution units. And/or use Intel's code analyzer (IACA) to find throughput bottlenecks (dependency chains or port contention).
As the commenters say, not all the execution ports can handle FP MUL. They can all handle vector-int logicals (AND/OR/XOR), but only one or two ports have a vector shuffle unit, or a vector shift unit, etc. etc.
I found the time complexity of Prims algorithm everywhere as O((V + E) log V) = E log V. But as we can see the algorithm:
It seems like the time complexity is O(V(log V + E log V)). But if its time complexity is O((V + E) log V). Then the nesting must have to be like this:
But the above nesting is seems to be wrong.
MST-PRIM(G, w, r)
1 for each u ∈ G.V
2 u.key ← ∞
3 u.π ← NIL
4 r.key ← 0
5 Q ← G.V
6 while Q ≠ Ø
7 u ← EXTRACT-MIN(Q)
8 for each v ∈ G.Adjacent[u]
9 if v ∈ Q and w(u, v) < v.key
10 v.π ← u
11 v.key ← w(u, v)
Using a Binary Heap
The time complexity required for one call to EXTRACT-MIN(Q) is O(log V) using a min priority queue. The while loop at line 6 is executing total V times.so EXTRACT-MIN(Q) is called V times. So the complexity of EXTRACT-MIN(Q) is O(V logV).
The for loop at line 8 is executing total 2E times as length of each adjacency lists is 2E for an undirected graph. The time required to execute line 11 is O(log v) by using the DECREASE_KEY operation on the min heap. Line 11 also executes total 2E times. So the total time required to execute line 11 is O(2E logV) = O(E logV).
The for loop at line 1 will be executed V times. Using the procedure to perform lines 1 to 5 will require a complexity of O(V).
Total time complexity of MST-PRIM is the sum of the time complexity required to execute steps 1 through 3 for a total of O((VlogV) + (E logV) + (V)) = O(E logV) since |E| >= |V|.
Using a Fibonacci Heap
Same as above.
Executing line 11 requires O(1) amortized time. Line 11 executes a total of 2E times. So the total time complexity is O(E).
Same as above
So the total time complexity of MST-PRIM is the sum of executing steps 1 through 3 for a total complexity of O(V logV + E + V)=O(E + V logV).
Your idea seems correct. Let's take the complexity as
V(lg(v) + E(lg(v)))
Then notice that in the inner for loop, we are actually going through all the vertices, and not the edge, so let's modify a little to
V(lg(v) + V(lg(v)))
which means
V(lg(v)) + V*V(lg(v))
But for worst case analysis(dense graphs), V*V is roughly equal to number of edges, E
V(lg(v)) + E(lg(v))
(V+E((lg(v))
but since V << E, hence
E(lg(v))
The time complexity of Prim's algorithm is O(VlogV + ElogV). It seems like you understand how the VlogV came to be, so let's skip over that. So where does ElogV come from? Let's start by looking at Prim's algorithm's source code:
| MST-PRIM(Graph, weights, r)
1 | for each u ∈ Graph.V
2 | u.key ← ∞
3 | u.π ← NIL
4 | r.key ← 0
5 | Q ← Graph.V
6 | while Q ≠ Ø
7 | u ← EXTRACT-MIN(Q)
8 | for each v ∈ Graph.Adj[u]
9 | if v ∈ Q and weights(u, v) < v.key
10| v.π ← u
11| v.key ← weights(u, v)
Lines 8-11 are executed for every element in Q, and we know that there are V elements in Q (representing the set of all vertices). Line 8's loop is iterating through all the neighbors of the currently extracted vertex; we will do the same for the next extracted vertex, and for the one after that. Djistkra's Algorithm does not repeat vertices (because it is a greedy, optimal algorithm), and will have us go through each of the connected vertices eventually, exploring all of their neighbors. In other words, this loop will end up going through all the edges of the graph twice at some point (2E).
Why twice? Because at some point we come back to a previously explored edge from the other direction, and we can't rule it out until we've actually checked it. Fortunately, that constant 2 is dropped during our time complexity analysis, so the loop is really just doing E amounts of work.
Why wasn't it V*V? You might reach that term if you just consider that we have to check each Vertex and its neighbors, and in the worst case graph the number of neighbors approaches V. Indeed, in a dense graph V*V = E. But the more accurate description of the work of these two loops is "going through all the edges twice", so we refer to E instead. It's up to the reader to connect how sparse their graph is with this term's time complexity.
Let's look at a small example graph with 4 vertices:
1--2
|\ |
| \|
3--4
Assume that Q will give us the nodes in the order 1, 2, 3, and then 4.
In the first iteration of the outer loop, the inner loop will run 3 times (for 2, 3, and 4).
In the second iteration of the outer loop, the inner loop runs 2 times (for 1 and 4).
In the third iteration of the outer loop, the inner loop runs 2 times (for 1 and 4).
In the last iteration of the outer loop, the inner loop runs 3 times (for 1, 2, and 3).
The total iterations was 10, which is twice the number of edges (2*5).
Extracting the minimum and tracking the updated minimum edges (usually done with a Fibonacci Heap, resulting in log(V) time complexity) occurs inside the loop iterations - the exact mechanisms involve end up needing to occur inside the inner loop enough times that they are controlled by the time complexity of both loops. Therefore, the complete time complexity for this phase of the algorithm is O(2*E*log(V)). Dropping the constant yields O(E*log(V)).
Given that the total time complexity of the algorithm is O(VlogV + ElogV), we can simplify to O((V+E)logV). In a dense graph E > V, so we can conclude O(ElogV).
actually as you are saying as for is nested inside while time complexity should be v.E lg V is correct in case of asymptotic analysis. But in cormen they have done amortized analysis thats why it comes out to be (Elogv)
This is an exercise that was suggested for my upcoming exam, the bottom is what I have gathered thus-far. All constructive input will be appreciated.
P1, P2 and P3 share three semaphores (x, y and z) each with 1 as initial value and three variables (a, b and c).
P1:
(1.1) wait(x);
(1.2) a = a + 1;
(1.3) wait(y);
(1.4) b = b - a;
(1.5) wait(z);
(1.6) c = a + 2*b -c;
(1.7) signal(z);
(1.8) signal(y);
(1.9) signal(x)
Code for P2:
(2.1) wait(y);
(2.2) b = b*2;
(2.3) wait(z);
(2.4) c = c - b;
(2.5) signal(y);
(2.6) wait(x);
(2.7) a = a + c;
(2.8) signal(x);
(2.9) signal(z)
Code for P3:
(3.1) wait(y);
(3.2) b = b*2;
(3.3) wait(z);
(3.4) c = c - b;
(3.5) signal(z);
(3.6) signal(y);
(3.7) wait(x);
(3.8) a = a / 10;
(3.9) signal(x)
A. If P1 and P2 run concurrently on a computer with only a single CPU, is it possible for these two processes to get into a deadlock? If so, show one execution sequence of the code that results in the deadlock, and show how to revise P2 only (P1 is not changed) to prevent deadlock.
B. If P1 and P3 are run concurrently on a computer with only a single CPU, is it possible for these two processes to get into a deadlock? If so, show one execution sequence of the code that results in the deadlock, and show how to revise P3 only (P1 is not changed) to prevent deadlock.
The changes you make should not violate the mutual exclusion requirement on shared variable access.
A) I'm not sure what it means by an example of when they would get into a deadlock? To me, it appears that y will cause a deadlock because line 1.3 will cause y to become -1 an would not be unlocked until 2.5 of P2.
To resolve this, 1.3 should be moved below 1.5 because that is when y is released in P2. It looks like there will be other conflicts though with x, but I don't know what a good way to rearrange P1 would be to resolve this without changing P2.
B) Here it appears 1.3 (wait(y)) causes a problem again since it is not signaled until 3.6. The resolution would then be to move it to after 1.6?
I'm trying to use Wiki's pages on Deadlock and Semaphore Programming to do this exercise.
Well, in the first case as an example;
Sequence Holds lock on
(1.1) wait(x); P1 x, P2 -
(1.2) a = a + 1; P1 x, P2 -
(2.1) wait(y); P1 x, P2 y
(2.2) b = b*2; P1 x, P2 y
(2.3) wait(z); P1 x, P2 yz
(2.4) c = c - b; P1 x, P2 yz
(2.5) signal(y); P1 x, P2 z
(2.6) wait(x); P1 x, P2 z - P2 locks waiting for x
(1.3) wait(y); P1 xy, P2 z
(1.4) b = b - a; P1 xy, P2 z
(1.5) wait(z); P1 xy, P2 z - P1 locks waiting for z
It could - for example - be fixed by locking P2 in the order same order as P1 (x, y, z instead of y, z, x). It may mean that you'll have to lock x before you really need to for mutual exclusion, but it will prevent deadlock.
As far as I can see, P1 and P3 can't mutually deadlock, since P3 is locked in the sequence y, z (which follows the x, y, z sequence, just skips the lock on x), then releases the locks and only locks/unlocks x.