Redis in action - unfair semaphore

Redis in action - unfair semaphore - redis

I'm reading Redis in action e-book chapter about semaphores. Here is the python code to implement semaphore using redis
def acquire_semaphore(conn, semname, limit, timeout=10):
identifier = str(uuid.uuid4())
now = time.time()
pipeline = conn.pipeline(True)
pipeline.zremrangebyscore(semname, '-inf', now - timeout)
pipeline.zadd(semname, identifier, now)
pipeline.zrank(semname, identifier)
if pipeline.execute()[-1] < limit:
return identifier
conn.zrem(semname, identifier)
return None
This basic semaphore works well—it’s simple, and it’s very fast. But
relying on every process having access to the same system time in
order to get the semaphore can cause problems if we have multiple
hosts. This isn’t a huge problem for our specific use case, but if we
had two systems A and B, where A ran even 10 milliseconds faster than
B, then if A got the last semaphore, and B tried to get a semaphore
within 10 milliseconds, B would actually “steal” A’s semaphore without
A knowing it.
I didn't catch what does it mean: if A ran even 10 milliseconds faster than B then B would actually “steal” A’s semaphore without A knowing it.
My thoughts: A's time is 10:10:10:200 and B's time 10:10:10:190 and A've got semaphore. Then B tries to get semaphore within 10 ms (now B's local time is 10:10:10:200). B will delete expired items and add itslef. How B can steal A's semaphore? At the same time if A's time is 10:59 and B's time is 11:02 it's possible when B can remove A's semaphore because of time difference. But it's not the case described in book.

If B runs 10 ms slower than A, then B's score is smaller than A, since we use the local time as the score of sorted set.
So B's rank, i.e. pipeline.zrank(semname, identifier), is smaller than A's rank, which is smaller than limit, i.e. if pipeline.execute()[-1] < limit:. And B thought it got the semaphore, i.e. return identifier. In fact, it steals the semaphore from A.

Related

How does Basic Paxos proposer know when to increment roundId / proposal number

Looking at the screenshot from this video, around 27:20
https://www.youtube.com/watch?v=JEpsBg0AO6o
When server S5 sends out the prepare RPC, it uses the roundId (proposal number) 4 and server number 5 (hence 4.5) as well as value "Y".
But how does it know 4 is the the roundId to use? Earlier, S1 used up roundId 3, but there's no way for S5 to know about that, as there hasn't been communication between S5 and anybody else at the time S5 chose 4 as roundId.

In theory, there is no need to know what is the latest number as every proposer will keep increasing the round number until it gets it right.
In your example S5 knows nothing, so it will start with smallest number and then keep going up.
In practical application, when a proposer proposes a number, and if the proposal is declined by an acceptor, the declined message will contain the largest round number seen so far by that acceptor; this will help the proposer to retry with a larger round number (instead of increasing their current number one by one).
-- Edit: Max posted a link with an answer claiming (at of now) that N has to be unique per proposer
Let me explain why there is no global uniqueness requirement by an example.
Let's say we have a system with two proposers and three accepters(and few learners):
Both proposers sent PREPARE(1) - same number - to all acceptors.
Based on paxos rules, only one of proposers will get the majority of PROMISE messages - this is based on the rule that acceptor promises if PREPARE has number strongly greater than any other previously seen by the acceptor.
Now we are in a state where one proposer has two (or three) PROMISES for N=1 and the other proposer has one (or zero) PROMISE with N = 1.
Only first proposer may issue ACCEPT(1, V) as it got majority. The other proposer does not have the majority of PROMISES and has to retry with a larger N.
After the other proposer retries, it will use N larger than any other it saw before - hence it will try with N=2
From now on, it's all works the same way - proposer PREPARES and if get majority of PROMISE for its N, then the proposer issues ACCEPT(N, VALUE_ACCORDING_TO_PROTOCOL)
The key understanding for paxos is there is no way to have two ACCEPT(N, V) messages being sent where same N will have different V, hence there is no issue that two proposer use the same N.
As for initiating every node with some unique ID - that's ok; will it improve the performance - it's a big question and I haven't see a formal proof for that yet.

How are reserved slots re-allocated between reservation/projects if idle slots are used?

The documentation on Introduction to Reservations: Idle Slots states that idle slots from reservations can be used by other reservations if required
By default, queries running in a reservation automatically use idle slots from other reservations. That means a job can always run as long as there's capacity. Idle capacity is immediately preemptible back to the original assigned reservation as needed, regardless of the priority of the query that needs the resources. This happens automatically in real time.
However, I'm wondering if this can have a negative effect on other reservations in a scenario where idle slots are used but are shortly after required by the "owning" reservation.
To be concrete I would like to understand if i can regard assigned slots as guarantee OR as a best effort.
Example:
Reserved slots: 100
Reservation A: 50 Slots
Reservation B: 50 Slots
"A" starts a query at 14:00:00 and the computation takes 300 seconds if 100 slots are used.
All slots are idle at the start of the query, thus all 100 slots are made available to A.
5 seconds later at 14:00:05 "B" starts a query that takes 30 seconds if 50 slots are used.
Note:
For the sake of simplicity let's assume that both queries have only excactly 1 stage and each computation unit ("job") in the stage takes the full time of the query. I.e. the stage is divided into 100 jobs and if a slot starts the computation it takes the full 300 seconds to finish successfully.
I'm fairly certain that on "multiple stages" or "shorter computation times" (e.g. if the computation can be broken down in 1000 jobs) GBQ would be smart enough to dynamically re-assign the freed up slot the reservation it belongs to.
Questions:
does "B" now have to wait until a slot in "A" finishes?
this would mean ~5 min waiting time
I'm not sure how "realistic" the 5 min are, but I feel this is an important variable since I wouldn't worry about a couple of seconds - but I would worry about a couple of minutes!
or might an already started computation of "A" also be killed mid-flight?
the docu Introduction to Reservations: Slot Scheduling seems to suggest something like this
The goal of the scheduler is to find a medium between being too aggressive with evicting running tasks (which results in wasting slot time) and being too lenient (which results in jobs with long running tasks getting a disproportionate share of the slot time).

Answer via Reddit
A stage may run for quite some time (minutes, even hours in really bad cases) but a stage is run by many workers. And most workers complete their work within a very short time, e.g. milliseconds or seconds. Hence rebalancing, I.e. reallocating slots from one job to another is very fast.
So if a rebalancing happens and a job loses a large part of slots, then it will run a lot slower. And the one that gains slots will run fast. And this change is quick.
So in the above example. As job B starts 5 seconds in, within a second or so it would have acquired most of its slots.
So bottom line:
a query is broken up into "a lot" of units of work
each unit of work finishes pretty fast
this give GBQ to opportunity to re-assign slots

Is there any option to use redis.expire more elastically?

I got a quick simple question,
Assume that if server receives 10 messages from user within 10 minutes, server sends a push email.
At first I thought it very simple using redis,
incr("foo"), expire("foo",60*10)
and in Java, handle the occurrence count like below
if(jedis.get("foo")>=10){sendEmail();jedis.del("foo");}
but imagine if user send one message at first minute and send 8 messages at 10th minute.
and the key expires, and user again send 3 messages in the next minute.
redis key will be created again with value 3 which will not trigger sendEmail() even though user send 11 messages in 2 minutes actually.
we're gonna use Redis and we don't want to put receive time values to redis.
is there any solution ?

So, there's 2 ways of solving this-- one to optimize on space and the other to optimize on speed (though really the speed difference should be marginal).
Optimizing for Space:
Keep up to 9 different counters; foo1 ... foo9. Basically, we'll keep one counter for each of the possible up to 9 different messages before we email the user, and let each one expire as it hits the 10 minute mark. This will work like a circular queue. Now do this (in Python for simplicity, assuming we have a connection to Redis called r):
new_created = False
for i in xrange(1,10):
var_name = 'foo%d' % i
if not (new_created or r.exists(var_name)):
r.set(var_name, 0)
r.expire(var_name, 600)
new_created = True
if not r.exists(var_name): continue
r.incr(var_name, 1)
if r.get(var_name) >= 10:
send_email(user)
r.del(var_name)
If you go with this approach, put the above logic in a Lua script instead of the example Python, and it should be quite fast. Since you'll at most be storing 9 counters per user, it'll also be quite space efficient.
Optimizing for speed:
Keep one Redis Sortet Set per user. Every time a user sends a message, add to his sorted set with a key equal to the timestamp and an arbitrary value. Then just do a ZCOUNT(now, now - 10 minutes) and send an email if that's greater than 10. Then ZREMRANGEBYSCORE(now - 10 minutes, inf). I know you said you didn't want to keep timestamps in Redis, but IMO this is a better solution, and you're going to have to hold some variant on timestamps somewhere no matter what.
Personally I'd go with the latter approach because the space differences are probably not that big, and the code can be done quickly in pure Redis, but up to you.

Multiversion Timestamp-based concurrency control

In timestamp based concurrency control why do you have to reject write in transaction T_i on element x_k if some transaction with T_j where j > i already read it.
As stated in document.
If T_j is not planing to do any update at all why is it necessary to be so restrictive on T_i's actions ?

Assume that T_i occurs first and T_j goes on second. Assume T_i also writes to x. The second read of t_j should fail due to T_i already using the value of x. T_i is younger than T_j and if T_j uses the last committed version of x, it shall cause a stale value being used if T_i writes to x.
You need to abort the writing transaction t_j during a read, write or at commit time due to the the potential for a stale value being used. If the writing transaction didn't abort, and someone else read and used the old value, the database is not serializable. As you would get a different result if you ran the transactions in a different order. This is what the text quoted means by timestamp order.
Any two reads of the same value at the same time is dangerous as it causes a not accurate view of the database, it reveals a non-serializable order. If three transactions are running and all use x, then the serializable order is undefined. You need to enforce one read of x at a time, and this forces the transactions to be single file and see the last transaction's x. So t_i then t_j, then t_k in order, finishing before the next one starts.
Think what could happen even if t_j were not to write, it would use a value that technically doesn't exist in the datbase that is stale, it would have ignored the outcome of t_i if t_i wrote.
If three transactions all read x and don't write x, then it is safe to run them at the same time. You would need to know in advance that all three transactions don't write to x.
As in the whitepaper Serializable Snapshot Isolation attests, the dangerous structure is two read-write dependencies. But a read-write x followed by a read x is dangerous also due to the value being stale if both transactions run at the same time, it needs to be serializable, so you abort the second read x as there is a younger transaction using x.
I wrote a multiversion concurrency implementation in a simulation. See the simulation runner. My simulation simulates 100 threads all trying to read and write two numbers, A and B. They want to increment the number by 1. We set A to 1 and B to 2 at the beginning of the simulation.
The desired outcome is that A and B should be set to 101 and 102 at the end of the simulation. This can only happen if there is locking or serialization due to multiversion concurrency control. If you didn't have concurrency control or locking, this number will be less than 101 and 102 due to data races.
When a thread reads A or B we iterate over versions of key A or B to see if there is a version that is <= transaction.getTimestamp() and committed.get(key) == that version. If successful, it sets the read timestamp of that value as the transaction that last read that value. rts.put("A", transaction)
At commit time, we check that the rts.get("A").getTimestamp() != committingTransaction.getTimestamp(). If this check is true, we abort the transaction and try again.
We also check if someone committed since the transaction began - we don't want to overwrite their commit.
We also check for each write that the other writing transaction is younger than us then we abort. The if statement is in a method called shouldRestart and this is called on reads and at commit time and on all transactions that touched a value.
public boolean shouldRestart(Transaction transaction, Transaction peek) {
boolean defeated = (((peek.getTimestamp() < transaction.getTimestamp() ||
(transaction.getNumberOfAttempts() < peek.getNumberOfAttempts())) && peek.getPrecommit()) ||
peek.getPrecommit() && (peek.getTimestamp() > transaction.getTimestamp() ||
(peek.getNumberOfAttempts() > transaction.getNumberOfAttempts() && peek.getPrecommit())
&& !peek.getRestart()));
return defeated;
}
see the code here The or && peek.getPrecommit() means that a younger transaction can abort if a later transaction gets ahead and the later transaction hasn't been restarted (aborted) Precommit occurs at the beginning of a commit.
During a read of a key we check the RTS to see if it is lower than the reading than our transaction. If so, we abort the transaction and restart - someone is ahead of us in the queue and they need to commit.
On average, the system reaches 101 and 102 after around < 300 transaction aborts. With many runs finishing well below 200 attempts.
EDIT: I changed the formula for calculating which transactions wins. So if another transactions is younger or the other transactions has a higher number of attempts, the current transactions aborts. This reduces the number of attempts.
EDIT: the reason there was high abort counts was that a committing thread would be starved by reading threads that would abort restart due to the committing thread. I added a Thread.yield when a read fails due to an ahead transaction, this reduces restart counts to <200.

changing real time process priority in Linux ..?

My query is regarding engineering the priority value of a process. In my system, process A is running in RR at priority 83. Now I have another process B in RR, I want B's priority to be higher than A (i.e. I want B to be scheduled always compared to A).
To do this, what value should I choose for B. I have read in code that there is a penalty/bonus of 5 depending upon process's history.
Also, If I choose value 84 Or 85, is there any chance in some situations that my process is ignored.
Please help in engineering this value.

Now I got it. Real time tasks(FF/RR) are not governed by penalty/bonus rules. With O(1) scheduler, task with higher priority will be chosen. In my case process B will be scheduled if its priority is greater than process A.
Penalty/bonus is for SCHED_OTHER/SCHED_NORMAL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas