Distributed workers that ensure a single instance of a task is running - locking

I need to design a distributed system a scheduler sends tasks to workers in multiple nodes. Each task is assigned an id, and it could be executed more than once, scheduled by the scheduler (usually once per hour).
My only requirement is that a task with a specific id should not be executed twice at the same time by the cluster. I can think of a design where the scheduler holds a lock for each task id and sends the task to an appropriate worker. Once the worker has finished the lock should be released and the scheduler might schedule it again.
What should my design include to ensure this. I'm concerned about cases where a task is sent to a worker which starts the task but then fails to inform the scheduler about it.
What would be the best practice in this scenario to ensure that only a single instance of a job is always executed at a time?

You could use a solution that implements a consensus protocol. Say - for example - that all your nodes in the cluster can communicate using the Raft protocol. As such, whenever a node X would want to start working on a task Y it would attempt to commit a message X starts working on Y. Once such messages are committed to the log, all the nodes will see all the messages in the log in the same order.
When node X finishes or aborts the task it would attempt to commit X no longer works on Y so that another node can start/continue working on it.
It could happen that two nodes (X and Z) may try to commit their start messages concurrently, and the log would then look something like this:
...
N-1: ...
N+0: "X starts working on Y"
...
N+k: "Z starts working on Y"
...
But since there is no X no longer works on Y message between the N+0 and N+k entry, every node (including Z) would know that Z must not start the work on Y.
The only remaining problem would be if node X got partitioned from the cluster before it can attempt to commit its X no longer works on Y for which I believe there is no perfect solution.
A work-around could be that X would try to periodically commit a message X still works on Y at time T and if no such message was committed to the log for some threshold duration, the cluster would assume that no one is working on that task anymore.
With this work-around however, you'd be allowing the possibility that two or more nodes will work on the same task (the partitioned node X and some new node that picks up the task after the timeout).

After some thorough search, I came to the conclusion that this problem can be solved through a method called fencing.
In essence, when you suspect that a node (worker) failed, the only way to ensure that it will not corrupt the rest of the system is to provide a fence that will stop the node from accessing the shared resource you need to protect. That must be a radical method like resetting the machine that runs the failed process or setup a firewall rule that will prevent the process from accessing the shared resource. Once the fence is in place, then you can safely break the lock that was being held by the failed process and start a new process.

Another possibility is to use a relational database to store task metadata + proper isolation level (can't go wrong with serializable if performance is not your #1 priority).
SERIALIZABLE
This isolation level specifies that all transactions occur in a completely isolated fashion; i.e., as if all transactions in the system had executed serially, one after the other. The DBMS may execute two or more transactions at the same time only if the illusion of serial execution can be maintained.
Use either optimistic or pessimistic locking should work too. https://learning-notes.mistermicheels.com/data/sql/optimistic-pessimistic-locking-sql/
In case you need a rerun of the task, simply update the metadata. (or I would recommend to create a new task with different metadata to keep track of its execution history)

Related

Implementing a mutual exclusion system / distributed queue in Postgres

I want to implement a mutual exclusion system in PostgreSQL where multiple worker processes will temporarily lock resources (rows) from a table (queue) while they work on them. If the worker processes crash, I want the lock to be cleanly released and not have to rely on another process to clean up the leaked locks.
What I have come up with so far is to use a SELECT ... FOR UPDATE SKIP LOCKED query within a transaction, which locks the row it finds and skips any other locked row.
It works well but one of the issues is that the worker might take a while to do its task and I need to keep the transaction open for the entire duration of its task.
Another problem is that the workers work incrementally and persist their state to the database so that if they're stopped or crash, they can resume quickly where they were. The row being locked makes it impossible to persist their state in the same table (though I think I can get away from that by using another table to persist the state).
I've searched on the Web on how to implement a semaphore or a resource borrowing system in SQL/PostgreSQL but I haven't found something that fits my needs. Is there a simple way of achieving this with PostgreSQL?

Apache ignite spring data save method transaction behaviour with map parameter

As per apache ignite spring data documentation, there are two method to save the data in ignite cache:
1. org.apache.ignite.springdata.repository.IgniteRepository.save(key, vlaue)
and
2. org.apache.ignite.springdata.repository.IgniteRepository.save(Map<ID, S> entities)
So, I just want to understand the 2nd method transaction behavior. Suppose we are going to save the 100 records by using the save(Map<Id,S>) method and due to some reason after 70 records there are some nodes go down. In this case, will it roll back all the 70 records?
Note: As per 1st method behavior, If we use #Transaction at method level then it will roll back the particular entity.
First of all, you should read about the transaction mechanism used in Apache Ignite. It is very good described in articles presented here:
https://apacheignite.readme.io/v1.0/docs/transactions#section-two-phase-commit-2pc
The most interesting part for you is "Backup Node Failures" and "Primary Node Failures":
Backup Node Failures
If a backup node fails during either "Prepare" phase or "Commit" phase, then no special handling is needed. The data will still be committed on the nodes that are alive. GridGain will then, in the background, designate a new backup node and the data will be copied there outside of the transaction scope.
Primary Node Failures
If a primary node fails before or during the "Prepare" phase, then the coordinator will designate one of the backup nodes to become primary and retry the "Prepare" phase. If the failure happens before or during the "Commit" phase, then the backup nodes will detect the crash and send a message to the Coordinator node to find out whether to commit or rollback. The transaction still completes and the data within distributed cache remains consistent.
In your case, all updates for all values in the map should be done in one transaction or rollbacked. I guess that these articles answered your question.

How to create a distributed 'debounce' task to drain a Redis List?

I have the following usecase: multiple clients push to a shared Redis List. A separate worker process should drain this list (process and delete). Wait/multi-exec is in place to make sure, this goes smoothly.
For performance reasons I don't want to call the 'drain'-process right away, but after x milliseconds, starting from the moment the first client pushes to the (then empty) list.
This is akin to a distributed underscore/lodash debounce function, for which the timer starts to run the moment the first item comes in (i.e.: 'leading' instead of 'trailing')
I'm looking for the best way to do this reliably in a fault tolerant way.
Currently I'm leaning to the following method:
Use Redis Set with the NX and px method. This allows:
to only set a value (a mutex) to a dedicated keyspace, if it doesn't yet exist. This is what the nx argument is used for
expires the key after x milliseconds. This is what the px argument is used for
This command returns 1 if the value could be set, meaning no value did previously exist. It returns 0 otherwise. A 1 means the current client is the first client to run the process since the Redis List was drained. Therefore,
this client puts a job on a distributed queue which is scheduled to run in x milliseconds.
After x milliseconds, the worker to receive the job starts the process of draining the list.
This works on paper, but feels a bit complicated. Any other ways to make this work in a distributed fault-tolerant way?
Btw: Redis and a distributed queue are already in place so I don't consider it an extra burden to use it for this issue.
Sorry for that, but normal response would require a bunch of text/theory. Because your good question you've already written a good answer :)
First of all we should define the terms. The 'debounce' in terms of underscore/lodash should be learned under the David Corbacho’s article explanation:
Debounce: Think of it as "grouping multiple events in one". Imagine that you go home, enter in the elevator, doors are closing... and suddenly your neighbor appears in the hall and tries to jump on the elevator. Be polite! and open the doors for him: you are debouncing the elevator departure. Consider that the same situation can happen again with a third person, and so on... probably delaying the departure several minutes.
Throttle: Think of it as a valve, it regulates the flow of the executions. We can determine the maximum number of times a function can be called in certain time. So in the elevator analogy you are polite enough to let people in for 10 secs, but once that delay passes, you must go!
Your are asking about debounce sinse first element would be pushed to list:
So that, by analogy with the elevator. Elevator should go up after 10 minutes after the lift came first person. It does not matter how many people crammed into the elevator more.
In case of distributed fault-tolerant system this should be viewed as a set of requirements:
Processing of the new list must begin within X time, after inserting the first element (ie the creation of the list).
The worker crash should not break anything.
Dead lock free.
The first requirement must be fulfilled regardless of the number of workers - be it 1 or N.
I.e. you should know (in distributed way) - group of workers have to wait, or you can start the list processing. As soon as we utter the phrase "distributed" and "fault-tolerant". These concepts always lead with they friends:
Atomicity (eg by blocking)
Reservation
In practice
In practice, i am afraid that your system needs to be a little bit more complicated (maybe you just do not have written, and you already have it).
Your method:
Pessimistic locking with mutex via SET NX PX. NX is a guarantee that only one process at a time doing the work (atomicity). The PX ensures that if something happens with this process the lock is released by the Redis (one part of fault-tolerant about dead locking).
All workers try to catch one mutex (per list key), so just one be happy and would process list after X time. This process can update TTL of mutex (if need more time as originally wanted). If process would crash - the mutex would be unlocked after TTL and be grabbed with other worker.
My suggestion
The fault-tolerant reliable queue processing in Redis built around RPOPLPUSH:
RPOPLPUSH item from processing to special list (per worker per list).
Process item
Remove item from special list
Requirements
So, if worker would crashed we always can return broken message from special list to main list. And Redis guarantees atomicity of RPOPLPUSH/RPOP. That is, there is only a problem group of workers to wait a while.
And then two options. First - if have much of clients and lesser workers use locking on side of worker. So try to lock mutex in worker and if success - start processing.
And vice versa. Use SET NX PX each time you execute LPUSH/RPUSH (to have "wait N time before pop from me" solution if you have many workers and some push clients). So push is:
SET myListLock 1 PX 10000 NX
LPUSH myList value
And each worker just check if myListLock exists we should wait not at least key TTL before set processing mutex and start to drain.

How to ensure Hazelcast migration is finished

Consider the following scenario.
There are 2 Hazelcast nodes. One is stopped, another is running under quite heavy load.
Now, the second node comes up. The application starts up and its Hazelcast instance hooks up to the first. Hazelcast starts data repartitioning. For 2 nodes, it essentially means
that each entry in IMap gets copied to the new node and two nodes are assigned to be master/backup arbitrarily.
PROBLEM:
If the first node is brought down during this process, and the replication is not done completely, part of the IMap contents and ITopic subscriptions may be lost.
QUESTION:
How to ensure that the repartitioning process has finished, and it is safe to turn off the first node?
(The whole setup is made to enable software updates without downtime, while preserving current application state).
I tried using getPartitionService().addMigrationListener(...) but the listener does not seem to be hooked up to the complete migration process. Instead, I get tens to hundreds calls migrationStarted()/migrationCompleted() for each chunk of the replication.
1- When you gracefully shutdown first node, shutdown process should wait (block) until data is safely backed up.
hazelcastInstance.getLifecycleService().shutdown();
2- If you use Hazelcast Management Center, it shows ongoing migration/repartitioning operation count in home screen.

SSIS 2005 Control Flow Priority

The short version is I am looking for a way to prioritize certain tasks in SSIS 2005 control flows. That is I want to be able to set it up so that Task B does not start until Task A has started but Task B does not need to wait for Task A to complete. The goal is to reduce the amount of time where I have idle threads hanging around waiting for Task A to complete so that they can move onto Tasks C, D & E.
The issue I am dealing with is converting a data warehouse load from a linear job that calls a bunch of SPs to an SSIS package calling the same SPs but running multiple threads in parallel. So basically I have a bunch of Execute SQL Task and Sequence Container objects with Precedent Constraints mapping out the dependencies. So far no problems, things are working great and it cut our load time a bunch.
However I noticed that tasks with no downstream dependencies are commonly being sequenced before those that do have dependencies. This is causing a lot of idle time in certain spots that I would like to minimize.
For example: I have about 60 procs involved with this load, ~10 of them have no dependencies at all and can run at any time. Then I have another one with no upstream dependencies but almost every other task in the job is dependent on it. I would like to make sure that the task with the dependencies is running before I pick up any of the tasks with no dependencies. This is just one example, there are similar situations in other spots as well.
Any ideas?
I am late in updating over here but I also raised this issue over on the MSDN forums and we were able to devise a partial work around. See here for the full thread, or here for the feature request asking microsoft to give us a way to do this cleanly...
The short version is that you use a series of Boolean variables to control loops that act like roadblocks and prevent the flow from reaching the lower priority tasks until the higher priority items have started.
The steps involved are:
Declare a bool variable for each of the high priority tasks and default the values to false.
Create a pre-execute event for each of the high priority tasks.
In the pre-execute event create a script task which sets the appropriate bool to true.
At each choke point insert a for each loop that will loop while the appropriate bool(s) are false. (I have a script with a 1 second sleep inside each loop but it also works with empty loops.)
If done properly this gives you a tool where at each choke point the package has some number of high priority tasks ready to run and a blocking loop that keeps it from proceeding down the lower priority branches until said high priority items are running. Once all of the high priority tasks have been started the loop clears and allows any remaining threads to move on to lower priority tasks. Worst case is one thread sits in the loop while waiting for other threads to come along and pick up the high priority tasks.
The major drawback to this approach is the risk of deadlocking the package if you have too many blocking loops get queued up at the same time, or misread your dependencies and have loops waiting for tasks that never start. Careful analysis is needed to decide which items deserved higher priority and where exactly to insert the blocks.
I don't know any elegant ways to do this but my first shot would be something like this..
Sequence Container with the proc that has to run first. In that same sequence container put a script task that just waits 5-10 seconds or so before each of the 10 independent steps can run. Then chain the rest of the procs below that sequence container.