I have a table where it stores tasks, that need to be executed into a Task Queue table as I am using multi threading to get top 1 from that table and then executing that task. i am getting top 1 record from Task Queue and then I am deleting that record. So for example, if another thread executes before previous thread deletes the task that it picked then both threads may pic same thread. I want know if there is a way to stop other reading from the database until my current thread deletes the thread that it picked?
Rather than doing a SELECT followed by a DELETE, you may instead perform a DELETE with OUTPUT clause. The OUTPUT clause produces a result set but you're now obtaining that result set directly from the DELETE and so it's a single atomic operation - two independent executions will not produce the same output row.
Related
I want to ensure that only one worker node (out of a cluster of nodes) picks up any given job from a Jobs table and processes it. I'm using the following database transaction on my Jobs table to achieve this:
BEGIN TRANSACTION;
SELECT id FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
-- rollback if id is null
UPDATE jobs
SET status = 'IN_PROCESS'
WHERE id = 'id';
COMMIT;
Will the above TRANSACTION ensure that only one node will pick up any given job? Is it possible that two nodes will read the SELECT statement simultaneously as NEW, and then both will run the UPDATE statement (after the first one releases the row lock) and start processing the same job?
In other words, will the TRANSACTION provide a lock for the SELECT statement as well, or only for the UPDATE statement?
No, transactions won't help you here, unless you raise the isolation level to SERIALIZABLE. It's an option of last resort, but I would avoid it if possible.
The possibilities I see are:
Pessimistic Locks. Add FOR UPDATE to the SELECT. These limit performance.
Optimistic Locks. Seems like the best fit for your needs.
Use a predefined queue manager linked to your table.
Implement a queue, or a queue process.
Seems to me that #2 is the best fit for it. If that's the case you'll need to add an extra column to the table: version. With that column added your queries will need to be changed as:
SELECT id, version FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
UPDATE jobs
SET status = 'IN_PROCESS', version = version + 1
WHERE id = 'id_retrieved_before' and version = 'version_retrieved_before';
The update above returns the number of updated rows. If the count is 1 then this thread got the row. If it's 0, then another competing thread got the row and you'll need to retry this strategy again.
As you can see:
No transactions are necessary.
All engines return the number of updated rows, so this strategy will work well virtually in any database.
No locks are necessary. This offers great performance.
The downside is that if another thread got the row, the logic needs to start again from the beginning. No big deal, but it doesn't ensure optimal response time in the scenario when there are many competing threads.
Finally, if your database is PostgreSQL, you can pack both SQL statements into a single one. Isn't PostgreSQL awesome?
Let's say I have a table with some jobs that need to be executed, that is read by many instances of the same service.
I have a status column, and each instance is querying first row which is not taken yet (filtering by status) and then changing this status within the same transaction.
How do I prevent multiple instances reading the same job and then updating its status and executing it simultaneously?
I have a set of triggers which create an entry in an separate events table whenever they are update or deleted. This table contains an ID, table name and the data which has been changed / deleted.
I have a multi-threaded application .NET core 3.0 application which periodically selects the record with the lowest ID, processes the data sends the processed data off to an API. Once this has completed it then deletes the row from the table.
The problem is that the same row could be read twice by separate processes if the SELECT is completed by Process 1 and then Process 2 completes a SELECT before the DELETE has been completed by Process 1.
As the events table does not have a 'locked' column, I was hoping to complete this with some kind of row locking and WITH (readpast). However as the SELECT and DELETE are in separate transactions, I'm not sure if this is suitable.
Any advice on how I could achieve this given the current set up, or would introducing a lock column be the ideal way?
An alternative, and I would argue, safer way to David's suggestion is to segment your queue.
For a two process system you can segment your queue into odd and even items (based on ID) and have one process handle even items and the other - odd items.
This was you will never have a deadlock / race condition between two processes.
Example:
DECLARE #ProcessID INT = 0 -- 0 for "Process 1" and 1 for "Process 2"
SELECT TOP 1 * FROM [Queue]
WHERE ID % 2 = #ProcessID -- 2 is the number of processes
ORDER BY ID ASC
From the above you can see that this can be scaled to any number of processes.
Advantages
Completely avoids deadlocking and race conditions.
No requirement for atomic operations.
Disadvantages
No guarantee of processing order.
If one process stops, then half the items will not be processed.
I have an Items and Jobs table:
Items
id = PK
job_id = Jobs FK
status = IN_PROGRESS | COMPLETE
Jobs
id = PK
Items start out as IN_PROGRESS, but work is performed on them, and handed off to a worker to update. I have an updater process that is updating Items as they come in, with a new status. The approach I have been doing so far has been (in pseudocode):
def work(item: Item) = {
insideTransaction {
updateItemWithNewStatus(item)
jobs, items = getParentJobAndAllItems(item)
newJobStatus = computeParentJobStatus(jobs, items)
// do some stuff depending on newJobStatus
}
}
Does that make sense? I want this to work in a concurrent environment. The issue I have right now, is that COMPLETE is arrived at multiple times for a job, when I only want to do logic on COMPLETE, once.
If I change my transaction level to SERIALIZABLE, I do get the "ERROR: could not serialize access due to read/write dependencies among transactions" error as described.
So my questions are:
Do I need SERIALIZABLE?
Can I get away with SELECT FOR UPDATE, and where?
Can someone explain to me what is happening, and why?
Edit: I have reopened this question because I was not satisfied with the previous answers explanation. Is anyone able to explain this for me? Specifically, I want some example queries for that pseudocode.
You can use a SELECT FOR UPDATE on items and jobs and work on the affected rows in both tables within a single transaction. That should be enough to enforce the integrity of the whole operation without the overhead of SERIALIZABLE or a table lock.
I would suggest you create a function that is called after an insert or update is made on the items table, passing the PK of the item:
CREATE FUNCTION process_item(item integer) RETURNS void AS $$
DECLARE
item items%ROWTYPE;
job jobs%ROWTYPE;
BEGIN -- Implicitly starting a transaction
SELECT * INTO job FROM jobs
WHERE id = (SELECT job_id FROM items WHERE id = item)
FOR UPDATE; -- Lock the row for other users
FOR item IN SELECT * FROM items FOR UPDATE LOOP -- Rows locked
-- Work on items individually
UPDATE items
SET status = 'COMPLETED'
WHERE id = item.id;
END LOOP;
-- Do any work on the job itself
END; -- Implicitly close the transaction, releasing the locks
$$ LANGUAGE plpgsql;
If some other process is already work on the job or any of its associated items, then the execution will halt until that other lock is released. This is different from SERIALIZABLE which will work until it fails and then you'd have to re-do all of the processing in a second try.
If you want the jobs to be able to run concurrently, neither SERIALIZABLE nor SELECT FOR UPDATE will work directly.
If you lock the row using SELECT FOR UPDATE, then another process will simply block when it executes the SELECT FOR UPDATE until the first process commits the transaction.
If you do SERIALIZABLE, both processes could run concurrently (processing the same row) but at least one should be expected to fail by the time it does a COMMIT since the database will detect the conflict. Also SERIALIZABLE might fail if it conflicts with any other queries going on in the database at the same time which affect related rows. The real reason to use SERIALIZABLE is precisely if you are trying to protect against concurrent database updates made by other jobs, as opposed to blocking the same job from executing twice.
Note there are tricks to make SELECT FOR UPDATE skip locked rows. If you do that then you can have actual concurrency. See Select unlocked row in Postgresql.
Another approach I see more often is to change your "status" column to have a 3rd temporary state which is used while a job is being processed. Typically one would have states like 'PENDING', 'IN_PROGRESS', 'COMPLETE'. When your process searches for work to do, it finds a 'PENDING' jobs, immediately moves it to 'IN_PROGRESS' and commits the transaction, then carries on with the work and finally moves it to 'COMPLETE'. The disadvantage is that if the process dies while processing a job, it will be left in 'IN_PROGRESS' indefinitely.
I have a list of 100 entries that I want to process with multiple threads. Each thread will take up to 20 entries to process.
I'm currently using global temp tables to store the entries that meet certain criteria -- I also do not want threads to overlap entries to process.
How do I do this (preventing the overlap)?
Thanks!
If on 11g, I'd use the SELECT ... FOR UPDATE SKIP LOCKED.
If on a previous version, I'd use Advanced Queuing to populate a queue with the primary key values of the entries to be processed, and have your threads dequeue those keys to process those records. Because the dequeue can (but doesn't have to be, if memory serves) within the processing transactional scope, the dequeue commits or rolls back with the processing, and no two threads can get the same records to process.
There are two issues here, so let's handle them separately:
How do you split some work among several threads/sessions?
You could use Advanced Queuing or the SKIP LOCKED feature as suggested by Adam.
You could also use a column that contains processing information, for example a STATE column that is empty when not processed. Each thread would start work on a row with:
UPDATE your_table
SET state='P'
WHERE STATE IS NULL
AND rownum = 1
RETURNING id INTO :id;
At this point the thread would commit to prevent other thread being locked. Then you would do your processing and select another row when you're done.
Alternatively, you could also split the work beforehand and assign each process with a range of ids that need to be processed.
How will temporary tables behave with multiple threads?
Most likely each thread will have its own Oracle session (else you couldn't run queries in parallel). This means that each thread will have its own virtual copy of the temporary table. If you stored data in this table beforehand, the threads will not be able to see it (the temp table will always be empty at the beginning of a session).
You will need regular tables if you want to store data accessible to multiple sessions. Temporary tables are fine for storing data that is private to a single session, for example intermediate data in a complex process.
Easiest will be to use DBMS_SCHEDULER to schedule a job for every row that you want to process. You have to pass a key to a permanent table to identifiy the row that you want to process, or put the full row in the arguments for the job, since a temporary table's content is not visible in different sessions. The number of concurrent jobs are controlled by resource manager, mostly limited by the number of cpus.
Why would you want to process row by row anyway? Set operations are in most occasions a lot faster.