Using database transactions to ensure only one process picks up a job from jobs table - sql

I want to ensure that only one worker node (out of a cluster of nodes) picks up any given job from a Jobs table and processes it. I'm using the following database transaction on my Jobs table to achieve this:
BEGIN TRANSACTION;
SELECT id FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
-- rollback if id is null
UPDATE jobs
SET status = 'IN_PROCESS'
WHERE id = 'id';
COMMIT;
Will the above TRANSACTION ensure that only one node will pick up any given job? Is it possible that two nodes will read the SELECT statement simultaneously as NEW, and then both will run the UPDATE statement (after the first one releases the row lock) and start processing the same job?
In other words, will the TRANSACTION provide a lock for the SELECT statement as well, or only for the UPDATE statement?

No, transactions won't help you here, unless you raise the isolation level to SERIALIZABLE. It's an option of last resort, but I would avoid it if possible.
The possibilities I see are:
Pessimistic Locks. Add FOR UPDATE to the SELECT. These limit performance.
Optimistic Locks. Seems like the best fit for your needs.
Use a predefined queue manager linked to your table.
Implement a queue, or a queue process.
Seems to me that #2 is the best fit for it. If that's the case you'll need to add an extra column to the table: version. With that column added your queries will need to be changed as:
SELECT id, version FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
UPDATE jobs
SET status = 'IN_PROCESS', version = version + 1
WHERE id = 'id_retrieved_before' and version = 'version_retrieved_before';
The update above returns the number of updated rows. If the count is 1 then this thread got the row. If it's 0, then another competing thread got the row and you'll need to retry this strategy again.
As you can see:
No transactions are necessary.
All engines return the number of updated rows, so this strategy will work well virtually in any database.
No locks are necessary. This offers great performance.
The downside is that if another thread got the row, the logic needs to start again from the beginning. No big deal, but it doesn't ensure optimal response time in the scenario when there are many competing threads.
Finally, if your database is PostgreSQL, you can pack both SQL statements into a single one. Isn't PostgreSQL awesome?

Related

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

Postgres concurrency and serializability. Do I need a SERIALIZABLE isolation level?

I have an Items and Jobs table:
Items
id = PK
job_id = Jobs FK
status = IN_PROGRESS | COMPLETE
Jobs
id = PK
Items start out as IN_PROGRESS, but work is performed on them, and handed off to a worker to update. I have an updater process that is updating Items as they come in, with a new status. The approach I have been doing so far has been (in pseudocode):
def work(item: Item) = {
insideTransaction {
updateItemWithNewStatus(item)
jobs, items = getParentJobAndAllItems(item)
newJobStatus = computeParentJobStatus(jobs, items)
// do some stuff depending on newJobStatus
}
}
Does that make sense? I want this to work in a concurrent environment. The issue I have right now, is that COMPLETE is arrived at multiple times for a job, when I only want to do logic on COMPLETE, once.
If I change my transaction level to SERIALIZABLE, I do get the "ERROR: could not serialize access due to read/write dependencies among transactions" error as described.
So my questions are:
Do I need SERIALIZABLE?
Can I get away with SELECT FOR UPDATE, and where?
Can someone explain to me what is happening, and why?
Edit: I have reopened this question because I was not satisfied with the previous answers explanation. Is anyone able to explain this for me? Specifically, I want some example queries for that pseudocode.
You can use a SELECT FOR UPDATE on items and jobs and work on the affected rows in both tables within a single transaction. That should be enough to enforce the integrity of the whole operation without the overhead of SERIALIZABLE or a table lock.
I would suggest you create a function that is called after an insert or update is made on the items table, passing the PK of the item:
CREATE FUNCTION process_item(item integer) RETURNS void AS $$
DECLARE
item items%ROWTYPE;
job jobs%ROWTYPE;
BEGIN -- Implicitly starting a transaction
SELECT * INTO job FROM jobs
WHERE id = (SELECT job_id FROM items WHERE id = item)
FOR UPDATE; -- Lock the row for other users
FOR item IN SELECT * FROM items FOR UPDATE LOOP -- Rows locked
-- Work on items individually
UPDATE items
SET status = 'COMPLETED'
WHERE id = item.id;
END LOOP;
-- Do any work on the job itself
END; -- Implicitly close the transaction, releasing the locks
$$ LANGUAGE plpgsql;
If some other process is already work on the job or any of its associated items, then the execution will halt until that other lock is released. This is different from SERIALIZABLE which will work until it fails and then you'd have to re-do all of the processing in a second try.
If you want the jobs to be able to run concurrently, neither SERIALIZABLE nor SELECT FOR UPDATE will work directly.
If you lock the row using SELECT FOR UPDATE, then another process will simply block when it executes the SELECT FOR UPDATE until the first process commits the transaction.
If you do SERIALIZABLE, both processes could run concurrently (processing the same row) but at least one should be expected to fail by the time it does a COMMIT since the database will detect the conflict. Also SERIALIZABLE might fail if it conflicts with any other queries going on in the database at the same time which affect related rows. The real reason to use SERIALIZABLE is precisely if you are trying to protect against concurrent database updates made by other jobs, as opposed to blocking the same job from executing twice.
Note there are tricks to make SELECT FOR UPDATE skip locked rows. If you do that then you can have actual concurrency. See Select unlocked row in Postgresql.
Another approach I see more often is to change your "status" column to have a 3rd temporary state which is used while a job is being processed. Typically one would have states like 'PENDING', 'IN_PROGRESS', 'COMPLETE'. When your process searches for work to do, it finds a 'PENDING' jobs, immediately moves it to 'IN_PROGRESS' and commits the transaction, then carries on with the work and finally moves it to 'COMPLETE'. The disadvantage is that if the process dies while processing a job, it will be left in 'IN_PROGRESS' indefinitely.

In Oracle, a way for Updates to not lock rows?

I have an Update query that recalculates -every- column value in exactly one row, each time.
I've been seeing more row-level lock contention, due to these Update queries occurring on the same row.
I'm thinking maybe one solution would be to have subsequent Updates simply preempt any Updates already in progress. Is this possible? Does Oracle support this kind of Update?
To spell out the idea in full:
update query #1 begins, in its own transaction
needs to update row X
acquires lock on row X
update query #2 begins, again in its own transaction
blocks, waiting for query #1 to release the lock on row X.
My thought is, can step 5 simply be: query #1 is aborted, query #2 proceeds. Or maybe dispense with acquiring the row-level lock in the first place.
I realize this logic would be disastrously wrong should the update query be updating only a subset of columns in a given row. But it's not -- every column gets recalculated, each time.
I'd ask whether a physical table is the right mechanism for whatever you are doing. One factor is how transactions needs to be handled. Anything that means "Don't lock for the duration of the transaction" will run into transactional issues.
There are a couple of non-transactional options:
Global context values might be useful (depends if you are on RAC) and how to handle persistence after a restart.
Another option is DBMS_PIPE where you'd have a background process maintaining that table and the separate sessions send messages to that process rather than update the table directly.
Queuing is another thought.
If you just need to need to reduce the time the record is locked, autonomous transactions could be the answer
It's possible to do the opposite of what you're asking, have query 2 fail if query 1 is in progress using SELECT FOR UPDATE and NOWAIT.
Alternatively, you could try to see if you can get the desired effect by adjusting the isolation level, but I do not recommend this without extensive testing, as you don't know what knock-on effects it may have.
Oracle's UPDATE doesn't support any locking hints.
But OraFAQ forum suggests such hacky workaround:
DECLARE
x CHAR(1);
BEGIN
SELECT 'x' INTO x
FROM tablea
WHERE -- your update condition
FOR UPDATE OF cola NOWAIT;
UPDATE tablea
SET cola = value
WHERE -- your update condition
EXCEPTION
WHEN OTHERS THEN
NULL; -- handle the exception
END;

Using SQL Server as a DB queue with multiple clients

Given a table that is acting as a queue, how can I best configure the table/queries so that multiple clients process from the queue concurrently?
For example, the table below indicates a command that a worker must process. When the worker is done, it will set the processed value to true.
| ID | COMMAND | PROCESSED |
| 1 | ... | true |
| 2 | ... | false |
| 3 | ... | false |
The clients might obtain one command to work on like so:
select top 1 COMMAND
from EXAMPLE_TABLE
with (UPDLOCK, ROWLOCK)
where PROCESSED=false;
However, if there are multiple workers, each tries to get the row with ID=2. Only the first will get the pessimistic lock, the rest will wait. Then one of them will get row 3, etc.
What query/configuration would allow each worker client to get a different row each and work on them concurrently?
EDIT:
Several answers suggest variations on using the table itself to record an in-process state. I thought that this would not be possible within a single transaction. (i.e., what's the point of updating the state if no other worker will see it until the txn is committed?) Perhaps the suggestion is:
# start transaction
update to 'processing'
# end transaction
# start transaction
process the command
update to 'processed'
# end transaction
Is this the way people usually approach this problem? It seems to me that the problem would be better handled by the DB, if possible.
I recommend you go over Using tables as Queues.
Properly implemented queues can handle thousands of concurrent users and service as high as 1/2 Million enqueue/dequeue operations per minute. Until SQL Server 2005 the solution was cumbersome and involved a mixing a SELECT and an UPDATE in a single transaction and give just the right mix of lock hints, as in the article linked by gbn. Luckly since SQL Server 2005 with the advent of the OUTPUT clause, a much more elegant solution is available, and now MSDN recommends using the OUTPUT clause:
You can use OUTPUT in applications
that use tables as queues, or to hold
intermediate result sets. That is, the
application is constantly adding or
removing rows from the table
Basically there are 3 parts of the puzzle you need to get right in order for this to work in a highly concurrent manner:
You need to dequeue automically. You have to find the row, skip any locked rows, and mark it as 'dequeued' in a single, atomic operation, and this is where the OUTPUT clause comes into play:
with CTE as (
SELECT TOP(1) COMMAND, PROCESSED
FROM TABLE WITH (READPAST)
WHERE PROCESSED = 0)
UPDATE CTE
SET PROCESSED = 1
OUTPUT INSERTED.*;
You must structure your table with the leftmost clustered index key on the PROCESSED column. If the ID was used a primary key, then move it as the second column in the clustered key. The debate whether to keep a non-clustered key on the ID column is open, but I strongly favor not having any secondary non-clustered indexes over queues:
CREATE CLUSTERED INDEX cdxTable on TABLE(PROCESSED, ID);
You must not query this table by any other means but by Dequeue. Trying to do Peek operations or trying to use the table both as a Queue and as a store will very likely lead to deadlocks and will slow down throughput dramatically.
The combination of atomic dequeue, READPAST hint at searching elements to dequeue and leftmost key on the clustered index based on the processing bit ensure a very high throughput under a highly concurrent load.
My answer here shows you how to use tables as queues... SQL Server Process Queue Race Condition
You basically need "ROWLOCK, READPAST, UPDLOCK" hints
If you want to serialize your operations for multiple clients, you can simply use application locks.
BEGIN TRANSACTION
EXEC sp_getapplock #resource = 'app_token', #lockMode = 'Exclusive'
-- perform operation
EXEC sp_releaseapplock #resource = 'app_token'
COMMIT TRANSACTION
Rather than using a boolean value for Processed you could use an int to define the state of the command:
1 = not processed
2 = in progress
3 = complete
Each worker would then get the next row with Processed = 1, update Processed to 2 then begin work. When work in complete Processed is updated to 3. This approach would also allow for extension of other Processed outcomes, for example rather than just defining that a worker is complet you may add new statuses for 'Completed Succesfully' and 'Completed with Errors'
Probably the better option will be use a trisSate processed column along with a version/timestamp column. The three values in the processed column will then indicate indicates if the row is under processing, processed or unprocessed.
For example
CREATE TABLE Queue ID INT NOT NULL PRIMARY KEY,
Command NVARCHAR(100),
Processed INT NOT NULL CHECK (Processed in (0,1,2) ),
Version timestamp)
You grab the top 1 unprocessed row, set the status to underprocessing and set the status back to processed when things are done. Base your update status on the Version and the primary key columns. If the update fails then someone has already been there.
You might want to add a client identifier as well, so that if the client dies while processing it up, it can restart, look at the last row and then start from where it was.
I would stay away from messing with locks in a table. Just create two extra columns like IsProcessing (bit/boolean) and ProcessingStarted (datetime). When a worker crashes or doesn't update his row after a timeout you can have another worker try to process the data.
One way is to mark the row with a single update statement. If you read the status in the where clause and change it in the set clause, no other process can come in between, because the row will be locked. For example:
declare #pickup_id int
set #pickup_id = 1
set rowcount 1
update YourTable
set status = 'picked up'
, #pickup_id = id
where status = 'new'
set rowcount 0
return #pickup_id
This uses rowcount to update one row at most. If no row was found, #pickup_id will be -1.

Getting a Chunk of Work

Recently I had to deal with a problem that I imagined would be pretty common: given a database table with a large (million+) number of rows to be processed, and various processors running in various machines / threads, how to safely allow each processor instance to get a chunk of work (say 100 items) without interfering with one another?
The reason I am getting a chunk at a time is for performance reasons - I don't want to go to the database for each item.
There are a few approaches - you could associate each processor a token, and have a SPROC that sets that token against the next [n] available items; perhaps something like:
(note - needs suitable isolation-level; perhaps serializable: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE)
(edited to fix TSQL)
UPDATE TOP (1000) WORK
SET [Owner] = #processor, Expiry = #expiry
OUTPUT INSERTED.Id -- etc
WHERE [Owner] IS NULL
You'd also want a timeout (#expiry) on this, so that when a processor goes down you don't lose work. You'd also need a task to clear the owner on things that are past their Expiry.
You can have a special table to queue work up, where the consumers delete (or mark) work as being handled, or use a middleware queuing solution, like MSMQ or ActiveMQ.
Middleware comes with its own set of problems so, if possible, I'd stick with a special table (keep it as small as possible, hopefully just with an id so the workers can fetch the rest of the information by themselves on the rest of the database and not lock the queue table up for too long).
You'd fill this table up at regular intervals and let processors grab what they need from the top.
Related questions on SQL table queues:
Queue using table
Working out the SQL to query a priority queue table
Related questions on queuing middleware:
Building a high performance and automatically backupped queue
Messaging platform
You didn't say which database server you're using, but there are a couple of options.
MySQL includes an extension to SQL99's INSERT to limit the number of rows that are updated. You can assign each worker a unique token, update a number of rows, then query to get that worker's batch. Marc used the UPDATE TOP syntax, but didn't specify the database server.
Another option is to designate a table used for locking. Don't use the same table with the data, since you don't want to lock it for reading. Your lock table likely only needs a single row, with the next ID needing work. A worker locks the table, gets the current ID, increments it by whatever your batch size is, updates the table, then releases the lock. Then it can go query the data table and pull the rows it reserved. This option assumes the data table has a monotonically increasing ID, and isn't very fault-tolerant if a worker dies or otherwise can't finish a batch.
Quite similar to this question: SQL Server Process Queue Race Condition
You run a query to assign a 100 rows to a given processorid. If you use these locking hints then it's "safe" in the concurrency sense. And it's a single SQL statement with no SET statements needed.
This is taken from the other question:
UPDATE TOP (100)
foo
SET
ProcessorID = #PROCID
FROM
OrderTable foo WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE
ProcessorID = 0 --Or whatever unassigned is