Postgres concurrency and serializability. Do I need a SERIALIZABLE isolation level? - sql

I have an Items and Jobs table:
Items
id = PK
job_id = Jobs FK
status = IN_PROGRESS | COMPLETE
Jobs
id = PK
Items start out as IN_PROGRESS, but work is performed on them, and handed off to a worker to update. I have an updater process that is updating Items as they come in, with a new status. The approach I have been doing so far has been (in pseudocode):
def work(item: Item) = {
insideTransaction {
updateItemWithNewStatus(item)
jobs, items = getParentJobAndAllItems(item)
newJobStatus = computeParentJobStatus(jobs, items)
// do some stuff depending on newJobStatus
}
}
Does that make sense? I want this to work in a concurrent environment. The issue I have right now, is that COMPLETE is arrived at multiple times for a job, when I only want to do logic on COMPLETE, once.
If I change my transaction level to SERIALIZABLE, I do get the "ERROR: could not serialize access due to read/write dependencies among transactions" error as described.
So my questions are:
Do I need SERIALIZABLE?
Can I get away with SELECT FOR UPDATE, and where?
Can someone explain to me what is happening, and why?
Edit: I have reopened this question because I was not satisfied with the previous answers explanation. Is anyone able to explain this for me? Specifically, I want some example queries for that pseudocode.

You can use a SELECT FOR UPDATE on items and jobs and work on the affected rows in both tables within a single transaction. That should be enough to enforce the integrity of the whole operation without the overhead of SERIALIZABLE or a table lock.
I would suggest you create a function that is called after an insert or update is made on the items table, passing the PK of the item:
CREATE FUNCTION process_item(item integer) RETURNS void AS $$
DECLARE
item items%ROWTYPE;
job jobs%ROWTYPE;
BEGIN -- Implicitly starting a transaction
SELECT * INTO job FROM jobs
WHERE id = (SELECT job_id FROM items WHERE id = item)
FOR UPDATE; -- Lock the row for other users
FOR item IN SELECT * FROM items FOR UPDATE LOOP -- Rows locked
-- Work on items individually
UPDATE items
SET status = 'COMPLETED'
WHERE id = item.id;
END LOOP;
-- Do any work on the job itself
END; -- Implicitly close the transaction, releasing the locks
$$ LANGUAGE plpgsql;
If some other process is already work on the job or any of its associated items, then the execution will halt until that other lock is released. This is different from SERIALIZABLE which will work until it fails and then you'd have to re-do all of the processing in a second try.

If you want the jobs to be able to run concurrently, neither SERIALIZABLE nor SELECT FOR UPDATE will work directly.
If you lock the row using SELECT FOR UPDATE, then another process will simply block when it executes the SELECT FOR UPDATE until the first process commits the transaction.
If you do SERIALIZABLE, both processes could run concurrently (processing the same row) but at least one should be expected to fail by the time it does a COMMIT since the database will detect the conflict. Also SERIALIZABLE might fail if it conflicts with any other queries going on in the database at the same time which affect related rows. The real reason to use SERIALIZABLE is precisely if you are trying to protect against concurrent database updates made by other jobs, as opposed to blocking the same job from executing twice.
Note there are tricks to make SELECT FOR UPDATE skip locked rows. If you do that then you can have actual concurrency. See Select unlocked row in Postgresql.
Another approach I see more often is to change your "status" column to have a 3rd temporary state which is used while a job is being processed. Typically one would have states like 'PENDING', 'IN_PROGRESS', 'COMPLETE'. When your process searches for work to do, it finds a 'PENDING' jobs, immediately moves it to 'IN_PROGRESS' and commits the transaction, then carries on with the work and finally moves it to 'COMPLETE'. The disadvantage is that if the process dies while processing a job, it will be left in 'IN_PROGRESS' indefinitely.

Related

Using database transactions to ensure only one process picks up a job from jobs table

I want to ensure that only one worker node (out of a cluster of nodes) picks up any given job from a Jobs table and processes it. I'm using the following database transaction on my Jobs table to achieve this:
BEGIN TRANSACTION;
SELECT id FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
-- rollback if id is null
UPDATE jobs
SET status = 'IN_PROCESS'
WHERE id = 'id';
COMMIT;
Will the above TRANSACTION ensure that only one node will pick up any given job? Is it possible that two nodes will read the SELECT statement simultaneously as NEW, and then both will run the UPDATE statement (after the first one releases the row lock) and start processing the same job?
In other words, will the TRANSACTION provide a lock for the SELECT statement as well, or only for the UPDATE statement?
No, transactions won't help you here, unless you raise the isolation level to SERIALIZABLE. It's an option of last resort, but I would avoid it if possible.
The possibilities I see are:
Pessimistic Locks. Add FOR UPDATE to the SELECT. These limit performance.
Optimistic Locks. Seems like the best fit for your needs.
Use a predefined queue manager linked to your table.
Implement a queue, or a queue process.
Seems to me that #2 is the best fit for it. If that's the case you'll need to add an extra column to the table: version. With that column added your queries will need to be changed as:
SELECT id, version FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
UPDATE jobs
SET status = 'IN_PROCESS', version = version + 1
WHERE id = 'id_retrieved_before' and version = 'version_retrieved_before';
The update above returns the number of updated rows. If the count is 1 then this thread got the row. If it's 0, then another competing thread got the row and you'll need to retry this strategy again.
As you can see:
No transactions are necessary.
All engines return the number of updated rows, so this strategy will work well virtually in any database.
No locks are necessary. This offers great performance.
The downside is that if another thread got the row, the logic needs to start again from the beginning. No big deal, but it doesn't ensure optimal response time in the scenario when there are many competing threads.
Finally, if your database is PostgreSQL, you can pack both SQL statements into a single one. Isn't PostgreSQL awesome?

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

How to prevent database deadlocks in concurrent transactions?

The scenario:
transaction A starts...
START TRANSACTION;
UPDATE table_name SET column_name=column_name+1 WHERE id = 1 LIMIT 1;
At the same time, transaction B starts...
START TRANSACTION;
UPDATE table_name SET column_name=column_name+1 WHERE id = 2 LIMIT 1;
UPDATE table_name SET column_name=column_name-1 WHERE id = 1 LIMIT 1;
COMMIT;
Right Now, transaction B is waiting for row 1, which is locked in transaction A.
And transaction A continues...
UPDATE table_name SET column_name=column_name-1 WHERE id = 2 LIMIT 1;
COMMIT;
And now we have a dead lock, so both transactions are waiting for each other to unlock a row that they want to update :'(
As I asked in the title, how can we prevent deadlocks in RDBMS transactions?
I think the only way to fix this situation is that we rollback transaction B and re-execute it. But how can we find out that we are in a deadlock, and get out of it immediately, and how can we guarantee that we do not stock in and endless loop (on very heavy web applications, for example).
If it is necessary, I'm using MySQL. But any solution for other RDBMSes are welcome - for helping other peoples coming here from Google :)
Most databases (if not all) will automatically detect a deadlock, pick one session to be the victim, and automatically roll back that session's transaction to break the deadlock. For example, here is the MySQL deadlock detection and rollback documentation.
Deadlocks are programming errors. One simple solution to avoiding deadlocks is to ensure that you always lock rows in a particular order. For example, if you have a transaction that wants to update two different rows, always update the row with the smaller id first and the larger id second. If your code always does that, you at least won't have row-level deadlocks. Beyond that, implement appropriate serialization for critical sections in your code. What, exactly, that entails is very dependent on your application.

In Oracle, a way for Updates to not lock rows?

I have an Update query that recalculates -every- column value in exactly one row, each time.
I've been seeing more row-level lock contention, due to these Update queries occurring on the same row.
I'm thinking maybe one solution would be to have subsequent Updates simply preempt any Updates already in progress. Is this possible? Does Oracle support this kind of Update?
To spell out the idea in full:
update query #1 begins, in its own transaction
needs to update row X
acquires lock on row X
update query #2 begins, again in its own transaction
blocks, waiting for query #1 to release the lock on row X.
My thought is, can step 5 simply be: query #1 is aborted, query #2 proceeds. Or maybe dispense with acquiring the row-level lock in the first place.
I realize this logic would be disastrously wrong should the update query be updating only a subset of columns in a given row. But it's not -- every column gets recalculated, each time.
I'd ask whether a physical table is the right mechanism for whatever you are doing. One factor is how transactions needs to be handled. Anything that means "Don't lock for the duration of the transaction" will run into transactional issues.
There are a couple of non-transactional options:
Global context values might be useful (depends if you are on RAC) and how to handle persistence after a restart.
Another option is DBMS_PIPE where you'd have a background process maintaining that table and the separate sessions send messages to that process rather than update the table directly.
Queuing is another thought.
If you just need to need to reduce the time the record is locked, autonomous transactions could be the answer
It's possible to do the opposite of what you're asking, have query 2 fail if query 1 is in progress using SELECT FOR UPDATE and NOWAIT.
Alternatively, you could try to see if you can get the desired effect by adjusting the isolation level, but I do not recommend this without extensive testing, as you don't know what knock-on effects it may have.
Oracle's UPDATE doesn't support any locking hints.
But OraFAQ forum suggests such hacky workaround:
DECLARE
x CHAR(1);
BEGIN
SELECT 'x' INTO x
FROM tablea
WHERE -- your update condition
FOR UPDATE OF cola NOWAIT;
UPDATE tablea
SET cola = value
WHERE -- your update condition
EXCEPTION
WHEN OTHERS THEN
NULL; -- handle the exception
END;

Incremented DB Field

Let's say that I have an article on a website, and I want to track the number of views to the article. In the Articles table, there's the PK ID - int, Name - nvarchar(50), and ViewCount - int. Every time the the page is viewed, I increment the ViewCount field. I'm worried about collisions when updating the field. I can run it in a sproc with a transaction like:
CREATE PROCEDURE IncrementView
(
#ArticleID int
)
as
BEGIN TRANSACTION
UPDATE Article set ViewCount = ViewCount + 1 where ID = #ArticleID
IF ##ERROR <> 0
BEGIN
-- Rollback the transaction
ROLLBACK
-- Raise an error and return
RAISERROR ('Error Incrementing', 16, 1)
RETURN
END
COMMIT
My fear is that I'm going to end up with PageViews not being counted in this model. The other possible solution is a log type of model where I actually log views to the articles and use a combination of a function and view to grab data about number of views to an article.
Probably a better model is to cache the number of views hourly in the app somewhere, and then update them in a batch-style process.
-- Edit:
To to elaborate more, a simple model for you may be:
Each page load, for the given page, increment a static hashmap. Also on each load, check if enough time has elapsed since 'Last Update', and if so, perform an update.
Be tricky, and put the base value in the asp.net cache (http://msdn.microsoft.com/en-us/library/aa478965.aspx) and, when it times out, [implement the cache removal handler as described in the link] do the update. Set the timeout for an hour.
In both models, you'll have the static map of pages to counts; you'll update this each view, and you'll also use this - and the cached db amount - to get the current 'live' count.
The database should be able to handle a single digit increment atomically. Queries on the queue should be handled in order in the case where there might be a conflict. Your bigger issue, if there is enough volume will be handling all of the writes to the same row. Each write will block the reads and writes behind it. If you are worried, I would create a simple program that calls SQL updates in a row and run it with a few hundred concurrent threads (increase threads until your hardware is saturated). Make sure the attempts = the final result.
Finding a mechanism to cache and/or perform batch updates as silky suggests sounds like a winner.
Jacob
You don't need to worry about concurrency within a single update statement in SQL Server.
But if you are worried about 2 users hitting a table in the same tenth of a second, keep in mind that there are 864,000 10th of a seconds in a day. Doesn't sound like something that is going to be an issue for a page that serves up articles.
Have no fear!
This update is a single (atomic) transaction - you cannot get 'collisions'. Even if 5,000,000 calls to IncrementView all hit the database at the exact same moment, they will each be processed in a serial, queue like fashion - thats what you are using a database engine for - consistency. Each call will gain an exclusive update lock on the row (at least), so no subsequent queries can update the row until the current one has committed.
You don't even need to use BEGIN TRAN...COMMIT. If the update fails, there is nothing to rollback anyway.
I don't see the need for any app caching - there's no reason why this update would take a long time adn therefore should have no impact on the performance of your app.
[Assuming it's relatively well designed!]