Get distinct sets of rows for an INSERT executed in concurrent transactions

Get distinct sets of rows for an INSERT executed in concurrent transactions - sql

I am implementing a simple pessimistic locking mechanism using Postgres as a medium. The goal is that multiple instances of an application can simultaneously acquire locks on distinct sets of users.
The app instances are not trying to lock specific users. Instead they will take any user locks they can get.
Say, for example, we have three instances of the app running and there are currently 5 users that are not locked. All three instances attempt to acquire locks on up to three users at the same time. Their requests are served in arbitrary order.
Ideally, the first instance served would acquire 3 user locks, the second would acquire 2 and the third would acquire no locks.
So far I have not been able to write a Query that accomplishes this. I'll show you my best attempt so far.
Here are the tables for the example:
CREATE TABLE my_locks (
id bigserial PRIMARY KEY,
user_id bigint NOT NULL UNIQUE
);
CREATE TABLE my_users (
id bigserial PRIMARY KEY,
user_name varchar(50) NOT NULL
);
And this is the Query for acquiring locks:
INSERT INTO my_locks(user_id)
SELECT u.id
FROM my_users AS u
LEFT JOIN my_locks
AS l
ON u.id = l.user_id
WHERE l.user_id IS NULL
LIMIT 3
RETURNING *
I had hoped that folding the collecting of lockable users and the insertion of the locks into the database into a single query would ensure that multiple simultaneous requests would be processed in their entirety one after the other.
It doesn't work that way. If applied to the above example where three instances use this Query to simultaneously acquire locks on a pool of 5 users, one instance acquires three locks and the other instances receive an error for attempting to insert locks with non-unique user-IDs.
This is not ideal, because it prevents the locking mechanism from scaling. There are a number of workarounds to deal with this, but what I am looking for is a database-level solution. Is there a way to tweak the Query or DB configuration in such a way that multiple app instances can (near-)simultaneously acquire the maximum available number of locks in perfectly distinct sets?

The locking clause SKIP LOCKED should be perfect for you. Added with Postgres 9.5.
The manual:
With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped.
FOR NO KEY UPDATE should be strong enough for your purpose. (Still allows other, non-exclusive locks.) And ideally, you take the weakest lock that's strong enough.
Work with just locks
If you can do your work while a transaction locking involved users stays open, then that's all you need:
BEGIN;
SELECT id FROM my_users
LIMIT 3
FOR NO KEY UPDATE SKIP LOCKED;
-- do some work on selected users here !!!
COMMIT;
Locks are gathered along the way and kept till the end of the current transaction. While the order can be arbitrary, we don't even need ORDER BY. No waiting, no deadlock possible with SKIP LOCKED. Each transaction scans over the table and locks the first 3 rows still up for grabs. Very cheap and fast.
Since transaction might stay open for a while, don't put anything else into the same transaction so not to block more than necessary.
Work with lock table additionally
If you can't do your work while a transaction locking involved users stays open, register users in that additional table my_locks.
Before work:
INSERT INTO my_locks(user_id)
SELECT id FROM my_users u
WHERE NOT EXISTS (
SELECT FROM my_locks l
WHERE l.user_id = u.id
)
LIMIT 3
FOR NO KEY UPDATE SKIP LOCKED
RETRUNGING *;
No explicit transaction wrapper needed.
Users in my_locks are excluded in addition to those currently locked exclusively. That works under concurrent load. While each transaction is open, locks are active. Once those are released at the end of the transaction they have already been written to the locks table - and are visible to other transaction at the same time.
There's a theoretical race condition for concurrent statements not seeing newly committed rows in the locks table just yet, and grabbing the same users after locks have just been released. But that would fail trying to write to the locks table. A UNIQUE constraint is absolute and will not allow duplicate entries, disregarding visibility.
Users won't be eligible again until deleted from your locks table.
Further reading:
Postgres UPDATE ... LIMIT 1
Select rows which are not present in other table
Aside:
... multiple simultaneous requests would be processed in their entirety one after the other.
It doesn't work that way.
To understand how it actually works, read about the Multiversion Concurrency Control (MVCC) of Postgres in the manual, starting here.

Related

Postgres design solution to user limiting in rooms (race condition)

I have a chatting application with rooms in which users can freely join/leave any time they want. Rooms are limited to 8 users at a single time.
For simplification purposes, I have this relationship right now:
User -> (Many to one) -> Room
I'm checking if the room is full by querying
SELECT COUNT(*) FROM users WHERE room_id = x
before inserting the user, which works in normal cases.
However, if everybody joins at the same time, then it gets to a race condition and the limit is bypassed. How should I tackle this issue? Is Postgres suitable for this kind of operation?

While not wishing to be unkind, the previous answer is far from optimal.
While you can do this...
LOCK TABLE users IN ACCESS EXCLUSIVE MODE;
This is clearly quite heavy handed and is preventing all updates to the users table whether related to room changes or not.
A lighter approach would be instead to lock just the data you care about.
-- Move user 456 from room 122 to room 123
BEGIN;
SELECT true FROM rooms WHERE id = 123 FOR UPDATE;
SELECT true FROM users WHERE id = 456 AND room_id = 122 FOR UPDATE;
-- If either of the above failed to return a row, the starting condition of your database is not what you thought. Rollback
SELECT count(*) FROM users WHERE room_id = 123;
-- check count looks ok
UPDATE users SET room_id = 123 WHERE id = 456;
COMMIT;
This will lock two crucial items:
The new room being moved to.
The user being moved between rooms.
Since you don't care about people being moved out of rooms, this should be enough.
If two separate connections try to move different people into the same room simultaneously, one will be forced to wait until the other transaction commits or rolls back and releases its locks. If two separate connections try to update the same person, the same will happen.
As far as the other answer is concerned
Of course during the period the table is locked, if some thread in your application is trying to write users, it will get back an error.
No, it won't. Not unless you hold the lock for long enough to time out or try to take your own lock and tell it not to wait.
I think that you should probably synchronise the access and the writing from your calling application
Relational databases have been expressly designed to handle concurrent access from multiple clients. It is one of the things they are inarguably good at. If you are already using a RDBMS and implementing your own concurrency control then you are either in an exceptional situation or aren't using your RDBMS properly.

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It

First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

Is it possible to lock on a value of a column in SQL Server?

I have a table that looks like that:
Id GroupId
1 G1
2 G1
3 G2
4 G2
5 G2
It should at any time be possible to read all of the rows (committed only). When there will be an update I want to have a transaction that will lock on group id, i.e. there should at any given time be only one transaction that attempts to update per GroupId.
It should ideally be still possible to read all committed rows (i.e. other transaction/ordinary reads that will not try to acquire the "update per group lock" should be still able to read).
The reason I want to do this is that an update can not rely on "outdated" data. I.e. I do make some calculations in a transaction and another transaction cannot edit row with id 1 or add a new row with the same GroupId after these rows were read by the first transaction (even though the first transaction would never modify the row itself it will be dependent on it's value).
Another "nice to have" requirement is that sometimes I would need the same requirement "cross group", i.e. the update transaction would have to lock 2 groups at the same time. (This is not a dynamic number of groups, but rather just 2)

Here are some ideas. I don't think any of them are perfect - I think you will need to give yourself a set of use-cases and try them. Some of the situations I tried after applying locks
SELECTs with the WHERE filter as another group
SELECTs with the WHERE filter as the locked group
UPDATES on the table with the WHERE clause as another group
UPDATEs on the table where ID (not GrpID!) was not locked
UPDATEs on the table where the row was locked (e.g., IDs 1 and 2)
INSERTs into the table with that GrpId
I have the funny feeling that none of these will be 100%, but the most likely answer is the second one (setting the transaction isolation level). It will probably lock more than desired, but will give you the isolation you need.
Also one thing to remember: if you lock many rows (e.g., there are thousands of rows with the GrpId you want) then SQL Server can escalate the lock to be a full-table lock. (I believe the tipping point is 5000 locks, but not sure).
Old-school hackjob
At the start of your transaction, update all the relevant rows somehow e.g.,
BEGIN TRAN
UPDATE YourTable
SET GrpId = GrpId
WHERE GrpId = N'G1';
-- Do other stuff
COMMIT TRAN;
Nothing else can use them because (bravo!) they are a write within a transaction.
Convenient - set isolation level
See https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-locking-and-row-versioning-guide?view=sql-server-ver15#isolation-levels-in-the-
Before your transaction, set the isolation level high e.g., SERIALIZABLE.
You may want to read all the relevant rows at the start of your transaction (e.g., SELECT Grp FROM YourTable WHERE Grp = N'Grp1') to lock them from being updated.
Flexible but requires a lot of coding
Use resource locking with sp_getapplock and sp_releaseapplock.
These are used to lock resources, not tables or rows.
What is a resource? Well, anything you want it to be. In this case, I'd suggest 'Grp1', 'Grp2' etc. It doesn't actually lock rows. Instead, you ask (via sp_getapplock, or APPLOCK_TEST) whether you can get the resource lock. If so, continue. If not, then stop.
Anything code referring to these tables needs to be reviewed and potentially modified to ask if it's allowed to run or not. If something doesn't ask for permission and just does it, there's no actual real locks stopping it (except via any transactions you've explicity specified).
You also need to ensure that errors are handled appropriately (e.g., still releasing the app_lock) and that processes that are blocked are re-tried.

How to prevent table locking with multiple users writing and reading from table at the same time?

I have a program I wrote for a group that is constantly writing, reading and deleting per session per user to our SQL server. I do not need to worry about what is being written or deleted as all the data being written/deleted by an individual will never be needed by another. Each users writes are separated by a unique ID and all queries are based on that unique ID.
I want to be able to write/delete rows from the same table by multiple users at the same time. I know I can set up the session to be able to read while data is being written using SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED.
The question:
That said can I have multiple users write/delete from the same table at the same time?
Currently my only idea if this is not possible is to set up the tool to write to temp tables per user session. But I don't think that is an efficient option to constantly create and delete temp tables hundreds of times a day.

Yes you can make this multi tenant approach work fine.
Ensure leading column of all indexes is UserId so a query for one user never needs to scan rows belonging to a different user.
Ensure all queries have an equality predicate on UserId and verify execution plans to ensure that they are seeking on it.
Ensure no use of serializable isolation level as this can take range locks affecting adjacent users.
Ensure that row locking is not disabled on all indexes and restrict DML operations to <= 5,000 rows (to prevent lock escalation)
Consider using read committed snapshot isolation for your reading queries

postgresql: delete all locks

question: there is a table with over 9000 rows. It must be cleaned but without any locks (table in active using). I tried to use pg_advisory_unlock_all, but no result.
select pg_advisory_unlock_all();
start transaction();
delete from table where id='1';
DELETE 1
start transaction();
delete from table where id='1';
(waiting for finish first transaction)

There is no way to delete data from a table without locking the rows you want to delete.
That shouldn't be a problem as long as concurrent access doesn't try to modify these rows or insert new ones with id = '1', because writers never block readers and vice versa in PostgreSQL.
If concurrent transactions keep modifying the rows you want to delete, that's a little funny (why would you want to delete data you need?). You'd have to wait for the exclusive locks, and you might well run into deadlocks. In that case, it might be best to lock the whole table with the LOCK statement before you start. Deleting from a table that small should then only take a very short time.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas