Have "select for update" block on nonrexisting rows - sql

we have some persistent data in an application, that is queried from a server and then stored in a database so we can keep track of additional information. Because we do not want to query when an object is used in the memory we do an select for update so that other threads that want to get the same data will be blocked.
I am not sure how select for update handles non-existing rows. If the row does not exist and another thread tries to do another select for update on the same row, will this thread be blocked until the other transaction finishes or will it also get an empty result set? If it does only get an empty result set is there any way to make it block as well, for example by inserting the missing row immediately?
EDIT:
Because there was a remark, that we might lock too much, here some more details on the concrete usage in our case. In reduced pseudocode our programm flow looks like this:
d = queue.fetch();
r = SELECT * FROM table WHERE key = d.key() FOR UPDATE;
if r.empty() then
r = get_data_from_somewhere_else();
new_r = process_stuff( r );
if Data was present then
update row to new_r
else
insert new_r
This code is run in multiple thread and the data that is fetched from the queue might be concerning the same row in the database (hence the lock). However if multiple threads are using data that needs the same row, then these threads need to be sequentialized (order does not matter). However this sequentialization fails, if the row is not present, because we do not get a lock.
EDIT:
For now I have the following solution, which seems like an ugly hack to me.
select the data for update
if zero rows match then
insert some dummy data // this will block if multiple transactions try to insert
if insertion failed then
// somebody beat us at the race
select the data for update
do processing
if data was changed then
update the old or dummy data
else
rollback the whole transaction
I am neither 100% sure however that this actually solves the problem, nor does this solution seem good style. So if anybody has to offer something more usable this would be great.

I am not sure how select for update handles non-existing rows.
It doesn't.
The best you can do is to use an advisory lock if you know something unique about the new row. (Use hashtext() if needed, and the table's oid to lock it.)
The next best thing is a table lock.
That being said, your question makes it sound like you're locking way more than you should. Only lock rows when you actually need to, i.e. write operations.

Example solution (i haven't found better :/)
Thread A:
BEGIN;
SELECT pg_advisory_xact_lock(42); -- database semaphore arbitrary ID
SELECT * FROM t WHERE id = 1;
DELETE FROM t WHERE id = 1;
INSERT INTO t (id, value) VALUES (1, 'thread A');
SELECT 1 FROM pg_sleep(10); -- only for race condition simulation
COMMIT;
Thread B:
BEGIN;
SELECT pg_advisory_xact_lock(42); -- database semaphore arbitrary ID
SELECT * FROM t WHERE id = 1;
DELETE FROM t WHERE id = 1;
INSERT INTO t (id, value) VALUES (1, 'thread B');
SELECT 1 FROM pg_sleep(10); -- only for race condition simulation
COMMIT;
Causes always correct order of transactions execution.

Looking at the code added in the second edit, it looks right.
As for it looking like a hack, there's a couple options - basically it's all about moving the database logic to the database.
One is simply to put the whole select for update, if not exist then insert logic in a function, and do select get_object(key1,key2,etc) instead.
Alternatively, you could make an insert trigger that will ignore attempts to add an entry if it already exists, and simply do an insert before you do the select for update. This does have more potential to interfere with other code already in place, though.
(If I remember to, I'll edit and add example code later on when I'm in a position to check what I'm doing.)

Related

Simultaneous updates to SQL table

I have a table will needs to be updated based on values within the table. The whole table will be updated and I wanted to ensure that SQL-Server will handle the locking for me. I know I can use WITH(LOCKTYPE). But as my update will be something like:
SELECT MyTable.ID, MyTable.Cost
INTO #tempTable from MyTable
-- More complex calculations here
UPDATE MyTable
SET MyTable.Cost = #tempTable.Cost+1
FROM #tempTable
WHERE MyTable.ID = #TempTable.ID
I wanted to be sure that I could lock the entire table so that if another request comes in to update the table it has to wait, but ideally, if a read request comes in it is processed. Is that possible?
Edit
Process:
Read from Table X
Perform calculations
Write to temp table Y
Average values between X and Y
Write back to X
Within the SP, once step 1 has happened, I don't want any other calls to the SP to be processed, I want them to sit in a queue. Because each update is dependent on the values already in the table, I cannot allow another call to the SP to read from the table.
In general, the SP needs to be run serially, the SP should not be allowed to be running twice at the same time. I don't mind if the call goes through and the SELECT is queued, but once a select happens, another select (within the SP) cannot be allowed to happen in another thread.
You can use a SQL transaction : http://msdn.microsoft.com/en-us/library/ms188929.aspx#fbid=-CQjt8-slkk

SELECT COUNT(SomeId) with INSERT later to same SomeId: Appropriate locking strategy?

I am using SQL Server 2012. I have a repeatable read transaction where I perform this query:
select count(SomeId)
from dbo.MyTable
where SomeId = #SomeId
SomeId is a column whose value may repeat in the table (think foreign key). However, SomeId is not a member of any index nor is it a foreign key.
Later in the transaction, I insert a record into dbo.MyTable with the same #SomeId, thus changing what the select count(*) would return were I to run it again:
insert into dbo.MyTable (SomeId, ...)
values (#SomeId, ...)
Several threads in my application can execute this transaction at the same time. Because of this, I'm getting deadlocks on the insert statement. At first, I thought an updlock would be appropriate on the select statement, but I quickly realized that it wouldn't work because I'm not actually updating the rows selected by the select count(SomeId).
My question is this: is there any way to avoid a potentially expensive table lock? Is there any way to lock just rows that involve SomeId, even if they haven't been inserted yet (strange, I know)? I want to force other threads to wait while the original transaction completes its work but I don't want to lock rows unnecessarily.
EDIT
Here's what I'm trying to accomplish:
I only want to insert up to eight rows for a particular SomeId. There are several unrelated processes that can start one of these transactions, potentially at the same time. The select count detects whether there are already eight rows and causes the operation to fail for that process. If the count is less than eight, that same transaction performs additional work, then inserts a record at the end, thus effectively incrementing the count were the select count to be run again. I am hitting the deadlock on the insert statement.
If you have several processes that try to do the same thing and you don't want to have more records than some number, you will need to actually prevent those processes from running at the the same time.
One way would be to read the counts with exclusive lock:
select count(SomeId)
from dbo.MyTable with (xlock)
where SomeId = #SomeId
This way those records will be blocked until the transaction completes.
You should create an index for the SomeId column though as it will be most likely that the locks will be on held on the index level this way.

Oracle MERGE deadlock

I want to insert rows with a MERGE statement in a specified order to avoid deadlocks. Deadlocks could otherwise happen because multiple transaction will call this statement with overlapping sets of keys. Note that this code is also sensitive to duplicate value exception but I handle that by retrying so that is not my question. I was doing the following:
MERGE INTO targetTable
USING (
SELECT ...
FROM sourceCollection
ORDER BY <desiredUpdateOrder>
)
WHEN MATCHED THEN
UPDATE ...
WHEN NOT MATCHED THEN
INSERT ...
Now I'm still getting the dead lock so I'm becoming unsure whether oracle maintains the order of the sub-query. Does anyone know how to best make sure that oracle locks the rows in targetTable in the same order in this case? Do I have to do a SELECT FOR UPDATE before the merge? In which order does the SELECT FOR UPDATE lock the rows? Oracle UPDATE statement has an ORDER BY clause that MERGE seems to be missing. Is there another way to avoid dead locks other than locking the rows in the same order every time?
[Edit]
This query is used to maintain a count of how often a certain action has taken place. When the action happens the first time a row is inserted, when it happens a second time the "count" column is incremented. There are millions of different actions and they happen very often. A table lock wouldn't work.
Controlling the order in which the target table rows are modified requires that you control the query execution plan of the USING subquery. That's a tricky business, and depends on what sort of execution plans your query is likely to be getting.
If you're getting deadlocks then I'd guess that you're getting a nested loop join from the source collection to the target table, as a hash join would probably be based on hashing the source collection and would modify the target table roughly in target-table rowid order because that would be full scanned -- in any case, the access order would be consistent across all of the query executions.
Likewise, if there was a sort-merge between the two data sets you'd get consistency in the order in which target table rows are accessed.
Ordering of the source collection seems to be desirable, but the optimiser might not be applying it so check the execution plan. If it is not then try inserting your data into a global temporary table using APPEND and with an ORDER BY clause, and then selecting from there without an order by clause, and explore the us of hints to entrench a nested loop join.
I don't believe the ORDER BY will affect anything (though I'm more than willing to be proven wrong); I think MERGE will lock everything it needs to.
Assume I'm completely wrong, assume that you get row-by-row locks with MERGE. Your problem still isn't solved as you have no guarantees that your two MERGE statements won't hit the same row simultaneously. In fact, from the information given, you have no guarantees that an ORDER BY improves the situation; it might make it worse.
Despite there being no skip locked rows syntax as there is with UPDATE there is still a simple answer, stop trying to update the same row from within different transactions. If feasible, you can use some form of parallel execution, for instance the DBMS_PARALLEL_EXECUTE subprogram CREATE_CHUNKS_BY_ROWID and ensure that your transactions only work on a specific sub-set of the rows in the table.
As an aside I'm a little worried by your description of the problem. You say there's some duplicate erroring that you fix by rerunning the MERGE. If the data in these duplicates is different you need to ensure that the ORDER BY is done not only on the data to be merged but the data being merged into. If you don't then there's no guarantee that you don't overwrite the correct data with older, incorrect, data.
First locks are not really managed at row level but at block level. You may encounter an ORA-00060 error even without modifying the same row. This can be tricky. Managing this is the request developper's job.
One possible workaround is to organize your table (never do that on huge tables or table with heavy change rates)
https://use-the-index-luke.com/sql/clustering/index-organized-clustered-index
Rather than do a merge, I suggest that you try and lock the row. If successful update it, if not insert new row. By default lock will wait if another process has a lock on the same thing.
CREATE TABLE brianl.deleteme_table
(
id INTEGER PRIMARY KEY
, cnt INTEGER NOT NULL
);
CREATE OR REPLACE PROCEDURE brianl.deleteme_table_proc (
p_id IN deleteme_table.id%TYPE)
AUTHID DEFINER
AS
l_id deleteme_table.id%TYPE;
-- This isolates this procedure so that it doesn't commit
-- anything outside of the procedure.
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
-- select the row for update
-- this will pause if someone already has the row locked.
SELECT id
INTO l_id
FROM deleteme_table
WHERE id = p_id
FOR UPDATE;
-- Row was locked, update it.
UPDATE deleteme_table
SET cnt = cnt + 1
WHERE id = p_id;
COMMIT;
EXCEPTION
WHEN NO_DATA_FOUND
THEN
-- we were unable to lock the record, insert a new row
INSERT INTO deleteme_table (id, cnt)
VALUES (p_id, 1);
COMMIT;
END deleteme_table_proc;
CREATE OR REPLACE PROCEDURE brianl.deleteme_proc_test
AUTHID CURRENT_USER
AS
BEGIN
-- This resets the table to empty for the test
EXECUTE IMMEDIATE 'TRUNCATE TABLE brianl.deleteme_table';
brianl.deleteme_table_proc (p_id => 1);
brianl.deleteme_table_proc (p_id => 2);
brianl.deleteme_table_proc (p_id => 3);
brianl.deleteme_table_proc (p_id => 2);
FOR eachrec IN ( SELECT id, cnt
FROM brianl.deleteme_table
ORDER BY id)
LOOP
DBMS_OUTPUT.put_line (
a => 'id: ' || eachrec.id || ', cnt:' || eachrec.cnt);
END LOOP;
END;
BEGIN
-- runs the test;
brianl.deleteme_proc_test;
END;

Using a database table as a queue

I want to use a database table as a queue. I want to insert in it and take elements from it in the inserted order (FIFO). My main consideration is performance because I have thousands of these transactions each second. So I want to use a SQL query that gives me the first element without searching the whole table. I do not remove a row when I read it.
Does SELECT TOP 1 ..... help here?
Should I use any special indexes?
I'd use an IDENTITY field as the primary key to provide the uniquely incrementing ID for each queued item, and stick a clustered index on it. This would represent the order in which the items were queued.
To keep the items in the queue table while you process them, you'd need a "status" field to indicate the current status of a particular item (e.g. 0=waiting, 1=being processed, 2=processed). This is needed to prevent an item be processed twice.
When processing items in the queue, you'd need to find the next item in the table NOT currently being processed. This would need to be in such a way so as to prevent multiple processes picking up the same item to process at the same time as demonstrated below. Note the table hints UPDLOCK and READPAST which you should be aware of when implementing queues.
e.g. within a sproc, something like this:
DECLARE #NextID INTEGER
BEGIN TRANSACTION
-- Find the next queued item that is waiting to be processed
SELECT TOP 1 #NextID = ID
FROM MyQueueTable WITH (UPDLOCK, READPAST)
WHERE StateField = 0
ORDER BY ID ASC
-- if we've found one, mark it as being processed
IF #NextId IS NOT NULL
UPDATE MyQueueTable SET Status = 1 WHERE ID = #NextId
COMMIT TRANSACTION
-- If we've got an item from the queue, return to whatever is going to process it
IF #NextId IS NOT NULL
SELECT * FROM MyQueueTable WHERE ID = #NextID
If processing an item fails, do you want to be able to try it again later? If so, you'll need to either reset the status back to 0 or something. That will require more thought.
Alternatively, don't use a database table as a queue, but something like MSMQ - just thought I'd throw that in the mix!
If you do not remove your processed rows, then you are going to need some sort of flag that indicates that a row has already been processed.
Put an index on that flag, and on the column you are going to order by.
Partition your table over that flag, so the dequeued transactions are not clogging up your queries.
If you would really get 1.000 messages every second, that would result in 86.400.000 rows a day. You might want to think of some way to clean up old rows.
Everything depends on your database engine/implementation.
For me simple queues on tables with following columns:
id / task / priority / date_added
usually works.
I used priority and task to group tasks and in case of doubled task i choosed the one with bigger priority.
And don't worry - for modern databases "thousands" is nothing special.
This will not be any trouble at all as long as you use something to keep track of the datetime of the insert. See here for the mysql options. The question is whether you only ever need the absolute most recently submitted item or whether you need to iterate. If you need to iterate, then what you need to do is grab a chunk with an ORDER BY statement, loop through, and remember the last datetime so that you can use that when you grab your next chunk.
perhaps adding a LIMIT=1 to your select statement would help ... forcing the return after a single match...
Since you don't delete the records from the table, you need to have a composite index on (processed, id), where processed is the column that indicates if the current record had been processed.
The best thing would be creating a partitioned table for your records and make the PROCESSED field the partitioning key. This way, you can keep three or more local indexes.
However, if you always process the records in id order, and have only two states, updating the record would mean just taking the record from the first leaf of the index and appending it to the last leaf
The currently processed record would always have the least id of all unprocessed records and the greatest id of all processed records.
Create a clustered index over a date (or autoincrement) column. This will keep the rows in the table roughly in index order and allow fast index-based access when you ORDER BY the indexed column. Using TOP X (or LIMIT X, depending on your RDMBS) will then only retrieve the first x items from the index.
Performance warning: you should always review the execution plans of your queries (on real data) to verify that the optimizer doesn't do unexpected things. Also try to benchmark your queries (again on real data) to be able to make informed decisions.
I had the same general question of "how do I turn a table into a queue" and couldn't find the answer I wanted anywhere.
Here is what I came up with for Node/SQLite/better-sqlite3.
Basically just modify the inner WHERE and ORDER BY clauses for your use case.
module.exports.pickBatchInstructions = (db, batchSize) => {
const buf = crypto.randomBytes(8); // Create a unique batch identifier
const q_pickBatch = `
UPDATE
instructions
SET
status = '${status.INSTRUCTION_INPROGRESS}',
run_id = '${buf.toString("hex")}',
mdate = datetime(datetime(), 'localtime')
WHERE
id IN (SELECT id
FROM instructions
WHERE
status is not '${status.INSTRUCTION_COMPLETE}'
and run_id is null
ORDER BY
length(targetpath), id
LIMIT ${batchSize});
`;
db.run(q_pickBatch); // Change the status and set the run id
const q_getInstructions = `
SELECT
*
FROM
instructions
WHERE
run_id = '${buf.toString("hex")}'
`;
const rows = db.all(q_getInstructions); // Get all rows with this batch id
return rows;
};
A very easy solution for this in order not to have transactions, locks etc is to use the change tracking mechanisms (not data capture). It utilizes versioning for each added/updated/removed row so you can track what changes happened after a specific version.
So, you persist the last version and query the new changes.
If a query fails, you can always go back and query data from the last version.
Also, if you want to not get all changes with one query, you can get top n order by last version and store the greatest version I'd you have got to query again.
See this for example Using Change Tracking in SQL Server 2008

SQL trigger for deleting old results

We have a database that we are using to store test results for an embedded device. There's a table with columns for different types of failures (details not relevant), along with a primary key 'keynum' and a 'NUM_FAILURES' column that lists the number of failures. We store passes and failures, so a pass has a '0' in 'NUM_FAILURES'.
In order to keep the database from growing without bounds, we want to keep the last 1000 results, plus any of the last 50 failures that fall outside of the 1000. So, worst case, the table could have 1050 entries in it. I'm trying to find the most efficient SQL insert trigger to remove extra entries. I'll give what I have so far as an answer, but I'm looking to see if anyone can come up with something better, since SQL isn't something I do very often.
We are using SQLITE3 on a non-Windows platform, if it's relevant.
EDIT: To clarify, the part that I am having problems with is the DELETE, and specifically the part related to the last 50 failures.
The reason you want to remove these entries is to keep the database growing too big and not to keep it in some special state. For that i would really not use triggers and instead setup a job to run at some interval cleaning up the table.
So far, I have ended up using a View combined with a Trigger, but I'm not sure it's going to work for other reasons.
CREATE VIEW tablename_view AS SELECT keynum FROM tablename WHERE NUM_FAILURES!='0'
ORDER BY keynum DESC LIMIT 50;
CREATE TRIGGER tablename_trig
AFTER INSERT ON tablename WHEN (((SELECT COUNT(*) FROM tablename) >= 1000) or
((SELECT COUNT(NUM_FAILURES) FROM tablename WHERE NUM_FAILURES!='0') >= 50))
BEGIN
DELETE FROM tablename WHERE ((((SELECT MAX(keynum) FROM ibit) - keynum) >= 1000)
AND
((NUM_FAILURES=='0') OR ((SELECT MIN(keynum) FROM tablename_view) > keynum)));
END;
I think you may be using the wrong data structure. Instead I'd create two tables and pre-populate one with a 1000 rows (successes) and the other with 50 (failures). Put a primary ID on each. The when you record a result instead of inserting a new row find the ID+1 value for the last timestamped record entered (looping back to 0 if > max(id) in table) and update it with your new values.
This has the advantage of pre-allocating your storage, not requiring a trigger, and internally consistent logic. You can also adjust the size of the log very simply by just pre-populating more records rather than to have to change program logic.
There's several variations you can use on this, but the idea of using a closed loop structure rather than an open list would appear to match the problem domain more closely.
How about this:
DELETE
FROM table
WHERE ( id > ( SELECT max(id) - 1000 FROM table )
AND num_failures = 0
)
OR id > ( SELECT max(id) - 1050 FROM table )
If performance is a concern, it might be better to delete on a periodic basis, rather than on each insert.