Is a transaction that only updates a single table always isolated?

Is a transaction that only updates a single table always isolated? - sql

According to the UPDATE documentation, an UPDATE always acquires an exclusive lock on the whole table. However, I am wondering if the exclusive lock is acquired before the rows to be updated are determined or only just before the actual update.
My concrete problem is that I have a nested SELECT in my UPDATE like this:
UPDATE Tasks
SET Status = 'Active'
WHERE Id = (SELECT TOP 1 Id
FROM Tasks
WHERE Type = 1
AND (SELECT COUNT(*)
FROM Tasks
WHERE Status = 'Active') = 0
ORDER BY Id)
Now I am wondering whether it is really guaranteed that there is exactly one
task with Status = 'Active' afterwards if in parallel the same statement may be executed with another Type:
UPDATE Tasks
SET Status = 'Active'
WHERE Id = (SELECT TOP 1 Id
FROM Tasks
WHERE Type = 2 -- <== The only difference
AND (SELECT COUNT(*)
FROM Tasks
WHERE Status = 'Active') = 0
ORDER BY Id)
If for both statements the rows to change would be determined before the lock is acquired, I could end up with two active tasks which I must prevent.
If this is the case, how can I prevent it? Can I prevent it without setting the transaction level to SERIALIZABLE or messing with lock hints?
From the answer to Is a single SQL Server statement atomic and consistent? I learned that the problem arises when the nested SELECT accesses another table. However, I'm not sure if I have to care about this issue if only the updated table is concerned.

If you want exactly one task with static = active, then set up the table to ensure this is true. Use a filtered unique index:
create unique index unq_tasks_status_filter_active on tasks(status)
where status = 'Active';
A second concurrent update might fail, but you will be ensured of uniqueness. Your application code can process such failed updates, and re-try.
Relying on the actual execution plans of the updates might be dangerous. That is why it is safer to have the database do such validations. Underlying implementation details could vary, depending on the environment and version of SQL Server. For instance, what works in a single threaded, single processor environment may not work in a parallel environment. What works with one isolation level may not work with another.
EDIT:
And, I cannot resist. For efficiency purposes, consider writing the query as:
UPDATE Tasks
SET Status = 'Active'
WHERE NOT EXISTS (SELECT 1
FROM Tasks
WHERE Status = 'Active'
) AND
Id = (SELECT TOP 1 Id
FROM Tasks
WHERE Type = 2 -- <== The only difference
ORDER BY Id
);
Then place indexes on Tasks(Status) and Tasks(Type, Id). In fact, with the right query, you might find that the query is so fast (despite the update on the index) that your worry about current updates is greatly mitigated. This would not solve a race condition, but it might at least make it rare.
And if you are capturing errors, then with the unique filtered index, you could just do:
UPDATE Tasks
SET Status = 'Active'
WHERE Id = (SELECT TOP 1 Id
FROM Tasks
WHERE Type = 2 -- <== The only difference
ORDER BY Id
);
This will return an error if a row already is active.
Note: all these queries and concepts can be applied to "one active per group". This answer is addressing the question that you asked. If you have a "one active per group" problem, then consider asking another question.

This not an answer on your question... But your query is pain for my eyes :)
;WITH cte AS
(
SELECT *, RowNum = ROW_NUMBER() OVER (PARTITION BY [type] ORDER BY id)
FROM Tasks
)
UPDATE cte
SET [Status] = 'Active'
WHERE RowNum = 1
AND [type] = 1
AND NOT EXISTS(
SELECT 1
FROM Tasks
WHERE [Status] = 'Active'
)

No, at least the nested select statement can be processed before the update is started and locks are acquired. To make sure that no other query interferes with this update it is required to set the transaction isolation level to SERIALIZABLE.
This article (and the series it is part of) explains very well the subtleties of concurrency in SQL server:
http://sqlperformance.com/2014/02/t-sql-queries/confusion-caused-by-trusting-acid

Related

SQL Server - Using joins in Update statement

I have a HRUser and an Audit table, both are in production with large number of rows.
Now I have added one more column to my HRUser table called IsActivated.
I need to create a one-time script which will be executed in production and populate data into this IsActivated column. After execution of this one-time script onwards, whenever the users activate their account, the HRUser table's IsActivated column will automatically be updated.
For updating the IsActivated column in the HRUser table, I need to check the Audit table whether the user has logged in till now.
UPDATE [dbo].HRUser
SET IsActivated = 1
FROM dbo.[UserAudit] A
JOIN dbo.[HRUser] U ON A.UserId = U.UserId
WHERE A.AuditTypeId = 14
AuditTypeId=14 means the user has logged in and user can login any number of times and every time the user logs in it will get captured in the UserAudit table...
The logic is that if the user has logged in at least once means the user is activated.
This cannot be tested on lower environments and it need to be directly executed on production as in lower environments we don’t have any data in the UserAudit table.
I am not really sure if that works as I have never used joins in update statement, I am looking for suggestions for any better approach than this for accomplishing my task

You could use EXISTS and a correlated subquery to filter on rows whose UserId has at least one audit event of id 14:
UPDATE h
SET IsActivated = 1
FROM [dbo].HRUser h
WHERE EXISTS (
SELECT 1 FROM
FROM dbo.[UserAudit] a
WHERE a.UserId = h.UserId AND a.AuditTypeId = 14
)
Note that there is no point reopening the target table in the subquery; you just need to correlate it with the outer query.

Two methods below. Method 1 is NOT recommended for tables "in production with large number of rows". But it is much easier to code. Method 2 works in production with no downtime.
Whichever method you choose: TEST it outside production. Copy the data from production. If you cannot do that, then build your own. Build a toy system. Highly recommended that you test at some level before running either method in production.
METHOD 1:
Updating on a join is straight forward. Use an alias. Reminder, this is NOT RECOMMENDED "with large number of rows" AND production running. The SQL Server optimizer most likely will escalate locks on both tables and block the tables until the update completes. IF you are taking an outage and are not concerned with how long the update takes, this method works.
UPDATE U
SET IsActivated = 1
FROM dbo.[UserAudit] A
JOIN dbo.[HRUser] U ON A.UserId = U.UserId
WHERE A.AuditTypeId = 14
METHOD 2:
IF you cannot afford to stop your production systems for this update (and most of us cannot), then I recommend that you do 2 things:
Set up a loop with the transaction inside the loop. This means that the optimizer will use row locks and not block the entire table. This method may take longer, but it will not block production. If the update takes longer, not a concern as long as the devops team never calls because production is blocked.
Capture rows to be updated outside the transaction. THEN, update based on a primary key (fastest). The total transaction time is how long the rows updated will be blocked.
Here is a toy example for looping.
-- STEP 1: get data to be updated
CREATE TABLE #selected ( ndx INT IDENTITY(1,1), UserId INT )
INSERT INTO #selected (UserId)
SELECT UserId
FROM dbo.[UserAudit] A
JOIN dbo.[HRUser] U ON A.UserId = U.UserId
WHERE A.AuditTypeId = 14
-- STEP 2: update on primary key in steps of 1000
DECLARE #RowsToUpdate INT = 1000
, #LastId INT = 0
, #RowCnt INT = 0
DECLARE #v TABLE(ndx INT, UserId INT)
WHILE 1=1
BEGIN
DELETE #v
INSERT INTO #v
SELECT TOP(#RowsToUpdate) *
FROM #selected WHERE ndx > #LastId
ORDER BY ndx
SET #RowCnt = ##ROWCOUNT
IF #RowCnt = 0
BREAK;
BEGIN TRANSACTION
UPDATE a
SET IsActivated = 1
FROM #v v
JOIN dbo.HRUser a ON a.Id = v.UserId
COMMIT TRANSACTION
SELECT #LastId = MAX(ndx) FROM #v
END

Query optimization, get rid of subqueries

I'm using sqlite3, but its SQL support is rather standard, so as long as the SQL doesn't contain any proprietary extensions all should be good. My schema is the following:
create table test (
_id integer primary key,
name text,
enabled integer not null default 1
);
create table task (
_id integer primary key,
_test_id integer not null references test on delete cascade,
favorite integer not null default 0,
comment text
);
In short: there are tests which may be enabled or not; tests have multiple tasks, which can be favorite and may have a comment.
The two most complex queries I need to write are the following:
A select which retrieves information whether the database contains at least 1 favorite and at least 1 commented task for any enabled test (i.e. don't group by test). I came up with the following monstrosity:
select
exists(select task._id from task as task inner join test as test on task._test_id=test._id where task.favorite=1 and test.enabled=1 limit 1) as has_favorite,
exists(select task._id from task as task inner join test as test on task._test_id=test._id where task.comment is not null and test.enabled=1 limit 1) as has_commented;
A select which retrieves test core data (id, name etc.) along with information about its task count, whether the test contains at least 1 favorite and at least 1 commented task. I came up with this:
select
test.*,
(select count(*) from task where _test_id=test._id) as task_count,
exists(select _id from task where favorite=1 and _test_id=test._id limit 1) as has_favorite,
exists(select _id from task where comment is not null and _test_id=test._id limit 1) as has_commented
from test as test where test.enabled=1 order by test._id asc
Actually, the 'has_favorite' and 'has_commented' info are not the only ones, but they depict my doubts - these queries are pretty big, contain a fair amount of subqueries (and I read subselects are bad for performance) and duplication.
The question: would it be possible to write the queries more easily? Make them better, more concise? Not duplicate so much? For example, I'm thinking maybe there is a way to perform only one join between the task and test tables and somehow derive the data from there.
Edit: so it appears I can write this for the first one:
select
count(*) as task_count,
max(task.favorite) as has_favorite,
count(task.comment) as has_commented
from task as task inner join test as test on task._test_id=test._id where test.enabled=1;
and this for the second one:
select
test.*,
count(*) as task_count,
max(task.favorite) as has_favorite,
count(task.comment) as has_commented
from task as task inner join test as test on task._test_id=test._id where test.enabled=1 group by test._id;
If max(task.favorite) is anything > 0 it means at least 1 task is favorite. I could replace it with 'sum(task.favorite)' and if the sum > 0, there is a favorite.
Is this any better than the original proposals (with exists(subselect))? It seems way easier.

I eventually went with joins similar to the ones in my edits as they worked quite nicely and also allowed be to gather other information as well in one go.

SQL EXISTS Why does selecting rownum cause inefficient execution plan?

Problem
I'm trying to understand why what seems like a minor difference in these two Oracle Syntax Update queries is causing a radically different execution plan.
Query 1:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select *
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
Query 2:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select rownum
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
As you can see the only difference between the two is that the subquery in Query 2 returns a rownum instead of the values of every row.
The execution plans for these two couldn't be more different:
Query1 - Pulls the total results from both tables and uses a sort and a hashjoin to return the results. This peforms well with a favorable 2,346 cost (despite the use of the EXISTS clause and the cohesive subquery).
Query2 - Pulls both table results as well but uses a count and a filter to accomplish the same task and returns an execution plan with an astonishing 77,789,696 cost! I should note that his query just hangs on me so I'm not actually positive this returns the same results (though I believe it should).
From my understanding of the Exists clause it is just a simple boolean check that runs per line of the main table. It doesn't matter if a single row is returned in my EXISTS condition or 100,000 rows... if any results are returned for the row that it is being run, then you've passed the exist check. So why would it matter what my subquery SELECT statement returns?
--------------------EDIT----------------------
Per request, below are the execution plans I'm running in TOAD... please note I edited the table names in my example above for ease - In these plans ALSS_SALES2 = sales above and SALESEXT_TMP = tempTABLE above.
Also should have mentioned but neither of the two tables has indices at this point.. I haven't yet added them to my tempTable and I'm testing with a cheap copy of the sales table which only contains the fields and data but no indices, constraints or security.
Thanks for the assistance everyone!
Query 1 Execution Plan
Query 2 Execution Plan
------------------------------------------------
Questions
1) Why did the call for rownum cause the execution plan to change?
2) What is it about the Filter that is so incredibally inefficient?
3) Am I missing something fundamental with the way the Exists clause works that is causing this change?

Posting the actual query plans would be quite helpful.
In general, though, when the optimizer sees a subquery with rownum, that radically limits its ability to transform the query and merge the results from the subquery with the main query because doing so potentially affects the results. That can be a quick way to force Oracle to materialize a subquery if that happens to be more efficient than the plan chosen by the optimizer. In this case, though, it is probably causing the optimizer to forego a transform step that makes the query more efficient.
Occasionally, you'll see someone take a query like
SELECT b.*
FROM (SELECT <<columns>>
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
and tack on a rownum to the a subquery
SELECT b.*
FROM (SELECT <<columns>>, rownum
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
in order to force the optimizer to evaluate the a subquery before executing the join. Normally, of course, the optimizer should do this by default if it is more efficient. But if the optimizer makes a mistake, adding rownum can be quicker than figuring out the right set of hints to force a plan or digging in to the underlying problem to figure out the right solution.
Of course, in the particular case that you have a subquery in a WHERE EXISTS where the only use of rownum comes in the SELECT list, we humans can detect that the rownum shouldn't prevent any query transform step that the optimizer would care to use. The optimizer, though, is probably using a more general rule that says that subqueries that reference a function like rownum must be completely executed (this may depend on the exact Oracle version and/or the optimizer settings). So the optimizer is realistically doing a bunch of extra work because it's not smart enough to recognize that the rownum you added cannot possibly affect the results of the query.

Just a question, what's the execution plan for this query:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select NULL
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3);
It visualize what is needed in an EXISTS (...) expression - actually nothing! As already stated Oracle just have to check if anything is returned, not what is returned in Sub-Query.

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?

What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.

Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.

To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.

How to do it in mysql: If id=idold THEN UPDATE status=1

I would like to compare two tables and then update if some logic is true,
In pseudo code:
SELECT * FROM users, usersold IF users.id=usersold.id THEN UPDATE users.status=1;
Is there a way to do it in MySQL?

UPDATE users u
SET status = 1
WHERE EXISTS (SELECT id FROM usersold WHERE id = u.id)
Alternate version:
UPDATE users
SET status = 1
WHERE id IN (SELECT id FROM usersold)
You should test and, depending on your database, you may find one performs better than the other although I expect any decent database will optimize then to be much the same anyway.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas