Efficiently delete from one table where ID matches another table

Efficiently delete from one table where ID matches another table - sql

I have two tables with few million records in a PostgreSQL database.
I'm trying to delete rows from one table where ID matches ID of another table. I have used the following command:
delete from table1 where id in (select id from table2)
The above command has been taking lot of time (few hours) which got me wondering is there a faster way to do this operation. Will creating indices help?
I have also tried the delete using join as suggested by few people:
delete from table1 join table2 on table1.id = table2.id
But the above command returned a syntax error. Can this be modified to avoid the error?

Syntax
You second attempt is not legal DELETE syntax in PostgreSQL. This is:
DELETE FROM table1 t1
USING table2 t2
WHERE t2.id = t1.id;
Consider the chapter "Notes" for the DELETE command:
PostgreSQL lets you reference columns of other tables in the WHERE condition by specifying the other tables in the USING clause. For example,
[...]
This syntax is not standard.
[...]
In some cases the join style is easier to write or faster to execute than the sub-select style.
Index
Will creating indices help?
The usefulness of indexes always depends on the complete situation. If table1 is big, and much bigger than table2, an index on table1.id should typically help. Typically, id would be your PRIMARY KEY, which is indexed implicitly anyway ...
Also typically, an index on table2 would not help (and not be used even if it exists.)
But like I said: Depends on the complete situation, and you disclosed preciously little.
Other details of your setup might make the deletes expensive. FK constraints, triggers, indexes, locks held by concurrent transactions, table and index bloat ...
Or non-unique rows in table2. (But I would assume ìd to be unique?) Then you would first extract a unique set of IDs from table2. Depending on cardinalities, a simple DISTINCT or more sophisticated query techniques would be in order ...

Related

Delete rows that don't have a reference anymore

I want to delete all rows of the seconde table that don't have a reference to the first one anymore or are 'NULL'. And that in a single query if possible.
I thought of an outer join but that didn't work since I got the missing rows of both tables. I can't do it in code either because the 'IN' query would become to big for Oracle and I can't just loop through because that would take too long. That's why I was hoping that it could be done with a single query?

In most databases, not exists is a supported solution:
delete from table2
where ref is null or not exists (
select 1
from table1
where table1.id = table2.ref
)
Note that this would be much simpler solved by setting up a proper foreign key between child table/column table2(re) and parent table1(id): foreign keys come with useful features such as on delete cascade, that perform such operation for you under the hood.

Delete Statement:
DELETE FROM Table2 LEFT JOIN Table1 ON ref=id WHERE Table1.id IS NULL

Count rows with column varbinary NOT NULL tooks a lot of time

This query
SELECT COUNT(*)
FROM Table
WHERE [Column] IS NOT NULL
takes a lot of time. The table has 5000 rows, and the column is of type VARBINARY(MAX).
What can I do?

Your query needs to do a table scan on a column that can potentially be very large without any way to index it. There isn't much you can do to fix this without changing your approach.
One option is to split the table into two tables. The first table could have all the details you have now in it and the second table would have just the file. You can make this a 1-1 table to ensure data is not duplicated.
You would only add the binary data as needed into the second table. If it is not needed anymore, you simply delete the record. This will allow you to simply write a JOIN query to get the information you are looking for.
SELECT
COUNT(*)
FROM dbo.Table1
INNER JOIN dbo.Table2
ON Table1.Id = Table2.Id

Will a SQL DELETE with a sub query execute inefficiently if there are many rows in the source table?

I am looking at an application and I found this SQL:
DELETE FROM Phrase
WHERE PhraseId NOT IN(SELECT Id FROM PhraseSource)
The intention of the SQL is to delete rows from Phrase that are not in the PhraseSource table.
The two tables are identical and have the following structure
Id - GUID primary key
...
...
...
Modified int
the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.
Can someone tell me will this query execute the SELECT Id from PhraseSource for every row in the Phrase table? If so is there a more efficient way that this could be coded.

1. Will this query execute the SELECT Id from PhraseSource for every row?
No.
In SQL you express what you want to do, not how you want it to be done1. The engine will create an execution plan to do what you want in the most performant way it can.
For your query, executing the query for each row is not necessary. Instead the engine will create an execution plan that executes the subquery once, then does a left anti-semi join to determine what IDs are not present in the PhraseSource table.
You can verify this when you include the Actual Execution Plan in SQL Server Management Studio.
2. Is there a more efficient way that this could be coded?
A little bit more efficient, as follows:
DELETE
p
FROM
Phrase AS p
WHERE
NOT EXISTS (
SELECT
1
FROM
PhraseSource AS ps
WHERE
ps.Id=p.PhraseId
);
This has been shown in tests done by user Aaron Bertrand on sqlperformance.com: Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?:
Conclusion
[...] for the pattern of finding all rows in table A where some condition does not exist in table B, NOT EXISTS is typically going to be your best choice.
Another benefit of using NOT EXISTS with a correlated subquery is that it does not have problems when PhraseSource.Id can be NULL. I suggest you read up on IN/NOT IN vs NULL values in the subquery. E.g. you can read more about that on sqlbadpractices.com: Using NOT IN operator with null values.
The PhraseSource.Id column is probably not nullable in your schema, but I prefer using a method that is resilient in all possible schemas.
1. Exceptions exist when forcing the engine to use a specific path, e.g. with Table Hints or Query Hints. The engine doesn't always get things right.

In this case the sub-query could be evaluated for each row if the database system is not smart enough (but in case of MS SQL Server, I suppose it should be able to recognize the fact that you don't need to evaluate the subquery more than once).
Still there is a better solution:
DELETE p
FROM Phrase p
LEFT JOIN PhraseSource ps ON ps.Id = p.PhraseId
WHERE ps.Id IS NULL
This uses the LEFT JOIN which matches the rows of both tables, but in case there is no match it leaves the ps entry NULL. Now you just check for NULLs on the left side to see which Phrases do not have a match and will delete those.
All types of JOIN statements are very nicely described in this answer.
Here you can see three different approaches for a similar issue compared on MySQL. As #Drammy mentions, to actually see the performance of a given approach, you could see the execution plan on your target database and do performance testing on different approaches of the same problem.

That query should optimise into a join. Have you looked at the execution plan?
If you're experiencing poor performance it is likely because of the guid primary keys.
A primary key is clustered by default. If the guid primary key is clustered on your table that means the data in the tables is ordered by the primary key. The problem with guids as clustered keys is that when you delete one record the table has to be reordered and shuffled around on disk.
This article is a good read on the topic..
https://blog.codinghorror.com/primary-keys-ids-versus-guids/

deleting a big table based on condition

i have a huge table which has no indexing on it. and indexing cant be added. i need to delete rows like this :
delete from table1 where id in (
select id from table2 inner join table3 on table2.col1 = table3.col1);
but since it has huge number of rows its taking too much time. what i can do to make it faster other than indexing (not permitted).
I am using oracle db.

Performance of "NOT IN" in SQL query

I'm quite new to SQL query analysis. Recently I stumbled upon a performance issue with one of the queries and I'm wondering whether my thought process is correct here and why Query Optimizer works the way it works in this case.
I'm om SQL Server 2012.
I've got a SQL query that looks like
SELECT * FROM T1
WHERE Id NOT IN
(SELECT DISTINCT T1_Id from T2);
It takes around 30 seconds to run on my test server.
While trying to understand what is taking so long I rewrote it using a temp table, like this:
SELECT DISTINCT T1_Id
INTO #temp from T2;
SELECT * FROM T1
WHERE Id NOT IN
(SELECT T1_Id from #temp);
It runs a hundred times faster than the first one.
Some info about the tables:
T2 has around 1 million rows, and there are around 1000 distinct values of T1_id there. T1 has around 1000+ rows. Initially I only had a clustered index on T2 on a column other than T1_Id, so T1_id wasn't indexed at all.
Looking at the execution plans, I saw that for the first query there were as many index scans as there are distinct T1_id values, so basically SQL Server performs about 1000 index scans in this case.
That made me realize that adding a non-clustered index on T1_id may be a good idea (the index should've been there from the start, admittedly), and adding an index indeed made the original query run much faster since now it does nonclustered index seeks.
What I'm looking for is to understand the Query optimizer behavior for the original query - does it look reasonable? Are there any ways to make it work in a way similar to the temporary table variant that I posted here rather than doing multiple scans? Am I just misunderstanding something here?
Thanks in advance for any links to the similar discussion as I haven't really found anything useful.

Not in is intuitive but slow. This construct will generally run quicker.
where id in
(select id from t1
except select t1_id from t2)

The actual performance will likely vary from the estimates, but neither of your queries will out-perform this query, which is the de facto standard approach:
SELECT T1.* FROM T1
LEFT JOIN T2 ON T1.Id = T2.T1_Id
WHERE T2.T1_Id IS NULL
This uses a proper join, which will perform very well (assuming the foreign key column is indexed) and being an left (outer) join the WHERE condition selects only those rows from T1 that don't join (all columns of the right side table are null when the join misses).
Note also that DISTINCT is not required, since there is only ever one row returned from T1 for missed joins.

The SQL Server optimizer needs to understand the size if tables for some of its decisions.
When doing a NOT IN with a subquery, those estimates may not be entirely accurate. When the table is actually materialized, the count would be highly accurate.
I think the first would be faster with an index on
Table2(t1_id)

This is just a guess, but hopefully an educated one...
The DBMS probably concluded that searching a large table small number of times is faster than searching a small table large number of times. That's why you had ~1000 searches on T2, instead of ~1000000 searches on T1.
When you added an index on T2.T1_Id, that turned ~1000 table scans (or full clustered index scans if the table is clustered) into ~1000 index seeks, which made things much faster, as you already noted.
I'm not sure why it didn't attempt a hash join (or a merge join after the index was added) - perhaps it had stale statistics and badly overestimated the number of distinct values?
One more thing: is there a FOREIGN KEY on T2.T1_Id referencing T1.Id? I know Oracle can use FKs to improve the accuracy of cost estimates (in this case, it could infer that the cardinality of T2.T1_Id cannot be greater than T1.Id). If MS SQL Server does something similar, and the FK is missing (or is untrusted), that could contribute to the MS SQL Server thinking there are more distinct values than there really are.
(BTW, it would have helped if you posted the actual query plans and the database structure.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas