How to optimize this SQL Delete statement for faster speed - sql

There is a faster way to run this SQL delete than this one:
DELETE FROM TABLE
WHERE TABLE.key NOT IN (SELECT DISTINCT(MAIN_TABLE.key) FROM MAIN_TABLE)

You can prefer using not exists
delete from TABLE t
where not exists ( select 0 from MAIN_TABLE m where m.key = t.key )
mostly preferable in performance point of view rather than not in.
I think this is because
NOT IN returns true for each non-matched value is found in a subquery, while
NOT EXISTS is active only if non-matching row is found within the subquery.

Related

DELETE command with SELECT JOIN

I've tried below delete command but I don't know why its not working:
DELETE FROM beteg
WHERE beteg.taj IN (
SELECT beteg.taj COUNT (ellatas.id) as "ellátások száma"
FROM ellatas RIGHT JOIN beteg ON ellatas.beteg = beteg.taj
WHERE "ellátások száma" = 0
);
I suspect that you want:
delete from beteg
where not exists (select 1
from ellatas e
where e.beteg = beteg.taj
);
This deletes everything in beteg that does not have a corresponding row in ellatas.
Your query has multiple issues:
A column alias is used in the where clause.
You are using IN with two columns in the subquery and one in the outer query.
The subquery has no GROUP BY, but is using COUNT().
In any case, the above query is simpler and should have better performance.

Join within an "exists" subquery

I am wondering why when you join two tables on a key within an exists subquery the join has to happen within the WHERE clause instead of the FROM clause.
This is my example:
Join within FROM clause:
SELECT payer_id
FROM Population1
WHERE NOT EXISTS
(Select *
From Population2 join Population1
On Population2.payer_id = Population1.payer_id)
Join within WHERE clause:
SELECT payer_id
FROM Population1
WHERE NOT EXISTS
(Select *
From Population2
WHERE Population2.payer_id = Population1.payer_id)
The first query gives me 0 results, which I know is incorrect, while the second query gives the the thousands of results I am expecting to see.
Could someone just explain to me why where the join happens in an EXISTS subquery matters? If you take the subqueries without the parent query and run them they literally give you the same result.
It would help me a lot to remember to not continue to make this mistake when using exists.
Thanks in advance.
You need to understand the distinction between a regular subquery and a correlated subquery.
Using your examples, this should be easy. The first where clause is:
where not exists (Select 1
from Population2 join
Population1
on Population2.payer_id = Population1.payer_id
)
This condition does exactly what it says it is doing. The subquery has no connection to the outer query. So, the not exists will either filter out all rows or keep all rows.
In this case, the engine runs the subquery and determines that at least one row is returned. Hence, not exists returns false in all cases, and the nothing is returned.
In the second case, the subquery is a correlated subquery. So, for each row in population1 the subquery is run using the value of Population1.payer_id. In some cases, matching rows exist in Population2; these are filtered out. In other cases, matching rows do not exist; these are in the result set.
The first example is not actually reffering to the base table which creates a logic that is unpredictable.
Another way to do the same logic would be:
SELECT payer_id
FROM Population1 P1
LEFT JOIN Population2 P2 ON
P2.Payer_Id = P1.Payer_Id
WHERE
P2.Payer_Id IS NULL
You qry return ROW EXISTS always if exist even if there is one result row.
Select *
from Population2
join Population1 on Population2.payer_id = Population1.payer_id
If exist at least one row from this join (and for sure there exists), you can imagine your subqry looks like:
select 'ROW EXISTS'
And result of:
select *
from Population1
where not exists (select 'ROW EXISTS')
So your anti-semijoin return:
payer_id 1 --> some ROW EXISTS -> dont't return this row
payer_id 2 --> some ROW EXISTS -> dont't return this row

SQL select IN (select) process too long why?

Lets say TABLE1 has 1 million entrys in it.
Table2 has 50k entries in it.
SELECT stringVal
FROM TABLE2
WHERE idTable2=5
Result of select:
5
4
That select takes 0.02s to process
But when i use it within IN it takes up to 20.20s
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (
SELECT stringVal FROM TABLE2 where idTable2=5);
If i would use it like this it would process in 0.02s
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (5,4);
Can anyone explain me how things work here ?
I think your RDBMS is doing a poor job of executing your query, other RDBMS(e.g. SQL Server) can see that if a subquery is not correlated with an outer query it will internally materialize the result and would not execute the subquery repeatedly. e.g.
select *
, (select count(*) from tbl) -- an smart RDBMS won't execute this repeatedly
from tbl
A good RDBMS would not execute the counting repeatedly, since it is an independent query(not correlated to the outside query)
Try all of the options, there are just few of them anyway
1st, try EXISTS. Your RDBMS's EXISTS might be faster than its IN. I encountered IN is faster than EXISTS though, example: Why the most natural query(I.e. using INNER JOIN (instead of LEFT JOIN)) is very slow Same observation by Quassnoi (IN is faster than EXISTS): http://explainextended.com/2009/06/16/in-vs-join-vs-exists/
SELECT count(*)
FROM TABLE1
WHERE
-- stringVal IN
EXISTS(
SELECT * -- please, don't bikeshed ;-)
FROM TABLE2
where
table1.stringVal = table2.stringVal -- simulated IN
and table2.idTable2 = 5);
2nd, try INNER JOIN, use this if there's no duplicate, or use DISTINCT to remove duplicates.
SELECT count(*)
FROM TABLE1
JOIN (
SELECT DISTINCT stringVal -- remove duplicates
FROM TABLE2
where table2.idTable2 = 5 ) as x
ON X.stringVal = table1.stringVal
3rd, try to materialize the rows yourself. I encountered same problem with SQL Server, querying the materialized rows is faster than querying the result of another query.
Check the example of materializing the query result to table, then using IN on result. I see that it is faster than using IN on another query approach, you can just read the bottom part of the post: http://www.ienablemuch.com/2012/05/recursive-cte-is-evil-and-cursor-is.html
Example:
SELECT distinct stringVal -- remove duplicates
into anotherTable
FROM TABLE2
where idTable2 = 5;
SELECT count(*)
FROM TABLE1 where stringVal in (select stringVal from anotherTable);
The above is working on Sql Server and Postgresql, on other RDBMS it might be like this:
create table anotherTable as
SELECT distinct stringVal -- remove duplicates
FROM TABLE2
where table2.idTable2 = 5;
select count(*)
from table1 where stringVal in (select stringVal from anotherTable)
While I love subqueries, they are immensely powerful, their also quite slow, since the query has to be completely evaluated at each iteration, ouch! (depending on implementation)
This is why they are mine/our last resort.
Some SQL implementations are quite good and will cache the subquery though Im not quite sure how safe that would be, but still you have to traverse this entire structure and if the structure isn't properly optmize it would take quadratic even cubic time if you nest enough of them ...
SELECT stringVal
FROM TABLE2
WHERE idTable2=5
This is linear time O(n), it can be even be constant O(1) if the sql database stores statistical information, but we will assume it doesn't as such it will search every row and return all those that match the where clause.
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (
SELECT stringVal FROM TABLE2 where idTable2=5);
Assuming the subquery isn't cache then it is being evaluated at each row, and if you have a lof them thats a lot evaluations, many many wasted repeated calculations, and even if its cache the structure may not be optimal for search, not to mention you are also comparing strings, on a list of strings.
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (5,4);
The subquery is still being evaluated but its a constant expression theres basically no overhead at all, it doesn't need to do any IO or deal with locks or anythig :)
Try this
SELECT count(*) FROM TABLE1 where EXISTS
(SELECT 1 FROM TABLE2 where idTable2=5 and stringVal = TABLE1.stringVal );
An you should create indexes for stringVal both TABLE1 and TABLE2 tables.
Here is a simple join that will give you the same kind of result that you were looking for. This can be applied in many different situations and this will avoid having to query against another table.
SELECT COUNT(*)
FROM TABLE1 INNER JOIN TABLE2 ON TABLE1.'COLUMN' = TABLE2.'COLUMN' AND TABLE2.IDTABLE2=5
WHERE 'WHATEVER YOU WANT'
Replace 'COLUMN' with a column that is referenced in both tables, normally an ID or primary key.

quickest way of deleting data from mysql

i am doing this:
delete from calibration_2009 where rowid in
(SELECT rowid FROM `qcvalues`.`batchinfo_2009` where reporttime like "%2010%");
i need to have about 250k rows deleted.
why does it take such a long time? is there a quicker way to do this?
DELETE c
FROM `qcvalues`.`batchinfo_2009` b
JOIN calibration_2009 c
ON c.rowid = b.rowid
WHERE b.reporttime LIKE '%2010%';
You should have an index on calibration_2009 (rowid) for this to work fast.
I assume it's a PRIMARY KEY anyway, but you better check.
If reporttime is a DATETIME, use:
DELETE FROM calibration_2009
WHERE rowid IN (SELECT rowid
FROM `qcvalues`.`batchinfo_2009`
WHERE reporttime BETWEEN STR_TO_DATE('2010-01-01', '%Y-%m-%d')
AND STR_TO_DATE('2010-12-31', '%Y-%m-%d'))
Depending on how many rows the inner select returns, you have to do that many comparisons per row in calibration_2009 to see if it has to be deleted.
If the inner select returns 250k rows, then you're doing up to 250k comparisons per row in calibration_2009 just to see if that row should be deleted.
I'd say a faster approach would be to add a column to calibration_2009, if at all possible, called to_be_deleted. Then update that table
UPDATE calibration_2009 SET to_be_deleted = EXISTS (
SELECT 1 FROM `qcvalues`.`batchinfo_2009`
WHERE `batchinfo_2009.rowid = calibration_2009.rowid AND batchinfo_2009.reporttime like "%2010%"
);
That should be pretty quick if both tables are indexed by rowid AND reporttime in batchinfo_2009.
Then just
DELETE FROM calibration_2009 WHERE to_be_deleted = 1;
Then you can delete that new field, or leave it there for future updates.
Not sure if this is valid inn mysql but have you tried
delete from calibration_2009 a
inner join qcvalues.batchinfo_2009 b
on a.rowid = b.rowid where reporttime like '%2010%'
Alternatively if the year part is at a fixed position within reporttime try using substring and equal sign as opposed to using the LIKE statement

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.