quickest way of deleting data from mysql - sql

i am doing this:
delete from calibration_2009 where rowid in
(SELECT rowid FROM `qcvalues`.`batchinfo_2009` where reporttime like "%2010%");
i need to have about 250k rows deleted.
why does it take such a long time? is there a quicker way to do this?

DELETE c
FROM `qcvalues`.`batchinfo_2009` b
JOIN calibration_2009 c
ON c.rowid = b.rowid
WHERE b.reporttime LIKE '%2010%';
You should have an index on calibration_2009 (rowid) for this to work fast.
I assume it's a PRIMARY KEY anyway, but you better check.

If reporttime is a DATETIME, use:
DELETE FROM calibration_2009
WHERE rowid IN (SELECT rowid
FROM `qcvalues`.`batchinfo_2009`
WHERE reporttime BETWEEN STR_TO_DATE('2010-01-01', '%Y-%m-%d')
AND STR_TO_DATE('2010-12-31', '%Y-%m-%d'))

Depending on how many rows the inner select returns, you have to do that many comparisons per row in calibration_2009 to see if it has to be deleted.
If the inner select returns 250k rows, then you're doing up to 250k comparisons per row in calibration_2009 just to see if that row should be deleted.
I'd say a faster approach would be to add a column to calibration_2009, if at all possible, called to_be_deleted. Then update that table
UPDATE calibration_2009 SET to_be_deleted = EXISTS (
SELECT 1 FROM `qcvalues`.`batchinfo_2009`
WHERE `batchinfo_2009.rowid = calibration_2009.rowid AND batchinfo_2009.reporttime like "%2010%"
);
That should be pretty quick if both tables are indexed by rowid AND reporttime in batchinfo_2009.
Then just
DELETE FROM calibration_2009 WHERE to_be_deleted = 1;
Then you can delete that new field, or leave it there for future updates.

Not sure if this is valid inn mysql but have you tried
delete from calibration_2009 a
inner join qcvalues.batchinfo_2009 b
on a.rowid = b.rowid where reporttime like '%2010%'
Alternatively if the year part is at a fixed position within reporttime try using substring and equal sign as opposed to using the LIKE statement

Related

Clean up parent table rows that are no longer reference by any child

The obvious queries are
delete from in_pipe
where id in (
select id
from in_pipe
where id not in
(select distinct inpipeid from out_pipe)
fetch first 1000 rows only
)
or
delete from in_pipe
where id in (
select i.id
from in_pipe i
left join out_pipe o on o.inpipeid = i.id
where o.id is null
fetch first 1000 rows only
)
There is primary key index on in_pipe.id and out_pipe.inpipeid has index CREATE INDEX ix_outpipe_inpipeid ON out_pipe(inpipeid)
Both of these queries would do the job and the execution plans look fine.
BUT I'm afraid about the performance of these queries once we get to production and tables have millions of rows (tens of millions). The performance of the clean up is not critical, but I'm afraid these queries would never finish.
Clean up should not effect performance of deletes/inserts from out_pipe or in_pipe, thus I would not use a trigger for this. I'd rather have this clean up done in the background during idle hours. It can (and should) be done little by little.
So I guess I'm looking for clever ideas...
Edit: I'm thinking browsing in_pipe ids in batches, starting for lowest and moving up, and checking for existence of the batch in out_pipe, until I reach the end and then start from the beginning again.
How about a two-and-a-half steps?
First step: table of IDs which aren't used:
create table not_used as
select id from in_pipe
minus
select inpipeid from out_pipe;
A half of a step: index:
create index i1nu on not_used (id);
The second step: delete IDs which aren't used:
delete from in_pipe a
where exists (select null
from not_used n
where n.id = a.id
);
I would recommend not exists:
delete from in_pipe i
where not exists (
select 1 from out_pipe o where o.inpipeid = i.id
)

How to optimize this SQL Delete statement for faster speed

There is a faster way to run this SQL delete than this one:
DELETE FROM TABLE
WHERE TABLE.key NOT IN (SELECT DISTINCT(MAIN_TABLE.key) FROM MAIN_TABLE)
You can prefer using not exists
delete from TABLE t
where not exists ( select 0 from MAIN_TABLE m where m.key = t.key )
mostly preferable in performance point of view rather than not in.
I think this is because
NOT IN returns true for each non-matched value is found in a subquery, while
NOT EXISTS is active only if non-matching row is found within the subquery.

The most efficient way to check whether a specific number of rows exists from a table

I'm have an oracle query as below which is working well :
INSERT /*+APPEND*/ INTO historical
SELECT a.* FROM TEMP_DATA a WHERE NOT EXISTS(SELECT 1 FROM historical WHERE KEY=a.KEY)
With the query, when i run explain plan, i notice that the optimizer chooses a HASH JOIN plan and the cost is fairly low
However there's a new request to state how many rows that can exists in the historical table to check against the TEMP_DATA table, and hence the query is changed to:
INSERT /*+APPEND*/ INTO historical
SELECT a.* FROM TEMP_DATA a WHERE (SELECT COUNT(1) FROM historical WHERE KEY=a.KEY) < 2
Which means if 1 row of record exists in the historical data given the key (not primary key), the data still could be inserted.
However with this approach the query slow down a lot, with the cost more than 10 times of the original cost. I'd also noticed that the optimizer chooses a NESTED LOOP plan now.
Note that the historical table is a partitioned table with indexes.
Is there anyway i can optimized this?
Thanks.
The following query should do the same thing and should be more performant:
select a.*
from temp_data a
left
join(select key, count(*) cnt
from historical
group
by key
) b
on a.key = b.key
where nvl(b.cnt, 0) < 2;
Hope it helps
An alternative to #DirkNM's answer would be:
select a.*
from temp_data a
where not exists (select null
from historical h
where h.key = a.key
and rownum <= 2
group by h.key
having count(*) > 1);
You would have to test with your data sets to work out which is the best solution for you.
NB: I wouldn't expect the new query (whichever one you choose) to be as performant as your original query.

How to update table based on row index?

I made a copy of an existing table like this:
select * into table_copy from table
Since then I've made some schema changes to table (added/removed columns, changed order of columns etc). Now I need to run an update statement to populate a new column I added like this:
update t
set t.SomeNewColumn = copy.SomeOldColumn
from t
However, how do I get the second table in here based on row index instead of some column value matching up?
Note: Both tables still have equal number of rows in their original positions.
You cannot join the tables without a key to define each row uniquely, the position of the data in the table has no bearing on the situation.
If you tables do not have a primary key you need to define one.
If you have an ID on it, you can do this:
update t set
t.SomeNewColumn = copy.SomeOldColumn
from
table t
inner join table_copy copy on
t.id = copy.id
If you have no way to uniquely identify the row and are relying on the order of the rows, you're out of luck, as row order is not reliable in any version of SQL Server (nor most other RDBMSes).
You could use this to update them by matching ids
UPDATE
t
SET
t.SomeNewColumn = other_table.SomeOldColumn,
FROM
original_table t
INNER JOIN
other_table copy
ON
t.id = copy.id
or if you don't have the ids you might be able to pull out something by using ROW_NUMBER function to enumerate the records, but that's a long shot(I haven't checked if it's possible).
If you're updating, you'll need a primary key to join on. Usually in that case, the others' answers will suffice. If for some reason you still need to update the table with a resultset in a certain order, you can do this:
UPDATE t SET t.SomeNewColumn = copy.SomeOldColumn
FROM table t
JOIN (SELECT ROW_NUMBER() OVER(ORDER BY id) AS row, id, SomeNewColumn FROM table) t2
ON t2.Id = t.Id
JOIN (SELECT ROW_NUMBER() OVER(ORDER BY id) AS row, SomeOldColumn FROM copytable) copy
ON copy.row = t2.row
You get the new table and its row numbers in the order you want, join the old table and its row numbers in the order you want, and join back to the new table so the query has something to directly update.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.