I'm trying to delete one of the duplicate rows with the highest user ID in my table, consisting of 3.5 million rows. I have around 1300 rows to delete, and I am currently using the following query:
delete from Data
where exists (select 1 from Data t2
where data.code = t2.code and data.issue = t2.issue
and data.id < t2.id);
The query has run for more than 15 minutes. Is there any way I can optimize this to decrease the time taken? I'm using SQLite version 3.22.0.
Often, deleting a lot of rows in a table is simply inefficient. It can be faster to reconstruct the table.
The idea is to select the rows you want into another table:
create table temp_data as
select t.*
from data t
where t.id = (select max(t2.id)
from data t2
where t2.code = t.code and t2.issue = t.issue
);
For this query, you want an index on (code, issue, id).
Then when the data is safely tucked away and validated, you can empty the existing table and re-insert:
delete from data;
Be sure you have removed any triggers on the table. You can read about SQLite's "truncate" optimization in the documentation. In most other databases, you would use the command truncate table data.
Then, you can re-insert the data:
insert into data
select *
from temp_data;
Related
Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.
Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.
A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;
This query
SELECT COUNT(*)
FROM Table
WHERE [Column] IS NOT NULL
takes a lot of time. The table has 5000 rows, and the column is of type VARBINARY(MAX).
What can I do?
Your query needs to do a table scan on a column that can potentially be very large without any way to index it. There isn't much you can do to fix this without changing your approach.
One option is to split the table into two tables. The first table could have all the details you have now in it and the second table would have just the file. You can make this a 1-1 table to ensure data is not duplicated.
You would only add the binary data as needed into the second table. If it is not needed anymore, you simply delete the record. This will allow you to simply write a JOIN query to get the information you are looking for.
SELECT
COUNT(*)
FROM dbo.Table1
INNER JOIN dbo.Table2
ON Table1.Id = Table2.Id
i have a huge table which has no indexing on it. and indexing cant be added. i need to delete rows like this :
delete from table1 where id in (
select id from table2 inner join table3 on table2.col1 = table3.col1);
but since it has huge number of rows its taking too much time. what i can do to make it faster other than indexing (not permitted).
I am using oracle db.
Need to delete duplicate records from the table. Table contains 33 columns out of them only PK_NUM is the primary key columns. As PK_NUM contains unique records we need to consider either min/max value.
Total records in the table : 1766799022
Distinct records in the table : 69237983
Duplicate records in the table : 1697561039
Column details :
4 : Date data type
4 : Number data type
1 : Char data type
24 : Varchar2 data type
Size of table : 386 GB
DB details : Oracle Database 11g EE::11.2.0.2.0 ::64bit Production
Sample data :
col1 ,col2,col3
1,ABC,123
2,PQR,456
3,ABC,123
Expected data should contains only 2 records:
col1,col2,col3
1,ABC,123
2,PQR,456
*1 can be replaced by 3 ,vice versa.
My plan here is to
Pull distinct records and store it in a back up table.(ie by using insert into select)
Truncate existing table and move records from back up to existing.
As data size is huge ,
Want to know what is the optimized sql for retrieving the distinct
records
Any estimate on how much it will take to complete (insert into
select) and to truncate the existing table.
Please do let me know, if there is any other best way to achieve this. My ultimate goal is to remove the duplicates.
One option for making this memory efficient is to insert (nologging append) all of the rows into a table that is hash partitioned on the list of columns on which duplicates are to be detected, or if there is a limitation on the number of columns then on as many as you can use (aiming to use those with maximum selectivity). Use something like 1024 partitions, and each one will ideally be around
You have then isolated all of the potential duplicates for each row into the same partition, and standard methods for deduplication will run on each partition without as much memory consumption.
So for each partition you can do something like ...
insert /*+ append */ into new_table
select *
from temp_table partition (p1) t1
where not exists (
select null
from temp_table partition (p1) t2
where t1.col1 = t2.col1 and
t1.col2 = t2.col2 and
t1.col3 = t2.col3 and
... etc ...
t1.rownum < t2.rownum);
The key to good performance here is that the hash table created to perform the anti-join in that query, which is going to be nearly as big as the partition itself, be able to fit in memory. So if you can manage a 2GB sort area you need at least 389/2 = approx 200 table partitions. Round up to the nearest power of two, so make it 256 table partitions in that case.
try this:
rename table_name to table_name_dup;
and then:
create table table_name
as
select
min(col1)
, col2
, col3
from table_name_dup
group by
col2
, col3;
as far as i know the temp_tablespace used is not much as the whole group by is taking place in the target tablespace where the new table will be created. once finished, you can just drop the one with the duplicates:
drop table table_name_dup;
I have a table that contains log entries for a program I'm writing. I'm looking for ideas on an SQL query (I'm using SQL Server Express 2005) that will keep the newest X number of records, and delete the rest. I have a datetime column that is a timestamp for the log entry.
I figure something like the following would work, but I'm not sure of the performance with the IN clause for larger numbers of records. Performance isn't critical, but I might as well do the best I can the first time.
DELETE FROM MyTable WHERE PrimaryKey NOT IN
(SELECT TOP 10,000 PrimaryKey FROM MyTable ORDER BY TimeStamp DESC)
I should mention that this query will run 3-4 times a day (as part of another process), so the number of records that will be deleted with each query will be small in comparison to the number of records that will be kept.
Try this:
DECLARE #X int
SELECT #X=COUNT(*) FROM MyTable
SET #X=#X-10000
DELETE MyTable
WHERE PrimaryKey IN (SELECT TOP(#x) PrimaryKey
FROM MyTable
ORDER BY TimeStamp ASC
)
kind of depends on if you are deleting fewer than 10,000 rows, if so this might run faster, as it identifies the rows to delete, not the rows to keep.
Try this, uses a CTE to get the row ordinal number, and then only deletes X number of rows at a time. You can alter this variable to suit your server.
Adding ReadPast table hint should prevent locking.
:
DECLARE #numberToDelete INT;
DECLARE #ROWSTOKEEP INT;
SET #ROWSTOKEEP = 50000;
SET #numberToDelete =1000;
WHILE 1=1
BEGIN
WITH ROWSTODELETE AS
(
SELECT ROW_NUMBER() OVER(ORDER BY dtsTimeStamp DESC) rn,
*
FROM MyTable
)
DELETE TOP (#numberToDelete) FROM ROWSTODELETE WITH(READPAST)
WHERE rn>#ROWSTOKEEP;
IF ##ROWCOUNT=0
BREAK;
END;
The query you have is about as efficient as it gets, and is readable.
NOT IN and NOT EXISTS are more efficient than LEFT JOIN/IS NULL, but only because both columns can never be null. You can read this link for a more in-depth comparison.
This depends on your scenario (whether it's feasible for you) and how many rows you have, but there is a potentially far more optimal approach.
Create a new copy of the log table with a new name
Insert into the new table, the most recent 10,000 records from the original table
Drop the original table (or rename)
Rename the new table, to the proper name
This obviously requires more thought than just deleting rows (e.g. if the table has an IDENTITY column this needs to be set on the new table etc). But if you have a large table it would be more efficient to copy 10,000 rows to a new table then drop the original table, than trying to delete millions of rows to leave just 10,000.
DELETE FROM MyTable
WHERE TimeStamp < (SELECT min(TimeStamp)
FROM (SELECT TOP 10,000 TimeStamp
FROM MyTable
ORDER BY TimeStamp DESC))
or
DELETE FROM MyTable
WHERE TimeStamp < (SELECT min(TimeStamp)
FROM MyTable
WHERE PrimaryKey IN (SELECT TOP 10,000 TimeStamp
FROM MyTable
ORDER BY TimeStamp DESC))
Not sure if these are improvement as far as efficiency though.