Delete Duplicates in Table with huge amount of rows - sql

I have a table with 19 million records. I want to delete duplicates, but the query I am using is taking a very long while and eventually connection is timing out.
This is the query I am using:
DELETE FROM [TableName]
WHERE id NOT IN
(SELECT MAX(id) FROM [TableName] GROUP BY field)
where ID is Primary key and auto increment.
I want to delete the duplicates in field.
Is there a faster alternative to this query?
Any help would be appreciated.

i suggest temporarily adding an index onto field to speed things up. maybe use this statement to delete (even though yours should work fine with the index).
my statement generates a list of ids that should be deleted. assuming that id as the primary key is indexed, this is probably faster. this should also perform a little better than not in.
with candidates as (
SELECT id
, ROW_NUMBER() over (PARTITION by field order by id desc) rn
FROM [TableName]
)
delete
from candidates
where rn > 1

My answer is a a spin on Brett Schneiders, with a batched approach (including a small wait) to avoid contention, and alleviate explosive log file growth.
Set your initial #batchcount to something you think the server can handle -- you can also increase/decrease the wait time as needed. Once ##ROWCOUNT=0, the loop will terminate.
declare #batchcount int, #totalrows int
set #totalrows = 0
set #batchcount = 10000 -- set this to some initial value
while #batchcount > 0
begin
;with dupes as (
SELECT id
, ROW_NUMBER() over (PARTITION by field order by id desc) rownum
FROM [TableName]
)
delete top (#batchcount) t1
from TableName t1
join dupes c
on c.id = t1.id
and c.rownum > 1
set #batchcount = ##ROWCOUNT --record how many just got nuked
set #totalrows = #totalrows + #batchcount --track progress
print cast(#totalrows as varchar) + ' rows have been deleted' -- show progress
waitfor delay '00:00:05' -- wait 5 seconds for log writes, other queries etc
end
The print statement may not "show" on every loop in SSMS, but every so often you'll see SQL messages appear showing hundreds of iterations completed... be patient.

Create another heap table and insert there the ids you want to delete. Than delete the records in the main table (where exists in heap table) in chunks of 1000-5000 each to avoid the time out. Good luck!

Related

SQL Delete with group by clause

I want to write a SQL command that will delete all rows with a 0 as the last digit in a column, then a 1 as the last digit, a 2 as the last digit, and so on.
delete from BASE_TABLE where SET = 'ABCD'
This statement would delete 400,000+ rows at once and my service can't handle this.
to break it up I wanted to to say something like
delete from BASE_TABLE
where BASE_TABLE.SET = 'ABCD'
and BASE_TABLE.TAG LIKE '%0'
After I delete everything with LIKE '%0' I would want to delete everything with LIKE '%1', all the way through LIKE '%9'
Is this possible?
You can do that in a while loop. However, more typical approach would be:
delete top 10 percent from BASE_TABLE
where BASE_TABLE.SET = 'ABCD';
Or, just a fixed number that you can handle:
delete top 1000 from BASE_TABLE
where BASE_TABLE.SET = 'ABCD';
You can then put this in a loop:
declare #x int;
set #x = 1;
while #x > 0 begin
delete top 1000 from BASE_TABLE
where BASE_TABLE.SET = 'ABCD';
set #x = ##ROWCOUNT;
end;
Don't put the percent method in the while loop. It will keep going for a long time -- because the percent is based on the rows in the table at the time. I believe it will eventually delete the last row, but there will be a lot of iterations.
If system load is all you care about, just delete the top X rows. Here's the syntax
DELETE TOP (top_value) [ PERCENT ]
FROM table
[WHERE conditions];
You could apply an order by if you want to delete in a certain order. If you want the process to do all the work for you, stick it in a loop until the table is empty.
You need [0-9]% in where clause
This will delete all the rows begining with 0 to 9 and if you want ending with shift the % to the left as
%[0-9]

SQL Server Concurrency in update

I have a TABLE:
id status mod runat
1 0 null null
2 0 null null
3 0 null null
4 0 null null
And, I call this query two times, at same time.
UPDATE TABLE
SET
status = 1,
mod = GETDATE()
OUTPUT INSERTED.id
WHERE id = (
SELECT TOP (1) id
FROM TABLE
WHERE STATUS = 0
AND NOT EXISTS(SELECT * FROM TABLE WHERE STATUS = 1)
AND COALESCE(runat, GETDATE()) <= GETDATE()
ORDER BY ID ASC)
And... some times I have:
1
1
Instead
1
NULL
why? Update query isn't transactional?
Short answer
Add WITH (UPDLOCK, HOLDLOCK) to select
UPDATE TABLE
SET
status = 1,
mod = GETDATE()
OUTPUT INSERTED.id
WHERE id = (
SELECT TOP (1) id
FROM TABLE WITH (UPDLOCK, HOLDLOCK)
WHERE STATUS = 0
AND NOT EXISTS(SELECT * FROM TABLE WHERE STATUS = 1)
AND COALESCE(runat, GETDATE()) <= GETDATE()
ORDER BY ID ASC)
Explanation
Because you are using a subquery to get the ID there are basically two statements being run here - a select and an update. When 1 is returned twice it just means both select statements ran before either update was completed. If you add an UPDLOCK, then when the first one runs it holds the UPDLOCK. The second SELECT has to wait for the UPDLOCK to be released by the first select before it can execute.
More information
Exactly what is happening will depending on the locking scheme of your database, and the locks issued by other statements. This kind of update can even lead to deadlocks under certain circumstances.
Because the statements runs so fast it's hard to see what locks they are holding. To effectively slow things down a good trick is to
Open a session and run the first statement with a BEGIN TRANS
statement at the start of it (don't include a COMMIT or ROLLBACK)
Run a query on sys.dm_tran_locks to see what locks are being held
Open a second session and run the second statement and see what
happens. If your locking scheme is setup correctly it should wait
for the first one to finish before it does anything.
Switch back to the first session and COMMIT to simulate it finished
This link has a lot of information but locking and data contention are complex areas with lots of possible solutions. This link should give you everything you need to know to decide how to approach this issue.
https://learn.microsoft.com/en-us/sql/relational-databases/sql-server- transaction-locking-and-row-versioning-guide?view=sql-server-2017

Limit query size to control transaction log size

This query causes our transaction log to grow to 25GB. The database is in SIMPLE mode.
INSERT INTO updbl.dbo.PopulationRelatives
( personid,
personsex,
relativeid,
relativesex,
degree,
relationship,
maternalpaternal )
SELECT DISTINCT
personid = relative1,
relative1sex,
relative2,
relative2sex,
degree,
relationship = Rel1Rel2,
maternalpaternal
FROM UPDBwork.dbo.DegreeRelationship
By looping I was able to limit the growth to 8GB.
SELECT #PID = 0, #BatchSize = 1000000, #ROWCOUNT = 0
SELECT #MaxPID = MAX(relative1) FROM updbwork.dbo.DegreeRelationship
WHILE #PID < #MaxPID+#BatchSize
BEGIN
INSERT INTO updbl.dbo.PopulationRelatives
( personid,
personsex,
relativeid,
relativesex,
degree,
relationship,
maternalpaternal )
SELECT DISTINCT
personid = relative1,
relative1sex,
relative2,
relative2sex,
degree,
relationship = Rel1Rel2,
maternalpaternal
FROM UPDBwork.dbo.DegreeRelationship
WHERE relative1 BETWEEN #PID+1 AND #PID+#BatchSize
SET #PID = #PID + #BatchSize
CHECKPOINT
END
This isn't the best strategy as each loop produces a different number of rows depending on the DISTINCT values. Unfortunately there is no good ID to partition the data on. Is there some way I could control for the size of each group? I was thinking of adding TOP(X) but the engine would still have to do a large calculation to satisfy the DISTINCT statement. A cursor would be great but again, how do I find my DISTINCT values? I am just hoping for some brain storming here.
Thanks.
Sounds like bulk operation... if changing the recovery model is an option temporarily change it to bulk-logged. Here is a link that may be of help: http://technet.microsoft.com/en-us/library/ms175987(v=SQL.105).aspx

Fastest check if row exists in PostgreSQL

I have a bunch of rows that I need to insert into table, but these inserts are always done in batches. So I want to check if a single row from the batch exists in the table because then I know they all were inserted.
So its not a primary key check, but shouldn't matter too much. I would like to only check single row so count(*) probably isn't good, so its something like exists I guess.
But since I'm fairly new to PostgreSQL I'd rather ask people who know.
My batch contains rows with following structure:
userid | rightid | remaining_count
So if table contains any rows with provided userid it means they all are present there.
Use the EXISTS key word for TRUE / FALSE return:
select exists(select 1 from contact where id=12)
How about simply:
select 1 from tbl where userid = 123 limit 1;
where 123 is the userid of the batch that you're about to insert.
The above query will return either an empty set or a single row, depending on whether there are records with the given userid.
If this turns out to be too slow, you could look into creating an index on tbl.userid.
if even a single row from batch exists in table, in that case I
don't have to insert my rows because I know for sure they all were
inserted.
For this to remain true even if your program gets interrupted mid-batch, I'd recommend that you make sure you manage database transactions appropriately (i.e. that the entire batch gets inserted within a single transaction).
INSERT INTO target( userid, rightid, count )
SELECT userid, rightid, count
FROM batch
WHERE NOT EXISTS (
SELECT * FROM target t2, batch b2
WHERE t2.userid = b2.userid
-- ... other keyfields ...
)
;
BTW: if you want the whole batch to fail in case of a duplicate, then (given a primary key constraint)
INSERT INTO target( userid, rightid, count )
SELECT userid, rightid, count
FROM batch
;
will do exactly what you want: either it succeeds, or it fails.
If you think about the performace ,may be you can use "PERFORM" in a function just like this:
PERFORM 1 FROM skytf.test_2 WHERE id=i LIMIT 1;
IF FOUND THEN
RAISE NOTICE ' found record id=%', i;
ELSE
RAISE NOTICE ' not found record id=%', i;
END IF;
as #MikeM pointed out.
select exists(select 1 from contact where id=12)
with index on contact, it can usually reduce time cost to 1 ms.
CREATE INDEX index_contact on contact(id);
SELECT 1 FROM user_right where userid = ? LIMIT 1
If your resultset contains a row then you do not have to insert. Otherwise insert your records.
select true from tablename where condition limit 1;
I believe that this is the query that postgres uses for checking foreign keys.
In your case, you could do this in one go too:
insert into yourtable select $userid, $rightid, $count where not (select true from yourtable where userid = $userid limit 1);

Paging in Pervasive SQL

How to do paging in Pervasive SQL (version 9.1)? I need to do something similar like:
//MySQL
SELECT foo FROM table LIMIT 10, 10
But I can't find a way to define offset.
Tested query in PSQL:
select top n *
from tablename
where id not in(
select top k id
from tablename
)
for all n = no.of records u need to fetch at a time.
and k = multiples of n(eg. n=5; k=0,5,10,15,....)
Our paging required that we be able to pass in the current page number and page size (along with some additional filter parameters) as variables. Since a select top #page_size doesn't work in MS SQL, we came up with creating an temporary or variable table to assign each rows primary key an identity that can later be filtered on for the desired page number and size.
** Note that if you have a GUID primary key or a compound key, you just have to change the object id on the temporary table to a uniqueidentifier or add the additional key columns to the table.
The down side to this is that it still has to insert all of the results into the temporary table, but at least it is only the keys. This works in MS SQL, but should be able to work for any DB with minimal tweaks.
declare #page_number int, #page_size
int -- add any additional search
parameters here
--create the temporary table with the identity column and the id
--of the record that you'll be selecting. This is an in memory
--table, so if the number of rows you'll be inserting is greater
--than 10,000, then you should use a temporary table in tempdb
--instead. To do this, use
--CREATE TABLE #temp_table (row_num int IDENTITY(1,1), objectid int)
--and change all the references to #temp_table to #temp_table
DECLARE #temp_table TABLE (row_num int
IDENTITY(1,1), objectid int)
--insert into the temporary table with the ids of the records
--we want to return. It's critical to make sure the order by
--reflects the order of the records to return so that the row_num
--values are set in the correct order and we are selecting the
--correct records based on the page INSERT INTO #temp_table
(objectid)
/* Example: Select that inserts
records into the temporary table
SELECT personid FROM person WITH
(NOLOCK) inner join degree WITH
(NOLOCK) on degree.personid =
person.personid WHERE
person.lastname = #last_name
ORDER BY person.lastname asc,
person.firsname asc
*/
--get the total number of rows that we matched DECLARE #total_rows
int SET #total_rows =
##ROWCOUNT
--calculate the total number of pages based on the number of
--rows that matched and the page size passed in as a parameter DECLARE
#total_pages int
--add the #page_size - 1 to the total number of rows to
--calculate the total number of pages. This is because sql
--alwasy rounds down for division of integers SET #total_pages =
(#total_rows + #page_size - 1) /
#page_size
--return the result set we are interested in by joining
--back to the #temp_table and filtering by row_num /* Example:
Selecting the data to return. If the
insert was done properly, then
you should always be joining the table
that contains the rows to return
to the objectid column on the
#temp_table
SELECT person.* FROM person WITH
(NOLOCK) INNER JOIN #temp_table
tt ON person.personid =
tt.objectid
*/
--return only the rows in the page that we are interested in
--and order by the row_num column of the #temp_table to make sure
--we are selecting the correct records WHERE tt.row_num <
(#page_size * #page_number) + 1
AND tt.row_num > (#page_size *
#page_number) - #page_size ORDER
BY tt.row_num
I face this problem in MS Sql too... no Limit or rownumber functions. What I do is insert the keys for my final query result (or sometimes the entire list of fields) into a temp table with an identity column... then I delete from the temp table everything outside the range I want... then use a join against the keys and the original table, to bring back the items I want. This works if you have a nice unique key - if you don't, well... that's a design problem in itself.
Alternative with slightly better performance is to skip the deleting step and just use the row numbers in your final join. Another performance improvement is to use the TOP operator so that at the very least, you don't have to grab the stuff past the end of what you want.
So... in pseudo-code... to grab items 80-89...
create table #keys (rownum int identity(1,1), key varchar(10))
insert #keys (key)
select TOP 89 key from myTable ORDER BY whatever
delete #keys where rownumber < 80
select <columns> from #keys join myTable on #keys.key = myTable.key
I ended up doing the paging in code. I just skip the first records in loop.
I thought I made up an easy way for doing the paging, but it seems that pervasive sql doesn't allow order clauses in subqueries. But this should work on other DBs (I tested it on firebird)
select *
from (select top [rows] * from
(select top [rows * pagenumber] * from mytable order by id)
order by id desc)
order by id