Deleting duplicate rows from sqlite database - sql

I have a huge table - 36 million rows - in SQLite3. In this very large table, there are two columns:
hash - text
d - real
Some of the rows are duplicates. That is, both hash and d have the same values. If two hashes are identical, then so are the values of d. However, two identical d's does not imply two identical hash'es.
I want to delete the duplicate rows. I don't have a primary key column.
What's the fastest way to do this?

You need a way to distinguish the rows. Based on your comment, you could use the special rowid column for that.
To delete duplicates by keeping the lowest rowid per (hash,d):
delete from YourTable
where rowid not in
(
select min(rowid)
from YourTable
group by
hash
, d
)

I guess the fastest would be to use the very database for it: add a new table with the same columns, but with proper constraints (a unique index on hash/real pair?), iterate through the original table and try to insert records in the new table, ignoring constraint violation errors (i.e. continue iterating when exceptions are raised).
Then delete the old table and rename the new to the old one.

If adding a primary key is not an option, then one approach would be to store the duplicates DISTINCT in a temp table, delete all of the duplicated records from the existing table, and then add the records back into the original table from the temp table.
For example (written for SQL Server 2008, but the technique is the same for any database):
DECLARE #original AS TABLE([hash] varchar(20), [d] float)
INSERT INTO #original VALUES('A', 1)
INSERT INTO #original VALUES('A', 2)
INSERT INTO #original VALUES('A', 1)
INSERT INTO #original VALUES('B', 1)
INSERT INTO #original VALUES('C', 1)
INSERT INTO #original VALUES('C', 1)
DECLARE #temp AS TABLE([hash] varchar(20), [d] float)
INSERT INTO #temp
SELECT [hash], [d] FROM #original
GROUP BY [hash], [d]
HAVING COUNT(*) > 1
DELETE O
FROM #original O
JOIN #temp T ON T.[hash] = O.[hash] AND T.[d] = O.[d]
INSERT INTO #original
SELECT [hash], [d] FROM #temp
SELECT * FROM #original
I'm not sure if sqlite has a ROW_NUMBER() type function, but if it does you could also try some of the approaches listed here: Delete duplicate records from a SQL table without a primary key

The proposed solution was not working for me, so I ended up doing this:
CREATE TABLE temp_table as SELECT DISTINCT * FROM your_table
DROP TABLE your_table
ALTER TABLE temp_table RENAME TO your_table

Related

Using sequence while inserting data into 2 tables at same time

I am trying to insert data using select statement. The table which I am inserting is having foreign key and it is sequence ID. How do I accomplish this? Because if I insert the sequence key in associated table first then how do I get the list of all the sequence ID to insert into the table.
Please note I am using insert with select statement so is there way to accomplish this without using cursor?
I think you can extract sequence value and then re-use it as many times as you want:
DECLARE #NextValue INT
SELECT #NextValue = NEXT VALUE FOR MySequence
SELECT NextValue = #NextValue
INSERT INTO PrimaryTable(PK_ID) VALUES (#NextValue);
INSERT INTO SecondaryTable(FK_ID) VALUES (#NextValue);
Here what I have tried.
DECLARE #MyTabVaR TABLE
(
FOREIGNKEY_ID INT,
COMMON_COL INT
);
INSERT INTO #MyTabVaR
SELECT NEXT VALUE FOR DBO.MY_SEQ,COMMON_COL FROM another_table2
INSERT INTO actual_table
SELECT FOREIGNKEY_ID FROM #MyTabVaR
INSERT INTO another_table
SELECT * FROM copy_table C
LEFT JOIN actual_table A
ON C.COMMON_COL=A.COMMON_COL
WHERE A.FOREIGNKEY_ID IS NOT NULL

SQL Server: return joined data from insert select

I perform steps:
Create temporal table and fill it with data and unique order column [_oid]
Insert everything from temporal table into real table except fictional [_oid], outputting generated [id]'s
Return those generated [id]'s along with corresponding [_oid]
SQL:
CREATE TABLE #temp
(
[Hash] INT NOT NULL,
[Size] INT NOT NULL,
[Data] NVARCHAR(MAX),
[_oid] INT NOT NULL
)
--here insert data into #temp--
INSERT [dbo].[TestObjects]
OUTPUT INSERTED.[Id]
SELECT [Hash], [Size], [Data]
FROM #temp
DROP TABLE #temp
How I can return ([Id], [_oid]) rows ? ....Or at least return [Id] ordered by [_oid] ?
I know insert does not preserve order of inserted items in it's output, but still...
I think you what you are asking for is INSERT INTO, as so:
INSERT INTO [dbo].[TestObjects]
SELECT Hash, Size, Data FROM #temp
ORDER BY _oid
But as you say, there's no guarantee about order when you select from TestObjects, so if it's important can you not have a field in TestObjects you can ORDER BY when you SELECT from it?
IF your insert into #temp is such that both o_id and (hash,size,data) are unique for each row (ie keys), then you could retrieve the inserted o_id from #temp:
select t.[_oid],to.[Id]
from #temp t
inner join [dbo].[TestObjects] to
on t.Hash=to.Hash and t.Size=to.Size and t.data=to.data
As noted by George Menoutis, I did merge:
MERGE [dbo].[TestObjects] AS T_Base
USING #temp AS T_Source
ON (0<>0)
WHEN NOT MATCHED THEN INSERT ([Hash],[Size],[Data]) VALUES (T_Source.[Hash],T_Source.[Size],T_Source.[Data])
OUTPUT INSERTED.[Id], T_Source.[_oid];
If anyone have better approach - feel free to contribute to this answer.

Inserting multiple rows in temp table without loop

This question has already been asked several times but the solution is not working for me. I don't know why.
Actually i am trying to create a temp table in sql query where i am inserting some records in temp table using select into but everytime it returns empty row:
here is what i am trying:
Create Table #TempTable
(
EntityID BIGINT
)
INSERT INTO #TempTable (EntityID)
SELECT pkEntityID FROM Employee WHERE EmpID = 45
Select * from #TempTable
Corresponding to 45 , there are 10 rows in Employee table. IS it like I have to do something else or a loop like structure here as we can only insert one row in a table at once?
This has been stated in the comments, all of which i up-voted, but to answer your question... there isn't anything else you have to do. There clearly isn't an EmpID = 45 in your source table. Here's a reproducible example:
Declare #Employee Table (pkEntityID bigint, EmpID int)
insert into #Employee (pkEntityID, EmpID)
values
(32168123,45),
(89746541,45),
(55566331,45),
(45649224,12)
Create Table #TempTable
(
EntityID BIGINT
)
INSERT INTO #TempTable (EntityID)
SELECT pkEntityID FROM #Employee WHERE EmpID = 45
Select * from #TempTable
drop table #TempTable
Have you accidentally also created the Employee table in the master database and you are currently connected to the master database?

How to delete duplicate rows in sybase, when you have no unique key?

Yes, you can find similar questions numerous times, but:
the most elegant solutions posted here, work for SQL Server, but not for Sybase (in my case Sybase Anywhere 11). I have even found some Sybase-related questions marked as duplicates for SQL Server questions, which doesn't help.
One example for solutions I liked, but didn't work, is the WITH ... DELETE ... construct.
I have found working solutions using cursors or while-loops, but I hope it is possible without loops.
I hope for a nice, simple and fast query, just deleting all but one exact duplicate.
Here a little framework for testing:
IF OBJECT_ID( 'tempdb..#TestTable' ) IS NOT NULL
DROP TABLE #TestTable;
CREATE TABLE #TestTable (Column1 varchar(1), Column2 int);
INSERT INTO #TestTable VALUES ('A', 1);
INSERT INTO #TestTable VALUES ('A', 1); -- duplicate
INSERT INTO #TestTable VALUES ('A', 1); -- duplicate
INSERT INTO #TestTable VALUES ('A', 2);
INSERT INTO #TestTable VALUES ('B', 1);
INSERT INTO #TestTable VALUES ('B', 2);
INSERT INTO #TestTable VALUES ('B', 2); -- duplicate
INSERT INTO #TestTable VALUES ('C', 1);
INSERT INTO #TestTable VALUES ('C', 2);
SELECT * FROM #TestTable ORDER BY Column1,Column2;
DELETE <your solution here>
SELECT * FROM #TestTable ORDER BY Column1,Column2;
If all fields are identical, you can just do this:
select distinct *
into #temp_table
from table_with_duplicates
delete table_with_duplicates
insert into table_with_duplicates select * from #temp_table
If all fields are not identical, for example, if you have an id that is different, then you'll need to list all the fields in the select statement, and hard code a value in the id to make it identical, if that is a field you don't care about.
For example:
insert #temp_table field1, field2, id select (field1, field2, 999)
from table_with_duplicates
This works well and fast:
DELETE FROM #TestTable
WHERE ROWID(#TestTable) IN (
SELECT rowid FROM (
SELECT ROWID(#TestTable) rowid,
ROW_NUMBER() OVER(PARTITION BY Column1,Column2 ORDER BY Column1,Column2) rownum
FROM #TestTable
) sub
WHERE rownum > 1
);
If you don't know OVER(PARTITION BY ...), just execute the inner SELECT statement to see what it does.
Here is another interesting one I found and adopted:
DELETE FROM #TestTable dupes
FROM #TestTable dupes, #TestTable fullTable
WHERE dupes.Column1 = fullTable.Column1
AND dupes.Column2 = fullTable.Column2
AND ROWID(dupes) > ROWID(fullTable);
or, if you like explicit joins more (I do):
DELETE FROM #TestTable dupes
FROM #TestTable dupes
INNER JOIN #TestTable fullTable
ON dupes.Column1 = fullTable.Column1
AND dupes.Column2 = fullTable.Column2
AND ROWID(dupes) > ROWID(fullTable);
or the short form (a "natural" join incorporates identical column names automatically):
DELETE FROM #TestTable dupes
FROM #TestTable dupes
NATURAL JOIN #TestTable fullTable
ON ROWID(dupes) > ROWID(fullTable);
...if someone finds a solution not requiring ROWID(), I would be interested to see them.
Please try this:
create clustered index i1 on table table_name(column_name) with ignore_dup_row
create table #test(id int,name char(9))
insert into #test values(1,"A")
insert into #test values(1,"A")
create clustered index i1 on #test(id) with ignore_dup_row
select * from #test
Ok, now that I know the ROWID() function, solutions for tables with primary key (PK) can be easily adopted. This one first selects all rows to keep and then deletes the remaining ones:
DELETE FROM #TestTable
FROM #TestTable
LEFT OUTER JOIN (
SELECT MIN(ROWID(#TestTable)) rowid
FROM #TestTable
GROUP BY Column1, Column2
) AS KeepRows ON ROWID(#TestTable) = KeepRows.rowid
WHERE KeepRows.rowid IS NULL;
...or how about this shorter variant? I like!
DELETE FROM #TestTable
WHERE ROWID(#TestTable) NOT IN (
SELECT MIN(ROWID(#TestTable))
FROM #TestTable
GROUP BY Column1, Column2
);
In this post, which inspired me most, is a comment that NOT IN might be slower. But that's for SQL server, and sometimes elegance is more important :) - I also think it all depends on good indexes.
Anyway, usually it is bad design, to have tables without a PK. You should at least add an "autoinc" ID, and if you do, you can use that ID instead of the ROWID() function, which is a non-standard extension by Sybase (some others have it, too).

Using temporary table in where clause

I want to delete many rows with the same set of field values in some (6) tables. I could do this by deleting the result of one subquery in every table (Solution 1), which would be redundant, because the subquery would be the same every time; so I want to store the result of the subquery in a temporary table and delete the value of each row (of the temp table) in the tables (Solution 2). Which solution is the better one?
First solution:
DELETE FROM dbo.SubProtocols
WHERE ProtocolID IN (
SELECT ProtocolID
FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
)
DELETE FROM dbo.ProtocolHeaders
WHERE ProtocolID IN (
SELECT ProtocolID
FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
)
// ...
DELETE FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
Second Solution:
DECLARE #Protocols table(ProtocolID int NOT NULL)
INSERT INTO #Protocols
SELECT ProtocolID
FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
DELETE FROM dbo.SubProtocols
WHERE ProtocolID IN (
SELECT ProtocolID
FROM #Protocols
)
DELETE FROM dbo.ProtocolHeaders
WHERE ProtocolID IN (
SELECT ProtocolID
FROM #Protocols
)
// ...
DELETE FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
Is it possible to do solution 2 without the subquery? Say doing WHERE ProtocolID IN #Protocols (but syntactically correct)?
I am using Microsoft SQL Server 2005.
While you can avoid the subquery in SQL Server with a join, like so:
delete from sp
from subprotocols sp
inner join protocols p on
sp.protocolid = p.protocolid
and p.workspaceid = #workspaceid
You'll find that this doesn't gain you really any performance over either of your approaches. Generally, with your subquery, SQL Server 2005 optimizes that in into an inner join, since it doesn't rely on each row. Also, SQL Server will probably cache the subquery in your case, so shoving it into a temp table is most likely unnecessary.
The first way, though, would be susceptible to changes in Protocols during the transactions, where the second one wouldn't. Just something to think about.
Can try this
DELETE FROM dbo.ProtocolHeaders
FROM dbo.ProtocolHeaders INNER JOIN
dbo.Protocols ON ProtocolHeaders.ProtocolID = Protocols.ProtocolID
WHERE Protocols.WorkplaceID = #WorkplaceID
DELETE ... FROM is a T-SQL extension to the standard SQL DELETE that provides an alternative to using a subquery. From the help:
D. Using DELETE based on a subquery
and using the Transact-SQL extension
The following example shows the
Transact-SQL extension used to delete
records from a base table that is
based on a join or correlated
subquery. The first DELETE statement
shows the SQL-2003-compatible subquery
solution, and the second DELETE
statement shows the Transact-SQL
extension. Both queries remove rows
from the SalesPersonQuotaHistory table
based on the year-to-date sales stored
in the SalesPerson table.
-- SQL-2003 Standard subquery
USE AdventureWorks;
GO
DELETE FROM Sales.SalesPersonQuotaHistory
WHERE SalesPersonID IN
(SELECT SalesPersonID
FROM Sales.SalesPerson
WHERE SalesYTD > 2500000.00);
GO
-- Transact-SQL extension
USE AdventureWorks;
GO
DELETE FROM Sales.SalesPersonQuotaHistory
FROM Sales.SalesPersonQuotaHistory AS spqh
INNER JOIN Sales.SalesPerson AS sp
ON spqh.SalesPersonID = sp.SalesPersonID
WHERE sp.SalesYTD > 2500000.00;
GO
You would want, in your second solution, something like
-- untested!
DELETE FROM
dbo.SubProtocols -- ProtocolHeaders, etc
FROM
dbo.SubProtocols
INNER JOIN #Protocols ON SubProtocols.ProtocolID = #Protocols.ProtocolID
However!!
Is it not possible to alter your design so that all the susidiary protocol tables have a FOREIGN KEY with DELETE CASCADE to the main Protocols table? Then you could just DELETE from Protocols and the rest would be taken care of...
edit to add:
If you already have FOREIGN KEYs set up, you would need to use DDL to alter them (I think a drop and recreate is required) in order for them to have DELETE CASCADE turned on. Once that is in place, a DELETE from the main table will automatically DELETE related records from the child table.
Without the temp table you risk deleting different rows in the the second delete, but that takes three operations to do.
You could delete from the first table and use the OUTPUT INTO clause to insert into a temp table all the IDs, and then use that temp table to delete the second table. This will make sure you only delete the same keys with and with only two statements.
declare #x table(RowID int identity(1,1) primary key, ValueData varchar(3))
declare #y table(RowID int identity(1,1) primary key, ValueData varchar(3))
declare #temp table (RowID int)
insert into #x values ('aaa')
insert into #x values ('bab')
insert into #x values ('aac')
insert into #x values ('bad')
insert into #x values ('aae')
insert into #x values ('baf')
insert into #x values ('aag')
insert into #y values ('aaa')
insert into #y values ('bab')
insert into #y values ('aac')
insert into #y values ('bad')
insert into #y values ('aae')
insert into #y values ('baf')
insert into #y values ('aag')
DELETE #x
OUTPUT DELETED.RowID
INTO #temp
WHERE ValueData like 'a%'
DELETE y
FROM #y y
INNER JOIN #temp t ON y.RowID=t.RowID
select * from #x
select * from #y
SELECT OUTPUT:
RowID ValueData
----------- ---------
2 bab
4 bad
6 baf
(3 row(s) affected)
RowID ValueData
----------- ---------
2 bab
4 bad
6 baf
(3 row(s) affected)