Best self join technique when checking for duplicates

Best self join technique when checking for duplicates - sql

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.
select * from table t1
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...
What would be a better way to do this? Or is it all the same based on indexes? Maybe
select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...
this table has 100m+ rows.
MS SQL, SQL Server 2008 Enterprise
select distinct t2.id
from table1 t1 with (nolock)
inner join table1 t2 with (nolock) on t1.ckid=t2.ckid
left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
where
t2.id > #Max_id and
t2.timestamp > t1.timestamp and
t2.rid = 2 and
isnull(t1.col1,'') = isnull(t2.col1,'') and
isnull(t1.cid,-1) = isnull(t2.cid,-1) and
isnull(t1.rid,-1) = isnull(t2.rid,-1)and
isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
isnull(t1.oid,'') = isnull(t2.oid,'') and
isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)
and (
(
t3.uniqueoid = 1
)
or
(
t3.uniqueoid is null and
isnull(t1.col1,'') = isnull(t2.col1,'') and
isnull(t1.col2,'') = isnull(t2.col2,'') and
isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and
isnull(t1.stid,-1) = isnull(t2.stid,-1) and
isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
isnull(t1.col3,-1) = isnull(t2.col3,-1)
)
)

Why self join: this is an aggregate question.
Hope you have an index on col1, col2, ...
--DELETE table
--WHERE KeyCol NOT IN (
select
MIN(KeyCol) AS RowToKeep,
col1, col2,
from
table
GROUP BY
col12, col2
HAVING
COUNT(*) > 1
--)
However, this will take some time. Have a look at bulk delete techniques

You can use ROW_NUMBER() to find duplicate rows in one table.
You can check here

The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.
And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.
You might want to try something more like:
select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1

For table with 100m+ rows, Using GROUPBY functions and using holding table will be optimized. Even though it translates into four queries.
STEP 1: create a holding key:
SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1
STEP 2: Push all the duplicate entries into the holddups. This is required for Step 4.
SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
STEP 3: Delete the duplicate rows from the original table.
DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
STEP 4: Put the unique rows back in the original table. For example:
INSERT t1 SELECT * FROM holddups

To detect duplicates, you don't need to join:
SELECT col1, col2
FROM table
GROUP BY col1, col2
HAVING COUNT(*) > 1
That should be much faster.

In my experience, SQL Server performance is really bad with OR conditions. Probably it is not the self join but that with table3 that causes the bad performance. But without seeing the plan, I would not be sure.
In this case, it might help to split your query into two:
One with a WHERE condition t3.uniqueoid = 1 and one with a WHERE condition for the other conditons on table3, and then use UNION ALL to append one to the other.

Related

Combine two tables on 4 similar columns and keep the unique columns

I have two tables, one table has columns
"DATE","PRODUCT","SUBPRODUCT","ACTUALS","RANDOM1"
The 2nd table has
"DATE,"PRODUCT","SUBPRODUCT","ACTUALS","RANDOM2"
Date, product, subproduct, actuals are the same values for both tables, and random1 and random 2 are unique values.
I want the final table to be
"DATE","PRODUCT","SUBPRODUCT","ACTUALS","RANDOM1","RANDOM2"

A simple way in standard SQL uses join with the using clause:
select *
from table1 t1 join
table2 t2
using (col1, col2, col3, col4) ;
SQL Server doesn't support the using clause, so alas, you need to list columns. A typical method is:
select t1.*, t2.random2
from table1 t1 join
table2 t2
on t1.date = t2.date and
t1.product = t2.product and
t1.subproduct = t2.subproduct and
t1.actuals = t2.actuals;
Note: This only returns rows that match between both tables. If you want all rows, you can use a full join and tweak the select.

You can join tables using where clause like below
select t1.DATE, t1.PRODUCT, t1.SUBPRODUCT, t1.ACTUALS, t1.RANDOM1, t2.RANDOM2
from table1 t1, table2 t2
where t1.DATE = t2.DATE and t1.PRODUCT = t2.PRODUCT and t1.SUBPRODUCT = t2.SUBPRODUCT and t1.ACTUALS = t2.ACTUALS
It will return data like this

Compare a two column with the same table to remove duplicate

A sample table with two column and I need to compare the column 1 and column 2 to the same table records and need to remove the column 1 + column 2 = column 2+column 1.
I tried to do self join and case condition. But its not working

If I understand correctly, you can run a simple select like this if you have all reversed pairs in the table:
select col1, col2
from t
where col1 < col2;
If you have some singletons, then:
select col1, col2
from t
where col1 < col2 or
(col1 > col2 and
not exists (select 1
from t t2
where t2.col1 = t.col2 and
t2.col2 = t.col1
)
);

You can use the except operator.
"EXCEPT returns distinct rows from the left input query that aren't output by the right input query."
SELECT C1, C2 FROM table
Except
SELECT C2, C1 FROM table
Example with your given data set : dbfiddle

I am posting the answer based on oracle database and also the columns are string/varchar:
delete from table where rowid in (
select rowid from table
where column1 || column2 =column2 || column1 )
Feel free to provide more input and we can tweak the answer.

Okay. There might be a simpler way of doing this but this might work as well. {table} is to be replaced with your table name.
;with orderedtable as (select t1.col1, t1.col2, ROW_NUMBER() OVER(ORDER BY t1.col1, t1.col2 ASC) AS rownum
from (select distinct t2.col1, t2.col2 from {table} t2) as t1)
select f1.col1, f1.col2
from orderedtable f1
left join orderedtable f2 on f1.col1 = f2.col2 and f1.col2 = f2.col1 and f1.rownum < f2.rownum
where f2.rownum is null

The SQL below will get the reversed col1 and col2 rows:
select
distinct t2.col1,t1.col2
from
table t1
join
table t2 on t1.col1 = t2.col2 and t1.col2 = t2.col1
And when we get these reversed rows, we can except them with the left join clause, the complete SQL is:
select
t.col1,t.col2
from
table t
left join
(
select
distinct t2.col1,t1.col2
from
table t1
join
table t2 on t1.col1 = t2.col2 and t1.col2 = t2.col1
) tmp on t.col1 = tmp.col1 and t.col2 = tmp.col2
where
tmp.col1 is null
Is it clear?

Remove inner query in SQL

We have a SQL query which is not written as per the sql guideline. We have to change the query but if we change the logic and remove the inner query then it take to much time to execute. Below is the query:
select col1,
col2,
case
when col1 <> '' then(select top 1
col1
from table1 as BP
where bp.col1 = FD.col1 order by BP.col1)
when col2 <> '' then(select top 1
BP.col2
from table1 as BP
where BP.col2 = FD.col2 order by BP.col2)
else ''
end
from table2 FD
The above query is being used to insert the data into a temp table. The table1 has almost 100 million of data. Is there any way to remove the inline query along with the good performance. We have already created the indexes on table1. Any thought?

Try this
;WITH CTE
AS
(
SELECT
RN = ROW_NUMBER() OVER(ORDER BY COALESCE(T2.col1,T2.col2)),
T2.col1,
T2.col2,
T1Val = COALESCE(T2.col1,T2.col2,'')
FROM table2 T2
LEFT JOIN table1 T1
ON
(
(
ISNULL(T2.col1,'')<>'' AND T1.col1 = T2.col1
)
OR
(
ISNULL(T2.col2,'')<>'' AND T1.col2 = T2.col2
)
)
)
SELECT
*
FROM CTE
WHERE RN = 1

Here is my modest help:
You can already prepare and materialize your subquery1 and subquery2 (Group BY col1 or col2) <-- It will reduce the size of your table 1)
Split your main query (from table2 into 3 different queries)
1 with SELECT .. FROM table2 WHERE col1 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 = ''
Use an INNER JOIN with your table created in the first point.
(If you use SSIS you can // and use your inner join table into a Lookup)
If your col1 or col2 use a NVARCHAR(MAX) or a big size, you should have a look to a HashFunction (MD5 for example) and compare Hash instead.
Be sure to have all your indexes
At least if it is not performant, you can try with:
OUTER APPLY (SELECT TOP 1 .. )
Another idea should be:
SELECT col1, col2, col1 AS yourNewCol
FROM table2 T2
WHERE EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
UNION ALL
SELECT col1, col2, col2 AS yourNewCol
FROM table2 T2
WHERE
NOT EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
AND EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col2 = T2.col2)
UNION ALL
...
I don't have a clean solution for you, but some ideas.
Let me know if it helps you.
Regards,
Arnaud

SQL command usage of in / or

I have an sql command similar to below one.
select * from table1
where table1.col1 in (select columnA from table2 where table2.keyColumn=3)
or table1.col2 in (select columnA from table2 where table2.keyColumn=3)
Its performance is really bad so how can I change this command? (pls note that the two sql commands in the paranthesis are exactly same.)

Try
select distinct t1.* from table1 t1
inner join table2 t2 ON t1.col1 =t2.columnA OR t1.col2 = t2.columnA

This is your query:
select *
from table1
where table1.col1 in (select columnA from table2 and t2.keyColumn = 3) or
table1.col2 in (select columnA from table2 and t2.keyColumn = 3);
Probably the best approach is to build an index on table2(keyColumn, columnA).
It is also possible that in has poor performance characteristics. So, you can try rewriting this as an exists query:
select *
from table1 t1
where exists (select 1 from table2 t2 where t2.columnA = t1.col1 and t2.keyColumn = 3) or
exists (select 1 from table2 t2 where t2.columnA = t2.col1 and t2.keyColumn = 3);
In this case, the appropriate index is table2(columnA, keyColumn).

Assuming you're doing this in VFP, use SYS(3054) to see how the query is being optimized and what part is not.

Are the main query and subqueries fully Rushmore-optimisable?
Since the subqueries do not appear to be correlated (i.e. they don't refer to table1 then as long as everything is fully supported by indexes you should be fine.

compare two table values

I have 2 tables table A and table B. In table B we have to check if all the column entered is exactly as in table A, means if a row exists in table B then the same row will be there in table A too. also table A may have rows which are not in table B. if there is a row which is not in table A and is there in table B, an alert should be displayed showing which element is extra in table B.
Can we do this using join? if so what will be the sql code?

this is the best picture about joins i've ever seen :)

You probably want to have a look at the following article
SQL SERVER – Introduction to JOINs – Basic of JOINs
This should give you a very clear understanding of JOINs in Sql.
From there you should be able to find the solution.
As an example, you would have to look at something like
TABLE1
Col1
Col2
Col3
Col4
TABLE2
Col1
Col2
Col3
Col4
--all rows that match
SELECT *
FROM TABLE1 t1 INNER JOIN
TABLE2 t2 ON t1.Col1 = t2.Col1
AND t1.Col2 = t2.Col2
...
AND t1.Col3 = t2.Col3
--rows only in TABLE1
SELECT *
FROM TABLE1 t1 LEFT JOIN
TABLE2 t2 ON t1.Col1 = t2.Col1
AND t1.Col2 = t2.Col2
...
AND t1.Col3 = t2.Col3
WHERE t2.Col1 IS NULL
--rows only in TABLE2
SELECT *
FROM TABLE1 t2 LEFT JOIN
TABLE2 t1 ON t1.Col1 = t2.Col1
AND t1.Col2 = t2.Col2
...
AND t1.Col3 = t2.Col3
WHERE t1.Col1 IS NULL

If you want to compare based on single column, then you can do something like this:
SELECT ID FROM B LEFT JOIN A ON B.ID = A.ID WHERE A.ID IS NULL;
The above query will give you the list of records that are not present in A but in B.
Instead if you want to compare the entire row, you can use the following approach:
SELECT COUNT(*) FROM B;
SELECT COUNT(*) FROM A;
SELECT COUNT(*) FROM (
SELECT * FROM B UNION SELECT * FROM A
)
If all the queries returns the same count then you can assume that both the tables are exactly equal.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Best self join technique when checking for duplicates - sql

Why self join: this is an aggregate question. Hope you have an index on col1, col2, ... --DELETE table --WHERE KeyCol NOT IN ( select MIN(KeyCol) AS RowToKeep, col1, col2, from table GROUP BY col12, col2 HAVING COUNT(*) > 1 --) However, this will take some time. Have a look at bulk delete techniques

You can use ROW_NUMBER() to find duplicate rows in one table. You can check here

To detect duplicates, you don't need to join: SELECT col1, col2 FROM table GROUP BY col1, col2 HAVING COUNT(*) > 1 That should be much faster.

Related

Combine two tables on 4 similar columns and keep the unique columns

Compare a two column with the same table to remove duplicate

Remove inner query in SQL

SQL command usage of in / or

compare two table values

Categories

Resources