Compare a two column with the same table to remove duplicate - sql

A sample table with two column and I need to compare the column 1 and column 2 to the same table records and need to remove the column 1 + column 2 = column 2+column 1.
I tried to do self join and case condition. But its not working

If I understand correctly, you can run a simple select like this if you have all reversed pairs in the table:
select col1, col2
from t
where col1 < col2;
If you have some singletons, then:
select col1, col2
from t
where col1 < col2 or
(col1 > col2 and
not exists (select 1
from t t2
where t2.col1 = t.col2 and
t2.col2 = t.col1
)
);

You can use the except operator.
"EXCEPT returns distinct rows from the left input query that aren't output by the right input query."
SELECT C1, C2 FROM table
Except
SELECT C2, C1 FROM table
Example with your given data set : dbfiddle

I am posting the answer based on oracle database and also the columns are string/varchar:
delete from table where rowid in (
select rowid from table
where column1 || column2 =column2 || column1 )
Feel free to provide more input and we can tweak the answer.

Okay. There might be a simpler way of doing this but this might work as well. {table} is to be replaced with your table name.
;with orderedtable as (select t1.col1, t1.col2, ROW_NUMBER() OVER(ORDER BY t1.col1, t1.col2 ASC) AS rownum
from (select distinct t2.col1, t2.col2 from {table} t2) as t1)
select f1.col1, f1.col2
from orderedtable f1
left join orderedtable f2 on f1.col1 = f2.col2 and f1.col2 = f2.col1 and f1.rownum < f2.rownum
where f2.rownum is null

The SQL below will get the reversed col1 and col2 rows:
select
distinct t2.col1,t1.col2
from
table t1
join
table t2 on t1.col1 = t2.col2 and t1.col2 = t2.col1
And when we get these reversed rows, we can except them with the left join clause, the complete SQL is:
select
t.col1,t.col2
from
table t
left join
(
select
distinct t2.col1,t1.col2
from
table t1
join
table t2 on t1.col1 = t2.col2 and t1.col2 = t2.col1
) tmp on t.col1 = tmp.col1 and t.col2 = tmp.col2
where
tmp.col1 is null
Is it clear?

Related

Alternatives to the "Partition By" Statement in SQL

If I am interested in selecting records that have the largest value in each group of duplicate records (and some general conditions), I normally do this with the following SQL code:
select a.id, a.col2, b.col3
from (select id, col2, col3,
rank() over (partition by id order by col2 desc, col3 desc) as r1 from my_table where col2 > 5 and col3 > 5) a
where a.r1 =1
I am interested in learning alternate ways to do this.
Are there other popular ways to do this in (netezza) SQL?
Thank you!
One way to do it is with NOT EXISTS:
SELECT t1.id, t1.col2, t1.col3
FROM my_table t1
WHERE t1.col2 > 5 AND t1.col3 > 5
AND NOT EXISTS (
SELECT 1
FROM my_table t2
WHERE t2.id = t1.id AND t2.col2 > 5 AND t2.col3 > 5
AND (t2.col2 > t1.col2 OR (t2.col2 = t1.col2 AND t2.col3 > t1.col3))
);
Or, if you use a CTE to preselect from the table:
WITH cte AS (
SELECT id, col2, col3
FROM my_table
WHERE col2 > 5 AND col3 > 5
)
SELECT c1.*
FROM cte c1
WHERE NOT EXISTS (
SELECT 1
FROM cte c2
WHERE c2.id = c1.id
AND (c2.col2 > c1.col2 OR (c2.col2 = c1.col2 AND c2.col3 > c1.col3))
);
Depending on the requirement, the WHERE clause inside the subquery may be a lot more complex. This is one of the reasons that, if you can, you should use a window function.

inserting into a table records from another table with a condition

Let's say I have two tables t1 and t2.
t1 has two integer cols col1 (primary) and col2
t2 has two cols a foreign key of t1.col1 and t2.col2
I want to do the following
Retrieve only the records where t1.col2 is unique OR if t1.col2 is duplicate only those if t2.col2 is not null.
Insert the above records into another summary table, let's say t3
This is what I tried:
insert into t3 (col1,col2)
select col1, col2
from t1
where t.col1 in (select A.col1 from t1 as A
group by 1
having count(*) > 1
union
select col1, col2
from t1, t2
where t.col1 in (select A.col1 from t1 as A
group by 1
having count(*) > 1
and t2.col2 is not null;
While the 'union qry' works on its own, the insert is not happening.
Any ideas or any other efficient way to achieve this please
You can select the records you want using:
select t1.*
from (select t1.*, count(*) over (partition by col2) as cnt
from t1
) t1
where cnt = 1 or
exists (select 1 from t2.col1 = t1.col1 and t2.col2 is null);
The rest is just an insert.

Remove inner query in SQL

We have a SQL query which is not written as per the sql guideline. We have to change the query but if we change the logic and remove the inner query then it take to much time to execute. Below is the query:
select col1,
col2,
case
when col1 <> '' then(select top 1
col1
from table1 as BP
where bp.col1 = FD.col1 order by BP.col1)
when col2 <> '' then(select top 1
BP.col2
from table1 as BP
where BP.col2 = FD.col2 order by BP.col2)
else ''
end
from table2 FD
The above query is being used to insert the data into a temp table. The table1 has almost 100 million of data. Is there any way to remove the inline query along with the good performance. We have already created the indexes on table1. Any thought?
Try this
;WITH CTE
AS
(
SELECT
RN = ROW_NUMBER() OVER(ORDER BY COALESCE(T2.col1,T2.col2)),
T2.col1,
T2.col2,
T1Val = COALESCE(T2.col1,T2.col2,'')
FROM table2 T2
LEFT JOIN table1 T1
ON
(
(
ISNULL(T2.col1,'')<>'' AND T1.col1 = T2.col1
)
OR
(
ISNULL(T2.col2,'')<>'' AND T1.col2 = T2.col2
)
)
)
SELECT
*
FROM CTE
WHERE RN = 1
Here is my modest help:
You can already prepare and materialize your subquery1 and subquery2 (Group BY col1 or col2) <-- It will reduce the size of your table 1)
Split your main query (from table2 into 3 different queries)
1 with SELECT .. FROM table2 WHERE col1 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 = ''
Use an INNER JOIN with your table created in the first point.
(If you use SSIS you can // and use your inner join table into a Lookup)
If your col1 or col2 use a NVARCHAR(MAX) or a big size, you should have a look to a HashFunction (MD5 for example) and compare Hash instead.
Be sure to have all your indexes
At least if it is not performant, you can try with:
OUTER APPLY (SELECT TOP 1 .. )
Another idea should be:
SELECT col1, col2, col1 AS yourNewCol
FROM table2 T2
WHERE EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
UNION ALL
SELECT col1, col2, col2 AS yourNewCol
FROM table2 T2
WHERE
NOT EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
AND EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col2 = T2.col2)
UNION ALL
...
I don't have a clean solution for you, but some ideas.
Let me know if it helps you.
Regards,
Arnaud

compare two table values

I have 2 tables table A and table B. In table B we have to check if all the column entered is exactly as in table A, means if a row exists in table B then the same row will be there in table A too. also table A may have rows which are not in table B. if there is a row which is not in table A and is there in table B, an alert should be displayed showing which element is extra in table B.
Can we do this using join? if so what will be the sql code?
this is the best picture about joins i've ever seen :)
You probably want to have a look at the following article
SQL SERVER – Introduction to JOINs – Basic of JOINs
This should give you a very clear understanding of JOINs in Sql.
From there you should be able to find the solution.
As an example, you would have to look at something like
TABLE1
Col1
Col2
Col3
Col4
TABLE2
Col1
Col2
Col3
Col4
--all rows that match
SELECT *
FROM TABLE1 t1 INNER JOIN
TABLE2 t2 ON t1.Col1 = t2.Col1
AND t1.Col2 = t2.Col2
...
AND t1.Col3 = t2.Col3
--rows only in TABLE1
SELECT *
FROM TABLE1 t1 LEFT JOIN
TABLE2 t2 ON t1.Col1 = t2.Col1
AND t1.Col2 = t2.Col2
...
AND t1.Col3 = t2.Col3
WHERE t2.Col1 IS NULL
--rows only in TABLE2
SELECT *
FROM TABLE1 t2 LEFT JOIN
TABLE2 t1 ON t1.Col1 = t2.Col1
AND t1.Col2 = t2.Col2
...
AND t1.Col3 = t2.Col3
WHERE t1.Col1 IS NULL
If you want to compare based on single column, then you can do something like this:
SELECT ID FROM B LEFT JOIN A ON B.ID = A.ID WHERE A.ID IS NULL;
The above query will give you the list of records that are not present in A but in B.
Instead if you want to compare the entire row, you can use the following approach:
SELECT COUNT(*) FROM B;
SELECT COUNT(*) FROM A;
SELECT COUNT(*) FROM (
SELECT * FROM B UNION SELECT * FROM A
)
If all the queries returns the same count then you can assume that both the tables are exactly equal.

Best self join technique when checking for duplicates

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.
select * from table t1
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...
What would be a better way to do this? Or is it all the same based on indexes? Maybe
select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...
this table has 100m+ rows.
MS SQL, SQL Server 2008 Enterprise
select distinct t2.id
from table1 t1 with (nolock)
inner join table1 t2 with (nolock) on t1.ckid=t2.ckid
left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
where
t2.id > #Max_id and
t2.timestamp > t1.timestamp and
t2.rid = 2 and
isnull(t1.col1,'') = isnull(t2.col1,'') and
isnull(t1.cid,-1) = isnull(t2.cid,-1) and
isnull(t1.rid,-1) = isnull(t2.rid,-1)and
isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
isnull(t1.oid,'') = isnull(t2.oid,'') and
isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)
and (
(
t3.uniqueoid = 1
)
or
(
t3.uniqueoid is null and
isnull(t1.col1,'') = isnull(t2.col1,'') and
isnull(t1.col2,'') = isnull(t2.col2,'') and
isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and
isnull(t1.stid,-1) = isnull(t2.stid,-1) and
isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
isnull(t1.col3,-1) = isnull(t2.col3,-1)
)
)
Why self join: this is an aggregate question.
Hope you have an index on col1, col2, ...
--DELETE table
--WHERE KeyCol NOT IN (
select
MIN(KeyCol) AS RowToKeep,
col1, col2,
from
table
GROUP BY
col12, col2
HAVING
COUNT(*) > 1
--)
However, this will take some time. Have a look at bulk delete techniques
You can use ROW_NUMBER() to find duplicate rows in one table.
You can check here
The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.
And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.
You might want to try something more like:
select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1
For table with 100m+ rows, Using GROUPBY functions and using holding table will be optimized. Even though it translates into four queries.
STEP 1: create a holding key:
SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1
STEP 2: Push all the duplicate entries into the holddups. This is required for Step 4.
SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
STEP 3: Delete the duplicate rows from the original table.
DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
STEP 4: Put the unique rows back in the original table. For example:
INSERT t1 SELECT * FROM holddups
To detect duplicates, you don't need to join:
SELECT col1, col2
FROM table
GROUP BY col1, col2
HAVING COUNT(*) > 1
That should be much faster.
In my experience, SQL Server performance is really bad with OR conditions. Probably it is not the self join but that with table3 that causes the bad performance. But without seeing the plan, I would not be sure.
In this case, it might help to split your query into two:
One with a WHERE condition t3.uniqueoid = 1 and one with a WHERE condition for the other conditons on table3, and then use UNION ALL to append one to the other.