Finding partial and exact duplicate from a SQL table - sql

I am trying to read duplicates from a table. There are some partial duplicates based on values of Col1 and Col2 and there are some full duplicates based on Col1, Col2 and Col3 as in below table.
Col1 Col2 Col3
1 John 100
1 John 200
2 Tom 150
3 Bob 100
3 Bob 100
4 Sam 500
I want to capture partial and exact duplicates in two separate outputs and ignore the non-repeated rows like 2 and 4 e.g.
Partial Duplicates
Col1 Col2 Col3
1 John 100
1 John 200
Full Duplicate
Col1 Col2 Col3
3 Bob 100
3 Bob 100
What is the best way to achieve this with SQL?
I tried using the self join with spark-sql but getting error: -
val source_df = sql("select col1, col2, col3 from sample_table")
source_df.as("df1").join(inter_df.as("df2"), $"df1.Col3" === $"df2.Col3" and $"df1.Col2" === $"df2.Col2" and $"df1.Col1" === $"df2.Col1").select($"df1.Col1",$"df1.Col2",$"df1.Col3",$"df2.Col3").show()
Error
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(Col3#1957, 200)

For partial duplicates:
SELECT *
FROM tbl
WHERE EXISTS (
SELECT *
FROM tbl t2
WHERE tbl.col1 = t2.col1 AND tbl.col2 = t2.col2 AND tbl.col3 <> t2.col3
)
Returns:
col1 col2 col3
1 John 100
1 John 200
For full duplicates, add a unique identifier per combination of col1, col2, and col3, and look for cases where there's another record with the same col1, col2, and col3, but a different unique identifier:
;WITH cte AS (
SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3 ORDER BY col1, col2, col3) AS uniqueid, col1, col2, col3
FROM tbl
)
SELECT col1, col2, col3
FROM cte
WHERE EXISTS (
SELECT *
FROM cte t2
WHERE cte.col1 = t2.col1 AND cte.col2 = t2.col2 AND cte.col3 = t2.col3 AND cte.uniqueid <> t2.uniqueid
)
Returns:
col1 col2 col3
3 Bob 100
3 Bob 100
http://sqlfiddle.com/#!18/f1d78/2
CREATE TABLE tbl (col1 INT, col2 VARCHAR(5), col3 INT)
INSERT INTO tbl VALUES
(1, 'John', 100),
(1, 'John', 200),
(2, 'Tom', 150),
(3, 'Bob', 100),
(3, 'Bob', 100),
(4, 'Sam', 500)

partial duplicates - use exists. Here is the demo.
select
*
from myTable m1
where exists (
select
*
from myTable m2
where m1.Col1 = m2.Col1
and m1.Col2 = m2.Col2
and m1.Col3 <> m2.Col3
)
output:
----------------------
Col1 Col2 Col3
----------------------
1 John 100
1 John 200
----------------------
full duplicates - you can use count(*) as window function. Here is the demo.
with cte as
(
select
Col1,
Col2,
Col3,
count(*) over (partition by Col1, Col2, Col3) as rn
from myTable
)
select
Col1,
Col2,
Col3
from cte
where rn > 1
output:
----------------------
Col1 Col2 Col3
----------------------
3 Bob 100
3 Bob 100
----------------------

Related

Oracle self join starting with minimum value for each partition

I have this table:
COL1 COL2 COL3
--------------------
A 1 VAL1
A 2 VAL2
A 4 VAL3
B 2 VAL4
B 4 VAL5
B 5 VAL6
And I would like to obtain this output:
COL1 COL2 COL3
--------------------
A 1 VAL1
A 2 VAL2
A 3 NULL
B 2 VAL4
B 3 NULL
B 4 VAL6
Logic:
with the smallest COL2 value for each partition of COL1, take the following 3 numbers and, if the combination COL1 and COL2 present in the first table, show COL3 and NULL otherwise.
Your question is a good example of what PARTITIONED OUTER JOIN was created for: DBFiddle
with top3 as (
select *
from (
select
col1, col2, col3
,min(col2)over(partition by col1) min_col2
,col2 - min(col2)over(partition by col1) + 1 as rn
from t
)
where col2 < min_col2 + 3
)
select
top3.col1
,r3.n as col2
,top3.col3
from
top3
partition by (col1)
right join
(select level n from dual connect by level<=3) r3
on r3.n=top3.rn;
As you can see, the first step is to get top3 and then just use partition by (col1) right join r3, where r3 is just generator of 3 rows.
Results:
COL1 COL2 COL3
----- ---------- ----
A 1 VAL1
A 2 VAL2
A 3
B 1 VAL4
B 2
B 3 VAL5
6 rows selected.
Note, this approach allows you to scan your table just once!
Let's see. Here is the table
select * from t order by col1, col2;
COL1 COL2 COL3
----- ---------- -----
A 1 VAL1
A 2 VAL2
A 4 VAL3
B 2 VAL4
B 4 VAL5
B 5 VAL6
6 rows selected
and now let's try to apply the described logic
with offsets as
(select level - 1 offset from dual connect by level <= 3),
smallest_col2 as
(select col1, min(col2) min_col2 from t group by col1)
select sc2.col1, sc2.min_col2 + o.offset col2, t.col3
from smallest_col2 sc2
cross join offsets o
left join t
on t.col1 = sc2.col1
and t.col2 = sc2.min_col2 + o.offset
order by 1, 2;
COL1 COL2 COL3
----- ---------- -----
A 1 VAL1
A 2 VAL2
A 3
B 2 VAL4
B 3
B 4 VAL5
6 rows selected
Use a recursive CTE to get the COL2s from the min of each COL1 up to the next 2 and then a left join to the table:
WITH cte(COL1, COL2, max_col2) AS (
SELECT COL1, MIN(COL2), MIN(COL2) + 2
FROM tablename
GROUP BY COL1
UNION ALL
SELECT COL1, COL2 + 1, max_col2
FROM cte
WHERE COL2 < max_col2
)
SELECT c.COL1, c.COL2, t.COL3
FROM cte c LEFT JOIN tablename t
ON t.COL1 = c.COL1 AND t.COL2 = c.COL2
ORDER BY c.COL1, c.COL2
See the demo.
The partitioned outer join, already demonstrated in Sayan's answer, is probably the best approach for that part of the assignment (data densification).
For the first part, in Oracle 12.1 and higher you can use the match_recognize clause:
select col1, col2, col3
from this_table
match_recognize(
partition by col1
order by col2
measures col2 - a.col2 + 1 as rn
all rows per match
pattern ( ^ a b* )
define b as col2 <= a.col2 + 2
)
partition by (col1)
right outer join
(select level as rn from dual connect by level <= 3) using (rn)
;
Another solution with the "recursive WITH clause"
With rws_numbered (COL1, COL2, COL3, rn) as (
select COL1, COL2, COL3
, row_number()over(order by col1, col3) rn
from Your_table
)
, cte ( COL1, COL2, COL3, rn ) as (
select COL1, COL2, COL3, rn
from rws_numbered
where rn = 1
union all
select
t.COL1
, case when t.col1 = c.col1 then c.col2 + 1 else t.col2 end COL2
, t.COL3
, t.rn
from rws_numbered t
join cte c
on c.rn + 1 = t.rn
)
select COL1, COL2, case when exists (select null from Your_table t where t.COL1 = cte.COL1 and t.COL2 = cte.COL2) then COL3 else null end COL3
from cte
order by 1, 2
;
db<>fiddle

Start calculation after the offset in LEAD

I've a weird scenario to work on. I have data in the way below.
Col1 col2 col3
a b 201921
a b 201923
a b 201924
a b 201925
a b 201927
Col1 and col2 etc are there for a partition and there are so many columns like those. I have a dynamic parameter which will feed the offset in LEAD function.
What LEAD doing is for every row, it will find the next value based on offset. But what I need is a little different. When the first row finds the offset, the next row should skip the offset number of rows.
For example, lead of 201921, 1 is 201923. So, the next calculation should happen from the row which has 201924. And then 201927 and further.
What I wrote currently is
Lead(col3,<dynamic param>,col3) over (partition by col1,col2 order by col3)
Is there such a thing to skip rows and continue from the next? I am a bit curious.
Expected output( for offset 1):
Col1 col2 col3 col4
a b 201921 201923
a b 201923 skip
a b 201924 201925
a b 201925 skip
a b 201927 201927
Expected output( for offset 2):
Col1 col2 col3 col4
a b 201921 201924
a b 201923 skip
a b 201924 skip
a b 201925 201925
a b 201927 201927
You can implement using a query determining COL4 using a CASE statement. This will be the base query. <dynamic param> is what will need to be replaced with your dymanic parameter.
SELECT col1,
col2,
col3,
CASE
WHEN ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY col3) + <dynamic param> >
COUNT (*) OVER (PARTITION BY col1, col2)
THEN
col3
WHEN MOD (ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY col3), <dynamic param> + 1) = 1
THEN
LEAD (col3, <dynamic param>) OVER (PARTITION BY col1, col2 ORDER BY col3)
END AS col4
FROM t;
Here are examples using the samples you provided
SQL> --offset of 1
SQL> WITH
2 t (col1, col2, col3)
3 AS
4 (SELECT 'a', 'b', 201921 FROM DUAL
5 UNION ALL
6 SELECT 'a', 'b', 201923 FROM DUAL
7 UNION ALL
8 SELECT 'a', 'b', 201924 FROM DUAL
9 UNION ALL
10 SELECT 'a', 'b', 201925 FROM DUAL
11 UNION ALL
12 SELECT 'a', 'b', 201927 FROM DUAL)
13 SELECT col1,
14 col2,
15 col3,
16 CASE
17 WHEN ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY col3) + 1 >
18 COUNT (*) OVER (PARTITION BY col1, col2)
19 THEN
20 col3
21 WHEN MOD (ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY col3), 1 + 1) = 1
22 THEN
23 LEAD (col3, 1) OVER (PARTITION BY col1, col2 ORDER BY col3)
24 END AS col4
25 FROM t;
COL1 COL2 COL3 COL4
_______ _______ _________ _________
a b 201921 201923
a b 201923
a b 201924 201925
a b 201925
a b 201927 201927
SQL> --offset of 2
SQL> WITH
2 t (col1, col2, col3)
3 AS
4 (SELECT 'a', 'b', 201921 FROM DUAL
5 UNION ALL
6 SELECT 'a', 'b', 201923 FROM DUAL
7 UNION ALL
8 SELECT 'a', 'b', 201924 FROM DUAL
9 UNION ALL
10 SELECT 'a', 'b', 201925 FROM DUAL
11 UNION ALL
12 SELECT 'a', 'b', 201927 FROM DUAL)
13 SELECT col1,
14 col2,
15 col3,
16 CASE
17 WHEN ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY col3) + 2 >
18 COUNT (*) OVER (PARTITION BY col1, col2)
19 THEN
20 col3
21 WHEN MOD (ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY col3), 2 + 1) = 1
22 THEN
23 LEAD (col3, 2) OVER (PARTITION BY col1, col2 ORDER BY col3)
24 END AS col4
25 FROM t;
COL1 COL2 COL3 COL4
_______ _______ _________ _________
a b 201921 201924
a b 201923
a b 201924
a b 201925 201925
a b 201927 201927
If I'm following your logic, you can use a case expression and analytic row_number() calculation to only populate col4 every offset row; something like (with n as the dynamic value):
select col1, col2, col3,
case when mod(row_number() over (partition by col1, col2 order by col3) + n, n + 1) = 0
then
lead(col3, n) over (partition by col1, col2 order by col3)
end as col4
from your_table
order by col1, col2, col3;
db<>fiddle
But that leaves the value in the last row for the partition null. Based on your second example you seem to actually want the last n rows to always have their own col3 value, which you could determine from lead() being null and then coalescing:
case when
mod(row_number() over (partition by col1, col2 order by col3) + n, n + 1) = 0
or lead(col3, n) over (partition by col1, col2 order by col3) is null
then
coalesce(lead(col3, n) over (partition by col1, col2 order by col3), col3)
end as col4
db<>fiddle
or using additional case branches:
case when
lead(col3, n) over (partition by col1, col2 order by col3) is null
then
col3
when
mod(row_number() over (partition by col1, col2 order by col3) + n, n + 1) = 0
then
lead(col3, n) over (partition by col1, col2 order by col3)
end as col4
db<>fiddle
If col3 can be nullable then you could always set the last n rows in the partition to their col3 answer instead of checking if the lead is null:
case when
row_number() over (partition by col1, col2 order by col3 desc) <= n
then
col3
when
mod(row_number() over (partition by col1, col2 order by col3) + n, n + 1) = 0
then
lead(col3, n) over (partition by col1, col2 order by col3)
end as col4
db<>fiddle

concatenate and de-dupe multiple rows

I have some incoming rows in the below format.
| Col1 | Col2 | Col3 |
| 1 | A | 1 |
| 1 | A | 1,2 |
| 1 | A | 1,3 |
| 1 | A | 2,4 |
Desired outputsql is
| Col1 | Col2 | Col3 |
| 1 | A | 1,2,3,4 |
Basically, group all rows based on Col1 and Col2 and then concatenate and remove duplicates from Col3.
SELECT COL1, COL2, {?????}
FROM TABLEA
GROUP BY COL1, COL2;
I could not think much at this moment. Any pointers would be much appreciated. I am inclined to WX2 database, but any ANSI compliant snippet would be helpful.
For Postgres use this:
select col1, col2, string_agg(distinct col3, ',') as col3
from (
select col1, col2, x.col3
from tablea, unnest(string_to_array(col3, ',')) as x(col3)
) t
group by col1, col2;
This is largely ANSI compliant except for the string_to_array() and string_agg() function.
You could try with transpose or concatenation functions. The difficulty comes from the fact that col3 is varchar and a conversion is needed to get the distinct values.
With MySQL :
SELECT col1, col2, GROUP_CONCAT(DISTINCT col3) AS col3 FROM
(SELECT col1, col2, CONVERT(SUBSTR(col3, 1), UNSIGNED INTEGER) AS col3 FROM (
SELECT 1 AS col1, 'A' AS col2, '1' AS col3 UNION ALL
SELECT 1 AS col1, 'A' AS col2, '1,2' AS col3 UNION ALL
SELECT 1 AS col1, 'A' AS col2, '1,3' AS col3 UNION ALL
SELECT 1 AS col1, 'A' AS col2, '2,4' AS col3
) AS t
UNION ALL
SELECT col1, col2, CONVERT(SUBSTR(col3, 3), UNSIGNED INTEGER) AS col3 FROM (
SELECT 1 AS col1, 'A' AS col2, '1' AS col3 UNION ALL
SELECT 1 AS col1, 'A' AS col2, '1,2' AS col3 UNION ALL
SELECT 1 AS col1, 'A' AS col2, '1,3' AS col3 UNION ALL
SELECT 1 AS col1, 'A' AS col2, '2,4' AS col3
) AS t1
) AS t2
WHERE col3 <> 0
Result :
col1 | col2 | col3
1 | A | 1,2,3,4
For SQL Server: first concatenate all col3 values using STUFF method and INSERT INTO CTE table.Based on this CTE tables split all rows as individual into single column based on CTE table.Finally concate all DISTINCT strings with help of STUFF.
CREATE TABLE #table ( Col1 INT , Col2 VARCHAR(10) , Col3 VARCHAR(10))
INSERT INTO #table ( Col1 , Col2 , Col3 )
SELECT 1 , 'A' , '1' UNION ALL
SELECT 1 , 'A' , '1,2' UNION ALL
SELECT 1 , 'A' , '1,3' UNION ALL
SELECT 1 , 'A' , '2,4'
;WITH CTEValues ( Colval ) AS
(
SELECT STUFF ( ( SELECT ',' + Col3 FROM #table T2 WHERE T2.Col2 =
T1.col2 FOR XML PATH('') ),1,1,'')
FROM #table T1
GROUP BY Col2
)
SELECT * INTO #CTEValues
FROM CTEValues
;WITH CTEDistinct ( SplitValues , SplitRemain ) AS
(
SELECT SUBSTRING(Colval,0,CHARINDEX(',',Colval)),
SUBSTRING(Colval,CHARINDEX(',',Colval)+1,LEN(Colval))
FROM #CTEValues
UNION ALL
SELECT CASE WHEN CHARINDEX(',',SplitRemain) = 0 THEN SplitRemain ELSE
SUBSTRING(SplitRemain,0,CHARINDEX(',',SplitRemain)) END,
CASE WHEN CHARINDEX(',',SplitRemain) = 0 THEN '' ELSE
SUBSTRING(SplitRemain,CHARINDEX(',',SplitRemain)+1,LEN(SplitRemain))
END
FROM CTEDistinct
WHERE SplitRemain <> ''
)
SELECT STUFF ( ( SELECT DISTINCT ',' + SplitValues FROM CTEDistinct T2
FOR XML PATH('') ),1,1,'')

Postgres pivot columns to rows

How can I convert table of the following 5 columns structure:
Id, name, col1, col2, col2
1 aaa 10 20 30
2 bbb 100 200 300
to the following structure where Col1, Col2 and Col3 columns are now shown as strings in new columns Colx.
Id, name, Colx, Value
1 aaa Col1 10
1 aaa Col2 20
1 aaa Col3 30
2 bbb Col1 100
2 bbb Col2 200
2 bbb Col3 300
Thanks!
Avi
You can use a subquery with UNION statement
select nombre, colx, val from (
select nombre, 'col1' as colx, col1 as val from test
UNION
select nombre, 'col2' as colx, col2 as val from test
UNION
select nombre, 'col3' as colx, col3 as val from test
) as query
order by val

Grouping multiple rows from a table into column

I have two table as below.
Table 1
+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| 1 | 1.5 | 1.5 | 2.5 |
| 1 | 2.5 | 3.5 | 1.5 |
+------+------+------+------+
Table 2
+------+--------+
| Col1 | Col2 |
+------+--------+
| 1 | 12345 |
| 1 | 678910 |
+------+--------+
I want the result as below.
+------+------+------+------+-------+--------+
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
+------+------+------+------+-------+--------+
| 1 | 4 | 5 | 4 | 12345 | 678910 |
+------+------+------+------+-------+--------+
Here Col2, Col3 and Col4 is the aggregate of value from Col2,3,4 in Table 1. And rows from Table 2 are transposed to Columns in the result.
I use Oracle 11G and tried the PIVOT option. But I couldn't aggregate values from Column 2,3,4 in Table 1.
Is there any function available in Oracle which provides direct solution without any dirty work around?
Thanks in advance.
Since you will always have only 2 records in second table simple grouping and join will do.
Since I dont have tables I am using CTEs and Inline views
with cte1 as (
select 1 as col1 , 1.5 as col2 , 1.5 as col3, 2.5 as col4 from dual
union all
select 1 , 2.5 , 3.5 , 1.5 fom dual
) ,
cte2 as (
select 1 as col1 , 12345 as col2 fom dual
union all
select 1,678910 fom dual )
select* from(
(select col1,sum(col2) as col2 , sum(col3) as col3,sum(col4) as col4
from cte1 group by col1) as x
inner join
(select col1 ,min(col2) as col5 ,max(col2) as col from cte2
group by col1
) as y
on x.col1=y.col1)
with
mytab1 as (select col1, col2, col3, col4, 0 col5, 0 col6 from tab1),
mytab2 as
(
select
col1, 0 col2, 0 col3, 0 col4, "1_COL2" col5, "2_COL2" col6
from
(
select
row_number() over (partition by col1 order by rowid) rn, col1, col2
from
tab2
)
pivot
(
max(col2) col2
for rn in (1, 2)
)
)
select
col1,
sum(col2) col2,
sum(col3) col3,
sum(col4) col4,
sum(col5) col5,
sum(col6) col6
from
(
select * from mytab1 union all select * from mytab2
)
group by
col1
Hello You can use the below query
with t1 (col1,col2,col3,col4)
as
(
select 1,1.5,1.5,2.5 from dual
union
select 1,2.5,3.5,1.5 from dual
),
t2 (col1,col2)
as
(
select 1,12345 from dual
union
select 1,678910 from dual
)
select * from
(
select col1
,max(decode(col2,12345,12345)) as co5
,max(decode(col2,678910,678910)) as col6
from t2
group by col1
) a
inner join
(
select col1,sum(col2) as col2,sum(col3) as col3,sum(col4) as col4
from t1
group by col1
) b
on a.col1=b.col1
Pivot only the second table. You can then do GROUP BY on the nested UNION ALL between table1 (col5 and col6 are null for subsequent group by) and pivoted table2 (col2, col3, col4 are null for subsequent group by).