Identifying rows for deletion/update based on criteria from matching rows - sql

I have a data set that contains rows considered duplicates based on certain fields. I need to match the duplicate rows, evaluate the non-matching fields, and flag one of them for deletion. A sample table is:
ID Col1 Col2 Col3
1 A B CC
2 A B DD
3 E F GG
4 E F HH
So I need to identify rows 1 & 2 as duplicates based on Col1 and Col2 matching, and compare the Col3 fields, ultimately flagging either row 1 or 2 for deletion. And the same for rows 3 & 4. This table consists entirely of rows that match at least one other row across Col1 and Col2.
My first thought was to join onto itself to flatten the rows into this format:
t1.ID t2.ID t1.Col1 t1.Col2 TableOneCol3 TableTwoCol3
1 2 A B CC DD
3 4 E F GG HH
Then it would be simple to evaluate TableOneCol3 and TableTwoCol3 for each row.
I tried to do this with a self join:
select t1.ID, t2.ID, t1.Col1, t1.Col2, t1.Col3 as TableOneCol3, t2.col3 as TableTwoCol3
into #temptable
from tableOne t1
join tableTwo t2
where t1.Col1 = t2.Col2
and t1.Col2 = t2.Col2
and t1.ID <> t2.ID
But of course this doesn't remove duplicates - just adds the duplicate field information to each row.
I went down the path of pivoting the data - but I end up with a similar result -I pivot the duplicates as well.
I dug through SO but not sure if I have the specific words for what I need to do (the admittedly vague title might be a giveaway - apologies for that). I found many examples of flattening data into single columns and pivots, but nothing that would flatten paired rows and remove one of them from the resultset.
Not sure if I'm going down the wrong road for this or not. It seems I need to evaluate each row in the context of what has been evaluated prior - but I'm not sure how to do this without resorting to a cursor.

It is extremely unclear what you are trying to do. I tossed together a couple of quick ideas that might be what you are trying to do.
if OBJECT_ID('tempdb..#Something') is not null
drop table #Something
create table #Something
(
ID int
, Col1 char(1)
, Col2 char(1)
, Col3 char(2)
)
insert #Something
(
ID
, Col1
, Col2
, Col3
)
VALUES
(1, 'A', 'B', 'CC'),
(2, 'A', 'B', 'DD'),
(3, 'E', 'F', 'GG'),
(4, 'E', 'F', 'HH');
with SortedResults as
(
select *
, ROW_NUMBER() over(partition by Col1, Col2 order by Col3) as RowNum
from #Something
)
delete SortedResults
where RowNum > 1
select *
from #Something;
--OR maybe you want to cross tab the data???
drop table #Something
GO
create table #Something
(
ID int
, Col1 char(1)
, Col2 char(1)
, Col3 char(2)
)
insert #Something
(
ID
, Col1
, Col2
, Col3
)
VALUES
(1, 'A', 'B', 'CC'),
(2, 'A', 'B', 'DD'),
(3, 'E', 'F', 'GG'),
(4, 'E', 'F', 'HH');
with SortedResults as
(
select *
, ROW_NUMBER() over(partition by Col1, Col2 order by Col3) as RowNum
from #Something
)
select
MAX(case when RowNum = 1 then ID end) as ID_1
, MAX(case when RowNum = 2 then ID end) as ID_2
, Col1
, Col2
, MAX(case when RowNum = 1 then Col3 end) as Col3_1
, MAX(case when RowNum = 2 then Col3 end) as Col3_2
from SortedResults
group by
Col1
, Col2

You could obtain a table in a form similar to the one you describe with use of the LEAD() analytic function. This will have the benefit that it works reasonably well when your dupes come in groups larger than two. For example:
select
ID,
lead(ID) over (partition by col1, col2 order by col3) as nextId,
Col1,
Col2,
Col3,
lead(Col3) over (partition by col1, col2 order by col3) as nextCol3
into #temptable
from tableOne
Results would be of the form
ID nextId Col1 Col2 Col3 nextCol3
1 2 A B CC DD
2 NULL A B DD NULL
3 4 E F GG HH
4 NULL E F HH NULL
If you are confident that you don't need to deal with groups larger than two then you could get the exact table you wanted by afterward filtering out, say, the rows having nextId IS NULL.

Related

Merge into (in SQL), but ignore the duplicates

I try to merge two tables in snowflake with:
On CONCAT(tab1.column1, tab1.column2) = CONCAT(tab1.column1, tab1.column2)
The problem here is that there are duplicates. that means rows where column1 and column2 in table2 are identical. the only difference is the column timestamp. Therefore i would like to have two options: either i ignore the duplicate and take only one row (with the biggest timestamp), or distinguish again based on the timestamp. the second would be nicer
But I have no clue how to do it
Example:
Table1:
Col1 Col2 Col3 Timestamp
24 10 3 05.05.2022
34 19 2 04.05.2022
24 10 4 06.05.2022
Table2:
Col1 Col2 Col3
24 10 Null
34 19 Null
What I want to do:
MERGE INTO table1 AS dest USING
(SELECT * FROM table2) AS src
ON CONCAT(dest.col1, dest.col2) = CONCAT(src.col1, src.col2)
WHEN MATCHED THEN UPDATE
SET dest.col3 = src.col3
It feels like you want to update from TABLE1 too TABLE2 not the other way around, because as your example is there is no duplicates.
It also feels like you want to use two equi join's on col1 AND col2 not concat them together:
thus how I see your data, and the words you used, I think you should do this:
create or replace table table1(Col1 number, Col2 number, Col3 number, timestamp date);
insert into table1 values
(24, 10, 3, '2022-05-05'::date),
(34, 19, 2, '2022-05-04'::date),
(24, 10, 4, '2022-05-06'::date);
create or replace table table2(Col1 number, Col2 number, Col3 number);
insert into table2 values
(24, 10 ,Null),
(34, 19 ,Null);
MERGE INTO table2 AS d
USING (
select *
from table1
qualify row_number() over (partition by col1, col2 order by timestamp desc) = 1
) AS s
ON d.col1 = s.col1 AND d.col2 = s.col2
WHEN MATCHED THEN UPDATE
SET d.col3 = s.col3;
which runs fine:
number of rows updated
2
select * from table2;
shows it's has been updated:
COL1
COL2
COL3
24
10
4
34
19
2
but the JOIN being your way work as you have used if that is correct for your application, albeit it feels very wrong to me.
MERGE INTO table2 AS d
USING (
select *
from table1
qualify row_number() over (partition by col1, col2 order by timestamp desc) = 1
) AS s
ON concat(d.col1, d.col2) = concat(s.col1, s.col2)
WHEN MATCHED THEN UPDATE
SET d.col3 = s.col3;
This is it:
WITH CTE AS
(
SELECT *,
RANK() OVER (PARTITION BY col1,col2
ORDER BY Timestamp desc) AS rn
FROM table1
)
UPDATE CTE
SET col3 = (select col3 from table2 where CONCAT(table2.col1,table2.col2) = CONCAT(CTE.col1, CTE.col2))
where CTE.rn =1;

How to make a condition to select A and B in a column but not A or B alone?

I have a table A:
col1 col2
nameA A
nameA B
nameB A
nameC B
.........
I want to make a condition so that all the values of nameA with both A and B, but not only A or B itself, will be selected. How can I do this?
Using IN is not working since if only A or B then the condition is still giving result.
Expected result:
col1 col2
nameA A
nameA B
Not expected resut:
col1 col2
nameA A
Or
col1 col2
nameA B
You may try the following aggregation query:
WITH cte AS (
SELECT col1
FROM yourTable
WHERE col2 IN ('A', 'B')
GROUP BY col1
HAVING MIN(col2) <> MAX(col2) AND COUNT(*) = 2
)
SELECT col1, col2
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM cte t2 WHERE t1.col1 = t2.col1);
As a note about the aggregation logic in the CTE, I restrict to only col1 groups which have only A and B values for col2 and only groups having exactly two records (i.e. multiple A and B is not acceptable). With the CTE having done most of the heavy lifting, to find the full matching records we only need a simple select against your table with an EXISTS clause.
DECLARE #Test TABLE (
col1 VARCHAR(32),
col2 VARCHAR(1)
)
INSERT #Test (col1, col2)
VALUES
('nameA', 'A'),
('nameA', 'B'),
('nameB', 'A'),
('nameC', 'B')
SELECT col1, col2
FROM #Test t
WHERE EXISTS (
SELECT 1
FROM #Test
WHERE col1 = t.col1
AND col2 != t.col2
)
Assuming no duplicates in your table, window functions are an easy solution:
select t.*
from (select t.*, sum(case when col2 in ('A', 'B') then 1 else 0 end) as cnt
from t
) t
where cnt = 2;
This nicely generalizes to more than two columns.
where col1 = 'nameA' and (col2 = 'A' or col2 = 'B')

Oracle SQL How to find duplicate values in different columns?

I have a set of rows with many columns. For example,
ID | Col1 | Col2 | Col3 | Duplicate
------------------------------------
81 | 101 | 102 | 101 | YES
82 | 101 | 103 | 104 | NO
I need to calculate the "Duplicate" column. It is duplicate because it has the same value in Col1 and Col3. I know there is the LEAST function, which is similar to the MIN function but with columns. Does something similar to achieve this exists?
The approach I have in mind is to write all possible combinations in a case like this:
SELECT ID, col1, col2, col3,
CASE WHEN col1 = col2 or col1 = col3 or col2 = col3 then 1 else 0 end as Duplicate
FROM table
But, I wish to avoid that, since I have too many columns in some cases, and is very prone to errors.
What is the best way to solve this?
Hmmm. You are looking for within-row duplicates. This is painful. More recent versions of Oracle support lateral joins. But for just a handful of non-NULL columns, you can do:
select id, col1, col2, col3,
(case when col1 in (col2, col3) or col2 in (col3) then 1 else 0 end) as Duplicate
from t;
For each additional column, you need to add one more in comparison and update the other in-lists.
Something like this... note that in the lateral clause we still need to unpivot, but that is one row at a time - resulting in possibly much faster execution than simple unpivot and standard aggregation.
with
input_data ( id, col1, col2, col3 ) as (
select 81, 101, 102, 101 from dual union all
select 82, 101, 103, 104 from dual
)
-- End of simulated input data (for testing purposes only).
-- Solution (SQL query) begins BELOW THIS LINE.
select i.id, i.col1, i.col2, i.col3, l.duplicates
from input_data i,
lateral ( select case when count (distinct val) = count(val)
then 'NO' else 'YES'
end as duplicates
from input_data
unpivot ( val for col in ( col1, col2, col3 ) )
where id = i.id
) l
;
ID COL1 COL2 COL3 DUPLICATES
-- ---- ---- ---- ----------
81 101 102 101 YES
82 101 103 104 NO
You can do this by unpivoting and then counting the distinct values per id and checking if it equals the number of rows for that id. Equal means there are no duplicates. Then left join this result to the original table to caclulate the duplicate column.
SELECT t.*,
CASE WHEN x.id IS NOT NULL THEN 'Yes' ELSE 'No' END AS duplicate
FROM t
LEFT JOIN
(SELECT id
FROM
(SELECT *
FROM t
unpivot (val FOR col IN (col1,col2,col3)) u
) t
GROUP BY id
HAVING count(*)<>count(DISTINCT val)
) x ON x.id=t.id
The best way† is to avoid storing repeating groups of columns. If you have multiple columns that essentially store comparable data (i.e. a multi-valued attribute), move the data to a dependent table, and use one column.
CREATE TABLE child (
ref_id INT,
col INT
);
INSERT INTO child VALUES
(81, 101), (81, 102), (81, 101),
(82, 101), (82, 103), (82, 104);
Then it's easier to find cases where a value occurs more than once:
SELECT id, col, COUNT(*)
FROM child
GROUP BY id, col
HAVING COUNT(*) > 1;
If you can't change the structure of the table, you could simulate it using UNIONs:
SELECT id, col1, COUNT(*)
FROM (
SELECT id, col1 AS col FROM mytable
UNION ALL SELECT id, col2 FROM mytable
UNION ALL SELECT id, col3 FROM mytable
... for more columns ...
) t
GROUP BY id, col
HAVING COUNT(*) > 1;
† Best for the query you are trying to run. A denormalized storage strategy might be better for some other types of queries.
SELECT ID, col1, col2,
NVL2(NULLIF(col1, col2), 'Not duplicate', 'Duplicate')
FROM table;
If you want to compare more than 2 columns can implement same logic with COALESCE
I think you want to use fresh data that doesnot contains any duplicate values inside table if it right then use SELECT DISTINCT statement like
SELECT DISTINCT * FROM TABLE_NAME
It will conatins duplicate free data,
Note: It will also applicable for a particular column like
SELECT DISTINCT col1 FROM TABLE_NAME

SQL script for retrieving 5 unique values in a table ( google big query )

I am looking for a query where I can get unique values(5) in a table. For example.
The table consists of more 100+ columns. Is there any way I can get unique values.
I am using google big query and tried this option
select col1 col2 ... coln
from tablename
where col1 is not null and col2 is not null
group by col1,col2... coln
order by col1, col2... coln
limit 5
But problem is it gives zero records if all the column are null
Thanks
R
I think you might be able to do this in Google bigquery, assuming that the types for the columns are compatible:
select colname, colval
from (select 'col1' as colname, col1 as colvalue
from t
where col1 is not null
group by col1
limit 5
),
(select 'col2' as colname, col2 as colvalue
from t
where col2 is not null
group by col2
limit 5
),
. . .
For those not familiar with the syntax, a comas in the from clause means union all, not cross join in this dialect. Why did they have to change this?
Try This one, i hope it works
;With CTE as (
select * ,ROW_NUMBER () over (partition by isnull(col1,''),isnull(col2,'')... isnull(coln,'') order by isnull(col1,'')) row_id
from tablename
) select * from CTE where row_id =1

How to pivot rows without grouping, counting, averaging

I am reworking some tables from a screwed up database. A few of the tables had the same data with different table names, and each one of them also had similar data but different column names. Anyway, this is a weird request but this has to be down like this.
I need to pivot rows up to simulate one row so I can create one record from two different tables.
I have attached a photo. The table on the left will pull a single row and the table on the left will supply 1 - n rows based on the id from the left table. I need to pivot the rows up to simulate one row and create one record with the two results.
From my checking online the pivot seems to be the way to go but it seems to want me to group or do some type of aggregating.
What is the best way to go about doing this?
table1 ---Produces one row
table1id | col1 | col2 | col3
1 Wow Wee Zee
table2 ---Produces 1 - n rows
table2id | table1id | col1 | col2 | col3
1 1 sock cloth sup
2 1 bal baa zak
3 1 x y fooZ
needs to look like this (the below is not column names, they're the result set)
Woo,wee,zee,sock,cloth,sup,bla,baaa,zak,x,y,fooZ
If using MySQL:
SELECT a.table1id, GROUP_CONCAT(a.col) AS col_values
FROM
(
SELECT table1id, col1 col FROM table1 UNION ALL
SELECT table1id, col2 FROM table1 UNION ALL
SELECT table1id, col3 FROM table1 UNION ALL
SELECT table1id, col1 FROM table2 UNION ALL
SELECT table1id, col2 FROM table2 UNION ALL
SELECT table1id, col3 FROM table2
) a
GROUP BY a.table1id
SQLFiddle Demo
If using SQL-Server:
SELECT a.table1id, b.colnames
FROM table1 a
CROSS APPLY
(
SELECT STUFF((
SELECT ',' + aa.col
FROM
(
SELECT table1id, col1 col FROM table1 UNION ALL
SELECT table1id, col2 FROM table1 UNION ALL
SELECT table1id, col3 FROM table1 UNION ALL
SELECT table1id, col1 FROM table2 UNION ALL
SELECT table1id, col2 FROM table2 UNION ALL
SELECT table1id, col3 FROM table2
) aa
WHERE aa.table1id = a.table1id
FOR XML PATH('')
), 1, 1, '') AS colnames
) b
SQLFiddle Demo