Need help finding duplicate values for Data Quality checks

Need help finding duplicate values for Data Quality checks - sql

I have a table which requires me to ensure that a combination of attributes should have a unique record against it.
col1 col2 col3
a b x
a b y
a c x
a d z
e b w
How do I ensure that a col1+col2 combination only has unique col3 values. Here ab has both x and y as col3 values. I have to send such rows to a reject file and I am looking for the right filter query.

We can use an aggregation approach. To identify rows which are failing the unique requirement use:
WITH cte AS (
SELECT col1, col2
FROM yourTable
GROUP BY col1, col2
HAVING MIN(col3) <> MAX(col3)
)
SELECT t1.*
FROM yourTable t1
INNER JOIN cte t2
ON t2.col1 = t1.col1 AND
t2.col2 = t1.col2;

Related

Get random value sets from table without using cursor or While loop

I have a table with 5 columns:
ID - int identity,col1,col2,col3,col4,(all 4 cols are varchar)
There are approx. 68,000 unique col1/col2 values. For each of these, there can be between 1 and approx. 214,000 unique col3/col4 values.
My task is to retrieve one random col3 and col4 (from the same row) for each of the unique col1/col2 values.
Is it possible to accomplish this without using a While loop or a cursor? I've done some research and know how to get random values (and the identity column helps with that), but the only way I can see to do this is to go thru the 68,000 unique col1/col2 values 1 by 1, and grab a random col3/col4 value from each.
Also, these row counts are for preliminary development/testing (collected from 4 previous months of data). When this goes live we will be going back 27 months. So obviously, we are talking about a massive amount of data.
I've seen some mentions of using CTE's, but have not been successful in finding an example or explanation.
Thanks for your help.

I figured out a solution involving temp tables, ROW_NUMBER() over..., and RAND().
First, I selected the distinct col1 and col2 values into #temp1.
SELECT DISTINCT col1, col2
INTO #temp1
FROM sourceTable
Next, I selected the distinct col3 and col4 values for each col1/col2, along with a row number, and put in temp table #temp2:
SELECT t.COL1, t.COL2, a.col3, a.col4,
ROW_NUMBER() OVER (
PARTITION BY t.col1, t.col2
ORDER BY t.col1, t.col2, a.col3, a.col4) as RowNumber
INTO #temp2
FROM #temp1 t
JOIN sourceTable a ON a.col1 = t.col1 AND a.col2 = t.col2
GROUP BY t.col1, t.col2, a.col3, a.col4
ORDER BY t.col1, t.col2, RowNumber
Then, I selected one of the rows at random from each set of col1/col2's into a 3rd temp table:
SELECT x.col1, x.col2,
(SELECT TOP 1 y.RowNumber
FROM #temp2 y
WHERE y.col1 = x.col1
AND y.col2 = x.col2
AND y.RowNumber >= RAND() *
(SELECT MAX(z.RowNumber)
FROM #temp2 z
WHERE z.col1 = x.col1
AND z.col2 = x.col2)) AS Random_RowNumber
INTO #temp3
FROM #temp1 x
ORDER BY x.col1, x.col2
Lastly, I join the tables to get the random rows:
SELECT t3.col1, t3.col2, t2.col3, t2.col4
FROM #temp3 t3
JOIN #temp2 t2 on t2.col1 = t3.col1 AND t2.col2 = t3.col2 AND t2.RowNumber = t3.Random_RowNumber

Filter rows if value in one column exists in another column

I have following table in Postgres 11:
col1 col2 col3 col4
1 trial_1 ag-270 ag
2 trial_2 ag ag
3 trial_3 methotexate (mtx) mtx
4 trial_4 mtx mtx
5 trial_5 hep-nor-b nor-b
I would like to search each value of col4 throughout the column col3. If the value in col4 exists in col3, I would like to keep the rows else the row should be excluded.
Desired output is:
col1 col2 col3 col4
1 trial_1 ag-270 ag
2 trial_2 ag ag
3 trial_3 methotexate (mtx) mtx
4 trial_4 mtx mtx
I could not try anything on this as I am unable to find a solution to this yet.

If the value in col4 exists in col3, I would like to keep the rows.
... translates to:
SELECT *
FROM tbl a
WHERE EXISTS (SELECT FROM tbl b WHERE b.col3 = a.col4);
db<>fiddle here
Produces your desired result.

This can be done as an inner join:
select distinct t.col1, t.col2, t.col3, t,col4
from T t inner join T t2 on t2.col3 = t.col4

select a.*
from myTable a
where exists (
select 1
from myTable b
where b.col3 = a.col4)
If your table has many rows, you should ensure that col3 is indexed.

Remove Duplicates from my SQL Server table

I have a table with the below data.
Col1 Col2
A B
B A
C D
D C
E F
F E
If the (col1 and Col2) and (col2 and Col1) values are same in the multiple rows, they are considered as Duplicates.In the above example, Col1 and Col2 are same between Row 1 and Row 2, they are considered as duplicates. We need only 1 row among them.
So the output for the above example will be,
Col1 Col2
A B
C D
E F
or
Col1 Col2
B A
D C
F E
Please help me.
Thanks..

Try this:
rextester: http://rextester.com/XCYU52032
create table tb (col1 char(1), col2 char(1))
insert into tb (col1, col2) values
('a','b')
,('b','a')
,('c','d')
,('d','c')
,('e','f')
,('f','e');
with cte as (
select col1, col2, rn = row_number() over(order by col1)
from tb
)
/*
select x.*
from cte as x
where not exists (
select 1
from cte as y
where y.col2 = x.col1
and x.rn>y.rn -- returns col1 in ('a','c','e')
--and x.rn<y.rn -- returns col1 in ('b','d','f')
)
--*/
delete x
from cte as x
where not exists (
select 1
from cte as y
where y.col2 = x.col1
--and x.rn>y.rn -- returns col1 in ('a','c','e')
and x.rn<y.rn -- returns col1 in ('b','d','f')
)
select * from tb

Try
delete from myTable t1
where col1 > col2 and exists (select 1
from myTable t2
where t2.col1 = t1.col2 and t2.col2 = t1.col1);

Join Tables SQL Server with duplicates

I have a table
col1
1
2
and other table
col1 col2 col3
1 1 data value one
1 2 data value one
2 3 data value two
and I want to join both tables to obtain the following result
col1 col2 col3
1 1 data value one
2 3 data value two
The second table have duplicates but I need to join only one (randomly). I've tried with Inner Join, Left Join, Right Join and always returns all rows. Actually I use SQL Server 2008.

select t1.col1, t2.col2, t2.col3 from table1 t1
cross apply
(select top 1 col2, col3 from table2 where col1 = t1.col1 order by newid()) t2

You can use the ROW_NUMBER Function along with ORDER BY NEWID() To get one random row for each value in col1:
WITH CTE AS
( SELECT Col1,
Col2,
Col3,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY NEWID())
FROM Table2
)
SELECT *
FROM Table1
INNER JOIN CTE
ON CTE.Col1 = table1.Col1
AND CTE.RowNumber = 1 -- ONLY GET ONE ROW FOR EACH VALUE

Use Distinct it will eliminate dups, but you sure both rows will contain same data?

Sort multiple query results in a single query

I have a select statement returning 5 columns:
select col1,col2,col3,col4,col5 from table1;
col1 col2 col3 col4 col5
9 A B C D
8 E F G H
I have another select statement from table2 which returns col1 alone;
col1
8
9
Based on the two select queries, is there a way to write a single select query to return the result as:
col1 col2 col3 col4 col5
8 E F G H
9 A B C D
ie. basically sort the output of I query based on col1 from II query. (this is in Mysql)
PS:II table column1 is used to for sorting & that is coming from table 2. Table2's col1 is not static, its changing for every user action & based on a call i will get col1 of table 2 & need to sort with table1's output.

Use an ORDER BY:
SELECT col1,col2,col3,col4,col5
FROM table1
ORDER BY col1
By default, ORDER BY is ASC.
SELECT col1,col2,col3,col4,col5
FROM table1
ORDER BY col1 DESC
...will put 9 from col1 as the first record returned.

For this to work, you seriously need a sort column on table2. Just having the IDs in table2 is not enough. You can have the records 7,8,9, then delete 8 and add it back. But no, that doesn't order it as 7,9,8. Maybe temporarily if there is no primary key on the table, but when the table gets large, even that "implicit" order is lost.
So, assuming you have such a sort column
Table2
Sort, Col1
1, 9
2, 8
Your query becomes
SELECT a.*
FROM table1 a
INNER JOIN table2 b ON a.col1 = b.col1
ORDER BY b.sort ASC
If you still want to rely on MySQL undocumented features or the way it currently works, then you can try this.
# test tables
create table table1 (col1 int, col2 int, col3 int);
insert table1 select 8, 1,2; # in this order
insert table1 select 9, 3,4;
create table table2 (col1 int);
insert table2 select 9; # in this order
insert table2 select 8;
# select
SELECT a.*
FROM table1 a
INNER JOIN table2 b ON a.col1 = b.col1
----output----
col1 col2 col3
9 3 4
8 1 2
This works at least for small tables, only because size(table2) < size(table1) so it collects in that order, preserving the filesort on table2.col1.

Not sure what the relationship is between t1.col1 and t2.col2. Probably looking for something like this though:
SELECT t2.col1, t1.col2, t1.col3, t1.col4, t1.col5
FROM table2 t2
INNER JOIN table1 t1 ON t1.col1 = t2.col1
ORDER BY t2.col1 ASC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Need help finding duplicate values for Data Quality checks - sql

We can use an aggregation approach. To identify rows which are failing the unique requirement use: WITH cte AS ( SELECT col1, col2 FROM yourTable GROUP BY col1, col2 HAVING MIN(col3) <> MAX(col3) ) SELECT t1.* FROM yourTable t1 INNER JOIN cte t2 ON t2.col1 = t1.col1 AND t2.col2 = t1.col2;

Related

Get random value sets from table without using cursor or While loop

Filter rows if value in one column exists in another column

Remove Duplicates from my SQL Server table

Join Tables SQL Server with duplicates

Sort multiple query results in a single query

Categories

Resources