Join Tables SQL Server with duplicates - sql

I have a table
col1
1
2
and other table
col1 col2 col3
1 1 data value one
1 2 data value one
2 3 data value two
and I want to join both tables to obtain the following result
col1 col2 col3
1 1 data value one
2 3 data value two
The second table have duplicates but I need to join only one (randomly). I've tried with Inner Join, Left Join, Right Join and always returns all rows. Actually I use SQL Server 2008.

select t1.col1, t2.col2, t2.col3 from table1 t1
cross apply
(select top 1 col2, col3 from table2 where col1 = t1.col1 order by newid()) t2

You can use the ROW_NUMBER Function along with ORDER BY NEWID() To get one random row for each value in col1:
WITH CTE AS
( SELECT Col1,
Col2,
Col3,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY NEWID())
FROM Table2
)
SELECT *
FROM Table1
INNER JOIN CTE
ON CTE.Col1 = table1.Col1
AND CTE.RowNumber = 1 -- ONLY GET ONE ROW FOR EACH VALUE

Use Distinct it will eliminate dups, but you sure both rows will contain same data?

Related

Need help finding duplicate values for Data Quality checks

I have a table which requires me to ensure that a combination of attributes should have a unique record against it.
col1 col2 col3
a b x
a b y
a c x
a d z
e b w
How do I ensure that a col1+col2 combination only has unique col3 values. Here ab has both x and y as col3 values. I have to send such rows to a reject file and I am looking for the right filter query.
We can use an aggregation approach. To identify rows which are failing the unique requirement use:
WITH cte AS (
SELECT col1, col2
FROM yourTable
GROUP BY col1, col2
HAVING MIN(col3) <> MAX(col3)
)
SELECT t1.*
FROM yourTable t1
INNER JOIN cte t2
ON t2.col1 = t1.col1 AND
t2.col2 = t1.col2;

Get random value sets from table without using cursor or While loop

I have a table with 5 columns:
ID - int identity,col1,col2,col3,col4,(all 4 cols are varchar)
There are approx. 68,000 unique col1/col2 values. For each of these, there can be between 1 and approx. 214,000 unique col3/col4 values.
My task is to retrieve one random col3 and col4 (from the same row) for each of the unique col1/col2 values.
Is it possible to accomplish this without using a While loop or a cursor? I've done some research and know how to get random values (and the identity column helps with that), but the only way I can see to do this is to go thru the 68,000 unique col1/col2 values 1 by 1, and grab a random col3/col4 value from each.
Also, these row counts are for preliminary development/testing (collected from 4 previous months of data). When this goes live we will be going back 27 months. So obviously, we are talking about a massive amount of data.
I've seen some mentions of using CTE's, but have not been successful in finding an example or explanation.
Thanks for your help.
I figured out a solution involving temp tables, ROW_NUMBER() over..., and RAND().
First, I selected the distinct col1 and col2 values into #temp1.
SELECT DISTINCT col1, col2
INTO #temp1
FROM sourceTable
Next, I selected the distinct col3 and col4 values for each col1/col2, along with a row number, and put in temp table #temp2:
SELECT t.COL1, t.COL2, a.col3, a.col4,
ROW_NUMBER() OVER (
PARTITION BY t.col1, t.col2
ORDER BY t.col1, t.col2, a.col3, a.col4) as RowNumber
INTO #temp2
FROM #temp1 t
JOIN sourceTable a ON a.col1 = t.col1 AND a.col2 = t.col2
GROUP BY t.col1, t.col2, a.col3, a.col4
ORDER BY t.col1, t.col2, RowNumber
Then, I selected one of the rows at random from each set of col1/col2's into a 3rd temp table:
SELECT x.col1, x.col2,
(SELECT TOP 1 y.RowNumber
FROM #temp2 y
WHERE y.col1 = x.col1
AND y.col2 = x.col2
AND y.RowNumber >= RAND() *
(SELECT MAX(z.RowNumber)
FROM #temp2 z
WHERE z.col1 = x.col1
AND z.col2 = x.col2)) AS Random_RowNumber
INTO #temp3
FROM #temp1 x
ORDER BY x.col1, x.col2
Lastly, I join the tables to get the random rows:
SELECT t3.col1, t3.col2, t2.col3, t2.col4
FROM #temp3 t3
JOIN #temp2 t2 on t2.col1 = t3.col1 AND t2.col2 = t3.col2 AND t2.RowNumber = t3.Random_RowNumber

Compare a two column with the same table to remove duplicate

A sample table with two column and I need to compare the column 1 and column 2 to the same table records and need to remove the column 1 + column 2 = column 2+column 1.
I tried to do self join and case condition. But its not working
If I understand correctly, you can run a simple select like this if you have all reversed pairs in the table:
select col1, col2
from t
where col1 < col2;
If you have some singletons, then:
select col1, col2
from t
where col1 < col2 or
(col1 > col2 and
not exists (select 1
from t t2
where t2.col1 = t.col2 and
t2.col2 = t.col1
)
);
You can use the except operator.
"EXCEPT returns distinct rows from the left input query that aren't output by the right input query."
SELECT C1, C2 FROM table
Except
SELECT C2, C1 FROM table
Example with your given data set : dbfiddle
I am posting the answer based on oracle database and also the columns are string/varchar:
delete from table where rowid in (
select rowid from table
where column1 || column2 =column2 || column1 )
Feel free to provide more input and we can tweak the answer.
Okay. There might be a simpler way of doing this but this might work as well. {table} is to be replaced with your table name.
;with orderedtable as (select t1.col1, t1.col2, ROW_NUMBER() OVER(ORDER BY t1.col1, t1.col2 ASC) AS rownum
from (select distinct t2.col1, t2.col2 from {table} t2) as t1)
select f1.col1, f1.col2
from orderedtable f1
left join orderedtable f2 on f1.col1 = f2.col2 and f1.col2 = f2.col1 and f1.rownum < f2.rownum
where f2.rownum is null
The SQL below will get the reversed col1 and col2 rows:
select
distinct t2.col1,t1.col2
from
table t1
join
table t2 on t1.col1 = t2.col2 and t1.col2 = t2.col1
And when we get these reversed rows, we can except them with the left join clause, the complete SQL is:
select
t.col1,t.col2
from
table t
left join
(
select
distinct t2.col1,t1.col2
from
table t1
join
table t2 on t1.col1 = t2.col2 and t1.col2 = t2.col1
) tmp on t.col1 = tmp.col1 and t.col2 = tmp.col2
where
tmp.col1 is null
Is it clear?

Remove inner query in SQL

We have a SQL query which is not written as per the sql guideline. We have to change the query but if we change the logic and remove the inner query then it take to much time to execute. Below is the query:
select col1,
col2,
case
when col1 <> '' then(select top 1
col1
from table1 as BP
where bp.col1 = FD.col1 order by BP.col1)
when col2 <> '' then(select top 1
BP.col2
from table1 as BP
where BP.col2 = FD.col2 order by BP.col2)
else ''
end
from table2 FD
The above query is being used to insert the data into a temp table. The table1 has almost 100 million of data. Is there any way to remove the inline query along with the good performance. We have already created the indexes on table1. Any thought?
Try this
;WITH CTE
AS
(
SELECT
RN = ROW_NUMBER() OVER(ORDER BY COALESCE(T2.col1,T2.col2)),
T2.col1,
T2.col2,
T1Val = COALESCE(T2.col1,T2.col2,'')
FROM table2 T2
LEFT JOIN table1 T1
ON
(
(
ISNULL(T2.col1,'')<>'' AND T1.col1 = T2.col1
)
OR
(
ISNULL(T2.col2,'')<>'' AND T1.col2 = T2.col2
)
)
)
SELECT
*
FROM CTE
WHERE RN = 1
Here is my modest help:
You can already prepare and materialize your subquery1 and subquery2 (Group BY col1 or col2) <-- It will reduce the size of your table 1)
Split your main query (from table2 into 3 different queries)
1 with SELECT .. FROM table2 WHERE col1 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 = ''
Use an INNER JOIN with your table created in the first point.
(If you use SSIS you can // and use your inner join table into a Lookup)
If your col1 or col2 use a NVARCHAR(MAX) or a big size, you should have a look to a HashFunction (MD5 for example) and compare Hash instead.
Be sure to have all your indexes
At least if it is not performant, you can try with:
OUTER APPLY (SELECT TOP 1 .. )
Another idea should be:
SELECT col1, col2, col1 AS yourNewCol
FROM table2 T2
WHERE EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
UNION ALL
SELECT col1, col2, col2 AS yourNewCol
FROM table2 T2
WHERE
NOT EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
AND EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col2 = T2.col2)
UNION ALL
...
I don't have a clean solution for you, but some ideas.
Let me know if it helps you.
Regards,
Arnaud

Sort multiple query results in a single query

I have a select statement returning 5 columns:
select col1,col2,col3,col4,col5 from table1;
col1 col2 col3 col4 col5
9 A B C D
8 E F G H
I have another select statement from table2 which returns col1 alone;
col1
8
9
Based on the two select queries, is there a way to write a single select query to return the result as:
col1 col2 col3 col4 col5
8 E F G H
9 A B C D
ie. basically sort the output of I query based on col1 from II query. (this is in Mysql)
PS:II table column1 is used to for sorting & that is coming from table 2. Table2's col1 is not static, its changing for every user action & based on a call i will get col1 of table 2 & need to sort with table1's output.
Use an ORDER BY:
SELECT col1,col2,col3,col4,col5
FROM table1
ORDER BY col1
By default, ORDER BY is ASC.
SELECT col1,col2,col3,col4,col5
FROM table1
ORDER BY col1 DESC
...will put 9 from col1 as the first record returned.
For this to work, you seriously need a sort column on table2. Just having the IDs in table2 is not enough. You can have the records 7,8,9, then delete 8 and add it back. But no, that doesn't order it as 7,9,8. Maybe temporarily if there is no primary key on the table, but when the table gets large, even that "implicit" order is lost.
So, assuming you have such a sort column
Table2
Sort, Col1
1, 9
2, 8
Your query becomes
SELECT a.*
FROM table1 a
INNER JOIN table2 b ON a.col1 = b.col1
ORDER BY b.sort ASC
If you still want to rely on MySQL undocumented features or the way it currently works, then you can try this.
# test tables
create table table1 (col1 int, col2 int, col3 int);
insert table1 select 8, 1,2; # in this order
insert table1 select 9, 3,4;
create table table2 (col1 int);
insert table2 select 9; # in this order
insert table2 select 8;
# select
SELECT a.*
FROM table1 a
INNER JOIN table2 b ON a.col1 = b.col1
----output----
col1 col2 col3
9 3 4
8 1 2
This works at least for small tables, only because size(table2) < size(table1) so it collects in that order, preserving the filesort on table2.col1.
Not sure what the relationship is between t1.col1 and t2.col2. Probably looking for something like this though:
SELECT t2.col1, t1.col2, t1.col3, t1.col4, t1.col5
FROM table2 t2
INNER JOIN table1 t1 ON t1.col1 = t2.col1
ORDER BY t2.col1 ASC