joining 2 tables with non unique IDs for merge

joining 2 tables with non unique IDs for merge - sql

I need to migrate data from another table to mine. However, the ID i have to use to join them is not unique (~10% of the IDs from either tables are duplicated, they are not primary keys).
eg. table A has IDs (1, 1, 2, 3) and table B has values (1, 2, 2, 3, 4) how can I join them so they either omit duplicates or take ANY value from the other table as the correct link?
My goal is to return a view where there are no duplicate rows at all in the ID column I am joining on.

How about using row_number() for the query:
select a.*, b.*
from (select a.*, row_number() over (partition by id order by id) as seqnum
from tablea a
) a join
(select b.*, row_number() over (partition by id order by id) as seqnum
from tableb b
) b
on a.id = b.id and a.seqnum = 1 and b.seqnum = 1;

Related

How do I join two tables together (one to many relationship), but only select the 3rd match from the second table?

I have two tables, table A and table B. There are multiple entries in table B for each entry in table A when joining them together, but I only want to match the 3rd value from table B, which is neither the maximum nor the minimum of the values. The values can be ordered, and it will always be the 3rd value after ordering. Is there a way to do this? Thank you!

WITH
ranked_b AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY key ORDER BY val) AS key_rank
FROM
table_b
)
SELECT
*
FROM
table_a
INNER JOIN
ranked_b
ON ranked_b.key = table_a.key
AND ranked_b.key_rank = 3

Consider below approach
select key,
array_agg(value order by value limit 3)[safe_ordinal(3)] as value
from tableA
left join tableB
on key = foreignkey
group by key

You can use a correlated subquery:
select a.*,
(select b.value
from b
where b.key = a.key
limit 1 offset 2
)
from a;

Get one record per ID

I'm trying to retrieve data from 2 tables A&B.
(select * from tableA a LEFT JOIN tableB b on a.idA = b.idA)
, there are multiple data rows in B for each PrimaryKey from A. But I want to get only the first record for every ID from tableA. How can I achieve this?

SQL tables represent unordered sets so there is no first row. But you can get an arbitrary row or specific row based on an ordering column using window functions:
select *
from tableA a LEFT JOIN
(select b.*,
row_number() over (partition by idA order by <ordering col>) as seqnum
from tableB b
) b
on a.idA = b.idA and seqnum = 1

How to compare two tables in Hive based on counts

I have below hive tables
Table_1
ID
1
1
2
Table_2
ID
1
2
2
I am comparing two tables based on count of ID in both tables, I need the output like below
ID
1 - 2records in table 1 and 1 record in Table 2
2 - one record in Table 1 and 2 records in table 2
Table_1 is parent table
i am using below query
select count(*),ID from Table_1 group by ID;
select count(*),ID from Table_2 group by ID;

Just do a full outer join on your queries with the on condition as X.id = Y.id, and then select * from the resultant table checking for nulls on either side.
Select id, concat(cnt1, " entries in table 1, ",cnt2, "entries in table 2") from (select * from (select count(*) as cnt1, id from table1 group by id) X full outer join (select count(*) as cnt2, id from table2 group by id)
on X.id=Y.id
)

Try This. You may use a case statement to check if it should be record / records etc.
SELECT m.id,
CONCAT (COALESCE(a.ct, 0), ' record in table 1, ', COALESCE(b.ct, 0),
' record in table 2')
FROM (SELECT id
FROM table_1
UNION
SELECT id
FROM table_2) m
LEFT JOIN (SELECT Count(*) AS ct,
id
FROM table_1
GROUP BY id) a
ON m.id = a.id
LEFT JOIN (SELECT Count(*) AS ct,
id
FROM table_2
GROUP BY id) b
ON m.id = b.id;

You could use this Python program to do a full comparison of 2 Hive tables:
https://github.com/bolcom/hive_compared_bq
If you want a quick comparison just based on counts, then pass the "--just-count" option (you can also specify the group by column with "--group-by-column").
The script also allows you to visually see all the differences on all rows and all columns if you want a complete validation.

SQL Server 2012 writing duplicate entries into table from CTE

So I am writing to a table the output from a few sequential CTEs, and when I fixed a join in one of the CTEs from an inner to a left join, there are now duplicated entries in the Table that do not show up if I just run the query without the insert.
Is there something I need to understanding about creating and inserting into a table with regards to joins in a CTE?
EDIT
create table MYTABLE
(
ID int,
Date smalldatetime,
Val1 int,
Val2 int
)
; with cte1 as (
select
a.ID,
a.Date,
a.Val1,
b.Val2
from table1 a
left join table2 b
on a.ID = b.ID
and a.Date = b.Date
)
insert into MYTABLE
(ID, Date, Val1, Val2)
select * from cte1
When creating the table on the inner join there is no problem with duplicates; on the left join (as shown above), rows where there are NULLs appear to be duplicated many times.

Check your right table (table2) my guess is that there are more than one record that have the same ID and Date.
If that is the case, the records are not technically duplicated if you do a select all (*) in the CTE, you will see the other fields that have changed.
If you do not care about the rest of the fields being different though, just try adding a Row_Number to your CTE and select where the Row_Number = 1 outside of the CTE.
For Instance:
create table MYTABLE
(
ID int,
Date smalldatetime,
Val1 int,
Val2 int
)
; with cte1 as (
select
a.ID,
a.Date,
a.Val1,
b.Val2
Rnum = ROW_NUMBER() OVER(PARTITION BY a.ID, a.Date, a.Val1, a.Val2 ORDER BY ID)
from table1 a
left join table2 b
on a.ID = b.ID
and a.Date = b.Date
)
insert into MYTABLE
(ID, Date, Val1, Val2)
select ID, Date, Val1, Val2 from cte1
where Rnum = 1
The row_number acts as a "distinct" and depending on what combination of fields you want to not duplicate, you will get different results.
For instance, if you do not want the IDs to duplicate, then
Rnum = ROW_NUMBER() OVER(PARTITION BY a.ID ORDER BY ID)
if you do not care about the IDs duplicating, but you do not want the same ID on the same date, then
Rnum = ROW_NUMBER() OVER(PARTITION BY a.ID, a.Date ORDER BY ID)
etc.... just depends on your selection criteria of what you do not want to duplicate.
Hope this helps

How to replace TOP 1000 rows of select columns indiscriminately

Basically I have a table that contains 1000 rows with three columns. (TABLE A)
I have ANOTHER table with 200 columns with 1million+ records. (TABLE B)
I am trying to replace the THREE COLUMNS OF 1000 rows of TABLE B with those of TABLE A. I've read a lot of solutions where you can INSERT into table B from TABLE A.. but that's useless because I'll get NULLs in the remaining 197 columns that I need data for.
So the task is to replace rows of certain columns from one table to select columns of another table. There is NO conditions, just the top rows or whatever order you can think of is fine. If you can give an answer that takes ORDER BY something into account, that'd be bonus! Thank you so much!

If I understood your requirements
WITH TA
AS (SELECT *,
ROW_NUMBER()
OVER (
ORDER BY col1) AS RN
FROM TableA),
TB
AS (SELECT *,
ROW_NUMBER()
OVER (
ORDER BY col1) AS RN
FROM TableB)
UPDATE TB
SET TB.col1 = TA.col1,
TB.col2 = TA.col2,
TB.col3 = TA.col3
FROM TB
JOIN TA
ON TB.RN = TA.RN

Try something like this:
WITH topB AS (
SELECT TOP 1000 row_number() OVER(ORDER BY field_n) rn, b.* FROM table_b b
ORDER BY field_x),
topA AS (
SELECT row_number() OVER(ORDER BY field_m) rn, a.*
FROM table_a a)
UPDATE b
SET
b.Field_1 = a.Field_1,
b.Field_2 = a.Field_2,
b.Field_3 = a.Field_3
FROM
TopB b JOIN TopA a ON b.rn = a.rn
Idea here is to assign row numbers in both tables, join them by these numbers, and update the B part of the join with values from A.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

joining 2 tables with non unique IDs for merge - sql

How about using row_number() for the query: select a., b. from (select a., row_number() over (partition by id order by id) as seqnum from tablea a ) a join (select b., row_number() over (partition by id order by id) as seqnum from tableb b ) b on a.id = b.id and a.seqnum = 1 and b.seqnum = 1;

Related

How do I join two tables together (one to many relationship), but only select the 3rd match from the second table?

Get one record per ID

How to compare two tables in Hive based on counts

SQL Server 2012 writing duplicate entries into table from CTE

How to replace TOP 1000 rows of select columns indiscriminately

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

joining 2 tables with non unique IDs for merge - sql

How about using row_number() for the query: select a.*, b.* from (select a.*, row_number() over (partition by id order by id) as seqnum from tablea a ) a join (select b.*, row_number() over (partition by id order by id) as seqnum from tableb b ) b on a.id = b.id and a.seqnum = 1 and b.seqnum = 1;

Related

How do I join two tables together (one to many relationship), but only select the 3rd match from the second table?

Get one record per ID

How to compare two tables in Hive based on counts

SQL Server 2012 writing duplicate entries into table from CTE

How to replace TOP 1000 rows of select columns indiscriminately

Categories

Resources

How about using row_number() for the query: select a., b. from (select a., row_number() over (partition by id order by id) as seqnum from tablea a ) a join (select b., row_number() over (partition by id order by id) as seqnum from tableb b ) b on a.id = b.id and a.seqnum = 1 and b.seqnum = 1;