Hive join understanding issue - sql

I have created two tables as below in hive
create table test1(id string);
create table test2(id string);
test1 has values as given below
1
1
test2 has values as given below
1
1
When I am joining these two tables I am getting output with
1
1
1
1
This is the query used :
select a.id from test1 a,test2 b where a.id=b.id;
Please help I expected the output to be as
1
1
I am using cloudera distribution

Better use ANSI join syntax:
select a.id
from test1 a
inner join test2 b on a.id=b.id
The expected output cannot be the result of your join because for each a.id all matching rows from a and b are selected. For the first row from a it will be two matching rows in b. For the second row from a it will be also two matching rows from b. So it will be four rows totally.
You can apply distinct to the second table before join for example.
select a.id
from test1 a
inner join (select distinct b.id from test2 b) b on a.id=b.id
In this case for each row in table a it will be single matching row in table b.
See this lesson to understand JOINS better: https://www.coursera.org/learn/analytics-mysql/lecture/kydcf/joins-with-many-to-many-relationships-and-duplicates

Related

Almost equal table with different running time

I’m using oracle. I have two table A and B
Table A has 8000 rows and 5 five columns
Table B has 5500 rows and same 5 columns
All of 5500 rows in table B are contained in Table A and they are the same
I have a query like
With t1 as (select distinct(id) from table A/B)
,T2 as (select a.id, c.value, d.value from Table A/B a
Join table c c on “conditions”
Join table d d on “conditions”
) select * from t2
So the query with Table A works excellent but with Table B it freezes for eternity.
Data types and other properties are equal in table A and table A.
Where should i look for the problem?
I tried to explain plan but differences only in row “PX PARTITION HASH JOIN-FILTER” in Table A and “PX BLOCK ITERATOR ADAPTIVE” in table B

Duplicate rows in left join

I have 2 tables. There are about 100000 of null in one column, other values are integer, total values are about 200000. Another table has only the integer value. When I use the left join on this column, it gave me a lot of duplicates rows. Is it ok to use left join here?
Table 1:
Column 1
2
3
5
null
null
Table 2:
Column 1
1
2
3
so on
Your example is really odd. Why would anyone have null values in an ID field? But anyway.
If you need fields from table 2 in the resultset as you say above then you must use an INNER JOIN not a LEFT JOIN
Something like:
SELECT DISTINCT a.id, a.name, b.someOtherField
FROM Table1 a
INNER JOIN Table2 b ON a.id = b.id
Please note: Since only the ID field of table 1 has null values there will be no records selected from table 1 with id IS NULL because they have no equivalent in table 2. Adding the DISTINCT keyword helps in case this query would still produce duplicates.

How does JOIN works when it sees two tables with duplicate entries

I have two tables. Say A and B.
Table A:
Id value
1 A
2 B
1 C
Table B:
Id value
2 AA
2 BB
4 CC
Now I write a simple left join
select A.Id
,A.Value
,B.Id
,B.Value
from A
left join B
on A.Id = b.Id
This shows me multiple entries. Why so?
When you left join two tables together in SQL, you are asking the database to look at all the values in the first table's specified column and find any and all that match in the specified column of your second table.
This means that if the database finds two rows of data with ID = 2 as per your example, it will bring both back.

Hive Query is not working as expected

I am trying a left join in Hive Query, but it does not seem to work. It returns me columns only from left table:
create table mb.spt_new_var as select distinct customer_id ,target from mb.spt_201603 A
left outer join mb.temp B
on (A.customer_id=B.cust_id);
I tried selecting few records from table B based on the some random customer_id from table A and it returns some records. But if I try the left join on table A, it returns me only columns from table A. The data-type of both the IDs is same(int). what could be the possible reason behind this?
Sample Table A:
Customer_account_id target
12356 1
34245 0
12356 1
.... ..
Sample Table B:
Cust_id col1 col2 col3
12356 ..
12567 ..
24426 ..
...
Table A has some 1m records, while table B has some 30m records. There is possibility of some duplicate IDs in table A and Table B.
I'm a bit confused. Hive is returning the columns that you specify in the query:
select distinct a.customer_id, a.target
from mb.spt_201603 a left outer join
mb.temp b
on a.customer_id = b.cust_id;
If you want columns from the second table, you need to select them:
select distinct a.customer_id, a.target, b.col1, b.col2
from mb.spt_201603 a left outer join
mb.temp b
on a.customer_id = b.cust_id;

left join on MS SQL 2008 R2

I'm trying to left join two tables. Table A contains unique 100 records with field_a_1, field_a_2, field_a_3. The combination of field_a_1 and field_a_2 is unique.
Table B has multi-million records with multiple fields. field_b_1 is same as field_a_1 and field_b_2 is same as field_a_2.
I join the two tables together like this:
select a.*, b.*
from a
left join b
on field_a_1 = field_b_1
and field_a_2 = field_b_2
Instead of getting 100 records, I get multi-million records. Why is this?
Because table B has multiple rows for each table A entry.
For example:
TableA (ID)
1
2
3
TableB (ID, data)
1 hello
1 world
1 foo
1 bar
2 data
2 words
2 more
3 words
3 boring
If you left join from TableA to TableB, you will get a row for every TableB record that matches a TableA record - ie. all of them.
Can you explain what results you are looking for?
Because a left join returns all of the rows from the first table + all of the matching rows from the second table. Which of the millions of matching rows did you expect to get?
Left join or inner join don't really make a difference. A JOIN will return all rows that match the join condition. So if table b has millions of rows that match the JOIN criteria, then all the rows will be returned.
Depending on what you wish to accomplish you should consider using the DISTINCT keyword or GROUP BY to perform aggregate functions.