How to take distinct values in hive join - hive

I need to take the distinct values from Table 2 while joining with Table 1 in Hive. Because the table 2 has duplicate records.
Considering below join condition is it possible to take only distinct key_col from table 2? i dont want to use select distinct * from ...
select * from Table_1 a left join Table_2 b on a.key_col = b.key_col
Note: This is in Hive

Use Left semi join. This will give you all the record in table1 which exist in table2(duplicate record) without duplicates.
select a.* from Table_1 a left semi join Table_2 b on a.key_col = b.key_col

Related

Is there any alternative way of achieving below logic in sql?

I have two tables -> tb1 and tb2.
I am performing left join operation on these tables using ID column and also i have one more condition such as one column is not equal to other column .
Below is sample code
select * from tb1 LEFT JOIN tb2 ON tb1.id=tb2.id AND tb1.pid!=tb2.pid;
I am able to get results from above query.
But i need to know is there any alternate ways to get same result using sql.?
The actually SQL standard uses <> instead of !=.
select * from tb1 LEFT JOIN tb2 ON tb1.id=tb2.id AND tb1.pid<>tb2.pid;
It seems to you not equal not working because of your join and join condition.
if we create two tables and create like your query
create table t1(id int,pid int);
create table t2 (id int,pid int );
insert into t1 values(1,2),(2,3),(3,4);
insert into t2 values(1,2),(2,3),(3,4);
select t1.* from t1 left join
t2 on t1.id=t2.id and
t1.pid!=t2.pid
order by t1.id
id pid
1 2
2 3
3 4
It returns all the values of 1st table, because LEFT JOIN returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match.
But if you put inner join in the same it will not return any row. so i think problem is not in the "not equal operator"

Hive Query is not working as expected

I am trying a left join in Hive Query, but it does not seem to work. It returns me columns only from left table:
create table mb.spt_new_var as select distinct customer_id ,target from mb.spt_201603 A
left outer join mb.temp B
on (A.customer_id=B.cust_id);
I tried selecting few records from table B based on the some random customer_id from table A and it returns some records. But if I try the left join on table A, it returns me only columns from table A. The data-type of both the IDs is same(int). what could be the possible reason behind this?
Sample Table A:
Customer_account_id target
12356 1
34245 0
12356 1
.... ..
Sample Table B:
Cust_id col1 col2 col3
12356 ..
12567 ..
24426 ..
...
Table A has some 1m records, while table B has some 30m records. There is possibility of some duplicate IDs in table A and Table B.
I'm a bit confused. Hive is returning the columns that you specify in the query:
select distinct a.customer_id, a.target
from mb.spt_201603 a left outer join
mb.temp b
on a.customer_id = b.cust_id;
If you want columns from the second table, you need to select them:
select distinct a.customer_id, a.target, b.col1, b.col2
from mb.spt_201603 a left outer join
mb.temp b
on a.customer_id = b.cust_id;

Count rows after joining three tables in PostgreSQL

Suppose I have three tables in PostgreSQL:
table1 - id1, a_id, updated_by_id
table2 - id2, a_id, updated_by_id
Users - id, display_name
Suppose I am using the using the following query:
select count(t1.id1) from table1 t1
left join table2 t2 on (t1.a_id=t2.a_id)
full outer join users u1 t1.updated_by_id=u1.id)
full outer join users u2 t2.updated_by_id=u2.id)
where u1.id=100;
I get 50 as count.
Whereas with:
select count(t1.id1) from table1 t1
left join table2 t2 on (t1.a_id=t2.a_id)
full outer join users u1 t1.updated_by_id=u1.id)
full outer join users u2 t2.updated_by_id=u2.id)
where u2.id=100;
I get only 25 as count.
What is my mistake in the second query? What can I do to get the same count?
My requirement is that there is a single user table, referenced by multiple tables. I want to take the complete list of users and get the count of ids from different tables.
But the table on which I have joined alone returns the proper count but rest of them don't return the proper count. Can anybody suggest a way to modify my second query to get the proper count?
To simplify your logic, aggregate first, join later.
Guessing missing details, this query would give you the exact count, how many times each user was referenced in table1 and table2 respectively for all users:
SELECT *
FROM users u
LEFT JOIN (
SELECT updated_by_id AS id, count(*) AS t1_ct
FROM table1
GROUP BY 1
) t1 USING (id)
LEFT JOIN (
SELECT updated_by_id AS id, count(*) AS t2_ct
FROM table2
GROUP BY 1
) t2 USING (id);
In particular, avoid multiple 1-n relationships multiplying each other when joined together:
Two SQL LEFT JOINS produce incorrect result
To retrieve a single or few users only, LATERAL joins will be faster (Postgres 9.3+):
SELECT *
FROM users u
LEFT JOIN LATERAL (
SELECT count(*) AS t1_ct
FROM table1
WHERE updated_by_id = u.id
) ON true
LEFT JOIN LATERAL (
SELECT count(*) AS t2_ct
FROM table2
WHERE updated_by_id = u.id
) ON true
WHERE u.id = 100;
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Explain perceived difference
The particular mismatch you report is due to the specifics of a FULL OUTER JOIN:
First, an inner join is performed. Then, for each row in T1 that does
not satisfy the join condition with any row in T2, a joined row is
added with null values in columns of T2. Also, for each row of T2 that
does not satisfy the join condition with any row in T1, a joined row
with null values in the columns of T1 is added.
So you get NULL values appended on the respective other side for missing matches. count() does not count NULL values. So you can get a different result depending on whether you filter on u1.id=100 or u2.id=100.
This is just to explain, you don't need a FULL JOIN here. Use the presented alternatives instead.

Query from table based on some condition

I have a scenario like this, there are 2 tables table_1 and table_2. Both table have a common column called column_1(no foreign_Key constraints!!). Table_1 can have some extra rows which are not present in table_2(In other words, table_2 is a sub-set of table_1). I want to list all those items which are only present in table_1 but not in table_2.
Kindly help in writing the sql query for the same.
Thanks in advance.
SELECT a.*
FROM table1 a
LEFT JOIN table2 b
on a.column_1 = b.column_1
WHERE b.column_1 IS NULL
if those two tables are not related with each other, better add an index on table1.column_1 and table2.column_1 so that it won't require full table scan (which slows the performance)
select * from table1
inner join table2 on table1.column1=table2.column1
select a.* from table1 a left outer join table2 b on a.col1=b.col1;

Outer Join with Where returning Nulls

Hi I have 2 tables. I want to list
all records in table1 which are present in
table2
all records in table2 which are not present in table1 with a where condition
Null rows will be returned by table1 in second condition but I am unable to get the query working correctly. It is only returning null rows
SELECT
A.CLMSRNO,A.CLMPLANO,A.GENCURRCODE,A.CLMNETLOSSAMT,
A.CLMLOSSAMT,A.CLMCLAIMPRCLLOSSSHARE
FROM
PAKRE.CLMCLMENTRY A
RIGHT OUTER JOIN (
SELECT
B.CLMSRNO,B.UWADVICETYPE,B.UWADVICENO,B.UWADVPREMCURRCODE,
B.GENSUBBUSICLASS,B.UWADVICENET,B.UWADVICEKIND,B.UWADVYEAR,
B.UWADVQTR,B.ISMANUAL,B.UWCLMNOREFNO
FROM
PAKRE.UWADVICE B
WHERE
B.ISMANUAL=1
) r
ON a.CLMSRNO=r.CLMSRNO
ORDER BY
A.CLMSRNO DESC;
Which OS are you using ?
Table aliases are case sensistive on some platforms, which is why your join condition ON a.CLMSRNO=r.CLMSRNO fails.
Try with A.CLMSRNO=r.CLMSRNO and see if that works
I'm not understanding your first attempt, but here's basically what you need, I think:
SELECT *
FROM TABLE1
INNER JOIN TABLE2
ON joincondition
UNION ALL
SELECT *
FROM TABLE2
LEFT JOIN TABLE1
ON joincondition
AND TABLE1.wherecondition
WHERE TABLE1.somejoincolumn IS NULL
I think you may want to remove the subquery and put its columns into the main query e.g.
SELECT A.CLMSRNO, A.CLMPLANO, A.GENCURRCODE, A.CLMNETLOSSAMT,
A.CLMLOSSAMT, A.CLMCLAIMPRCLLOSSSHARE,
B.CLMSRNO, B.UWADVICETYPE, B.UWADVICENO, B.UWADVPREMCURRCODE,
B.GENSUBBUSICLASS, B.UWADVICENET, B.UWADVICEKIND, B.UWADVYEAR,
B.UWADVQTR, B.ISMANUAL, B.UWCLMNOREFNO
FROM PAKRE.CLMCLMENTRY A
RIGHT OUTER JOIN PAKRE.UWADVICE B
ON A.CLMSRNO = B.CLMSRNO
WHERE B.ISMANUAL = 1
ORDER
BY A.CLMSRNO DESC;