Join two tables using HiveQL - sql

These are two tables below-
CREATE EXTERNAL TABLE IF NOT EXISTS Table1 (This is the MAIN table through which comparisons need to be made)
(
ITEM_ID BIGINT,
CREATED_TIME STRING,
BUYER_ID BIGINT
)
CREATE EXTERNAL TABLE IF NOT EXISTS Table2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
As BUYER_ID and USER_ID they both are same thing.
I need to find the total COUNT and all those BUYER_ID that are not there in Table2 by comparing from Table1. So I think it's a kind of Left Outer Join Query. I am new to HiveSql stuff so I am having problem to figure out what should be the actual syntax to do this in HiveQL. I wrote the below SQL Query. Can anyone tell me whether the SQL query below is fine or not to achieve my scenario?
SELECT COUNT(BUYER_ID), BUYER_ID
FROM Table1 dw
LEFT OUTER JOIN Table2 dps ON (dw.BUYER_ID = dps.USER_ID)
GROUP BY BUYER_ID;

If I understand your requirements correctly, I think you are almost there. It seems you only need to add a condition checking if there's no match between the two tables:
SELECT COUNT(BUYER_ID), BUYER_ID
FROM Table1 dw
LEFT OUTER JOIN Table2 dps ON (dw.BUYER_ID = dps.USER_ID)
WHERE dps.USER_ID IS NULL
GROUP BY BUYER_ID;
The above will filter out BUYER_IDs that do have matches in Table2, and will show the remaining BUYER_IDs and their corresponding count values. (Well, that's what I understand you want.)

Related

Left Join two tables with one common column and other diff columns

I have two tables where main table has 10+ columns and second table has 3 columns with one common field. My problem here is I am not able to get exact count with left outer join as main table. I am seeing more count than actual. It might be due to one of the field I am trying to get is not in main table which is in second table.
Table 1: master_table
Table 2: manager_table
Master_table :
ID,
Column1,
Column2,
...
Column10
manager_table:
ID,
Column2_different,
Column3_different
I am trying to join using Left Join to get same records as present in master table.
Select table1.columns, table2.columns
From table1
Left join table2 on table1.ID = table2.ID
The above is not giving me exact record count as in master table (table1) , it is giving me more count as the table 2 other field is not present in table 1 .
Can someone help me here ?
TIA
I believe that an INNER JOIN would be better than a LEFT JOIN. Need some sample data to be sure, but if you're getting a higher count than you'd expect upon joining the tables this is probably because the LEFT JOIN is returning everything from both tables. An INNER JOIN will only return data that appears in both tables.

How to map each distinct value of a column in one table with each distinct value of a column in another table in Hive

I have two tables in Hive, Table1 and Table2. I want to get each distinct customerID in Table1 and map it to each distinct value in a column called category of Table2. However I am a bit lost on how to do this in hive. A better example of what I am trying to do is the following: Let's say Table1 contains 5 distinct customerID's and Table2 contains 3 distinct categories. I want my query result to look something like the following:
However Table1 and Table2 do not have any columns in common so I am a bit lost on how to perform a join on this two tables in hive. Is this task possible in hive? Any insights on this would be greatly appreciated!
You can do that with a cross join of distinct values from both tables.
select t1.customerid,t2.categories
from (select distinct customerid from tbl1) t1
cross join (select distinct categories from tbl2) t2

Creating a table out of two tables in SQL

I'm trying to create a table based off of two tables. For example, I have a column in one table called Customer_ID and a column in another table called Debit_Card_Number. How can I make it so I can get the Customer_ID column from one table and the Debit_card_number from the other table and make a table? Thanks
Assuming Two Table Names as TableOne and TableTwo and CustomerID as a common Attribute.
CREATE TABLE NEW_TABLE_NAME AS (
SELECT
TableOne.Customer_ID,
TableTwo.Debit_Card_Number
FROM
TableOne,
TableTwo
Where
tableOne.CustomerID = tableTwo.CustomerID
)
Look into using a join. Use Left Join to give you the id, even if there isn't a matching card number for that id. Your value to match on will probably be the id, assuming that value is in the table with the card number
create table joined_table as(
select t1.customer_id, t2.debit_card_number
from t1
inner join t2
on t1.matchValue = t2.matchValue
)

Querying multiple tables at once

Let's say I have many tables with different structure that have a common column. How can I query for all rows from all these tables based on a condition.
Example:
table1:
column1 | column2 | user_id
table2:
columna | columnb | columnc | user_id
...
The condition would be user_id = <some number>. I don't want to query each table individually as there are about 30 tables. There may not be a record for each user_id in each table. What's the best option to do this?
Sounds as if you are looking for a full outer join
select *
from (
select *
from table1 t1
full outer join table2 t2 using (user_id)
full outer join table3 t2 using (user_id)
) t
where user_id = 42;
The using (user_id) syntax will make sure that the common column user_id is only present once in the result. So even though the query uses select *, there will only be a single user_id column in the result on which you can apply the where condition.
You can join the Tables on the user_id.
Inner JOIN would be good.
You have to set primarykeys and foreignkeys
If you have a relationship or a common column between the tables; like in your posted case user_id you can perform a simple JOIN like
select t1.*, t2.*
from table1 t1 join table2 t2 on t1.user_id = t2.user_id
where user_id = <some number>;
But if there is no relation exists (or) you can't join them then there is no other way than querying them individually.
I think in this very case outer join would server the purpose instead of inner Join as there may not be a record for each user_id in each table.
On other note I strongly feel that joining multiple tables(specially 30) would navigate to "locking" of tables for longer period of time and that can hamper your DB as well as your application. Restructuring of DB can be an option but in case you can't change it,make cluster of similar data which can have 4-5 table dat at once,mean total 6-7 queries, use multithreading from application end as each thread would snatch data for its respective query, club them together to create set of required information. This will enhance your application performance.

How to use sql join where table1 has rows not present in table2

I have two tables record and share. record has columns: name and id. share has columns id.
I want to find the rows which are present in record but not present in share.
How can I do this?
SQL LEFT JOIN returns all rows from the left table even if there are no matches in the right table
SELECT name, id
FROM record r LEFT JOIN share s on r.id = s.id
WHERE s.id is null
You have this tables:
RECORD (ID, NAME)
SHARE(ID,VALUE)
If your SQL engine supports LEFT OUTER JOINS, the best way is:
SELECT RECORD.* FROM RECORD LEFT OUTER JOIN SHARE WHERE RECORD.ID=SHARE.ID
WHERE SHARE.ID IS NULL
Important
place an index on SHARE.ID
How it works:
SQL Engine span all the RECORD table looking for a record in SHARE, for each record in RECORD if it is found a "linked" record in SHARE the where clause is falso, so the record is not included in the result set, if no records are found in share the where clause is true and the RECORD.* is included in result set. This works thanks to LEFT OUTER JOIN.
Note:
Another way of doing the same thing is to use the WHERE RECORD.ID NOT IN (SELECT ID FROM SHARE).
Pay attention that depending on the sql engine you are using this may lead to serious performance issues because the internal engine can run the (SELECT ID FROM SHARE) once per record in RECORD table.
select Id from t1 where id not in (select id from t2)
SELECT * from RECORD where ID not in (SELECT DISTINCT ID FROM SHARE);
SELECT DISTINCT ID FROM SHARE - will get all the distinct IDs in the table share
SELECT * from RECORD where ID not in (SELECT DISTINCT ID FROM SHARE); would display all records whose ID is not in the first query.