Hive: nonequality join not working in hive

Hive: nonequality join not working in hive - hive

I am having issue hive LEFT OUTER JOIN.
I had al table in sql-server. then used sqoop to migrate all tables on
hive.
This is the original query from sql-server which contains non-equi LEFT
OUTER JOIN. both table have cartesian data.
SELECT
vss.company_id,vss.shares_ship_id,vss.seatmap_cd,vss.cabin,vss.seat,
vss.seat_loc_dscr, vss.ep_seat AS EPlus_Seat, vss.ep_win_seat,
vss.ep_asle_seat, vss.ep_mid_seat, vss.em_win_seat,
vss.em_mid_seat,vss.em_asle_seat,vss.y_win_seat, vss.y_mid_seat,
vss.y_asle_seat, vss.fj_win_seat, vss.fj_mid_seat,
vss.fj_asle_seat,vss.exit_row, vss.bulkhead_row, vss.eff_dt, vss.disc_dt
FROM rvsed11 zz
LEFT OUTER JOIN rvsed22 vss
ON zz.company_id = vss.company_id
AND zz.shares_ship_id = vss.shares_ship_id
AND *zz.report_dt >= vss.eff_dt *
AND *zz.report_dt < vss.disc_dt*;
As we know that Nonequi joins are not working in hive ( Nonequi joins
working in WHERE clause but we cannot use with LEFT OUTER JOIN).
See below hive query with noon-equi condition moved to where clause.
SELECT
vss.company_id,vss.shares_ship_id,vss.seatmap_cd,vss.cabin,vss.seat,
vss.seat_loc_dscr, vss.ep_seat AS EPlus_Seat, vss.ep_win_seat,
vss.ep_asle_seat, vss.ep_mid_seat, vss.em_win_seat,
vss.em_mid_seat,vss.em_asle_seat,vss.y_win_seat, vss.y_mid_seat,
vss.y_asle_seat, vss.fj_win_seat, vss.fj_mid_seat,
vss.fj_asle_seat,vss.exit_row, vss.bulkhead_row, vss.eff_dt, vss.disc_dt
FROM rvsed11 zz
LEFT OUTER JOIN rvsed22 vss
ON zz.company_id = vss.company_id
AND zz.shares_ship_id = vss.shares_ship_id
*WHERE zz.report_dt >= vss.eff_dt AND zz.report_dt < vss.disc_dt;*
Original query is giving 1162 records on Sql-Server , but this hive query
giving 46240 records.
I tried multiple workaround to get same logic , but didn't get same result
on hive.
Can you please help me on this to identify this issue and get query working
on hive with same result set.
Let me know you need other details.

Hive does not allow the use of <= or >= in ON statement to compare columns across table.
Here is an excerpt from the Hive Manual:
Version 2.2.0+: Complex expressions in ON clause
Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions.
In particular, syntax for join conditions was restricted as follows:
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
Also see this as an alternate: Non equi Left outer join in hive workaround

Related

BigQuery : WITH clause behavior in multiple JOIN conditions

For readability, I have defined "org_location_ext" clause in the query as follows.
This "org_location_ext" is first used to join with the main fact-table "LOCATION_SALES".
It is used in other JOIN conditions as well.
According to the BigQuery documentation : https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with_clause
The WITH clause contains one or more named subqueries which execute
every time a subsequent SELECT statement references them
I want to know the behavior for this case.
Does this query executes the "org_location_ext" WITH clause multiple times ?
Or when the SELECT query gets executed, a temporary table is created for "org_location_ext" and use this temporary table for all the JOINs.
Basically, after the first JOIN with the fact-table , later joins use that "filtered" result for their joins , or do they rerun the WITH clause ?
WITH org_location_ext AS (
SELECT *
FROM ORG_LOC_MASTER AS loc_master
JOIN LOC_REGN1 as regn1 ON loc_master.id = regn1.id
JOIN ...
JOIN ...
)
SELECT
..
org_location_ext.store_class,
org_location_ext.country,
org_location_ext.
..
..
FROM LOCATION_SALES AS sales
JOIN org_location_ext ON org_location_ext.area_id = sales.area_id AND org_location_ext.date = sales.date
JOIN ....
JOIN ....
JOIN COUNTRY_VAT AS vat ON vat.key1 =TBL_Y.key1 AND vat.country_code = org_location_ext.country_code

It depends on the query plan. Consider checking a query plan. You'll see how many times any specific table is accessed.

Hive - Multiple sub-queries in where clause is failing

I am trying to create a table by checking two sub-query expressions within the where clause but my query fails with the below error :
Unsupported sub query expression. Only 1 sub query expression is
supported
Code snippet is as follows (Not the exact code. Just for better understanding) :
Create table winners row format delimited fields terminated by '|' as
select
games,
players
from olympics
where
exists (select 1 from dom_sports where dom_sports.players = olympics.players)
and not exists (select 1 from dom_sports where dom_sports.games = olympics.games)
If I execute same command with only one sub-query in where clause it is getting executed successfully. Having said that is there any alternative to achieve the same in a different way ?

Of course. You can use left join.
Inner join will act as exists. and left join + where clause will mimic the not exists.
There can be issue with granularity but that depends on your data.
select distinct
olympics.games,
olympics.players
from olympics
inner join dom_sports dom_sports on dom_sports.players = olympics.players
left join dom_sports dom_sports2 where dom_sports2.games = olympics.games
where dom_sports2.games is null

RedShift SQL subquery with Inner join

I am using AWS Redshift SQL. I want to inner join a sub-query which has group by and inner join inside of it. When I do an outside join; I am getting an error that column does not exist.
Query:
SELECT si.package_weight
FROM "packageproduct" ub "clearpathpin" cp ON ub.cpipr_number = cp.pin_number
INNER JOIN "clearpathpin" cp ON ub.cpipr_number = cp.pin_number
INNER JOIN (
SELECT sf."AWB", SUM(up."weight") AS package_weight
FROM "productweight" up ON up."product_id" = sf."item_id"
GROUP BY sf."AWB"
HAVING sf."AWB" IS NOT NULL
) AS si ON si.item_id = ub.order_item_id
LIMIT 100;
Result:
ERROR: column si.item_id does not exist

It's simply because column si.item_id does not exist
Include item_id in the select statement for the table productweight
and it should work.

There are many things wrong with this query.
For your subquery, you have an ON statement, but it is not joining:
FROM "productweight" up ON up."product_id" = sf."item_id"
When you join the results of this subquery, you are referencing a field that does not exist within the subquery:
SELECT sf."AWB", SUM(up."weight") AS package_weight
...
) AS si ON si.item_id = ub.order_item_id
You should imagine the subquery as creating a new, separate, briefly-existing table. The outer query than joins that temporary table to the rest of the query. So anything not explicitly resulted in the subquery will not be available to the outer query.
I would recommend when developing you write and run the subquery on its own first. Only after it returns the results you expect (no errors, appropriate columns, etc) then you can copy/paste it in as a subquery and start developing the main query.

When to cross or join two tables?

Example 1:
SELECT name
FROM Customer, Order
WHERE Customer.id = Order.cid
Example 2:
SELECT name
FROM Customer JOIN Order
ON Customer.id = Order.cid
What is the difference between these two queries? When should I cross two tables vs JOIN?

Both will give you identical result. So there is no real situation to use one over another.
The comma separated join, is an ANSI 89 standard join, INNER JOIN is the newer ANSI 92 standard join.
However comma separated join syntax is depreciated we always prefer to use INNER JOIN syntax. When you want to join more than one table it will be difficult to follow the join conditions in Where clause where as INNER JOIN syntax is more readable

CROSS JOIN operation is a Cartesian product. Result of CROSS JOIN operation
between set A and set B is the superset that contains all values.
Then with operator WHERE you are filtering this result set.
INNER JOIN operation will try to find rows from both tables that correspond predicate
after keyword ON. That rows will go to the result set.
Practically, SQL engine can choose its own physical implementation CROSS JOIN operator.
SQL engine doesn't have to get huge result set and then filter it. SQL engine behaviour will
be similar when using INNER JOIN operation.

HQL: Is it possible to perform an INNER JOIN on a subquery?

The diagram above is a simplified version of the database structure that I use to log item locations through time. I wrote the following SQL query which returns the current item inventory of each location:
select *
from ItemLocationLog l
inner join
(select g.idItemLocationLog, max(g.dateTime) as latest
from ItemLocationLog g
group by g.idItem)
as i
on l.idItem = i.idItem and l.dateTime = i.latest
The problem I'm having is that I want to convert that to HQL, but I haven't found the syntax to perform an INNER JOIN on a subquery, and it seems like this is not supported. Is there a way to convert the above to HQL (or a Criteria) or will I have to use a standard SQL query in this case? Thanks.

http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/queryhql.html#queryhql-subqueries
Note that HQL subqueries can occur only in the select or where clauses.
You can rewrite the query so that the subquery is part of the where clause instead. Referencing the l.idItem in the subquery

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive: nonequality join not working in hive - hive

Related

BigQuery : WITH clause behavior in multiple JOIN conditions

Hive - Multiple sub-queries in where clause is failing

RedShift SQL subquery with Inner join

When to cross or join two tables?

HQL: Is it possible to perform an INNER JOIN on a subquery?

Categories

Resources