Subqueries in hive with where clause - hive

Can we join two tables with where clause in hive?
In SQL I tried and its working, but in hive it's not working as hive doesn't support subqueries in where clause
select t.name,t2.addr from trail t
join trail2 t2
on t.name=t2.name
where marks > (select marks from trail where name='sa) ;

Query will not work.
Hive 0.13 supports only IN/NOT IN or EXISTS/NOT EXISTS in where clause

Related

BigQuery select, join from multiple datasets and avoid name conflicts

Imagine I have several datasets and tables.
Format: dataset.table.field
dataset01.table_xxx.field_z
dataset02.table_xxx.field_z
I try to write smth like
select
dataset01.table_xxx.field_z as dataset01_table_xxx_field_z,
dataset02.table_xxx.field_z as dataset02_table_xxx_field_z
from dataset01.table_xxx
join dataset02.table_xxx on dataset02.table_xxx.field_z = dataset01.table_xxx.field_z
to avoid conflicting names
BigQuery says that dataset01.table_xxx.field_xxx is unrecognised name in SELECT clause.
it complains about unrecognised name in join clause too.
Query works if I remove dataset01, dataset02 from SELECT clause and on condition
What is the right way to refer fields in such case?
select
t1.field_z as dataset01_table_xxx_field_z,
t2.field_z as dataset02_table_xxx_field_z
from dataset01.table_xxx t1
join dataset02.table_xxx t2
on t2.field_z = t1.field_z

Hive: nonequality join not working in hive

I am having issue hive LEFT OUTER JOIN.
I had al table in sql-server. then used sqoop to migrate all tables on
hive.
This is the original query from sql-server which contains non-equi LEFT
OUTER JOIN. both table have cartesian data.
SELECT
vss.company_id,vss.shares_ship_id,vss.seatmap_cd,vss.cabin,vss.seat,
vss.seat_loc_dscr, vss.ep_seat AS EPlus_Seat, vss.ep_win_seat,
vss.ep_asle_seat, vss.ep_mid_seat, vss.em_win_seat,
vss.em_mid_seat,vss.em_asle_seat,vss.y_win_seat, vss.y_mid_seat,
vss.y_asle_seat, vss.fj_win_seat, vss.fj_mid_seat,
vss.fj_asle_seat,vss.exit_row, vss.bulkhead_row, vss.eff_dt, vss.disc_dt
FROM rvsed11 zz
LEFT OUTER JOIN rvsed22 vss
ON zz.company_id = vss.company_id
AND zz.shares_ship_id = vss.shares_ship_id
AND *zz.report_dt >= vss.eff_dt *
AND *zz.report_dt < vss.disc_dt*;
As we know that Nonequi joins are not working in hive ( Nonequi joins
working in WHERE clause but we cannot use with LEFT OUTER JOIN).
See below hive query with noon-equi condition moved to where clause.
SELECT
vss.company_id,vss.shares_ship_id,vss.seatmap_cd,vss.cabin,vss.seat,
vss.seat_loc_dscr, vss.ep_seat AS EPlus_Seat, vss.ep_win_seat,
vss.ep_asle_seat, vss.ep_mid_seat, vss.em_win_seat,
vss.em_mid_seat,vss.em_asle_seat,vss.y_win_seat, vss.y_mid_seat,
vss.y_asle_seat, vss.fj_win_seat, vss.fj_mid_seat,
vss.fj_asle_seat,vss.exit_row, vss.bulkhead_row, vss.eff_dt, vss.disc_dt
FROM rvsed11 zz
LEFT OUTER JOIN rvsed22 vss
ON zz.company_id = vss.company_id
AND zz.shares_ship_id = vss.shares_ship_id
*WHERE zz.report_dt >= vss.eff_dt AND zz.report_dt < vss.disc_dt;*
Original query is giving 1162 records on Sql-Server , but this hive query
giving 46240 records.
I tried multiple workaround to get same logic , but didn't get same result
on hive.
Can you please help me on this to identify this issue and get query working
on hive with same result set.
Let me know you need other details.
Hive does not allow the use of <= or >= in ON statement to compare columns across table.
Here is an excerpt from the Hive Manual:
Version 2.2.0+: Complex expressions in ON clause
Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions.
In particular, syntax for join conditions was restricted as follows:
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
Also see this as an alternate: Non equi Left outer join in hive workaround

Querying a Partitioned table in BigQuery using a reference from a joined table

I would like to run a query that partitions table A using a value from table B.
For example:
#standard SQL
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where B.date = '2018-01-01'
This query will scan all the partitions in table A and will not take into consideration the date I specified in the where clause (for partitioning purposes). I have tried running this query in several different ways but all produced the same result - scanning all partitions in table A.
Is there any way around it?
Thanks in advance.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
DECLARE date_filter ARRAY<DATETIME>
DEFAULT (SELECT ARRAY_AGG(date) FROM B WHERE ...);
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where A._partitiontime IN UNNEST(date_filter)
The doc says this about your use case:
Express the predicate filter as closely as possible to the table
identifier. Complex queries that require the evaluation of multiple
stages of a query in order to resolve the predicate (such as inner
queries or subqueries) will not prune partitions from the query.
The following query does not prune partitions (note the use of a subquery):
#standardSQL
SELECT
t1.name,
t2.category
FROM
table1 t1
INNER JOIN
table2 t2
ON
t1.id_field = t2.field2
WHERE
t1.ts = (SELECT timestamp from table3 where key = 2)

Correlated-subquery in INSERT statement - PostgreSQL

I am trying to populate a table using a query that contains a subquery.
The format is the following:
INSERT INTO table_C
SELECT columns FROM table_A, table_B
The subquery is present in one of the columns of the select statement and it refers to "table_A" again (there is a join between table_A and table_B).
Here is the code, but before reading it please consider that the select statement works perfectly if run alone (i.e. with no INSERT):
INSERT INTO hypercube_2015 (date, hour, name, rel_val)
SELECT t1.date, t1.hour, t2.name,
CAST(sum(t1.num) as float)/(SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = t1.date AND t11.hour = t1.hour)
FROM hc_num t1, names t2
WHERE date between '2015-01-01' AND '2015-12-31'
AND t1.id = t2.id
GROUP BY t1.date, t1.hour, t2.name
The issue is related to the subquery in the 3rd line, in particular to the WHERE condition. If I change it into the following it works:
SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = '2015-01-01' AND t11.hour=0
The error message is (I am working on a Redshift db via DBVis):
[Code: 500310, SQL State: XX000] Amazon Invalid operation:
This type of correlated subquery pattern is not supported due to
internal error;
I've got no solution to propose but an answer that explains why you have this error.
On RedShift there are several cases where the optimiser can't resolve a correlated subquery and trigger this error. One of them is precisely your kind of suquery:
Correlated Subquery Patterns That Are Not Supported
The query planner uses a query rewrite method called subquery
decorrelation to optimize several patterns of correlated subqueries
for execution in an MPP environment. A few types of correlated
subqueries follow patterns that Amazon Redshift cannot decorrelate and
does not support. Queries that contain the following correlation
references return errors:
References in a GROUP BY column to the results of a correlated subquery. For example:
select listing.listid,
(select count (sales.listid) from sales where sales.listid=listing.listid) as list
from listing
group by list, listing.listid;
Source : Amazon webservices Correlated Subqueries
In your subquery:
(SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = t1.date AND t11.hour = t1.hour)
you do make a reference to t1.hour which is present in the final GROUP BY:
GROUP BY t1.date, t1.hour, t2.name
Note that I might have a deeper look at your query later to propose an alternative, if nobody else does. Got no time at the moment.

What does a (+) sign mean in an Oracle SQL WHERE clause? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Oracle: What does (+) do in a WHERE clause?
Consider the simplified SQL query below, in an Oracle database environment (although I'm not sure that it's Oracle-specific):
SELECT
t0.foo, t1.bar
FROM
FIRST_TABLE t0, SECOND_TABLE t1
WHERE
t0.ID (+) = t1.ID;
What is that (+) notation for in the WHERE clause? I'm sorry if this is an ignorant newbie question, but it's been extremely difficult to search for on Google or StackOverflow... because even when using quote marks, search engines see a '+' sign and seem to want to treat it as some kind of a logical directive.
This is an Oracle-specific notation for an outer join. It means that it will include all rows from t1, and use NULLS in the t0 columns if there is no corresponding row in t0.
In standard SQL one would write:
SELECT t0.foo, t1.bar
FROM FIRST_TABLE t0
RIGHT OUTER JOIN SECOND_TABLE t1;
Oracle recommends not to use those joins anymore if your version supports ANSI joins (LEFT/RIGHT JOIN) :
Oracle recommends that you use the FROM clause OUTER JOIN syntax rather than the Oracle join operator. Outer join queries that use the Oracle join operator (+) are subject to the following rules and restrictions […]