Print physical plan of query Hive on Spark - hive

I'm using hive 2.3.7 and spark-2.0.0 as the execution engine.
I was wondering how can i print the physical plan , to see which join algorithm for example caclcite choose to execute on a query.

You can use explain.
In Pyspark:
df = df1.join(df2, 'id')
df.explain()
In Spark SQL / Hive QL:
EXPLAIN SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;
See more details at
http://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

Related

driving_site hint for multiple remote tables

I have a query of the following format. It uses two remote tables and a local table.
SELECT *
FROM table1#db2 t1 INNER JOIN table2#db2 t2 -- two large remote tables on the same DB
ON t1.id = t2.id
WHERE t1.prop = '1'
AND t2.prop = '2'
AND t1.prop2 IN (SELECT val FROM tinylocaltable)
I'm wondering how to properly use the DRIVING_SITE query hint to push the bulk of the work to db2 (i.e. ensure the join and conditions are applied on db2). Most of the examples I see of DRIVING_SITE reference only one remote table. Is SELECT /*+DRIVING_SITE(t1)*/ * sufficient or do I need to list both remote tables (t1 and t2) in the hint? If the latter, what is the proper syntax?
(If you're wondering why this isn't being executed on db2 to start with, it's because this is actually one UNION ALL section of a larger query, where the other UNION ALL sections use the local DB).
The DRIVING_SITE hint instructs the optimizer to execute the query at a different site than that selected by the database
Your query uses
FROM table1#db2 t1 INNER JOIN table2#db2 t2
where both tables are on the same "different site", so
SELECT /*+ DRIVING_SITE(t1)*/
should be OK (in my opinion. Can't find anything in documentation that would suggest different).

Hive: nonequality join not working in hive

I am having issue hive LEFT OUTER JOIN.
I had al table in sql-server. then used sqoop to migrate all tables on
hive.
This is the original query from sql-server which contains non-equi LEFT
OUTER JOIN. both table have cartesian data.
SELECT
vss.company_id,vss.shares_ship_id,vss.seatmap_cd,vss.cabin,vss.seat,
vss.seat_loc_dscr, vss.ep_seat AS EPlus_Seat, vss.ep_win_seat,
vss.ep_asle_seat, vss.ep_mid_seat, vss.em_win_seat,
vss.em_mid_seat,vss.em_asle_seat,vss.y_win_seat, vss.y_mid_seat,
vss.y_asle_seat, vss.fj_win_seat, vss.fj_mid_seat,
vss.fj_asle_seat,vss.exit_row, vss.bulkhead_row, vss.eff_dt, vss.disc_dt
FROM rvsed11 zz
LEFT OUTER JOIN rvsed22 vss
ON zz.company_id = vss.company_id
AND zz.shares_ship_id = vss.shares_ship_id
AND *zz.report_dt >= vss.eff_dt *
AND *zz.report_dt < vss.disc_dt*;
As we know that Nonequi joins are not working in hive ( Nonequi joins
working in WHERE clause but we cannot use with LEFT OUTER JOIN).
See below hive query with noon-equi condition moved to where clause.
SELECT
vss.company_id,vss.shares_ship_id,vss.seatmap_cd,vss.cabin,vss.seat,
vss.seat_loc_dscr, vss.ep_seat AS EPlus_Seat, vss.ep_win_seat,
vss.ep_asle_seat, vss.ep_mid_seat, vss.em_win_seat,
vss.em_mid_seat,vss.em_asle_seat,vss.y_win_seat, vss.y_mid_seat,
vss.y_asle_seat, vss.fj_win_seat, vss.fj_mid_seat,
vss.fj_asle_seat,vss.exit_row, vss.bulkhead_row, vss.eff_dt, vss.disc_dt
FROM rvsed11 zz
LEFT OUTER JOIN rvsed22 vss
ON zz.company_id = vss.company_id
AND zz.shares_ship_id = vss.shares_ship_id
*WHERE zz.report_dt >= vss.eff_dt AND zz.report_dt < vss.disc_dt;*
Original query is giving 1162 records on Sql-Server , but this hive query
giving 46240 records.
I tried multiple workaround to get same logic , but didn't get same result
on hive.
Can you please help me on this to identify this issue and get query working
on hive with same result set.
Let me know you need other details.
Hive does not allow the use of <= or >= in ON statement to compare columns across table.
Here is an excerpt from the Hive Manual:
Version 2.2.0+: Complex expressions in ON clause
Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions.
In particular, syntax for join conditions was restricted as follows:
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
Also see this as an alternate: Non equi Left outer join in hive workaround

Querying a Partitioned table in BigQuery using a reference from a joined table

I would like to run a query that partitions table A using a value from table B.
For example:
#standard SQL
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where B.date = '2018-01-01'
This query will scan all the partitions in table A and will not take into consideration the date I specified in the where clause (for partitioning purposes). I have tried running this query in several different ways but all produced the same result - scanning all partitions in table A.
Is there any way around it?
Thanks in advance.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
DECLARE date_filter ARRAY<DATETIME>
DEFAULT (SELECT ARRAY_AGG(date) FROM B WHERE ...);
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where A._partitiontime IN UNNEST(date_filter)
The doc says this about your use case:
Express the predicate filter as closely as possible to the table
identifier. Complex queries that require the evaluation of multiple
stages of a query in order to resolve the predicate (such as inner
queries or subqueries) will not prune partitions from the query.
The following query does not prune partitions (note the use of a subquery):
#standardSQL
SELECT
t1.name,
t2.category
FROM
table1 t1
INNER JOIN
table2 t2
ON
t1.id_field = t2.field2
WHERE
t1.ts = (SELECT timestamp from table3 where key = 2)

Spark SQL statement broadcast

Is there a way to use broadcast in Spark SQL statement?
For example:
SELECT
Column
FROM
broadcast (Table 1)
JOIN
Table 2
ON
Table1.key = Table2.key
And in my case, Table 1 is also a sub query.
Below is the syntax for Broadcast join:
SELECT /*+ BROADCAST(Table 2) */ COLUMN
FROM Table 1 join Table 2
on Table1.key= Table2.key
To check if broadcast join occurs or not you can check in Spark UI port number 18080 in the SQL tab.
The reason we need to ensure whether broadcast join is actually working is because earlier we were using the below syntax:
/* BROADCASTJOIN(Table2) */ which did not throw syntax error but in the UI it was performing sort merge join
Hence it is essential to ensure our joins are working as expected
In Spark 2.2 or later you can use planner hints:
SELECT /*+ MAPJOIN(Table1) */ COLUMN
FROM Table1 JOIN Table2
ON Table1.key = Table2.key

Join on Hive on different keyspace tables on DSE

Is this possible to run JOIN query on DSE Hive with different keypsaces tables in Cassandra ?
I'm trying to execute below query with no success
hive> select * from mykeyspace1.table1 a JOIN keyspace_185.table_508 b on a.companyid=b.companyid limit 10;
there are two KEYSPACES mykeyspace1 and keyspace_508.
In my case map reduce run with no error but not showing any result.
Thanks in advance !
It works for me for a simple test. select a.name, b.state from test7.test1 a join test8.test1 b on a.name = b.name; Maybe something wrong with the data or join condition.