hive skew join questions

hive skew join questions - hive

I have some doubts about skew join in hive .
1.when will hive use a common join to process the data , because I only see map join after I set blow properties
set hive.optimize.skewjoin=true;
set hive.mapjoin.smalltable.filesize=2;
2.why dosn`t skew join work with left join
below is table and sql:
tmp.skew_large_table 字段 imei,imsi,mac,phone,data_date;
total rows:290,0808
skew key : 868407035454956 670081
-----------
tmp.test_skew_small_table 字段 imei,package,data_date
total rows:857,6164
skew key : 868407035454956 10461
-----------
sql:
select a.*,b.*
from tmp.skew_large_table a
join
tmp.test_skew_small_table b
on a.imei=b.imei;

After reading source code of hive . I got answers
Q1:
hive.mapjoin.smalltable.filesize and hive.auto.convert.join dosn`t work for skew join
For every skey join , hive will use map-joins to handle it.
Q2
Outer-join will not trigger skew-join ,source code shows blow
// We are trying to adding map joins to handle skew keys, and map join righ
// now does not work with outer joins
if (!GenMRSkewJoinProcessor.skewJoinEnabled(parseCtx.getConf(), joinOp)) {
return;
}

Related

Redshift: is a join to a subquery/CTE consisting of SELECT * from a table equivalent to joining the table itself, or a performance hit?

On Redshift, does a CTE/subquery used in a join incur a performance hit if it is doing a SELECT * from a source table, vs. code that just references and joins to the source table directly? That is, is there any difference in performance between this code:
WITH cte_source_2 AS (SELECT * FROM source_2)
SELECT
s1.field_1, s2.field_2
FROM
source_1 AS s1
LEFT JOIN
cte_source_2 AS s2
ON
s1.key_field = s2.key_field
And this code:
SELECT
s1.field_1, s2.field_2
FROM
source_1 AS s1
LEFT JOIN
source_2 AS s2
ON
s1.key_field = s2.key_field
I would think not, that the query optimizer would reduce the first version to the second, but am getting conflicting results (mostly I think due to caching).
Another way of phrasing this question is, CTEs aside, and on Redshift specifically, does this:
SELECT
.....
FROM
(SELECT * FROM source_1) AS s1
LEFT JOIN
.......
Perform the same as this:
SELECT
.....
FROM
source_1 AS s1
LEFT JOIN
.......
Unfortunately I do not have the kind of access to get any profiling info. Thanks!

On Redshift, cte's are great for convenience but the query still resolves as a sub-select.
See second paragraph of https://docs.aws.amazon.com/redshift/latest/dg/r_WITH_clause.html
Because of that, you are correct. Performance will be the same either way.
This is not the case on postgres where cte's are resolved as temp tables. See first paragraph of https://www.postgresql.org/docs/current/queries-with.html

SQL is evaluated Right to Left or Left to Right?

I have a SQL query Like below -
A left JOIN B Left Join C Left JOIN D
Say table A is a big table whereas tables B, C, D are small.
Will Spark join will execute like-
A with B and subsequent results will be joined with C then D
or,
Spark will automatically optimize i.e it will join B, C and D and then
results will be joined with A.
My question is what is order of execution or join evaluation? Does it go left to right or right to left?

Spark can optimize join order, if it has access to information about cardinialities of those joins.
For example, if those are parquet tables or cached dataframes, then it has estimates on total counts of the tables, and can reorder join order to make it less expensive. If a "table" is a jdbc dataframe, Spark may not have information on row counts.
Spark Query Optimizer can also choose a different join type in case if it has statistics (e.g. it can broadcast all smaller tables, and run broadcast hash join instead of sort merge join).
If statistics aren't available, then it'll will just follow the order as in the SQL query, e.g. from left to right.
Update:
I originally missed that all the joins in your query are OUTER joins (left is equivalent to left outer).
Normally outer joins can't be reordered, because this would change result of the query. I said "normally" because sometimes Spark Optimizer can convert an outer join to an inner join (e.g. if you have a WHERE clause that filters out NULLs - see conversion logic here).
For completeness of the answer, reordering of joins is driven by two different codepaths, depending is Spark CBO is enabled or not (spark.sql.cbo.enabled first appeared in Spark 2.2 and is off by default). If spark.sql.cbo.enabled=true and spark.sql.cbo.joinReorder.enabled=true (also off by default), and statistics are available/collected manually through ANALYZE TABLE .. COMPUTE STATISTICS then reordering is based on estimated cardinality of the join I mentioned above.
Proof that reordering only works for INNER JOINS is here (on example of CBO).
Update 2: Sample queries that show that reordering of outer joins produce different results, so outer joins are never reordered :

The order of interpretation of joins does not matter for inner joins. However, it can matter for outer joins.
Your logic is equivalent to:
FROM ((A LEFT JOIN
B
) ON . . . LEFT JOIN
C
ON . . . LEFT JOIN
)
D
ON . . .
The simplest way to think about chains of LEFT JOIN is that they keep all rows in the first table and columns from matching rows in the subsequent tables.
Note that this is the interpretation of the code. The SQL optimizer is free to rearrange the JOINs in any order to arrive at the same result set (although with outer joins this is generally less likely than with inner joins).

LEFT OUTER JOIN behaving as CROSS JOIN in SPARK SQL

I am using Apache Spark 1.6.1, and am trying to perform a simple left outer join between two tables. One of these tables are loaded from an HDFS folder directly, and the other by directly querying a Hive table.
This is what I have done:
data = sqlContext.read.format("parquet").option("basePath","/path_to_folder").load("path_to_folder/partition_date=20170501
print data.count()
data2 = sqlContext.sql("select * from hive_table")
print data2.count()
join5 = data.join(data2,data["session_id"] == data2["key"],"left_outer")
print join5.count()
The first two counts return the correct count number (around 2 million)
But the count after the join, returns around 2 billion.
In a left join, I should never have more records than the table on the left side. I assume this join is behaving like a cross join.
How is this possible?
Thank you.

Hive query stuck at 99%

I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.
Below query works
Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1;
But this does not
Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;
I have increased number of reducers but still it doesn't work.

Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.
set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)

If your query is getting stuck at 99% check out following options -
Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
Go to resource manager log and see amount of data job is accessing and writing.

Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.
I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.
hive.auto.convert.join=false
mapred.compress.map.output=true
hive.exec.parallel=true

Make sure you don't have rows with duplicate id values in one of your data tables!
I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.
After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.

use these configuration and try
hive> set mapreduce.map.memory.mb=9000;
hive> set mapreduce.map.java.opts=-Xmx7200m;
hive> set mapreduce.reduce.memory.mb=9000;
hive> set mapreduce.reduce.java.opts=-Xmx7200m

I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined

multiple sql joins not producing desired results

I'm new to sql and trying to tweak someone else's huge stored procedure to get a subset of the results. The code below is maybe 10% of the whole procedure. I added the lp.posting_date, last left join, and the where clause. Trying to get records where the posting date is between the start date and the end date. Am I doing this right? Apparently not because the results are unaffected by the change. UPDATE: I CHANGED THE LAST JOIN. The results are correct if there's only one area allocation term. If there is more than one area allocation term, the results are duplicated for each term.
SELECT Distinct
l.lease_id ,
l.property_id as property_id,
l.lease_number as LeaseNumber,
l.name as LeaseName,
lty.name as LeaseType,
lst.name as LeaseStatus,
l.possession_date as PossessionDate,
l.rent as RentCommencementDate,
l.store_open_date as StoreOpenDate,
msr.description as MeasureUnit,
l.comments as Comments ,
lat.start_date as atStartDate,
lat.end_date as atEndDate,
lat.rentable_area as Rentable,
lat.usable_area as Usable,
laat.start_date as aatStartDate,
laat.end_date as aatEndDate,
MK.Path as OrgPath,
CAST(laa.percentage as numeric(9,2)) as Percentage,
laa.rentable_area as aaRentable,
laa.usable_area as aaUsable,
laa.headcounts as Headcount,
laa.area_allocation_term_id,
lat.area_term_id,
laa.area_allocation_id,
lp.posting_date
INTO #LEASES FROM la_tbl_lease l
INNER JOIN #LEASEID on l.lease_id=#LEASEID.lease_id
INNER JOIN la_tbl_lease_term lt on lt.lease_id=l.lease_id and lt.IsDeleted=0
LEFT JOIN la_tlu_lease_type lty on lty.lease_type_id=l.lease_type_id and lty.IsDeleted=0
LEFT JOIN la_tlu_lease_status lst on lst.status_id= l.status_id
LEFT JOIN la_tbl_area_group lag on lag.lease_id=l.lease_id
LEFT JOIN fnd_tlu_unit_measure msr on msr.unit_measure_key=lag.unit_measure_key
LEFT JOIN la_tbl_area_term lat on lat.lease_id=l.lease_id and lat.isDeleted=0
LEFT JOIN la_tbl_area_allocat_term laat on laat.area_term_id=lat.area_term_id and laat.isDeleted=0
LEFT JOIN dbo.la_tbl_area_allocation laa on laa.area_allocation_term_id=laat.area_allocation_term_id and laa.isDeleted=0
LEFT JOIN vw_FND_TLU_Menu_Key MK on menu_type_id_key=2 and isActive=1 and id=laa.menu_id_key
INNER JOIN la_tbl_lease_projection lp on lp.lease_projection_id = #LEASEID.lease_projection_id
where lp.posting_date <= laat.end_date and lp.posting_date >= laat.start_date

As may have already been hinted at you should be careful when using the WHERE clause with an OUTER JOIN.
The idea of the OUTER JOIN is to optionally join that table and provide access to the columns.
The JOINS will generate your set and then the WHERE clause will run to restrict your set. If you are using a condition in the WHERE clause that says one of the columns in your outer joined table must exist / equal a value then by the nature of your query you are no longer doing a LEFT JOIN since you are only retrieving rows where that join occurs.

Shorten it and copy it out as a new query in ssms or whatever you are using for testing. Use an inner join unless you want to preserve the left side set even when there is no matching lp.lease_id. Try something like
if object_id('tempdb..#leases) is not null
drop table #leases;
select distinct
l.lease_id
,l.property_id as property_id
,lp.posting_date
into #leases
from la_tbl_lease as l
inner join la_tbl_lease_projection as lp on lp.lease_id = l.lease_id
where lp.posting_date <= laat.end_date and lp.posting_date >= laat.start_date
select * from #leases
drop table #leases
If this gets what you want then you can work from there and add the other left joins to the query (getting rid of the select * and 'drop table' if you copy it back into your proc). If it doesn't then look at your Boolean date logic or provide more detail for us. If you are new to sql and its procedural extensions, try using the object explorer to examine the properties of the columns you are querying, and try selecting the top 1000 * from the tables you are using to get a feel for what the data looks like when building queries. -Mike

You can try the BETWEEN operator as well
Where lp.posting_date BETWEEN laat.start_date AND laat.end_date
Reasoning: You can have issues wheres there is no matching values in a table. In that instance on a left join the table will populate with null. Using the 'BETWEEN' operator insures that all returns have a value that is between the range and no nulls can slip in.

As it turns out, the problem was easier to solve and it was in a different place in the stored procedure. All I had to do was add one line to one of the cursors to include area term allocations by date.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas