Issue with broadcast Join in Spark 1.6.0 - sql

Spark creates different SQL execution plans in versions 1.5.2 and 1.6.0.
The database contains one large facts table (f) and several small dimension tables (d1, d2).
The dimension tables (d1, d2) are cached in Spark using:
sqlContext.cacheTable("d1");
sqlContext.sql("SELECT * from d1").count();
sqlContext.cacheTable("d2");
sqlContext.sql("SELECT * from d2").count();
The size of tables d1 and d2: ~100MB (the tab STORAGE shows this size for the tables in Spark Web UI).
The Spark SQL is configured by the broadcast parameter (1GB) which is more than the size of the tables:
spark.sql.autoBroadcastJoinThreshold=1048576000
The SQL is:
SELECT d1.name, d2.name, SUM(f.clicks)
FROM f
JOIN d1 on f.d1_id = d1.id
JOIN d2 on f.d2_id = d2.id
WHERE d1.name='A' AND f.date=20160101
GROUP BY d1.name, d2.name
ORDER BY SUM(f.clicks) DESC
The query is executed using ThriftServer.
When we run this query using Spark 1.5.2 the query is executed fast and the execution plan contains BroadcastJoin wth dimensions.
But when we run the same query using Spark 1.6.0 the query is executed very slow and the execution plan contains SortMergeJoin with dimensions.
The environment for both executions is the same (the same Yarn Cluster).
Provided executors count, tasks count per executor and memory of the executors are the same for both executions.
What can I do to configure Spark 1.6.0 for using BroadcastJoin in sql executions?
Thanks.

Related

Remove Hash Match to Increase SQL Query Performance

I have an SSIS package that has the following Query as its OLE DB Source
SELECT SESM.[Id]
,SE.SessionId
,SESM.[SegmentId]
,SE.[Id] as 'SessionEntryId'
,V.[Number] as 'VehicleNumber'
,SELS.[SessionEntryLapId]
,SEL.[LapNumber]
,SESM.[Name]
,SESM.[NotNulls]
,SESM.[OutOfRange]
,SESM.[Nulls]
,SESM.[Mean]
,SESM.[Variance]
,SESM.[Min]
,SESM.[Max]
,SESM.[P05]
,SESM.[P10]
,SESM.[P20]
,SESM.[P25]
,SESM.[P50]
,SESM.[P75]
,SESM.[P80]
,SESM.[P90]
,SESM.[P95]
,SESM.[Value]
,SESM.[Percentage]
,SESM.[Discriminator]
FROM [dbo].[SessionEntrySegmentMetrics] SESM
LEFT JOIN [SessionEntryLapSegments] SELS on SESM.SegmentId = SELS.Id
LEFT JOIN SessionEntryLaps SEL on SELS.SessionEntryLapId = SEL.Id
LEFT JOIN SessionEntries SE on SEL.SessionEntryId = SE.Id
LEFT JOIN Vehicles V on SE.VehicleId = V.Id
The result of this query is 140M rows that its processing which is a small data conversion and transfer to our warehouse. This package is averaging 1M rows per hour which is unacceptably slow.
Looking at the query in SSMS this is the Execution Plan
The two big points are the Index Scan on SessionEntrySegmentMetrics with a cost of 71% and the HASH MATCH at 16%. The SessionEntrySegmentMetrics has 140M rows in it and the index its using is fragmented to 70% with a 60% page fullness.
The memory on the SQL box executing the SSIS package is pegged at 97%.
Besides the fragmentation issue, any ideas on how to eliminate that HASH Match and increase performance of this query?

How do I enable execution of multiple jobs in parallel Hive

I am running the below sql where the data in the table is around 4,34,836,959 records. It is taking more than 3 minutes to get the result.
select distinct col_1,col_2,col_3,col_4,col_5,
to_date(concat(year(col_6),'-',month(col_6), '-1')) as col_6_new,col_7,
cast(first_value(col_11) over (partition by col_1,col_2,col_5,col_4,concat(year(col_6),'-',month(col_6)) order by col_6) as double) as col_9,
cast(first_value(col_11) over (partition by col_1,col_2,col_5,col_4,concat(year(col_6),'-',month(col_6)) order by col_6 desc) as double) as col_10,
min(to_date(concat(year(col_6),'-',month(col_6), '-1'))) over (partition by col_1,col_2,col_5,col_4) as col_8
from my_table
When I checked the execution from background, I could see that only 1 Job and 1 Stage is running at a time. Is there a way to parallelize this?
I even tried the below code, but Jobs/Stages are not running in parallel.
spark.sql("set hive.exec.parallel=true")
spark.sql("set hive.exec.parallel.thread.number=16")
spark.sql("set hive.vectorized.execution = true")
spark.sql("set hive.vectorized.execution.enabled = true")
The Spark version I am using is 2.3.
Any help is greatly appreciated.
Every stage in the Spark execution plan corresponds to the set of operations that do not require shuffling.
As far as I can see, you need at least 2 shuffles in your query:
For calculating window functions with clauses like partition by col_1,col_2,col_5,col_4,concat(year(col_6),'-',month(col_6)) order by col_6
For calculating final distinct operation.
Thus, 2 shuffles result in 3 Spark stages.
Since you cannot do distinct before calculating all window functions, the stages should be executed one by one and cannot be parallelized.
To check it from your side you can find the execution DAG in the Spark UI.

Pig ParquetLoader : Column Pruning

I read parquet files which has a schema of 12 columns.
I do a group by and sum aggregation over a single long column.
then join on another dataset. after join I only take a single column (the sum one) from the parquet dataset.
But pig constantly keeps on giving me error=>
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2000: Error processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune"
Does the pig parquet loader doesn't support column pruning?
If i tried with column pruning disabled, the job works.
pseudo code for what I am trying to achieve.
REGISTER /<path>/parquet*.jar;
res1 = load '<path>' using parquet.pig.ParquetLoader() as (c1:chararray,c2:chararray,c3:int, c4:int, c5:chararray, c6:chararray, c7:chararray, c8:chararray, c9:chararray, c10:chararray, c11:chararray, c12:long);
res2 = group winrate by (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11);
res3 = foreach res2 generate flatten(group) as (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11),SUM(res1.c12) as counts;

BUG in SQL query "select * in table" using RODBC package with ODBC Driver 13 for SQL Server in R

There seems to be a problem with ODBC Driver 13 for SQL Server (Running local Ubuntu 16.04) with RODBC package (version1.3-15) in R (version 3.4.1 (2017-06-30)). First of we make a query to see the size of the SQL table called TableName.
library(RODBC)
connectionString <- "Driver={ODBC Driver 13 for SQL Server};Server=tcp:<DATABASE-URL>,<NUMBER>;Database=<DATABASE NAME>;Uid=<USER ID>;Pwd=<PASSWORD>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
connection <- odbcDriverConnect(connectionString)
count <- sqlQuery(connection, 'select count(*) from TableName')
odbcGetErrMsg(connection)
Output of the above gives a count value of 200.000 (odbcGetErrMsg returns no errors), which is known to be the correct size of the SQL table called TableName.
Now comes the trange part.
TableName <- sqlQuery(connection, 'select * from TableName')
count = dim(TableName)[1]
odbcGetErrMsg(connection)
Output of the above first gives a value of 700 (odbcGetErrMsg returns no errors). But when the above code is executed again it returns another count value of 2300 i.e. it is random. When repeating above code multiple times I see the range of the count value returned is approx. between 700-8.000 (TableName has 8 columns).
None of the above outputs changes when setting ConnectionTimeout equal to either 0 or some absurd high number, respectively.
Does anybody know what is going on here? The goal is to store the full SQL table called TableName as a dataframe in R for further data processing.
Any help is much appreciated.
Note for others with a similar problem:
I did not solve this BUG, however by shifting to Microsoft JDBC Driver 6.2 for SQL server with R package RJDBC returns the correct result. With this setup I am now able to load the full SQL table (200.000 rows and counting) into R as a dataframe for further processing.

Left Semi Join on Geo-Spatial tables in Spark-SQL & GeoMesa

Problem:
I have 2 tables (d1 & d2) containing Geo-spatial points. I want to carry out the following query:
select * from table 1 where table1.point is within 50km of any point in table2.point
I am using Spark-SQL with GeoMesa & Accumulo to achieve the same. (Spark as processing engine, Accumulo as Data Store & GeoMesa for GeoSpatial libraries).
The above query is kind of left semi join but I am not sure on how to achieve it using Spark-SQL because as far as I have read subqueries can't be used in where clause.
Was able to achieve this using:
select * from d1 left semi join d2 on st_contains(st_bufferPoint(d1.point, 10000.0), d2.point)
Spark broadcasted d2 & is carrying out joins but it is still taking more time as the size of d1 is 5 billion & d2 is 10 million.
Not sure though if there is any more efficient way to achieve the same.