Understanding Query Plan of a Spark SQL Query - apache-spark-sql

I am trying to understand the Physical Plan of a Spark SQL query. I am using Spark SQL v 2.4.7.
Below is a partial query plan generated for a big query.
: +- ReusedQueryStage 16
: +- BroadcastQueryStage 7
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- AdaptiveSparkPlan(isFinalPlan=true)
: +- *(11) HashAggregate(keys=[src_clmorigid#21055], functions=[], output=[src_clmorigid#21055])
: +- ShuffleQueryStage 21, true
: +- Exchange hashpartitioning(src_clmorigid#21055, 10)
: +- *(10) HashAggregate(keys=[src_clmorigid#21055], functions=[], output=[src_clmorigid#21055])
: +- *(10) Project [src_clmorigid#21055]
: +- *(10) BroadcastHashJoin [tgt_clmorigid#21152], [tgt_clmorigid#20756], Inner, BuildRight
: :- *(10) Project [src_clmorigid#21055, tgt_clmorigid#21152]
: : +- *(10) Filter (isnotnull(tgt_clmorigid#21152) && isnotnull(src_clmorigid#21055))
: : +- *(10) FileScan parquet default.vw_exclude_latest_set_frm_clm[src_clmorigid#21055,tgt_clmorigid#21152] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://dm_bucket...
: +- ReusedQueryStage 20
: +- BroadcastQueryStage 6
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- AdaptiveSparkPlan(isFinalPlan=true)
: +- *(9) HashAggregate(keys=[tgt_clmorigid#20756], functions=[], output=[tgt_clmorigid#20756])
: +- ShuffleQueryStage 19, true
: +- Exchange hashpartitioning(tgt_clmorigid#20756, 10)
: +- *(8) HashAggregate(keys=[tgt_clmorigid#20756], functions=[], output=[tgt_clmorigid#20756])
: +- *(8) Project [tgt_clmorigid#20756]
: +- *(8) Filter ((((isnotnull(tgt_clm_line_type_ind#20783) && isnotnull(src_clm_line_type_ind#20686))
: +- *(8) FileScan parquet default.vw_exclude_latest_set_frm_clm[src_clm_line_type_ind#20686,tgt_clmorigid#20756,tgt_clm_line_type_ind#20783] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://...PushedFilters: [IsNotNull(tgt_clm_line_type_ind),
+- *(41) Project [vw_clm_base_fact_sk#21807, source_system#21808, eff_date#21809, frst_sales_crtn_dt#21810, clmorigid#21811, ... 59 more fields]
+- *(41) FileScan parquet default.vw_to_be_merged_data[vw_clm_base_fact_sk#21807,source_system#21808,eff_date#21809,frst_sales_crtn_dt#21810,... 56 more fields], ...
Can anyone please help me answer the following questions I have:
What are the numbers inside parenthesis signify ? E.g. *(41), *(8)
etc. Does it represent Stage Id or WholeStageCodeGen Id ?
I understand that the asterisk '*' represent WholeStageCodegen. What
does that exactly mean and what is the significance in terms of query performance ? Few days
back I saw a Spark SQL query Physical Plan that did
not contain any asterisks - which means no wholestagecodegen. Does
that mean the sql query was poorly written and hence performance
of that query will be suboptimal ? What causes wholestagecodegen to
be not utilized by Spark optimizer as in that specific query ?
What does "ReusedQueryStage 20" mean in the above query plan ? What
does the number 20 signify ?
What does "BroadcastQueryStage 6" mean in the above query plan ? What
does the number 6 signify ?
What does "ShuffleQueryStage 21" mean in the above query plan ? What
does the number 21 signify ?
What does "AdaptiveSparkPlan(isFinalPlan=true)" mean in the above
query plan ?
I once saw an execution plan of a query in the SQL tab of the
Spark UI which had an operator called "BloomFilter". What does that
mean ? Is it something regarding reading a parquet file ? Can you please explain.
A question regarding Spark UI:
In the Stages tab of Spark UI (spark SQL 2.4.7), the DAGs often
contain boxes labelled "WholeStageCodeGen" which contains several
operators in the same box. What does this signify in Spark with
respect to query performance ?
The DAGs shown in the Stages tab of Spark UI do not show
which part of the actual query it pertains to, since it does not show
any specific table-name etc. Hence with big queries it often becomes
very difficult to pinpoint the exact part of the query that the DAG
pertains to. Is there any way to identify the exact part of the code
which pertains to that specific Stage in the DAG ?
Note: I am using Spark 2.4.7
Can anyone please help me answer the above questions. Any help is appreciated.
Thanks.

Related

Push data to mongoDB using spark from hive

i want to to extract data from hive using sql query convert that to a nested dataframe and push it into mongodb using spark.
Can anyone suggest a efficient way to do that .
eg:
Flat query result -->
{"columnA":123213 ,"Column3 : 23,"Column4" : null,"Column5" : "abc"}
Nested Record to be pushed to mongo -->
{
"columnA":123213,
"newcolumn" : {
"Column3 : 23,
"Column4" : null,
"Column5" : "abc"
}
}
You may use the map function in spark sql to achieve the desired transformation eg
df.selectExpr("ColumnA","map('Column3',Column3,'Column4',Column4,'Column5',Column5) as newcolumn")
or you may run the following on your spark session after creating a temp view
df.createOrReplaceTempView("my_temp_view")
sparkSession.sql("<insert sql below here>")
SELECT
ColumnA,
map(
"Column3",Column3,
"Column4",Column4,
"Column5",Column5
) as newcolumn
FROM
my_temp_view
Moreover, if this is the only transformation that you wish to use, you may run this query on hive also.
Additional resources:
Spark Writing to Mongo
Let me know if this works for you.
For a nested level array for your hive dataframe we can try something like:
from pyspark.sql import functions as F
df.withColumn(
"newcolumn",
F.struct(
F.col("Column3").alias("Column3"),
F.col("Column4").alias("Column4"),
F.col("Column5").alias("Column5")
)
)
followed by groupBy and F.collect_list to create an nested array wrapped in a single record.
we can then write this to mongo
df.write.format('com.mongodb.spark.sql.DefaultSource').mode("append").save()

Apache Spark: using plain SQL queries vs using Spark SQL methods

I'm very new to Apache Spark.
I have a very basic question: what is best in terms of performance between the two syntax below: using plain SQL queries or using Spark SQL methods like select, filter, etc. .
Here's a short example in Java, that will make you understand better my question.
private static void queryVsSparkSQL() throws AnalysisException {
SparkConf conf = new SparkConf();
SparkSession spark = SparkSession
.builder()
.master("local[4]")
.config(conf)
.appName("queryVsSparkSQL")
.getOrCreate();
//using predefined query
Dataset<Row> ds1 = spark
.read()
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#localhost:1521/orcl")
.option("user", "hr")
.option("password", "hr")
.option("query","select * from hr.employees t where t.last_name = 'King'")
.load();
ds1.show();
//using spark sql methods: select, filter
Dataset<Row> ds2 = spark
.read()
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#localhost:1521/orcl")
.option("user", "hr")
.option("password", "hr")
.option("dbtable", "hr.employees")
.load()
.select("*")
.filter(col("last_name").equalTo("King"));
ds2.show();
}
Try .explain and check if pushdown predicate is used on your second query.
It should be in that second case. If so, it is equivalent technically in performance to passing the explicit query with pushdown already contained in the query option.
See a simulated version against mySQL, based on your approach.
CASE 1: select statement via passed query containing filter
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("query","select * from family where rfam_acc = 'RF01527'").option("user", "rfamro").load().explain()
== Physical Plan ==
*(1) Scan JDBCRelation((select * from family where rfam_acc = 'RF01527') SPARK_GEN_SUBQ_4) [numPartitions=1] #[rfam_acc#867,rfam_id#868,auto_wiki#869L,description#870,author#871,seed_source#872,gathering_cutoff#873,trusted_cutoff#874,noise_cutoff#875,comment#876,previous_id#877,cmbuild#878,cmcalibrate#879,cmsearch#880,num_seed#881L,num_full#882L,num_genome_seq#883L,num_refseq#884L,type#885,structure_source#886,number_of_species#887L,number_3d_structures888,num_pseudonokts#889,tax_seed#890,... 11 more fields] PushedFilters: [], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...
Here PushedFilters is not used as a query is only used; it contains the filter in the actual passed to db query.
CASE 2: No select statement, rather using Spark SQL APIs referencing a filter
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("dbtable", "family").option("user", "rfamro").load().select("*").filter(col("rfam_acc").equalTo("RF01527")).explain()
== Physical Plan ==
*(1) Scan JDBCRelation(family) [numPartitions=1] [rfam_acc#1149,rfam_id#1150,auto_wiki#1151L,description#1152,author#1153,seed_source#1154,gathering_cutoff#1155,trusted_cutoff#1156,noise_cutoff#1157,comment#1158,previous_id#1159,cmbuild#1160,cmcalibrate#1161,cmsearch#1162,num_seed#1163L,num_full#1164L,num_genome_seq#1165L,num_refseq#1166L,type#1167,structure_source#1168,number_of_species#1169L,number_3d_structures#1170,num_pseudonokts#1171,tax_seed#1172,... 11 more fields] PushedFilters: [*IsNotNull(rfam_acc), *EqualTo(rfam_acc,RF01527)], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...
PushedFilter is set to the criteria so filtering is applied in the database itself prior to returning data to Spark. Note the * on the PushedFilters, that signfies filtering at data source = database.
Summary
I ran both options and the timing was quick. They are equivalent in terms of what DB processing is done, only filtered results are returned to Spark, but via two different mechanisms that result in the same performance and results physically.

Neo4j Cypher- Equivalent of ADD_DATE(date,INTERVAL expr unit)

I have some sql queries that I want to translate to cypher. One of my queries contains the function DATE_ADD :
WHERE s_date<=DATE_ADD('2000-12-01',INTERVAL -90 DAY);
Is there any equivalent function in cypher please?
Thanks,
You can use APOC for that : https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_adding_subtracting_time_unit_values_to_timestamps
Or if you are using the temporal feature of Neo4j 3.4, you can add a Duration to a Date : RETURN date({year:2018, month:3, day: 31}) + duration('P1D').
For more information, see the documentation : https://neo4j.com/docs/developer-manual/3.4/cypher/syntax/temporal/#cypher-temporal-specifying-durations
Cheers

Fuzzy search with stop words produces unexpected results with Lucene / ElasticSearch

I am noticing that the fuzzy operator on stop words does not produce the results I'd expect.
Here's my configuration:
index :
analysis :
analyzer :
my_analyzer :
tokenizer : my_tokenizer
filter : [standard, my_stop_english_filter]
tokenizer :
my_tokenizer :
type : standard
max_token_length : 512
filter :
my_stop_english_filter :
type : stop
stopwords : [the]
ignore_case : true
And suppose I have indexed:
the brown fox
If I search for:
the brown~ fox~, then I get a hit as expected.
However, if I search for: the~ brown~ fox~, then I do not get a hit, presumably because the fuzzy operator prevents the from being treated as a stop word.
Is there a way I can combine stop words with fuzzy search?
Thanks,
Eric
If I recall correctly, this is the way Lucene is supposed to work as it is currently written -- using a fuzzy search disable the stopping of the stop words. It would take some work, but you could create a modified version of the query parser so stop words are ignored when applying fuzzy search (but then how do handle a fuzzy search on something that looks like a stop word?)

How to boost search based on index type in elasticsearch or lucene?

I have three food type indices "Italian", "Spanish", "American".
When the user searches "Cheese", documents from "Italian" appear to come up at the top. Is it possible to "boost" the results if I were to give preference to say "Spanish"? (I should still get results for Italian, but based on some numeric boost value for index type "Spanish", the ordering of the documents returned in the results give preference to the "Spanish" index. Is this possible in user input lucene and/or ES query? If so, how?
Add a term query with a boost for either the _type field or the _index (or both).
Use a script_score as part of the function score query:
function_score: {
script_score: {
script: "doc['_type'].value == '<your _type>' ? _score * <boost_factor> : _score"
}
}
If querying several indices at once, it is possible to specify indices boost at the top level of object passed to Search API:
curl -XGET localhost:9200/italian,spanish,american/_search -d '
{
"query":{"term":{"food_type":"cheese"}},
"indices_boost" : {
"ilalian" : 1.4,
"spanish" : 1.3,
"american" : 1.1
}
}'
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-index-boost.html#search-request-index-boost
For query-time boosting, queries (ex. query string) generally have a boost attribute you can set. Alternatively, you can wrap queries in a custom boost factor. I would probably prefer the former, usually.