ShuffleQueryStage and ReusedQueryStage in Spark SQL query Plans - apache-spark-sql

What does ShuffleQueryStage 20 and ReusedQueryStage 16mean in a Spark SQL Query Plan below ? I have shared a part of the query plan generated for my query.
I am using Spark 2.4.7.
: +- ReusedQueryStage 16
: +- BroadcastQueryStage 7
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- AdaptiveSparkPlan(isFinalPlan=true)
: +- *(11) HashAggregate(keys=[src_clmorigid#21055], functions=[], output=[src_clmorigid#21055])
: +- ShuffleQueryStage 21, true
: +- Exchange hashpartitioning(src_clmorigid#21055, 10)
: +- *(10) HashAggregate(keys=[src_clmorigid#21055], functions=[], output=[src_clmorigid#21055])
: +- *(10) Project [src_clmorigid#21055]
: +- *(10) BroadcastHashJoin [tgt_clmorigid#21152], [tgt_clmorigid#20756], Inner, BuildRight
: :- *(10) Project [src_clmorigid#21055, tgt_clmorigid#21152]
: : +- *(10) Filter (isnotnull(tgt_clmorigid#21152) && isnotnull(src_clmorigid#21055))
: : +- *(10) FileScan parquet default.vw_exclude_latest_set_frm_clm[src_clmorigid#21055,tgt_clmorigid#21152] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://dm_bucket...
: +- ReusedQueryStage 20
: +- BroadcastQueryStage 6
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- AdaptiveSparkPlan(isFinalPlan=true)
: +- *(9) HashAggregate(keys=[tgt_clmorigid#20756], functions=[], output=[tgt_clmorigid#20756])
: +- ShuffleQueryStage 19, true
: +- Exchange hashpartitioning(tgt_clmorigid#20756, 10)
: +- *(8) HashAggregate(keys=[tgt_clmorigid#20756], functions=[], output=[tgt_clmorigid#20756])
: +- *(8) Project [tgt_clmorigid#20756]
: +- *(8) Filter ((((isnotnull(tgt_clm_line_type_ind#20783) && isnotnull(src_clm_line_type_ind#20686))
: +- *(8) FileScan parquet default.vw_exclude_latest_set_frm_clm[src_clm_line_type_ind#20686,tgt_clmorigid#20756,tgt_clm_line_type_ind#20783] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://...PushedFilters: [IsNotNull(tgt_clm_line_type_ind),
+- *(41) Project [vw_clm_base_fact_sk#21807, source_system#21808, eff_date#21809, frst_sales_crtn_dt#21810, clmorigid#21811, ... 59 more fields]
+- *(41) FileScan parquet default.vw_to_be_merged_data[vw_clm_base_fact_sk#21807,source_system#21808,eff_date#21809,frst_sales_crtn_dt#21810,... 56 more fields], ...
Happy to provide additional information if required.

This stages are connected to AQE (Adaptive Query Execution). If you execute your code wit AQE they should disappear
ShuffleQueryStage - this is added after exchange, its used to materialize results from previous stage (so exchange) to allow AQE to use runtime statistics and reoptimize plan
ReusedQueryStage - this means that this branch of execution is already in your plan and it can be reused. In source code i found a comment which says that order of queries may be not intuitive
I know it may sound weird because AQE was officialy released in Spark 3 but actually it was available from Spark 2.2.x (in a different shape than it works now) so its possible that someone enabled AQE on your cluster with Spark 2.4.7 and you see this stages
I added example with analysis for ShuffleQueryStage in other answer, you may take a look: Other answer

Related

Pydrill Heapmemory Issue while Executing Spark submit commnads

Exception :
[MainThread ] [ERROR] : Message : TransportError(500, '{\n "errorMessage" : "RESOURCE ERROR: There is not enough heap memory to run this query using the web interface. \n\nPlease try a query with fewer columns or with a filter or limit condition to limit the data returned. \nYou can also try an ODBC/JDBC client. \n\n[Error Id: 8dc824fc-b1b5-4352-85fe-84e2eb5ff71d on

What does an asterisk preceding a PushedFilters entail in a Spark SQL Explain Plan

Regarding Spark PushedFilters showing in Spark Physical Explain Plans it's stated that (ref.
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html):
The asterisk indicates that push down filter will be handled only at the datasource level.
What does this mean and more importantly when you see a PushedFilters array entry without an asterisk,
is the filter still being pushed down to the data source level or not and handled out side of it, but
why call it a Pushed Filter then in the first place?
Very confusing and googling it I couldn't find a real answer to the question.
Thanks!
Jan
Pushdown of predicates always happen at the data source level. It happens in a manner that the datasource would selectively scan those pieces of data which are predicated upon. Spark is just a processing engine which hands over the query to the datasource for final execution. The data source on the other hand would execute the query as it wishes. The Spark-sql connectors are aware of the behavior of datasources (based on the schema) so they can predict a physical plan with pushdown predicates but they can't guarantee that it will run, so the asterisk.
I ran a query against a local parquet file. The physical plan has pushed down predicate and no asterisk. It is a local parquet file which Spark reads itself so the physical plan is 100% accurate.
val df = spark.read.parquet("/Users/Documents/temp/temp1")
df.filter($"income" >= 30).explain(true)
== Physical Plan ==
*(1) Project [client#0, type#1, address#2, type_2#3, income#4]
+- *(1) Filter (isnotnull(income#4) && (income#4 >= 30))
+- *(1) FileScan parquet [client#0,type#1,address#2,type_2#3,income#4] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/User/Documents/temp/temp1], PartitionFilters: [], PushedFilters: [IsNotNull(income), GreaterThanOrEqual(income,30)], ReadSchema: struct<client:string,type:string,address:string,type_2:string,income:int>
Here a table is read from Oracle using Spark-SQL. The Oracle DB uses predicate push down and index access but Spark has no idea about it.
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/data/.., false, Parquet, Map(codec -> org.apache.hadoop.io.compress.snappyCodec, path -> /data/...), Overwrite, [COLUMN_01, COLUMN_02, COLUMN_03, COLUMN_04, COLUMN_05, COLUMN_06, COLUMN_07, COLUMN_08, COLUMN_09, COLUMN_10, COLUMN_11, COLUMN_12, COLUMN_13, COLUMN_14, COLUMN_15, COLUMN_16, COLUMN_17, COLUMN_18, ... 255 more fields]
+- Project [COLUMN_01#1246, COLUMN_02#1247, COLUMN_03#1248, COLUMN_04#1249, COLUMN_05#1250, COLUMN_06#1251, COLUMN_07#1252, COLUMN_08#1253, COLUMN_09#1254, COLUMN_10#1255, COLUMN_11#1256, COLUMN_12#1257, COLUMN_13#1258, COLUMN_14#1259, COLUMN_15#1260, COLUMN_16#1261, COLUMN_17#1262, COLUMN_18#1263, ... 255 more fields]
+- Scan JDBCRelation((select cu.*, ROWIDTONCHAR(t.rowid) as ROW_ID from table t where (column1 in (786567473,786567520,786567670,786567570,...........)) and column2 in (10,11, ...) and t.result is null)t) [numPartitions=20] [COLUMN_87#1332,COLUMN_182#1427,COLUMN_128#1373,COLUMN_104#1349,COLUMN_189#1434,COLUMN_108#1353,COLUMN_116#1361,COLUMN_154#1399,COLUMN_125#1370,COLUMN_120#1365,COLUMN_267#1512,COLUMN_54#1299,COLUMN_100#1345,COLUMN_230#1475,COLUMN_68#1313,COLUMN_44#1289,COLUMN_53#1298,COLUMN_97#1342,COLUMN_03#1248,COLUMN_16#1261,COLUMN_43#1288,COLUMN_50#1295,COLUMN_174#1419,COLUMN_20#1265,... 254 more fields] PushedFilters: [], ReadSchema: struct<COLUMN_87:string,COLUMN_182:string,COLUMN_128:string,COLUMN_104:string,COLUMN_189:string,C...

"java.lang.OutOfMemoryError: Java heap spaceā€ error in Apache Spark?

I have a dataframe and I want to convert a column to a Set in spark scala. But It happen an error java.lang.OutOfMemoryError: Java heap space when converting. My data so large, about 50,000,000 records 100GB in mongodb. I had read it to a datafame.
val readConf1 = ReadConfig(Map("uri" -> "mongodb://127.0.0.1/product-repository.detail?readPreference=primaryPreferred"))
val dff = MongoSpark.load(sc, readConf1).toDF()
val setUrl=dff.select(collect_list("DetailUrl")).first().getList[String](0).toSet
This is my spark conf:
spark.executor.memory 4G
spark.executor.instances 12
spark.executor.cores 4
spark.executor.extraJavaOptions -Xms3600m
It runs normally with small data, below 1,000,000 records. How to solve this error??

Spark cannot query Hive tables it can see?

I'm running the prebuilt version of Spark 1.2 for CDH 4 on CentOS. I have copied the hive-site.xml file into the conf directory in Spark so it should see the Hive metastore.
I have three tables in Hive (facility, newpercentile, percentile), all of which I can query from the Hive CLI. After I log into Spark and create the Hive Context like so: val hiveC = new org.apache.spark.sql.hive.HiveContext(sc) I am running into an issue querying these tables.
If I run the following command: val tableList = hiveC.hql("show tables") and do a collect() on tableList, I get this result: res0: Array[org.apache.spark.sql.Row] = Array([facility], [newpercentile], [percentile])
If I then run this command to get the count of the facility table: val facTable = hiveC.hql("select count(*) from facility"), I get the following output, which I take to mean that it cannot find the facility table to query it:
scala> val facTable = hiveC.hql("select count(*) from facility")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
14/12/26 10:27:26 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
14/12/26 10:27:26 INFO ParseDriver: Parsing command: select count(*) from facility
14/12/26 10:27:26 INFO ParseDriver: Parse Completed
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(355177) called with curMem=0, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 346.9 KB, free 264.6 MB)
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(50689) called with curMem=355177, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 49.5 KB, free 264.6 MB)
14/12/26 10:27:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:45305 (size: 49.5 KB, free: 264.9 MB)
14/12/26 10:27:26 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
14/12/26 10:27:26 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:68
facTable: org.apache.spark.sql.SchemaRDD =
SchemaRDD[2] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
Aggregate false, [], [Coalesce(SUM(PartialCount#38L),0) AS _c0#5L]
Exchange SinglePartition
Aggregate true, [], [COUNT(1) AS PartialCount#38L]
HiveTableScan [], (MetastoreRelation default, facility, None), None
Any assistance would be appreciated. Thanks.
scala> val facTable = hiveC.hql("select count(*) from facility")
Great! You have an RDD, now what do you want to do with it?
scala> facTable.collect()
Remember that an RDD is an abstraction on top of your data and is not materialized until you invoke an action on it such as collect() or count().
You would get a very obvious error if you tried to use a non-existent table name.

db2 error in dataview SQL0956C

I am receiving this error on DB2 9.7.8
Database handling error - 4003: Database error in dataview MATCHTAB: SQL0956C Not enough storage is available in the database heap to process the statement.
I have already increased the Heap size as recommended here:
http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.5.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00956c.html?lang=en
but it hasn't resolved the error.
I increased the heapsize from 6000 to 8000.
Is there any way to determine the appropriate heapsize as I do not want to increase it arbitrarily?
Thanks,
Joe
db2diag log at mustaccios suggestion
DATA #1 : <preformatted>
Out of memory failure for Database Heap (DBHEAP) on node 0.
Requested block size : 8208 bytes.
Physical heap size : 48955392 bytes.
Configured heap size : 49872896 bytes.
Unreserved memory used by heap : 0 bytes.
Unreserved memory left in set : 61014016 bytes.
2014-01-20-15.16.30.573000+060 I90041446H637 LEVEL: Error
PID : 2504 TID : 2488 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000 DB : Database1
APPHDL : 0-4062 APPID: *LOCAL.DB2.140120140000
AUTHID : CORONA
EDUID : 2488 EDUNAME: db2agent (CORONA) 0
FUNCTION: DB2 UDB, access plan manager, sqlra_add_pkg_id_to_ejected_list, probe:252
RETCODE : ZRC=0x8B0F0002=-1961951230=SQLO_NOMEM_DBH
"No memory available in 'Database Heap'"
DIA8302C No memory available in the database heap.
DATA #1 : signed integer, 8 bytes
7360