I am seeking suggestions on how to expose RDDs from a Spark SQL physical plan (with Spark-3.2.1). Here let's take TPCH Query 1 as a concrete example.
This is the physical plan generated by sql(query).explain():
== Physical Plan ==
*(3) Sort [l_returnflag#17 ASC NULLS FIRST, l_linestatus#18 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(l_returnflag#17 ASC NULLS FIRST, l_linestatus#18 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#97]
+- *(2) HashAggregate(keys=[l_returnflag#17, l_linestatus#18], functions=[sum(l_quantity#13), sum(l_extendedprice#14), sum(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)), sum(CheckOverflow((promote_precision(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)) * promote_precision(cast(CheckOverflow((1.00 + promote_precision(cast(l_tax#16 as decimal(13,2)))), DecimalType(13,2), true) as decimal(26,4)))), DecimalType(38,6), true)), avg(l_quantity#13), avg(l_extendedprice#14), avg(l_discount#15), count(1)], output=[l_returnflag#17, l_linestatus#18, sum_qty#114, sum_base_price#115, sum_disc_price#116, sum_charge#117, avg_qty#118, avg_price#119, avg_disc#120, count_order#121L])
+- Exchange hashpartitioning(l_returnflag#17, l_linestatus#18, 200), ENSURE_REQUIREMENTS, [id=#93]
+- *(1) HashAggregate(keys=[l_returnflag#17, l_linestatus#18], functions=[partial_sum(l_quantity#13), partial_sum(l_extendedprice#14), partial_sum(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)), partial_sum(CheckOverflow((promote_precision(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)) * promote_precision(cast(CheckOverflow((1.00 + promote_precision(cast(l_tax#16 as decimal(13,2)))), DecimalType(13,2), true) as decimal(26,4)))), DecimalType(38,6), true)), partial_avg(l_quantity#13), partial_avg(l_extendedprice#14), partial_avg(l_discount#15), partial_count(1)], output=[l_returnflag#17, l_linestatus#18, sum#155, isEmpty#156, sum#157, isEmpty#158, sum#159, isEmpty#160, sum#161, isEmpty#162, sum#163, count#164L, sum#165, count#166L, sum#167, count#168L, count#169L])
+- *(1) Project [l_quantity#13, l_extendedprice#14, l_discount#15, l_tax#16, l_returnflag#17, l_linestatus#18]
+- *(1) ColumnarToRow
+- FileScan parquet tpch_100.lineitem[l_quantity#13,l_extendedprice#14,l_discount#15,l_tax#16,l_returnflag#17,l_linestatus#18,l_shipdate#24] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(2436 paths)[hdfs://node13-opa:8020/user/spark_benchmark/tpch_100/dataset/lineit..., PartitionFilters: [isnotnull(l_shipdate#24), (l_shipdate#24 <= 1998-09-02)], PushedFilters: [], ReadSchema: struct<l_quantity:decimal(12,2),l_extendedprice:decimal(12,2),l_discount:decimal(12,2),l_tax:deci...
These are the generated RDDs by sql(query).rdd.toDebugString:
(4) MapPartitionsRDD[17] at rdd at <console>:24 []
| SQLExecutionRDD[16] at rdd at <console>:24 []
| MapPartitionsRDD[15] at rdd at <console>:24 []
| MapPartitionsRDD[14] at rdd at <console>:24 []
| ShuffledRowRDD[13] at rdd at <console>:24 []
+-(200) MapPartitionsRDD[12] at rdd at <console>:24 []
| MapPartitionsRDD[8] at rdd at <console>:24 []
| ShuffledRowRDD[7] at rdd at <console>:24 []
+-(265) MapPartitionsRDD[6] at rdd at <console>:24 []
| MapPartitionsRDD[5] at rdd at <console>:24 []
| MapPartitionsRDD[4] at rdd at <console>:24 []
| FileScanRDD[3] at rdd at <console>:24 []
I know the process of RDDs generation is to 1) create the input RDD (e.g. FileScanRDD[3]) from the input data; 2) do transformations based on the input RDD for each physical operator.
My question is: whether/how the internal "transformations" code is able to be exposed outside the Spark SQL?
For example, a user can create the initial RDD, like FileScanRDD[3], from the input data, whether/how the user (rather than the internal Spark SQL) is able to implement the same "transformation" on FileScanRDD[3] to produce MapPartitionsRDD[4], and later for the rest of RDDs?
Appreciate your help in advance!
Related
Question: When joining two datasets, Why is the filter isnotnull applied twice on the joining key column? In the physical plan, it is once applied as a PushedFilter and then explicitly applied right after it. Why is that so?
code:
import os
import pandas as pd, numpy as np
import pyspark
spark=pyspark.sql.SparkSession.builder.getOrCreate()
save_loc = "gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/"
df1 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]),
'b': np.random.random(1000)}))
df2 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]),
'b': np.random.random(1000)}))
df1.write.parquet(os.path.join(save_loc,"dfl_key_int"))
df2.write.parquet(os.path.join(save_loc,"dfr_key_int"))
dfl_int = spark.read.parquet(os.path.join(save_loc,"dfl_key_int"))
dfr_int = spark.read.parquet(os.path.join(save_loc,"dfr_key_int"))
dfl_int.join(dfr_int,on='a',how='inner').explain()
output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#23L, b#24, b#28]
+- BroadcastHashJoin [a#23L], [a#27L], Inner, BuildRight, false
:- Filter isnotnull(a#23L)
: +- FileScan parquet [a#23L,b#24] Batched: true, DataFilters: [isnotnull(a#23L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfl_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
+- Filter isnotnull(a#27L)
+- FileScan parquet [a#27L,b#28] Batched: true, DataFilters: [isnotnull(a#27L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfr_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double>
The reason is that a PushedFilter does not guarantee you that all the data is filtered as you want before it has been read into memory by Spark. For more context on what a PushedFilter is, check out this SO answer.
Parquet files
Let's have a look at Parquet files like in your example. Parquet files are stored in a columnar format, and they are also organized in Row Groups (or chunks). The following picture comes from the Apache Parquet docs:
You see that the data is stored in a columnar fashion, and they are chopped up into chunks (row groups). Now, for each column/row chunk combination, Parquet stores some metadata. In that picture, you see that it contains a bunch of metadata and then also extra key/value pairs. These also contain statistics about your data (depending on what type your column is).
Some examples of these statistics are:
what the min/max value is of the chunk (in case it makes sense for the data type of the column)
whether the chunk has non-null values
...
Back to your example
You are joining on the a column. To be able to do that we need to be sure that a has no null values. Let's imagine that your a column (disregarding the other columns) is stored like this:
a column:
chunk 1: 0, 1, None, 1, 1, None
chunk 2: 0, 0, 0, 0, 0, 0
chunk 3: None, None, None, None, None, None
Now, using the PushedFilter we can immediately (just by looking at the metadata of the chunks) disregard chunk 3, we don't even have to read it in!
But as you see, chunk 1 still contains null values. This is something we can't filter out by only looking at the chunk's metadata. So we'll have to read in that whole chunk and then filter those other null values afterwards within Spark using that second Filter node in your Physical Plan.
I have two columns in a data frame df in PySpark:
| features | center |
+----------+----------+
| [0,1,0] | [1.5,2,1]|
| [5,7,6] | [10,7,7] |
I want to create a function which calculates the Euclidean distance between df['features'] and df['center'] and map it to a new column in df, distance.
Let's say our function looks like the following:
#udf
def dist(feat, cent):
return np.linalg.norm(feat-cent)
How would I actually apply this to do what I want it to do? I was trying things like
df.withColumn("distance", dist(col("features"), col("center"))).show()
but that gives me the following error:
rg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 869.0 failed 4 times, most recent failure: Lost task 0.3 in stage 869.0 (TID 26423) (10.50.91.134 executor 35): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
I am really struggling with understanding how to do basic Python mappings in a Spark context, so I really appreciate any help.
You have truly chosen a difficult topic. In Spark, 95%+ of things can be done without python UDFs. You should always try to find a way not to create a UDF.
I've attempted your UDF, I got the same error, and I cannot really tell, why. I think there's something with data types, as you pass Spark array into a function which expects numpy data types. I really can't tell much more...
For Euclidian distance, it's possible to calculate it in Spark. Not an easy one, though.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([0, 1, 0], [1.5, 2., 1.]),
([5, 7, 6], [10., 7., 7.])],
['features', 'center'])
distance = F.aggregate(
F.transform(
F.arrays_zip('features', 'center'),
lambda x: (x['features'] - x['center'])**2
),
F.lit(0.0),
lambda acc, x: acc + x,
lambda x: x**.5
)
df = df.withColumn('distance', distance)
df.show()
# +---------+----------------+------------------+
# | features| center| distance|
# +---------+----------------+------------------+
# |[0, 1, 0]| [1.5, 2.0, 1.0]|2.0615528128088303|
# |[5, 7, 6]|[10.0, 7.0, 7.0]|5.0990195135927845|
# +---------+----------------+------------------+
from sklearn.metrics.pairwise import paired_distances
Alter dfs schema to accommodate the dist column
sch= df.withColumn('dist', lit(90.087654623)).schema
Create pandas udf that claculates distance
def euclidean_dist(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(dist=paired_distances(pdf['features'].to_list(),pdf['center'].to_list()))
df.mapInPandas(euclidean_dist, schema=sch).show()
Solution
+---------+----------------+------------------+
| features| center| dist|
+---------+----------------+------------------+
|[0, 1, 0]| [1.5, 2.0, 1.0]|2.0615528128088303|
|[5, 7, 6]|[10.0, 7.0, 7.0]|5.0990195135927845|
+---------+----------------+------------------+
You can calculate the distance using only PySpark and spark sql APIs:
import pyspark.sql.functions as f
df = (
df
.withColumn('distance', f.sqrt(f.expr('aggregate(transform(features, (element, idx) -> pow(element - element_at(center, idx + 1), 2)), cast(0 as double), (acc, val) -> acc + val)')))
)
These two achieve almost the same (the only diff the order of rows):
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("key")
val df1 = Seq((3, "A", 5), (1, "A", 2), (3, "A", 5), (3, "B", 13)).toDF("key", "Categ1", "value")
df1.withColumn("avg", avg("value").over(windowSpec)).show
+---+------+-----+-----------------+
|key|Categ1|value| avg|
+---+------+-----+-----------------+
| 1| A| 2| 2.0|
| 3| A| 5|7.666666666666667|
| 3| A| 5|7.666666666666667|
| 3| B| 13|7.666666666666667|
+---+------+-----+-----------------+
and
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df1 = Seq((3, "A", 5), (1, "A", 2), (3, "A", 5), (3, "B", 13)).toDF("key",
val df2 = df1.groupBy("key").agg(avg("value") as "avg")
df1.join(df2, Seq("key")).show
+---+------+-----+-----------------+
|key|Categ1|value| avg|
+---+------+-----+-----------------+
| 3| A| 5|7.666666666666667|
| 1| A| 2| 2.0|
| 3| A| 5|7.666666666666667|
| 3| B| 13|7.666666666666667|
+---+------+-----+-----------------+
At a first glance, I thought the first one should be faster, because it avoids the join, but by experience sometimes it turns out a different conclusion.
Is there additional overhead of window func? Is there complexity difference between window func and groupBy? Thanks.
Side question
Is it impossible to ensure type safety when using window func?
Related: How use Window aggrgates on strongly typed Spark Datasets?
You can take a look at window's solution physical plan with explain method:
== Physical Plan ==
Window [avg(cast(value#9 as bigint)) windowspecdefinition(key#7, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS avg#14], [key#7]
+- *(1) Sort [key#7 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key#7, 200)
+- LocalTableScan [key#7, Categ1#8, value#9]
it will load the data, shuffle the data to get 1 key per partition then calculate the average.
for the second solution :
== Physical Plan ==
*(3) Project [key#7, Categ1#8, value#9, avg#24]
+- *(3) BroadcastHashJoin [key#7], [key#27], Inner, BuildRight
:- LocalTableScan [key#7, Categ1#8, value#9]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *(2) HashAggregate(keys=[key#27], functions=[avg(cast(value#29 as bigint))], output=[key#27, avg#24])
+- Exchange hashpartitioning(key#27, 200)
+- *(1) HashAggregate(keys=[key#27], functions=[partial_avg(cast(value#29 as bigint))], output=[key#27, sum#37, count#38L])
+- LocalTableScan [key#27, value#29]
the execution plan is way more complex, load the data, shuffle to get 1 key par partition. your data exemple is very small, so my spark use broadcastJoin instead of sortMergeJoin or ShuffleHashjoin .. bu it can cause shuffle again then join.
it will also load your df1 2 time because your data is not persisted, and the first solution seam easier to understand.
First solution is better.
Leaving aside that 1) no explicit setting of number of 'shuffle partitions' evident, and 2) setting AQE true or false not stated, and 3) the fact that there is a small amount of data,
then noting no caching (should imho not be necessary),
noting that the second case has an element of 'self-join' for which 'reuse exchange' should be used according to https://issues.apache.org/jira/browse/SPARK-2183 so as to avoid re-reading from rest - but this does not occur:
Then approach 1 I ran:
val dfA = spark.table("ZZZ")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("number2")
dfA.withColumn("avg", avg("number").over(windowSpec)).explain
shows:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Window [number#73, lit#74, number2#75, avg(_w0#85L) windowspecdefinition(number2#75, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS avg#80], [number2#75]
+- Sort [number2#75 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number2#75, 200), true, [id=#141]
+- Project [number#73, lit#74, number2#75, cast(number#73 as bigint) AS _w0#85L]
+- FileScan parquet default.zzz[number#73,lit#74,number2#75] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[dbfs:/user/hive/warehouse/zzz], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<number:int,lit:string>
The sort is so that within partitions the same key values can be read in sequence and summed and divided by total number of key values for that key to get avg when there is a new key value and at end. This is the quickest method as it has no JOIN as in the 2nd option.
Then approach 2 I ran:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold",-1)
spark.conf.set("spark.sql.adaptive.enabled",false)
val dfB = spark.table("ZZZ")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val dfC = dfB.groupBy("number2").agg(avg("number") as "avg")
dfB.join(dfC, Seq("number2")).explain
shows:
== Physical Plan ==
*(4) Project [number2#75, number#73, lit#74, avg#500]
+- *(4) SortMergeJoin [number2#75], [number2#505], Inner
:- Sort [number2#75 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number2#75, 200), true, [id=#1484]
: +- *(1) ColumnarToRow
: +- FileScan parquet default.zzz[number#73,lit#74,number2#75] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/zzz/number2=0, dbfs:/user/hive/warehouse/zzz/number2=..., PartitionFilters: [isnotnull(number2#75)], PushedFilters: [], ReadSchema: struct<number:int,lit:string>
+- *(3) Sort [number2#505 ASC NULLS FIRST], false, 0
+- *(3) HashAggregate(keys=[number2#505], functions=[finalmerge_avg(merge sum#512, count#513L) AS avg(cast(number#503 as bigint))#499])
+- Exchange hashpartitioning(number2#505, 200), true, [id=#1491]
+- *(2) HashAggregate(keys=[number2#505], functions=[partial_avg(cast(number#503 as bigint)) AS (sum#512, count#513L)])
+- *(2) ColumnarToRow
+- FileScan parquet default.zzz[number#503,number2#505] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/zzz/number2=0, dbfs:/user/hive/warehouse/zzz/number2=..., PartitionFilters: [isnotnull(number2#505)], PushedFilters: [], ReadSchema: struct<number:int>
This option involves a JOIN and 2 reads from data at rest. As indicated a 'self-join' and in some cases a Union with some help, should use reuse exchange according to the ticket above, but not the case. Here we have an element of 'self join' imho, but Catalyst sees it differently. This is a non-AQE approach as you can see on the settings. It is far more complex.
Conclusion:
This type of query with AQE on can adapt itself to using broadcast
hash join, hence the disabling of AQE here.
First option is the way to go as it does not read from rest twice and
does not need a JOIN.
Which of the following is the better way in Pyspark?
Does the second query has any advantage/performance gain over first query in PySpark (in cluster mode)?
#1) without using aggr
total_distance_df = spark.sql("SELECT sum(distance) FROM flights")\
.withColumnRenamed('sum(CAST(distance AS DOUBLE))', 'total_distance')
total_distance_df.show()
Vs
#2) with using aggr
total_distance_df = spark.sql("SELECT distance FROM flights")\
.agg({"distance":"sum"})\
.withColumnRenamed("sum(distance)","total_distance")
total_distance_df.show()
Both are same, Check the explain plan on the queries to see any differences.
Example:
#sample df
df1.show()
+---+--------+
| id|distance|
+---+--------+
| a| 1|
| b| 2|
+---+--------+
df1.createOrReplaceTempView("tmp")
spark.sql("SELECT sum(distance) FROM tmp").withColumnRenamed('sum(CAST(distance AS DOUBLE))', 'total_distance').explain()
#== Physical Plan ==
#*(2) HashAggregate(keys=[], functions=[sum(distance#179L)])
#+- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_sum(distance#179L)])
# +- *(1) Project [distance#179L]
# +- Scan ExistingRDD[id#178,distance#179L]
spark.sql("SELECT distance FROM tmp").agg({"distance":"sum"}).explain()
#== Physical Plan ==
#*(2) HashAggregate(keys=[], functions=[sum(distance#179L)])
#+- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_sum(distance#179L)])
# +- *(1) Project [distance#179L]
# +- Scan ExistingRDD[id#178,distance#179L]
As you can see plans are similar for both SUM and aggr.
I need some help with this Apache Spark (pyspark) issue.
I've a dataFrame (df1) which has a single column & a single row, it contains max_timestamp
+------------------+
|max_timestamp |
+-------------------+
|2019-10-24 21:18:26|
+-------------------+
I've another DataFrame, which contains 2 Columns - EmpId & Timestamp
masterData = [(1, '1999-10-24 21:18:23',), (1, '2019-10-24 21:18:26',), (2, '2020-01-24 21:18:26',)]
df_masterdata = spark.createDataFrame(masterData, ['dsid', 'txnTime_str'])
df_masterdata = df_masterdata.withColumn('txnTime_ts', col('txnTime_str').cast(TimestampType())).drop('txnTime_str')
df_masterdata.show(5, False)
+----+-------------------+
|dsid|txnTime_ts |
+----+-------------------+
|1 |1999-10-24 21:18:23|
|1 |2019-10-24 21:18:26|
|2 |2020-01-24 21:18:26|
+----+-------------------+
Object is to filter the records in the 2nd Dataframe, based on condition txnTime_ts < max_timestamp
What i'm trying to do -> add the column 'max_timestamp' to the 2nd DataFrame, and filter records by comparing the 2 values.
df_masterdata1 = df_masterdata.withColumn('maxTime', maxTS2['TEMP_MAX'])
Pyspark does not let me add the column from maxTS2 to the dataFrame - df_masterdata
Error -
AnalysisException: 'Resolved attribute(s) TEMP_MAX#207255 missing from dsid#207263L,txnTime_ts#207267 in operator
!Project [dsid#207263L, txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280].;;\n!Project [dsid#207263L,
txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280]\n+- Project [dsid#207263L, txnTime_ts#207267]\n +- Project
[dsid#207263L, txnTime_str#207264, cast(txnTime_str#207264 as timestamp) AS txnTime_ts#207267]\n +- LogicalRDD
[dsid#207263L, txnTime_str#207264], false\n'
Any ideas on how to resolve this issue?
If you actually have a DF with a single row/column, the most efficient way to accomplish this would be to extract the value from the dataframe and then filter df_masterdata against it. If you nevertheless need to do this within the context of a dataframe, you should us join , e.g.:
df_masterdata1 = df_masterdata.join(df1, df_masterdata.txnTime_ts <= df1.max_timestamp)