pyspark sql sum vs aggr - apache-spark-sql

Which of the following is the better way in Pyspark?
Does the second query has any advantage/performance gain over first query in PySpark (in cluster mode)?
#1) without using aggr
total_distance_df = spark.sql("SELECT sum(distance) FROM flights")\
.withColumnRenamed('sum(CAST(distance AS DOUBLE))', 'total_distance')
total_distance_df.show()
Vs
#2) with using aggr
total_distance_df = spark.sql("SELECT distance FROM flights")\
.agg({"distance":"sum"})\
.withColumnRenamed("sum(distance)","total_distance")
total_distance_df.show()

Both are same, Check the explain plan on the queries to see any differences.
Example:
#sample df
df1.show()
+---+--------+
| id|distance|
+---+--------+
| a| 1|
| b| 2|
+---+--------+
df1.createOrReplaceTempView("tmp")
spark.sql("SELECT sum(distance) FROM tmp").withColumnRenamed('sum(CAST(distance AS DOUBLE))', 'total_distance').explain()
#== Physical Plan ==
#*(2) HashAggregate(keys=[], functions=[sum(distance#179L)])
#+- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_sum(distance#179L)])
# +- *(1) Project [distance#179L]
# +- Scan ExistingRDD[id#178,distance#179L]
spark.sql("SELECT distance FROM tmp").agg({"distance":"sum"}).explain()
#== Physical Plan ==
#*(2) HashAggregate(keys=[], functions=[sum(distance#179L)])
#+- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_sum(distance#179L)])
# +- *(1) Project [distance#179L]
# +- Scan ExistingRDD[id#178,distance#179L]
As you can see plans are similar for both SUM and aggr.

Related

how to expose RDDs from Spark SQL plan

I am seeking suggestions on how to expose RDDs from a Spark SQL physical plan (with Spark-3.2.1). Here let's take TPCH Query 1 as a concrete example.
This is the physical plan generated by sql(query).explain():
== Physical Plan ==
*(3) Sort [l_returnflag#17 ASC NULLS FIRST, l_linestatus#18 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(l_returnflag#17 ASC NULLS FIRST, l_linestatus#18 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#97]
+- *(2) HashAggregate(keys=[l_returnflag#17, l_linestatus#18], functions=[sum(l_quantity#13), sum(l_extendedprice#14), sum(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)), sum(CheckOverflow((promote_precision(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)) * promote_precision(cast(CheckOverflow((1.00 + promote_precision(cast(l_tax#16 as decimal(13,2)))), DecimalType(13,2), true) as decimal(26,4)))), DecimalType(38,6), true)), avg(l_quantity#13), avg(l_extendedprice#14), avg(l_discount#15), count(1)], output=[l_returnflag#17, l_linestatus#18, sum_qty#114, sum_base_price#115, sum_disc_price#116, sum_charge#117, avg_qty#118, avg_price#119, avg_disc#120, count_order#121L])
+- Exchange hashpartitioning(l_returnflag#17, l_linestatus#18, 200), ENSURE_REQUIREMENTS, [id=#93]
+- *(1) HashAggregate(keys=[l_returnflag#17, l_linestatus#18], functions=[partial_sum(l_quantity#13), partial_sum(l_extendedprice#14), partial_sum(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)), partial_sum(CheckOverflow((promote_precision(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)) * promote_precision(cast(CheckOverflow((1.00 + promote_precision(cast(l_tax#16 as decimal(13,2)))), DecimalType(13,2), true) as decimal(26,4)))), DecimalType(38,6), true)), partial_avg(l_quantity#13), partial_avg(l_extendedprice#14), partial_avg(l_discount#15), partial_count(1)], output=[l_returnflag#17, l_linestatus#18, sum#155, isEmpty#156, sum#157, isEmpty#158, sum#159, isEmpty#160, sum#161, isEmpty#162, sum#163, count#164L, sum#165, count#166L, sum#167, count#168L, count#169L])
+- *(1) Project [l_quantity#13, l_extendedprice#14, l_discount#15, l_tax#16, l_returnflag#17, l_linestatus#18]
+- *(1) ColumnarToRow
+- FileScan parquet tpch_100.lineitem[l_quantity#13,l_extendedprice#14,l_discount#15,l_tax#16,l_returnflag#17,l_linestatus#18,l_shipdate#24] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(2436 paths)[hdfs://node13-opa:8020/user/spark_benchmark/tpch_100/dataset/lineit..., PartitionFilters: [isnotnull(l_shipdate#24), (l_shipdate#24 <= 1998-09-02)], PushedFilters: [], ReadSchema: struct<l_quantity:decimal(12,2),l_extendedprice:decimal(12,2),l_discount:decimal(12,2),l_tax:deci...
These are the generated RDDs by sql(query).rdd.toDebugString:
(4) MapPartitionsRDD[17] at rdd at <console>:24 []
| SQLExecutionRDD[16] at rdd at <console>:24 []
| MapPartitionsRDD[15] at rdd at <console>:24 []
| MapPartitionsRDD[14] at rdd at <console>:24 []
| ShuffledRowRDD[13] at rdd at <console>:24 []
+-(200) MapPartitionsRDD[12] at rdd at <console>:24 []
| MapPartitionsRDD[8] at rdd at <console>:24 []
| ShuffledRowRDD[7] at rdd at <console>:24 []
+-(265) MapPartitionsRDD[6] at rdd at <console>:24 []
| MapPartitionsRDD[5] at rdd at <console>:24 []
| MapPartitionsRDD[4] at rdd at <console>:24 []
| FileScanRDD[3] at rdd at <console>:24 []
I know the process of RDDs generation is to 1) create the input RDD (e.g. FileScanRDD[3]) from the input data; 2) do transformations based on the input RDD for each physical operator.
My question is: whether/how the internal "transformations" code is able to be exposed outside the Spark SQL?
For example, a user can create the initial RDD, like FileScanRDD[3], from the input data, whether/how the user (rather than the internal Spark SQL) is able to implement the same "transformation" on FileScanRDD[3] to produce MapPartitionsRDD[4], and later for the rest of RDDs?
Appreciate your help in advance!

Equivalent of `takeWhile` for Spark dataframe

I have a dataframe looking like this:
scala> val df = Seq((1,.5), (2,.3), (3,.9), (4,.0), (5,.6), (6,.0)).toDF("id", "x")
scala> df.show()
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
| 4|0.0|
| 5|0.6|
| 6|0.0|
+---+---+
I would like to take the first rows of the data as long as the x column is nonzero (note that the dataframe is sorted by id so talking about the first rows is relevant). For this given dataframe, it would give something like that:
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
+---+---+
I only kept the 3 first rows, as the 4th row was zero.
For a simple Seq, I can do something like Seq(0.5, 0.3, 0.9, 0.0, 0.6, 0.0).takeWhile(_ != 0.0). So for my dataframe I thought of something like this:
df.takeWhile('x =!= 0.0)
But unfortunately, the takeWhile method is not available for dataframes.
I know that I can transform my dataframe to a Seq to solve my problem, but I would like to avoid gathering all the data to the driver as it will likely crash it.
The take and the limit methods allow to get the n first rows of a dataframe, but I can't specify a predicate. Is there a simple way to do this?
Can you guarantee that ID's will be in ascending order? New data is not necessarily guaranteed to be added in a specific order. If you can guarantee the order then you can use this query to achieve what you want. It's not going to perform well on large data sets, but it may be the only way to achieve what you are interested in.
We'll mark all 0's as '1' and everything else as '0'. We'll then do a rolling total over the entire data awr. As the numbers only increase in value on a zero it will partition the dataset into sections with number between zero's.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy().orderBy("id")
df.select(
col("id"),
col("x"),
sum( // creates a running total which will be 0 for the first partition --> All numbers before the first 0
when( col("x") === lit(0), lit(1) ).otherwise(lit(0)) // mark 0's to help partition the data set.
).over(windowSpec).as("partition")
).where(col("partition") === lit(0) )
.show()
---+---+---------+
| id| x|partition|
+---+---+---------+
| 1|0.5| 0|
| 2|0.3| 0|
| 3|0.9| 0|
+---+---+---------+

Performance comparison between groupBy + join vs window func Spark

These two achieve almost the same (the only diff the order of rows):
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("key")
val df1 = Seq((3, "A", 5), (1, "A", 2), (3, "A", 5), (3, "B", 13)).toDF("key", "Categ1", "value")
df1.withColumn("avg", avg("value").over(windowSpec)).show
+---+------+-----+-----------------+
|key|Categ1|value| avg|
+---+------+-----+-----------------+
| 1| A| 2| 2.0|
| 3| A| 5|7.666666666666667|
| 3| A| 5|7.666666666666667|
| 3| B| 13|7.666666666666667|
+---+------+-----+-----------------+
and
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df1 = Seq((3, "A", 5), (1, "A", 2), (3, "A", 5), (3, "B", 13)).toDF("key",
val df2 = df1.groupBy("key").agg(avg("value") as "avg")
df1.join(df2, Seq("key")).show
+---+------+-----+-----------------+
|key|Categ1|value| avg|
+---+------+-----+-----------------+
| 3| A| 5|7.666666666666667|
| 1| A| 2| 2.0|
| 3| A| 5|7.666666666666667|
| 3| B| 13|7.666666666666667|
+---+------+-----+-----------------+
At a first glance, I thought the first one should be faster, because it avoids the join, but by experience sometimes it turns out a different conclusion.
Is there additional overhead of window func? Is there complexity difference between window func and groupBy? Thanks.
Side question
Is it impossible to ensure type safety when using window func?
Related: How use Window aggrgates on strongly typed Spark Datasets?
You can take a look at window's solution physical plan with explain method:
== Physical Plan ==
Window [avg(cast(value#9 as bigint)) windowspecdefinition(key#7, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS avg#14], [key#7]
+- *(1) Sort [key#7 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key#7, 200)
+- LocalTableScan [key#7, Categ1#8, value#9]
it will load the data, shuffle the data to get 1 key per partition then calculate the average.
for the second solution :
== Physical Plan ==
*(3) Project [key#7, Categ1#8, value#9, avg#24]
+- *(3) BroadcastHashJoin [key#7], [key#27], Inner, BuildRight
:- LocalTableScan [key#7, Categ1#8, value#9]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *(2) HashAggregate(keys=[key#27], functions=[avg(cast(value#29 as bigint))], output=[key#27, avg#24])
+- Exchange hashpartitioning(key#27, 200)
+- *(1) HashAggregate(keys=[key#27], functions=[partial_avg(cast(value#29 as bigint))], output=[key#27, sum#37, count#38L])
+- LocalTableScan [key#27, value#29]
the execution plan is way more complex, load the data, shuffle to get 1 key par partition. your data exemple is very small, so my spark use broadcastJoin instead of sortMergeJoin or ShuffleHashjoin .. bu it can cause shuffle again then join.
it will also load your df1 2 time because your data is not persisted, and the first solution seam easier to understand.
First solution is better.
Leaving aside that 1) no explicit setting of number of 'shuffle partitions' evident, and 2) setting AQE true or false not stated, and 3) the fact that there is a small amount of data,
then noting no caching (should imho not be necessary),
noting that the second case has an element of 'self-join' for which 'reuse exchange' should be used according to https://issues.apache.org/jira/browse/SPARK-2183 so as to avoid re-reading from rest - but this does not occur:
Then approach 1 I ran:
val dfA = spark.table("ZZZ")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("number2")
dfA.withColumn("avg", avg("number").over(windowSpec)).explain
shows:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Window [number#73, lit#74, number2#75, avg(_w0#85L) windowspecdefinition(number2#75, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS avg#80], [number2#75]
+- Sort [number2#75 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number2#75, 200), true, [id=#141]
+- Project [number#73, lit#74, number2#75, cast(number#73 as bigint) AS _w0#85L]
+- FileScan parquet default.zzz[number#73,lit#74,number2#75] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[dbfs:/user/hive/warehouse/zzz], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<number:int,lit:string>
The sort is so that within partitions the same key values can be read in sequence and summed and divided by total number of key values for that key to get avg when there is a new key value and at end. This is the quickest method as it has no JOIN as in the 2nd option.
Then approach 2 I ran:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold",-1)
spark.conf.set("spark.sql.adaptive.enabled",false)
val dfB = spark.table("ZZZ")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val dfC = dfB.groupBy("number2").agg(avg("number") as "avg")
dfB.join(dfC, Seq("number2")).explain
shows:
== Physical Plan ==
*(4) Project [number2#75, number#73, lit#74, avg#500]
+- *(4) SortMergeJoin [number2#75], [number2#505], Inner
:- Sort [number2#75 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number2#75, 200), true, [id=#1484]
: +- *(1) ColumnarToRow
: +- FileScan parquet default.zzz[number#73,lit#74,number2#75] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/zzz/number2=0, dbfs:/user/hive/warehouse/zzz/number2=..., PartitionFilters: [isnotnull(number2#75)], PushedFilters: [], ReadSchema: struct<number:int,lit:string>
+- *(3) Sort [number2#505 ASC NULLS FIRST], false, 0
+- *(3) HashAggregate(keys=[number2#505], functions=[finalmerge_avg(merge sum#512, count#513L) AS avg(cast(number#503 as bigint))#499])
+- Exchange hashpartitioning(number2#505, 200), true, [id=#1491]
+- *(2) HashAggregate(keys=[number2#505], functions=[partial_avg(cast(number#503 as bigint)) AS (sum#512, count#513L)])
+- *(2) ColumnarToRow
+- FileScan parquet default.zzz[number#503,number2#505] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/zzz/number2=0, dbfs:/user/hive/warehouse/zzz/number2=..., PartitionFilters: [isnotnull(number2#505)], PushedFilters: [], ReadSchema: struct<number:int>
This option involves a JOIN and 2 reads from data at rest. As indicated a 'self-join' and in some cases a Union with some help, should use reuse exchange according to the ticket above, but not the case. Here we have an element of 'self join' imho, but Catalyst sees it differently. This is a non-AQE approach as you can see on the settings. It is far more complex.
Conclusion:
This type of query with AQE on can adapt itself to using broadcast
hash join, hence the disabling of AQE here.
First option is the way to go as it does not read from rest twice and
does not need a JOIN.

Adding column from dataframe(df1) to another dataframe (df2)

I need some help with this Apache Spark (pyspark) issue.
I've a dataFrame (df1) which has a single column & a single row, it contains max_timestamp
+------------------+
|max_timestamp |
+-------------------+
|2019-10-24 21:18:26|
+-------------------+
I've another DataFrame, which contains 2 Columns - EmpId & Timestamp
masterData = [(1, '1999-10-24 21:18:23',), (1, '2019-10-24 21:18:26',), (2, '2020-01-24 21:18:26',)]
df_masterdata = spark.createDataFrame(masterData, ['dsid', 'txnTime_str'])
df_masterdata = df_masterdata.withColumn('txnTime_ts', col('txnTime_str').cast(TimestampType())).drop('txnTime_str')
df_masterdata.show(5, False)
+----+-------------------+
|dsid|txnTime_ts |
+----+-------------------+
|1 |1999-10-24 21:18:23|
|1 |2019-10-24 21:18:26|
|2 |2020-01-24 21:18:26|
+----+-------------------+
Object is to filter the records in the 2nd Dataframe, based on condition txnTime_ts < max_timestamp
What i'm trying to do -> add the column 'max_timestamp' to the 2nd DataFrame, and filter records by comparing the 2 values.
df_masterdata1 = df_masterdata.withColumn('maxTime', maxTS2['TEMP_MAX'])
Pyspark does not let me add the column from maxTS2 to the dataFrame - df_masterdata
Error -
AnalysisException: 'Resolved attribute(s) TEMP_MAX#207255 missing from dsid#207263L,txnTime_ts#207267 in operator
!Project [dsid#207263L, txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280].;;\n!Project [dsid#207263L,
txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280]\n+- Project [dsid#207263L, txnTime_ts#207267]\n +- Project
[dsid#207263L, txnTime_str#207264, cast(txnTime_str#207264 as timestamp) AS txnTime_ts#207267]\n +- LogicalRDD
[dsid#207263L, txnTime_str#207264], false\n'
Any ideas on how to resolve this issue?
If you actually have a DF with a single row/column, the most efficient way to accomplish this would be to extract the value from the dataframe and then filter df_masterdata against it. If you nevertheless need to do this within the context of a dataframe, you should us join , e.g.:
df_masterdata1 = df_masterdata.join(df1, df_masterdata.txnTime_ts <= df1.max_timestamp)

Statistics of Columns computed parallely

Best way to get the max value in a Spark dataframe column
This post shows how to run an aggregation (distinct, min, max) on a table something like:
for colName in df.columns:
dt = cd[[colName]].distinct().count()
mx = cd.agg({colName: "max"}).collect()[0][0]
mn = cd.agg({colName: "min"}).collect()[0][0]
print(colName, dt, mx, mn)
This can be easily done by compute statistics. The stats from Hive and spark are different:
Hive gives - distinct, max, min, nulls, length, version
Spark Gives - count, mean, stddev, min, max
Looks like there are quite a few statistics that are calculated. How get all of them for all columns using one command?
However, I have 1000s of columns and doing this serially is very slow. Suppose I want to compute some other function say Standard Deviation on each of the columns - how can that be done parallely?
You can use pyspark.sql.DataFrame.describe() to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary() which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg(). For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe() and summary() do.
You can put as many expressions into an agg as you want, when you collect they all get computed at once. The result is a single row with all the values. Here's an example:
from pyspark.sql.functions import min, max, countDistinct
r = df.agg(
min(df.col1).alias("minCol1"),
max(df.col1).alias("maxCol1"),
(max(df.col1) - min(df.col1)).alias("diffMinMax"),
countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
# |-- minCol1: long (nullable = true)
# |-- maxCol1: long (nullable = true)
# |-- diffMinMax: long (nullable = true)
# |-- distinctItemsInCol2: long (nullable = false)
row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)
You can also use the dictionary syntax here, but it's harder to manage for more complex things.