These two achieve almost the same (the only diff the order of rows):
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("key")
val df1 = Seq((3, "A", 5), (1, "A", 2), (3, "A", 5), (3, "B", 13)).toDF("key", "Categ1", "value")
df1.withColumn("avg", avg("value").over(windowSpec)).show
+---+------+-----+-----------------+
|key|Categ1|value| avg|
+---+------+-----+-----------------+
| 1| A| 2| 2.0|
| 3| A| 5|7.666666666666667|
| 3| A| 5|7.666666666666667|
| 3| B| 13|7.666666666666667|
+---+------+-----+-----------------+
and
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df1 = Seq((3, "A", 5), (1, "A", 2), (3, "A", 5), (3, "B", 13)).toDF("key",
val df2 = df1.groupBy("key").agg(avg("value") as "avg")
df1.join(df2, Seq("key")).show
+---+------+-----+-----------------+
|key|Categ1|value| avg|
+---+------+-----+-----------------+
| 3| A| 5|7.666666666666667|
| 1| A| 2| 2.0|
| 3| A| 5|7.666666666666667|
| 3| B| 13|7.666666666666667|
+---+------+-----+-----------------+
At a first glance, I thought the first one should be faster, because it avoids the join, but by experience sometimes it turns out a different conclusion.
Is there additional overhead of window func? Is there complexity difference between window func and groupBy? Thanks.
Side question
Is it impossible to ensure type safety when using window func?
Related: How use Window aggrgates on strongly typed Spark Datasets?
You can take a look at window's solution physical plan with explain method:
== Physical Plan ==
Window [avg(cast(value#9 as bigint)) windowspecdefinition(key#7, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS avg#14], [key#7]
+- *(1) Sort [key#7 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key#7, 200)
+- LocalTableScan [key#7, Categ1#8, value#9]
it will load the data, shuffle the data to get 1 key per partition then calculate the average.
for the second solution :
== Physical Plan ==
*(3) Project [key#7, Categ1#8, value#9, avg#24]
+- *(3) BroadcastHashJoin [key#7], [key#27], Inner, BuildRight
:- LocalTableScan [key#7, Categ1#8, value#9]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *(2) HashAggregate(keys=[key#27], functions=[avg(cast(value#29 as bigint))], output=[key#27, avg#24])
+- Exchange hashpartitioning(key#27, 200)
+- *(1) HashAggregate(keys=[key#27], functions=[partial_avg(cast(value#29 as bigint))], output=[key#27, sum#37, count#38L])
+- LocalTableScan [key#27, value#29]
the execution plan is way more complex, load the data, shuffle to get 1 key par partition. your data exemple is very small, so my spark use broadcastJoin instead of sortMergeJoin or ShuffleHashjoin .. bu it can cause shuffle again then join.
it will also load your df1 2 time because your data is not persisted, and the first solution seam easier to understand.
First solution is better.
Leaving aside that 1) no explicit setting of number of 'shuffle partitions' evident, and 2) setting AQE true or false not stated, and 3) the fact that there is a small amount of data,
then noting no caching (should imho not be necessary),
noting that the second case has an element of 'self-join' for which 'reuse exchange' should be used according to https://issues.apache.org/jira/browse/SPARK-2183 so as to avoid re-reading from rest - but this does not occur:
Then approach 1 I ran:
val dfA = spark.table("ZZZ")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("number2")
dfA.withColumn("avg", avg("number").over(windowSpec)).explain
shows:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Window [number#73, lit#74, number2#75, avg(_w0#85L) windowspecdefinition(number2#75, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS avg#80], [number2#75]
+- Sort [number2#75 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number2#75, 200), true, [id=#141]
+- Project [number#73, lit#74, number2#75, cast(number#73 as bigint) AS _w0#85L]
+- FileScan parquet default.zzz[number#73,lit#74,number2#75] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[dbfs:/user/hive/warehouse/zzz], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<number:int,lit:string>
The sort is so that within partitions the same key values can be read in sequence and summed and divided by total number of key values for that key to get avg when there is a new key value and at end. This is the quickest method as it has no JOIN as in the 2nd option.
Then approach 2 I ran:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold",-1)
spark.conf.set("spark.sql.adaptive.enabled",false)
val dfB = spark.table("ZZZ")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val dfC = dfB.groupBy("number2").agg(avg("number") as "avg")
dfB.join(dfC, Seq("number2")).explain
shows:
== Physical Plan ==
*(4) Project [number2#75, number#73, lit#74, avg#500]
+- *(4) SortMergeJoin [number2#75], [number2#505], Inner
:- Sort [number2#75 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number2#75, 200), true, [id=#1484]
: +- *(1) ColumnarToRow
: +- FileScan parquet default.zzz[number#73,lit#74,number2#75] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/zzz/number2=0, dbfs:/user/hive/warehouse/zzz/number2=..., PartitionFilters: [isnotnull(number2#75)], PushedFilters: [], ReadSchema: struct<number:int,lit:string>
+- *(3) Sort [number2#505 ASC NULLS FIRST], false, 0
+- *(3) HashAggregate(keys=[number2#505], functions=[finalmerge_avg(merge sum#512, count#513L) AS avg(cast(number#503 as bigint))#499])
+- Exchange hashpartitioning(number2#505, 200), true, [id=#1491]
+- *(2) HashAggregate(keys=[number2#505], functions=[partial_avg(cast(number#503 as bigint)) AS (sum#512, count#513L)])
+- *(2) ColumnarToRow
+- FileScan parquet default.zzz[number#503,number2#505] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/zzz/number2=0, dbfs:/user/hive/warehouse/zzz/number2=..., PartitionFilters: [isnotnull(number2#505)], PushedFilters: [], ReadSchema: struct<number:int>
This option involves a JOIN and 2 reads from data at rest. As indicated a 'self-join' and in some cases a Union with some help, should use reuse exchange according to the ticket above, but not the case. Here we have an element of 'self join' imho, but Catalyst sees it differently. This is a non-AQE approach as you can see on the settings. It is far more complex.
Conclusion:
This type of query with AQE on can adapt itself to using broadcast
hash join, hence the disabling of AQE here.
First option is the way to go as it does not read from rest twice and
does not need a JOIN.
Related
I have a streaming dataframe and I am not sure what the best way is to solve this issue
ID
lattitude
longitude
A
28
30
B
40
52
Transform to:
A
B.
Distance
(28,30)
(40,52)
calculate distance
I need to transform it to this and add a distance column in which I pass the coordinates.
I am thinking about producing 2 data streams that are filtered with all the A coordinates and B coordinates. I would then A.join(B).withColumn(distance) and stream the output. Is this the way to go about solving this problem?
Is there a way I could pivot without aggregation to readstream data into the format needed which could be faster than making 2 streaming dataframes filtered and merging them?
Can I add an array column of coordinates in a streaming dataset?
I am not sure how performant this will be, but you can use pivot to force rows of the ID column to become new columns and sum the individual latitude and longitude as a way to obtain the value itself (since there is no F.identity). This will get you the following result:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
)
+----------+-----------+----------+-----------+
|A_latitude|A_longitude|B_latitude|B_longitude|
+----------+-----------+----------+-----------+
| 28| 30| 40| 52|
+----------+-----------+----------+-----------+
Then you can use F.struct to create columns A and B using the latitude and longitude columns:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
)
+----------+-----------+----------+-----------+--------+--------+
|A_latitude|A_longitude|B_latitude|B_longitude| A| B|
+----------+-----------+----------+-----------+--------+--------+
| 28| 30| 40| 52|{28, 30}|{40, 52}|
+----------+-----------+----------+-----------+--------+--------+
The last step is to use a udf to calculate geographic distance, which has been answered here. Putting this all together:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from geopy.distance import geodesic
#F.udf(returnType=FloatType())
def geodesic_udf(a, b):
return geodesic(a, b).m
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
).withColumn(
'distance', geodesic_udf(F.array('B.B_longitude','B.B_latitude'), F.array('A.A_longitude','A.A_latitude'))
).select(
'A','B','distance'
)
+--------+--------+---------+
| A| B| distance|
+--------+--------+---------+
|{28, 30}|{40, 52}|2635478.5|
+--------+--------+---------+
EDIT: When I answered your question, I let pyspark infer the datatype of each column, but I also tried to more closely reproduce the schema for your streaming dataframe by specifying the column types:
streaming_df = spark.createDataFrame(
[
("A", 28., 30.),
("B", 40., 52.),
],
StructType([
StructField("ID", StringType(), True),
StructField("latitude", DoubleType(), True),
StructField("longitude", DoubleType(), True),
])
)
streaming_df.printSchema()
root
|-- ID: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
The end result is still the same:
+------------+------------+---------+
| A| B| distance|
+------------+------------+---------+
|{28.0, 30.0}|{40.0, 52.0}|2635478.5|
+------------+------------+---------+
Question: When joining two datasets, Why is the filter isnotnull applied twice on the joining key column? In the physical plan, it is once applied as a PushedFilter and then explicitly applied right after it. Why is that so?
code:
import os
import pandas as pd, numpy as np
import pyspark
spark=pyspark.sql.SparkSession.builder.getOrCreate()
save_loc = "gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/"
df1 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]),
'b': np.random.random(1000)}))
df2 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]),
'b': np.random.random(1000)}))
df1.write.parquet(os.path.join(save_loc,"dfl_key_int"))
df2.write.parquet(os.path.join(save_loc,"dfr_key_int"))
dfl_int = spark.read.parquet(os.path.join(save_loc,"dfl_key_int"))
dfr_int = spark.read.parquet(os.path.join(save_loc,"dfr_key_int"))
dfl_int.join(dfr_int,on='a',how='inner').explain()
output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#23L, b#24, b#28]
+- BroadcastHashJoin [a#23L], [a#27L], Inner, BuildRight, false
:- Filter isnotnull(a#23L)
: +- FileScan parquet [a#23L,b#24] Batched: true, DataFilters: [isnotnull(a#23L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfl_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
+- Filter isnotnull(a#27L)
+- FileScan parquet [a#27L,b#28] Batched: true, DataFilters: [isnotnull(a#27L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfr_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double>
The reason is that a PushedFilter does not guarantee you that all the data is filtered as you want before it has been read into memory by Spark. For more context on what a PushedFilter is, check out this SO answer.
Parquet files
Let's have a look at Parquet files like in your example. Parquet files are stored in a columnar format, and they are also organized in Row Groups (or chunks). The following picture comes from the Apache Parquet docs:
You see that the data is stored in a columnar fashion, and they are chopped up into chunks (row groups). Now, for each column/row chunk combination, Parquet stores some metadata. In that picture, you see that it contains a bunch of metadata and then also extra key/value pairs. These also contain statistics about your data (depending on what type your column is).
Some examples of these statistics are:
what the min/max value is of the chunk (in case it makes sense for the data type of the column)
whether the chunk has non-null values
...
Back to your example
You are joining on the a column. To be able to do that we need to be sure that a has no null values. Let's imagine that your a column (disregarding the other columns) is stored like this:
a column:
chunk 1: 0, 1, None, 1, 1, None
chunk 2: 0, 0, 0, 0, 0, 0
chunk 3: None, None, None, None, None, None
Now, using the PushedFilter we can immediately (just by looking at the metadata of the chunks) disregard chunk 3, we don't even have to read it in!
But as you see, chunk 1 still contains null values. This is something we can't filter out by only looking at the chunk's metadata. So we'll have to read in that whole chunk and then filter those other null values afterwards within Spark using that second Filter node in your Physical Plan.
I am seeking suggestions on how to expose RDDs from a Spark SQL physical plan (with Spark-3.2.1). Here let's take TPCH Query 1 as a concrete example.
This is the physical plan generated by sql(query).explain():
== Physical Plan ==
*(3) Sort [l_returnflag#17 ASC NULLS FIRST, l_linestatus#18 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(l_returnflag#17 ASC NULLS FIRST, l_linestatus#18 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#97]
+- *(2) HashAggregate(keys=[l_returnflag#17, l_linestatus#18], functions=[sum(l_quantity#13), sum(l_extendedprice#14), sum(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)), sum(CheckOverflow((promote_precision(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)) * promote_precision(cast(CheckOverflow((1.00 + promote_precision(cast(l_tax#16 as decimal(13,2)))), DecimalType(13,2), true) as decimal(26,4)))), DecimalType(38,6), true)), avg(l_quantity#13), avg(l_extendedprice#14), avg(l_discount#15), count(1)], output=[l_returnflag#17, l_linestatus#18, sum_qty#114, sum_base_price#115, sum_disc_price#116, sum_charge#117, avg_qty#118, avg_price#119, avg_disc#120, count_order#121L])
+- Exchange hashpartitioning(l_returnflag#17, l_linestatus#18, 200), ENSURE_REQUIREMENTS, [id=#93]
+- *(1) HashAggregate(keys=[l_returnflag#17, l_linestatus#18], functions=[partial_sum(l_quantity#13), partial_sum(l_extendedprice#14), partial_sum(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)), partial_sum(CheckOverflow((promote_precision(CheckOverflow((promote_precision(cast(l_extendedprice#14 as decimal(13,2))) * promote_precision(CheckOverflow((1.00 - promote_precision(cast(l_discount#15 as decimal(13,2)))), DecimalType(13,2), true))), DecimalType(26,4), true)) * promote_precision(cast(CheckOverflow((1.00 + promote_precision(cast(l_tax#16 as decimal(13,2)))), DecimalType(13,2), true) as decimal(26,4)))), DecimalType(38,6), true)), partial_avg(l_quantity#13), partial_avg(l_extendedprice#14), partial_avg(l_discount#15), partial_count(1)], output=[l_returnflag#17, l_linestatus#18, sum#155, isEmpty#156, sum#157, isEmpty#158, sum#159, isEmpty#160, sum#161, isEmpty#162, sum#163, count#164L, sum#165, count#166L, sum#167, count#168L, count#169L])
+- *(1) Project [l_quantity#13, l_extendedprice#14, l_discount#15, l_tax#16, l_returnflag#17, l_linestatus#18]
+- *(1) ColumnarToRow
+- FileScan parquet tpch_100.lineitem[l_quantity#13,l_extendedprice#14,l_discount#15,l_tax#16,l_returnflag#17,l_linestatus#18,l_shipdate#24] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(2436 paths)[hdfs://node13-opa:8020/user/spark_benchmark/tpch_100/dataset/lineit..., PartitionFilters: [isnotnull(l_shipdate#24), (l_shipdate#24 <= 1998-09-02)], PushedFilters: [], ReadSchema: struct<l_quantity:decimal(12,2),l_extendedprice:decimal(12,2),l_discount:decimal(12,2),l_tax:deci...
These are the generated RDDs by sql(query).rdd.toDebugString:
(4) MapPartitionsRDD[17] at rdd at <console>:24 []
| SQLExecutionRDD[16] at rdd at <console>:24 []
| MapPartitionsRDD[15] at rdd at <console>:24 []
| MapPartitionsRDD[14] at rdd at <console>:24 []
| ShuffledRowRDD[13] at rdd at <console>:24 []
+-(200) MapPartitionsRDD[12] at rdd at <console>:24 []
| MapPartitionsRDD[8] at rdd at <console>:24 []
| ShuffledRowRDD[7] at rdd at <console>:24 []
+-(265) MapPartitionsRDD[6] at rdd at <console>:24 []
| MapPartitionsRDD[5] at rdd at <console>:24 []
| MapPartitionsRDD[4] at rdd at <console>:24 []
| FileScanRDD[3] at rdd at <console>:24 []
I know the process of RDDs generation is to 1) create the input RDD (e.g. FileScanRDD[3]) from the input data; 2) do transformations based on the input RDD for each physical operator.
My question is: whether/how the internal "transformations" code is able to be exposed outside the Spark SQL?
For example, a user can create the initial RDD, like FileScanRDD[3], from the input data, whether/how the user (rather than the internal Spark SQL) is able to implement the same "transformation" on FileScanRDD[3] to produce MapPartitionsRDD[4], and later for the rest of RDDs?
Appreciate your help in advance!
Which of the following is the better way in Pyspark?
Does the second query has any advantage/performance gain over first query in PySpark (in cluster mode)?
#1) without using aggr
total_distance_df = spark.sql("SELECT sum(distance) FROM flights")\
.withColumnRenamed('sum(CAST(distance AS DOUBLE))', 'total_distance')
total_distance_df.show()
Vs
#2) with using aggr
total_distance_df = spark.sql("SELECT distance FROM flights")\
.agg({"distance":"sum"})\
.withColumnRenamed("sum(distance)","total_distance")
total_distance_df.show()
Both are same, Check the explain plan on the queries to see any differences.
Example:
#sample df
df1.show()
+---+--------+
| id|distance|
+---+--------+
| a| 1|
| b| 2|
+---+--------+
df1.createOrReplaceTempView("tmp")
spark.sql("SELECT sum(distance) FROM tmp").withColumnRenamed('sum(CAST(distance AS DOUBLE))', 'total_distance').explain()
#== Physical Plan ==
#*(2) HashAggregate(keys=[], functions=[sum(distance#179L)])
#+- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_sum(distance#179L)])
# +- *(1) Project [distance#179L]
# +- Scan ExistingRDD[id#178,distance#179L]
spark.sql("SELECT distance FROM tmp").agg({"distance":"sum"}).explain()
#== Physical Plan ==
#*(2) HashAggregate(keys=[], functions=[sum(distance#179L)])
#+- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_sum(distance#179L)])
# +- *(1) Project [distance#179L]
# +- Scan ExistingRDD[id#178,distance#179L]
As you can see plans are similar for both SUM and aggr.
I need some help with this Apache Spark (pyspark) issue.
I've a dataFrame (df1) which has a single column & a single row, it contains max_timestamp
+------------------+
|max_timestamp |
+-------------------+
|2019-10-24 21:18:26|
+-------------------+
I've another DataFrame, which contains 2 Columns - EmpId & Timestamp
masterData = [(1, '1999-10-24 21:18:23',), (1, '2019-10-24 21:18:26',), (2, '2020-01-24 21:18:26',)]
df_masterdata = spark.createDataFrame(masterData, ['dsid', 'txnTime_str'])
df_masterdata = df_masterdata.withColumn('txnTime_ts', col('txnTime_str').cast(TimestampType())).drop('txnTime_str')
df_masterdata.show(5, False)
+----+-------------------+
|dsid|txnTime_ts |
+----+-------------------+
|1 |1999-10-24 21:18:23|
|1 |2019-10-24 21:18:26|
|2 |2020-01-24 21:18:26|
+----+-------------------+
Object is to filter the records in the 2nd Dataframe, based on condition txnTime_ts < max_timestamp
What i'm trying to do -> add the column 'max_timestamp' to the 2nd DataFrame, and filter records by comparing the 2 values.
df_masterdata1 = df_masterdata.withColumn('maxTime', maxTS2['TEMP_MAX'])
Pyspark does not let me add the column from maxTS2 to the dataFrame - df_masterdata
Error -
AnalysisException: 'Resolved attribute(s) TEMP_MAX#207255 missing from dsid#207263L,txnTime_ts#207267 in operator
!Project [dsid#207263L, txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280].;;\n!Project [dsid#207263L,
txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280]\n+- Project [dsid#207263L, txnTime_ts#207267]\n +- Project
[dsid#207263L, txnTime_str#207264, cast(txnTime_str#207264 as timestamp) AS txnTime_ts#207267]\n +- LogicalRDD
[dsid#207263L, txnTime_str#207264], false\n'
Any ideas on how to resolve this issue?
If you actually have a DF with a single row/column, the most efficient way to accomplish this would be to extract the value from the dataframe and then filter df_masterdata against it. If you nevertheless need to do this within the context of a dataframe, you should us join , e.g.:
df_masterdata1 = df_masterdata.join(df1, df_masterdata.txnTime_ts <= df1.max_timestamp)