Joining large data in spark streaming

Joining large data in spark streaming - sql

We have a big customer table with 7 million records and we are trying to process some transaction data (500K messages per batch) coming from the kafka stream.
During the processing, we need to join the transaction data with customer data. This is currently taking us around 10s and the requirement is to bring it down to 5s. Since the customer table is too large, we cannot use broadcast join. Is there any other optimization that we can make?
== Parsed Logical Plan ==
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Join Inner, Some((custId#110 = rowkey#0))
:- Subquery custProfile
: +- Project [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- Subquery jz_view_sub_cust_profile
: +- Project [rowkey#0,thrd_party_ads_opto_flag#4,no_mkt_opto_flag#5]
: +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166
== Analyzed Logical Plan ==
count: bigint
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Join Inner, Some((custId#110 = rowkey#0))
:- Subquery custProfile
: +- Project [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- Subquery jz_view_sub_cust_profile
: +- Project [rowkey#0,thrd_party_ads_opto_flag#4,no_mkt_opto_flag#5]
: +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166
== Optimized Logical Plan ==
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Project
+- Join Inner, Some((custId#110 = rowkey#0))
:- Project [rowkey#0]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- Project [custId#110]
+- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#119L])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#122L])
+- Project
+- SortMergeJoin [rowkey#0], [custId#110]
:- Sort [rowkey#0 ASC], false, 0
: +- TungstenExchange hashpartitioning(rowkey#0,200), None
: +- Project [rowkey#0]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- HiveTableScan [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4], MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- Sort [custId#110 ASC], false, 0
+- TungstenExchange hashpartitioning(custId#110,200), None
+- Project [custId#110]
+- Scan ExistingRDD[key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118]

Assuming customer data is constant across mini-batches, partition this customer data on customerId using hash partitioner and cache it in RDD/DF.
Since transaction data is coming from Kafka, this data can also be partitioned on same key using hash partitioner while publishing into Kafka
https://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html
This should reduce time in joining two dataset but only condition is partition key should be same in both datasets(transaction data and customer data).

Related

Pyspark dropped column not gone

I have a spark dataframe. I attempt to drop a column, but in some situations the column appears to still be there.
my_range = spark.range(1000).toDF("number")
new_range = my_range.withColumn('num2', my_range.number*2).drop('number')
# can still sort by "number" column
new_range.sort('number')
Is this a bug? Or am I missing something?
Spark version is v3.3.1
python 3
I'm on a Mbook pro M1 20221

,I called .explain(True) on your sample dataset, lets take a look at output:
== Parsed Logical Plan ==
'Sort ['number ASC NULLS FIRST], true
+- Project [num2#61L]
+- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L]
+- Project [id#57L AS number#59L]
+- Range (0, 1000, step=1, splits=Some(8))
Parsed Logical Plan is first "raw" version of query plan. Here you can see Project [num2#61L] before sort - this is your drop
But at next stage (Analyzed Logical Plan) its different:
== Analyzed Logical Plan ==
num2: bigint
Project [num2#61L]
+- Sort [number#59L ASC NULLS FIRST], true
+- Project [num2#61L, number#59L]
+- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L]
+- Project [id#57L AS number#59L]
+- Range (0, 1000, step=1, splits=Some(8))
Spark was smart enough to figure out that you need this column, so project before sort includes this column right now. To be compliant with your code, there is new Project added after sort
Now last stage, so optimized logical plan:
== Optimized Logical Plan ==
Project [num2#61L]
+- Sort [number#59L ASC NULLS FIRST], true
+- Project [(id#57L * 2) AS num2#61L, id#57L AS number#59L]
+- Range (0, 1000, step=1, splits=Some(8))
In my opinion its not a bug but Spark design. Keep in mind that your code is executed within same action so due Spark lazy nature he is smart enough to adjust/optimize some code during planning.

The first answer is obviously correct, and whether or not the Spark approach is a good implementation is open to debate - I think it is.
As an embellishment: A checkpoint, if used, will mean an error:
spark.sparkContext.setCheckpointDir("/foo2/bar")
new_range = new_range.checkpoint()
new_range.sort('number').show()
returns:
AnalysisException: Column 'number' does not exist. Did you mean one of the following? [num2];
'Sort ['number ASC NULLS FIRST], true
+- LogicalRDD [num2#69L], false

Pyspark Broadcast join

can anyone help in to understand the behaviour of the below query. why there is a broadcast join happening as shown in the physical plan but i am not doing any broadcast join in the query.
query:
SELECT count(*) FROM table
WHERE date_id in (select max(date_id) from table)
== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#17L])
+- Exchange SinglePartition
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#20L])
+- *(2) Project
+- *(2) BroadcastHashJoin [date_id#14], [max(date_id)#16], LeftSemi, BuildRight
:- *(2) FileScan parquet table[date_id#14] Batched: true, Format: Parquet, Location: CatalogFileIndex[gs://data/features/smart_subs/pipeline/s..., PartitionCount: 14, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- SortAggregate(key=[], functions=[max(date_id#14)], output=[max(date_id)#16])
+- Exchange SinglePartition
+- SortAggregate(key=[], functions=[partial_max(date_id#14)], output=[max#22])
+- *(1) FileScan parquet table[date_id#14] Batched: true, Format: Parquet, Location: CatalogFileIndex[gs:/data/features/smart_subs/pipeline/s..., PartitionCount: 14, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Here in this screenshot you can see that in databricks they have explained using subqueries will be making a Broadcast Nested Loop join.
For further information, you can read this article:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html

PushDownPredicate in Spark SQL and Exchange reuse

I am executing the given query containing a UNION. My intention is to reuse the exchange between both the query branches by disabling PushDownPredicate configuration setting in the Spark-shell.
With PushDownPredicate enabled, Spark will push down the filter condition close to the source and hence insert 1 Exchange (Shuffle) on each branch of the queries - resulting in total 2 Exchanges.
However, the expectation is that - with the PushDownPredicate disabled, Spark will not push the filter close to the source by holding the filter condition at its original place in the query - i.e. after the group by clause. This will let Spark use only a single Exchange (for both the queries) thereby reducing 1 shuffle.
Unfortunately I am not being able to produce this action with Spark SQL. The query and the commands are given below:
Configuration:
spark-sql> SET "spark.sql.optimizer.excludeRules", org.apache.spark.sql.catalyst.optimizer.PushDownPredicate;
Query:
select prodId, count(*) as cnt
from test_db.product
group by 1
having count(*) > 1000
and prodId = '1234'
union all
select prodId, count(*) as cnt
from test_db.product
group by 1
having count(*) < 100;
Physical Plan:
Union
:- *(2) Project [prodId#5539, count(1)#5576L]
: +- *(2) Filter (count(1)#5579L > 1000)
: +- *(2) HashAggregate(keys=[prodId#5539], functions=[count(1)])
: +- Exchange hashpartitioning(prodId#5539, 200)
: +- *(1) HashAggregate(keys=[prodId#5539], functions=[partial_count(1)])
: +- *(1) Project[prodId#5539]
: +- *(1) Filter (isnotnull(prodId#5539) && (prodId#5539 = 1234))
: +- *(1) FileScan parquet testdb.product[prodId#5539,eff_dt#5546] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://path], PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(prodId), EqualTo(ProdId, 1234)], ReadSchema: struct<prodId:string>
+- *(4) Project[prodId#5568, count(1)#5577L]
+- *(4) Filter (count(1)#5581L < 100)
+- *(4) HashAggregate(keys=[prodId#5568], functions=[count(1)])
+- Exchange hashpartitioning(prodId#5568, 200)
+- *(3) HashAggregate(keys=[prodId#5568], functions=[partial_count(1)])
+- *(3) Project [prodId#5568]
+- *(3) FileScan parquet testdb.product[prodId#5568,eff_dt#5575] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://path], PartitionCount: 1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<prodId:string>
As seen from the above Query Plan, Spark is using 2 Exchanges -> 1 on each query branch. My intention is to reduce the 2 Exchanges into 1 shared Exchange, by using the configuration setting mentioned above.
I am using Spark v2.4.0.
Can anyone please help as to where I may be going wrong. Am I setting the configuration properly ?
Any help is appreciated.
Thanks

You can try with this :
SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.PushDownPredicate

SQL LIKE in Spark SQL

I'm trying to implement a join in Spark SQL using a LIKE condition.
The row I am performing the join on looks like this and is called 'revision':
Table A:
8NXDPVAE
Table B:
[4,8]NXD_V%
Performing the join on SQL server (A.revision LIKE B.revision) works just fine, but when doing the same in Spark SQL, the join returns no rows (if using inner join) or null values for Table B (if using outer join).
This is the query I am running:
val joined = spark.sql("SELECT A.revision, B.revision FROM RAWDATA A LEFT JOIN TPTYPE B ON A.revision LIKE B.revision")
The plan looks like this:
== Physical Plan ==
BroadcastNestedLoopJoin BuildLeft, LeftOuter, revision#15 LIKE revision#282, false
:- BroadcastExchange IdentityBroadcastMode
: +- *Project [revision#15]
: +- *Scan JDBCRelation(RAWDATA) [revision#15] PushedFilters: [EqualTo(bulk_id,2016092419270100198)], ReadSchema: struct<revision>
+- *Scan JDBCRelation(TPTYPE) [revision#282] ReadSchema: struct<revision>
Is it possible to perform a LIKE join like this or am I way off?

You are only a little bit off. Spark SQL and Hive follow SQL standard conventions where LIKE operator accepts only two special characters:
_ (underscore) - which matches an arbitrary character.
% (percent) - which matches an arbitrary sequence of characters.
Square brackets have no special meaning and [4,8] matches only a [4,8] literal:
spark.sql("SELECT '[4,8]' LIKE '[4,8]'").show
+----------------+
|[4,8] LIKE [4,8]|
+----------------+
| true|
+----------------+
To match complex patterns you can use RLIKE operator which suports Java regular expressions:
spark.sql("SELECT '8NXDPVAE' RLIKE '^[4,8]NXD.V.*$'").show
+-----------------------------+
|8NXDPVAE RLIKE ^[4,8]NXD.V.*$|
+-----------------------------+
| true|
+-----------------------------+

Syntax for like in spark scala api:
dataframe.filter(col("columns_name").like("regex"))

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).
I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of the rows in the smallDataFrame.
Here's an example:
largeDataFrame
some_idenfitier,first_name
111,bob
123,phil
222,mary
456,sue
smallDataFrame
some_identifier
123
456
desiredOutput
111,bob
222,mary
Here is my ugly solution.
val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row"))
val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")
Is there a cleaner solution?

You'll need to use a left_anti join in this case.
The left anti join is the opposite of a left semi join.
It filters out data from the right table in the left table according to a given key :
largeDataFrame
.join(smallDataFrame, Seq("some_identifier"),"left_anti")
.show
// +---------------+----------+
// |some_identifier|first_name|
// +---------------+----------+
// | 222| mary|
// | 111| bob|
// +---------------+----------+

A version in pure Spark SQL (and using PySpark as an example, but with small changes
same is applicable for Scala API):
def string_to_dataframe (df_name, csv_string):
rdd = spark.sparkContext.parallelize(csv_string.split("\n"))
df = spark.read.option('header', 'true').option('inferSchema','true').csv(rdd)
df.registerTempTable(df_name)
string_to_dataframe("largeDataFrame", '''some_identifier,first_name
111,bob
123,phil
222,mary
456,sue''')
string_to_dataframe("smallDataFrame", '''some_identifier
123
456
''')
anti_join_df = spark.sql("""
select *
from largeDataFrame L
where NOT EXISTS (
select 1 from smallDataFrame S
WHERE L.some_identifier = S.some_identifier
)
""")
print(anti_join_df.take(10))
anti_join_df.explain()
will output expectedly mary and bob:
[Row(some_identifier=222, first_name='mary'),
Row(some_identifier=111, first_name='bob')]
and also Physical Execution Plan will show it is using
== Physical Plan ==
SortMergeJoin [some_identifier#252], [some_identifier#264], LeftAnti
:- *(1) Sort [some_identifier#252 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(some_identifier#252, 200)
: +- Scan ExistingRDD[some_identifier#252,first_name#253]
+- *(3) Sort [some_identifier#264 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(some_identifier#264, 200)
+- *(2) Project [some_identifier#264]
+- Scan ExistingRDD[some_identifier#264]
Notice Sort Merge Join is more efficient for joining / anti-joining data sets that are approximately of the same size.
Since you have mentioned that that the small dataframe is smaller, we should make sure that Spark optimizer chooses Broadcast Hash Join which will be much more efficient in this scenario :
I will change NOT EXISTS to NOT IN clause for this :
anti_join_df = spark.sql("""
select *
from largeDataFrame L
where L.some_identifier NOT IN (
select S.some_identifier
from smallDataFrame S
)
""")
anti_join_df.explain()
Let's see what it gave us :
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, ((some_identifier#302 = some_identifier#314) || isnull((some_identifier#302 = some_identifier#314)))
:- Scan ExistingRDD[some_identifier#302,first_name#303]
+- BroadcastExchange IdentityBroadcastMode
+- Scan ExistingRDD[some_identifier#314]
Notice that Spark Optimizer actually chose Broadcast Nested Loop Join and not Broadcast Hash Join. The former is okay since we have just two records to exclude from the left side.
Also notice that both execution plans do have LeftAnti so it is similar to #eliasah answer, but is implemented using pure SQL. Plus it shows that you can have more control over physical execution plan.
PS. Also keep in mind that if the right dataframe is much smaller than the left-side dataframe but is bigger than just a few records, you do want to have Broadcast Hash Join and not Broadcast Nested Loop Join nor Sort Merge Join. If this doesn't happen, you may need to tune up spark.sql.autoBroadcastJoinThreshold as it defaults to 10Mb, but it has to be bigger than the size of the "smallDataFrame".

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Joining large data in spark streaming - sql

Related

Pyspark dropped column not gone

Pyspark Broadcast join

PushDownPredicate in Spark SQL and Exchange reuse

SQL LIKE in Spark SQL

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

Categories

Resources