Using Spark window with more than one partition when there is no obvious partitioning column - sql

Here is the scenario. Assuming I have the following table:
identifier
line
51169081604
2
00034886044
22
51168939455
52
The challenge is to, for every single column line, select the next biggest column line, which I have accomplished by the following SQL:
SELECT i1.line,i1.identifier,
MAX(i1.line) OVER (
ORDER BY i1.line ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)AS parent
FROM global_temp.documentIdentifiers i1
The challenge is partially solved alright, the problem is, when I execute this code on Spark, the performance is terrible. The warning message is very clear about it:
No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Partitioning by any of the two fields does not work, it breaks the result, of course, as every created partition is not aware of the other lines.
Does anyone have any clue on how can I " select the next biggest column line" without performance issues?
Thanks

Using your "next" approach AND assuming the data is generated in ascending line order, the following does work in parallel, but if actually faster you can tell me; I do not know your volume of data. In any event you cannot solve just with SQL (%sql).
Here goes:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
case class X(identifier: Long, line: Long) // Too hard to explain, just gets around issues with df --> rdd --> df.
// Gen some more data.
val df = Seq(
(1000000, 23), (1200, 56), (1201, 58), (1202, 60),
(8200, 63), (890000, 67), (990000, 99), (33000, 123),
(33001, 124), (33002, 126), (33009, 132), (33019, 133),
(33029, 134), (33039, 135), (800, 201), (1800, 999),
(1801, 1999), (1802, 2999), (1800444, 9999)
).toDF("identifier", "line")
// Add partition so as to be able to apply parallelism - except for upper boundary record.
val df2 = df.as[X]
.rdd
.mapPartitionsWithIndex((index, iter) => {
iter.map(x => (index, x ))
}).mapValues(v => (v.identifier, v.line)).map(x => (x._1, x._2._1, x._2._2))
.toDF("part", "identifier", "line")
// Process per partition.
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("part").orderBy("line")
val df3 = df2.withColumn("next", lead("line", 1, null).over(w))
// Process upper boundary.
val df4 = df3.filter(df3("part") =!= 0).groupBy("part").agg(min("line").as("nxt")).toDF("pt", "nxt")
val df5 = df3.join(df4, (df3("part") === df4("pt") - 1), "outer" )
val df6 = df5.withColumn("next", when(col("next").isNull, col("nxt")).otherwise(col("next"))).select("identifier", "line", "next")
// Display. Sort accordingly.
df6.show(false)
returns:
+----------+----+----+
|identifier|line|next|
+----------+----+----+
|1000000 |23 |56 |
|1200 |56 |58 |
|1201 |58 |60 |
|1202 |60 |63 |
|8200 |63 |67 |
|890000 |67 |99 |
|990000 |99 |123 |
|33000 |123 |124 |
|33001 |124 |126 |
|33002 |126 |132 |
|33009 |132 |133 |
|33019 |133 |134 |
|33029 |134 |135 |
|33039 |135 |201 |
|800 |201 |999 |
|1800 |999 |1999|
|1801 |1999|2999|
|1802 |2999|9999|
|1800444 |9999|null|
+----------+----+----+
You can add additional sorting etc. Relies on narrow transformation when adding partition index. How you load may be an issue. Caching not considered.
If data is not ordered as stated above, a range partitioning needs to occur first.

Related

Spark window aggregate function not working intuitively with records ordering

I have following example which i am running on Spark 3.3
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
("1", 333),
("1", 222),
("1", 111),
("2", 334)
]
inputDf = spark.createDataFrame(inputData, schema=["id", "val"])
window = Window.partitionBy("id")
aggregatedDf = (
inputDf.withColumn("min_val", F.min(F.col("val")).over(window))
.withColumn("max_val", F.max(F.col("val")).over(window))
).show()
The output is as expected, i am getting correct min/max value for each window
+---+---+-------+-------+
| id|val|min_val|max_val|
+---+---+-------+-------+
| 1|333| 111| 333|
| 1|222| 111| 333|
| 1|111| 111| 333|
| 2|334| 334| 334|
+---+---+-------+-------+
When i add orderBy to window, output is different:
window = Window.partitionBy("id").orderBy(F.col("val").desc())
+---+---+-------+-------+
| id|val|min_val|max_val|
+---+---+-------+-------+
| 1|333| 333| 333|
| 1|222| 222| 333|
| 1|111| 111| 333|
| 2|334| 334| 334|
+---+---+-------+-------+
As you can see, with desc ordering max_value is fine, but min_value is changing from record to record
I tried to find more informations in docu or here on SO but no luck. For me its not intuitive at all.
My expectation was that Spark is going to scan all records in given partition and assign min/max value for each record within partition, which is true without ordering within window, but works differently when ordering is added
Does anyone know why its working like this?
You need to add Frame to get the output you expect.
As per Docs:
Note When ordering is not defined, an unbounded window frame
(rowFrame, unboundedPreceding, unboundedFollowing) is used by default.
When ordering is defined, a growing window frame (rangeFrame,
unboundedPreceding, currentRow) is used by default.
Essentially Spark or any SQL will by default consider Window till the current row while processing the function for that row. By adding the Frame as - unboundedPreceding to unboundedFollowing - we ask Spark to consider the whole window instead.
For e.g. while processing min function for the second row in your dataframe (order by value in descending manner), Spark will consider window for id=1 as the first and second row (Between unboundedPreceding and CURRENT_ROW)
This would work
window = Window.partitionBy("id")\
.orderBy(F.col("val").desc())\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Output:
To understand more about frame, consider reading https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/window-functions-frames.html
https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-window-functions-7b4e39ad3c86

fbprophet at scale with applyInPandas resulting unexpected count values [PySpark]

I am using applyInPandas to implement a forecast function over a sampled data using groupBy on ID. The end goal is to calculate MAPE for each ID.
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:
anonym_cis = history_pd.at[0,'ID']
# instantiate the model, configure the parameters
model = Prophet(
interval_width=0.95,
growth='linear',
daily_seasonality=True,
weekly_seasonality=True,
yearly_seasonality=False,
seasonality_mode='multiplicative'
)
# fit the model
model.fit(history_pd)
# configure predictions
future_pd = model.make_future_dataframe(
periods=30,
freq='d',
include_history=True
)
# make predictions
results_pd = model.predict(future_pd)
results_pd.loc[:, 'ID'] = anonym_cis
# . . .
# return predictions
return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]
results = (
fr_sample
.groupBy('ID')
.applyInPandas(forecast_balance, schema=result_schema)
)
I am getting am expected predictive results. However, when I count the number of rows for each ID in input data and the output data, it doesn't match. I would like to know from where/how these extra 30 (292-262) rows are getting created in the process for each ID.
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 262|
| 482769| 262|
| 483946| 262|
| 484124| 262|
| 484364| 262|
| 485103| 262|
+----------+-----+
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 292|
| 482769| 292|
| 483946| 292|
| 484124| 292|
| 484364| 292|
| 485103| 292|
+----------+-----+
Note:
This is how I am calculating MAPE as of now which is not for each ID but a over all data, hence resulting a single value (e.g. 1.4382).
def gr_mape_val(pd_sample_df, result_df):
result_df = result_df.toPandas()
actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
mape = mean_absolute_percentage_error(actuals_pd, predicted_pd)
return mape
To use it in groupBy format for each ID, I need to have both the above mentioned count values matched but I am not able to figure out, how?
I just found what was going on there:
Basically with make_future_dataframe, I am creating 30 extra datapoints which was changing the total count of predicted_pd.
This can be simply solved by using df.na.drop()
pd_sample_df.join(result_df, on=['ID', 'ds'], how='outer').na.drop()

Running show() twice gives same results for rand() function for Dataframe

The below produces random numbers that differ per row as expected. So, fine. But I am apparently missing some basic aspect apparently in thinking.
from pyspark.sql import functions as F
df = spark.range(10).withColumn("randomNum",F.rand())
df.show(truncate=False)
returning:
+---+-------------------+
|id |randomNum |
+---+-------------------+
|0 |0.8128581612050234 |
|1 |0.40656852491856355|
|2 |0.9444869347865689 |
|3 |0.10391423680687417|
|4 |0.05285485891027453|
|5 |0.5140906081158558 |
|6 |0.900727341820192 |
|7 |0.11046600268909801|
|8 |0.6509183512961298 |
|9 |0.5060097759646045 |
+---+-------------------+
Then invoking show() again - the special Action below, why do we get the same random number sequence again as per first time round? Is show() overriding the traditional Action approach on some aspects as it sees that it is the same DF? If so, that is not true for all methods in pyspark, etc. I am running this in 2 cells in a Databricks Notebook.
Looking at the SPARK UI, it uses the same seed twice. Why? Deterministic aspect, seems to be at odds with the concept of Action as we are taught.
df.show(truncate=False)
This is the result of random being initialized only once per partition. Source.
Hence, if partition layout is not changed, that same initial seed is being used in every consecutive execution, and thus rand generates the same sequence.
When partitions are NOT fixed, rand's behavior becomes non-deterministic, which is documented (in the source, at least) through Spark JIRA-13380.

Spark SQL window function with complex condition

This is probably easiest to explain through example. Suppose I have a DataFrame of user logins to a website, for instance:
scala> df.show(5)
+----------------+----------+
| user_name|login_date|
+----------------+----------+
|SirChillingtonIV|2012-01-04|
|Booooooo99900098|2012-01-04|
|Booooooo99900098|2012-01-06|
| OprahWinfreyJr|2012-01-10|
|SirChillingtonIV|2012-01-11|
+----------------+----------+
only showing top 5 rows
I would like to add to this a column indicating when they became an active user on the site. But there is one caveat: there is a time period during which a user is considered active, and after this period, if they log in again, their became_active date resets. Suppose this period is 5 days. Then the desired table derived from the above table would be something like this:
+----------------+----------+-------------+
| user_name|login_date|became_active|
+----------------+----------+-------------+
|SirChillingtonIV|2012-01-04| 2012-01-04|
|Booooooo99900098|2012-01-04| 2012-01-04|
|Booooooo99900098|2012-01-06| 2012-01-04|
| OprahWinfreyJr|2012-01-10| 2012-01-10|
|SirChillingtonIV|2012-01-11| 2012-01-11|
+----------------+----------+-------------+
So, in particular, SirChillingtonIV's became_active date was reset because their second login came after the active period expired, but Booooooo99900098's became_active date was not reset the second time he/she logged in, because it fell within the active period.
My initial thought was to use window functions with lag, and then using the lagged values to fill the became_active column; for instance, something starting roughly like:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.partitionBy("user_name").orderBy("login_date")
val df2 = df.withColumn("tmp", lag("login_date", 1).over(window))
Then, the rule to fill in the became_active date would be, if tmp is null (i.e., if it's the first ever login) or if login_date - tmp >= 5 then became_active = login_date; otherwise, go to the next most recent value in tmp and apply the same rule. This suggests a recursive approach, which I'm having trouble imagining a way to implement.
My questions: Is this a viable approach, and if so, how can I "go back" and look at earlier values of tmp until I find one where I stop? I can't, to my knowledge, iterate through values of a Spark SQL Column. Is there another way to achieve this result?
Spark >= 3.2
Recent Spark releases provide native support for session windows in both batch and structured streaming queries (see SPARK-10816 and its sub-tasks, especially SPARK-34893).
The official documentation provides nice usage example.
Spark < 3.2
Here is the trick. Import a bunch of functions:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{coalesce, datediff, lag, lit, min, sum}
Define windows:
val userWindow = Window.partitionBy("user_name").orderBy("login_date")
val userSessionWindow = Window.partitionBy("user_name", "session")
Find the points where new sessions starts:
val newSession = (coalesce(
datediff($"login_date", lag($"login_date", 1).over(userWindow)),
lit(0)
) > 5).cast("bigint")
val sessionized = df.withColumn("session", sum(newSession).over(userWindow))
Find the earliest date per session:
val result = sessionized
.withColumn("became_active", min($"login_date").over(userSessionWindow))
.drop("session")
With dataset defined as:
val df = Seq(
("SirChillingtonIV", "2012-01-04"), ("Booooooo99900098", "2012-01-04"),
("Booooooo99900098", "2012-01-06"), ("OprahWinfreyJr", "2012-01-10"),
("SirChillingtonIV", "2012-01-11"), ("SirChillingtonIV", "2012-01-14"),
("SirChillingtonIV", "2012-08-11")
).toDF("user_name", "login_date")
The result is:
+----------------+----------+-------------+
| user_name|login_date|became_active|
+----------------+----------+-------------+
| OprahWinfreyJr|2012-01-10| 2012-01-10|
|SirChillingtonIV|2012-01-04| 2012-01-04| <- The first session for user
|SirChillingtonIV|2012-01-11| 2012-01-11| <- The second session for user
|SirChillingtonIV|2012-01-14| 2012-01-11|
|SirChillingtonIV|2012-08-11| 2012-08-11| <- The third session for user
|Booooooo99900098|2012-01-04| 2012-01-04|
|Booooooo99900098|2012-01-06| 2012-01-04|
+----------------+----------+-------------+
Refactoring the other answer to work with Pyspark
In Pyspark you can do like below.
create data frame
df = sqlContext.createDataFrame(
[
("SirChillingtonIV", "2012-01-04"),
("Booooooo99900098", "2012-01-04"),
("Booooooo99900098", "2012-01-06"),
("OprahWinfreyJr", "2012-01-10"),
("SirChillingtonIV", "2012-01-11"),
("SirChillingtonIV", "2012-01-14"),
("SirChillingtonIV", "2012-08-11")
],
("user_name", "login_date"))
The above code creates a data frame like below
+----------------+----------+
| user_name|login_date|
+----------------+----------+
|SirChillingtonIV|2012-01-04|
|Booooooo99900098|2012-01-04|
|Booooooo99900098|2012-01-06|
| OprahWinfreyJr|2012-01-10|
|SirChillingtonIV|2012-01-11|
|SirChillingtonIV|2012-01-14|
|SirChillingtonIV|2012-08-11|
+----------------+----------+
Now we want to first find out the difference between login_date is more than 5 days.
For this do like below.
Necessary imports
from pyspark.sql import functions as f
from pyspark.sql import Window
# defining window partitions
login_window = Window.partitionBy("user_name").orderBy("login_date")
session_window = Window.partitionBy("user_name", "session")
session_df = df.withColumn("session", f.sum((f.coalesce(f.datediff("login_date", f.lag("login_date", 1).over(login_window)), f.lit(0)) > 5).cast("int")).over(login_window))
When we run the above line of code if the date_diff is NULL then the coalesce function will replace NULL to 0.
+----------------+----------+-------+
| user_name|login_date|session|
+----------------+----------+-------+
| OprahWinfreyJr|2012-01-10| 0|
|SirChillingtonIV|2012-01-04| 0|
|SirChillingtonIV|2012-01-11| 1|
|SirChillingtonIV|2012-01-14| 1|
|SirChillingtonIV|2012-08-11| 2|
|Booooooo99900098|2012-01-04| 0|
|Booooooo99900098|2012-01-06| 0|
+----------------+----------+-------+
# add became_active column by finding the `min login_date` for each window partitionBy `user_name` and `session` created in above step
final_df = session_df.withColumn("became_active", f.min("login_date").over(session_window)).drop("session")
+----------------+----------+-------------+
| user_name|login_date|became_active|
+----------------+----------+-------------+
| OprahWinfreyJr|2012-01-10| 2012-01-10|
|SirChillingtonIV|2012-01-04| 2012-01-04|
|SirChillingtonIV|2012-01-11| 2012-01-11|
|SirChillingtonIV|2012-01-14| 2012-01-11|
|SirChillingtonIV|2012-08-11| 2012-08-11|
|Booooooo99900098|2012-01-04| 2012-01-04|
|Booooooo99900098|2012-01-06| 2012-01-04|
+----------------+----------+-------------+

How to apply a custom filtering function on a Spark DataFrame

I have a DataFrame of the form:
A_DF = |id_A: Int|concatCSV: String|
and another one:
B_DF = |id_B: Int|triplet: List[String]|
Examples of concatCSV could look like:
"StringD, StringB, StringF, StringE, StringZ"
"StringA, StringB, StringX, StringY, StringZ"
...
while a triplet is something like:
("StringA", "StringF", "StringZ")
("StringB", "StringU", "StringR")
...
I want to produce the cartesian set of A_DF and B_DF, e.g.;
| id_A: Int | concatCSV: String | id_B: Int | triplet: List[String] |
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 21 | ("StringA", "StringF", "StringZ")|
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 45 | ("StringB", "StringU", "StringR")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 21 | ("StringA", "StringF", "StringZ")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 45 | ("StringB", "StringU", "StringR")|
| ... | | | |
Then keep just the records that have at least two substrings (e.g StringA, StringB) from A_DF("concatCSV") that appear in B_DF("triplet"), i.e. use filter to exclude those that don't satisfy this condition.
First question is: can I do this without converting the DFs into RDDs?
Second question is: can I ideally do the whole thing in the join step--as a where condition?
I have tried experimenting with something like:
val cartesianRDD = A_DF
.join(B_DF,"right")
.where($"triplet".exists($"concatCSV".contains(_)))
but where cannot be resolved. I tried it with filter instead of where but still no luck. Also, for some strange reason, type annotation for cartesianRDD is SchemaRDD and not DataFrame. How did I end up with that? Finally, what I am trying above (the short code I wrote) is incomplete as it would keep records with just one substring from concatCSV found in triplet.
So, third question is: Should I just change to RDDs and solve it with a custom filtering function?
Finally, last question: Can I use a custom filtering function with DataFrames?
Thanks for the help.
The function CROSS JOIN is implemented in Hive, so you could first do the cross-join using Hive SQL:
A_DF.registerTempTable("a")
B_DF.registerTempTable("b")
// sqlContext should be really a HiveContext
val result = sqlContext.sql("SELECT * FROM a CROSS JOIN b")
Then you can filter down to your expected output using two udf's. One that converts your string to an array of words, and a second one that gives us the length of the intersection of the resulting array column and the existing column "triplet":
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val splitArr = udf { (s: String) => s.split(",").map(_.trim) }
val commonLen = udf { (a: WrappedArray[String],
b: WrappedArray[String]) => a.intersect(b).length }
val temp = (result.withColumn("concatArr",
splitArr(col("concatCSV"))).select(col("*"),
commonLen(col("triplet"), col("concatArr")).alias("comm"))
.filter(col("comm") >= 2)
.drop("comm")
.drop("concatArr"))
temp.show
+----+--------------------+----+--------------------+
|id_A| concatCSV|id_B| triplet|
+----+--------------------+----+--------------------+
| 14|StringD, StringB,...| 21|[StringA, StringF...|
| 18|StringA, StringB,...| 21|[StringA, StringF...|
+----+--------------------+----+--------------------+