PySpark SQL: How to use multiple conditions in "Window -> PartitionBy -> Range Between" - apache-spark-sql

Here is our test input data (85 rows):
+-------------+-------+---------+
| auth_dttm|acc_num|tranAmnts|
+-------------+-------+---------+
|11/8/20 11:20| 123| 100|
|11/8/20 11:19| 123| 100|
|11/8/20 11:18| 123| 100|
|11/8/20 11:17| 123| 100|
|11/8/20 11:16| 123| 100|
|11/8/20 11:15| 123| 100|
|11/8/20 11:14| 123| 100|
|11/8/20 11:13| 123| 100|
|11/8/20 11:12| 123| 100|
|11/8/20 11:11| 123| 100|
|11/8/20 11:10| 123| 100|
|11/8/20 11:09| 123| 100|
|11/8/20 11:08| 123| 100|
|11/8/20 11:07| 123| 100|
|11/8/20 11:06| 123| 100|
|11/8/20 11:05| 123| 100|
|11/8/20 11:04| 123| 100|
|11/8/20 11:03| 123| 100|
|11/8/20 11:02| 123| 100|
|11/8/20 11:01| 123| 100|
|11/8/20 11:00| 123| 100|
|11/8/20 10:59| 123| 100|
|11/8/20 10:58| 123| 100|
|11/8/20 10:57| 123| 100|
|11/8/20 10:56| 123| 100|
|11/8/20 10:55| 123| 100|
|11/8/20 10:54| 123| 100|
|11/8/20 10:53| 123| 100|
|11/8/20 10:52| 123| 100|
|11/8/20 10:51| 123| 100|
|11/8/20 10:50| 123| 100|
|11/8/20 11:20| 321| 10000|
|11/8/20 11:19| 321| 10000|
|11/8/20 11:18| 321| 10000|
|11/8/20 11:17| 321| 10000|
|11/8/20 11:16| 321| 10000|
|11/8/20 11:15| 321| 10000|
|11/8/20 11:14| 321| 10000|
|11/8/20 11:13| 321| 10000|
|11/8/20 11:12| 321| 10000|
|11/8/20 11:11| 321| 10000|
|11/8/20 11:10| 321| 10000|
|11/8/20 11:09| 321| 10000|
|11/8/20 11:08| 321| 10000|
|11/8/20 11:07| 321| 10000|
|11/8/20 11:06| 321| 10000|
|11/8/20 11:05| 321| 10000|
|11/8/20 11:04| 321| 10000|
|11/8/20 11:03| 321| 10000|
|11/8/20 11:02| 321| 10000|
|11/8/20 11:01| 321| 10000|
|11/8/20 11:00| 321| 10000|
|11/8/20 10:59| 321| 10000|
|11/8/20 10:58| 321| 10000|
|11/8/20 10:57| 321| 10000|
|11/8/20 10:56| 321| 10000|
|11/8/20 10:55| 321| 10000|
|11/8/20 10:54| 321| 10000|
|11/8/20 10:53| 321| 10000|
|11/8/20 10:52| 321| 10000|
|11/8/20 10:51| 321| 10000|
|11/8/20 10:50| 321| 10000|
|11/8/20 10:49| 321| 10000|
|11/8/20 10:48| 321| 10000|
|11/8/20 10:47| 321| 10000|
|11/8/20 10:46| 321| 10000|
|11/8/20 10:45| 321| 10000|
|11/8/20 10:44| 321| 10000|
|11/8/20 10:43| 321| 10000|
|11/8/20 10:42| 321| 10000|
|11/8/20 10:41| 321| 10000|
|11/8/20 10:40| 321| 10000|
|11/8/20 10:39| 321| 10000|
|11/8/20 10:38| 321| 10000|
|11/8/20 10:37| 321| 10000|
|11/8/20 10:36| 321| 10000|
|11/8/20 10:35| 321| 10000|
|11/8/20 10:34| 321| 10000|
|11/8/20 10:33| 321| 10000|
| 9/1/20 11:18| 321| 10000|
| 7/1/20 11:18| 321| 10000|
| 5/1/20 11:18| 321| 10000|
| 3/1/20 11:18| 321| 10000|
| 1/1/20 11:18| 321| 10000|
+-------------+-------+---------+
What I am trying to do is:
Count and sum top 45 transactions in the last 24 hours for every transaction
I am able to do with the following approach:
Self Join
Add a new Time Diff column "DiffinSeconds"
Filter by "DiffinSeconds > 0 and DiffinSeconds < 86400"
Adding new column (row_number) using partitionBy AccNum,DateTime OrderBy DiffinSeconds
Sum and count top 45 rows
# PySpark Imports
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import desc,asc
from pyspark.sql import Window
# Create a Spark Session
spark = SparkSession \
.builder \
.appName("test") \
.config('spark.sql.legacy.timeParserPolicy', 'LEGACY') \
.getOrCreate()
# Read input file
sparkDf = spark.read.csv('input.csv',header=True)
# DateTime conversion (converting String date to PySpark dates)
sparkDf = (
sparkDf
.withColumn('datetime', F.to_timestamp(sparkDf.auth_dttm, 'M/d/yyyy HH:mm'))
.withColumn("date", F.to_date(F.to_timestamp(sparkDf.auth_dttm, 'M/d/yyyy HH:mm')))
.drop('auth_dttm')
)
# Self Join on account, filter datediff which are within last 24 hours
joinedDf = (
sparkDf.alias('df1').join(sparkDf.alias("df2"), F.col("df2.acc_num") == F.col("df2.acc_num"), "inner")
.withColumn('DiffInSeconds',F.unix_timestamp(F.col("df1.datetime")) - F.unix_timestamp(F.col('df2.datetime')))
.select(
F.col('df1.acc_num'),
F.col('df1.datetime')
F.col('df1.tranAmnts'),
F.col('df2.datetime').alias('trailing_datetime'),
F.col('df2.tranAmnts').alias('trailing_tranAmnts'),
'DiffInSeconds'
)
.filter('DiffInSeconds >= 0 and DiffInSeconds <= 86400')
)
# Define a windows function to partition on account and datetime, Order DiffInSeconds
window = Window.partitionBy("acc_num", "datetime").orderBy(asc(F.col("DiffInSeconds")))
# Add a row_count column using above window
windowDf = (
joinedDf
.withColumn("row_count", F.row_number().over(window))
)
# Final Aggregation and filter by row count
aggDf = (
windowDf
.filter(
"row_count <= 45"
).groupBy('acc_num', 'datetime')
.agg(
F.sum('tranAmnts').alias('sumTranAmnts'),
F.count('tranAmnts').alias('countTranAmnts'))
)
aggDf.show()
Output:
+-------+-------------------+------------+--------------+
|acc_num| datetime|sumTranAmnts|countTranAmnts|
+-------+-------------------+------------+--------------+
| 321|0020-11-08 11:18:00| 450000.0| 45|
| 321|0020-11-08 10:44:00| 120000.0| 12|
| 123|0020-11-08 10:56:00| 700.0| 7|
| 321|0020-11-08 11:09:00| 370000.0| 37|
| 123|0020-11-08 11:14:00| 2500.0| 25|
| 321|0020-11-08 10:51:00| 190000.0| 19|
| 321|0020-11-08 10:53:00| 210000.0| 21|
| 123|0020-11-08 10:55:00| 600.0| 6|
| 321|0020-07-01 11:18:00| 10000.0| 1|
| 123|0020-11-08 11:00:00| 1100.0| 11|
| 123|0020-11-08 11:19:00| 3000.0| 30|
| 321|0020-11-08 11:20:00| 450000.0| 45|
| 321|0020-11-08 11:01:00| 290000.0| 29|
| 321|0020-11-08 10:46:00| 140000.0| 14|
| 321|0020-11-08 10:34:00| 20000.0| 2|
| 321|0020-11-08 10:36:00| 40000.0| 4|
| 123|0020-11-08 11:09:00| 2000.0| 20|
| 123|0020-11-08 11:13:00| 2400.0| 24|
| 321|0020-11-08 11:17:00| 450000.0| 45|
| 321|0020-11-08 10:50:00| 180000.0| 18|
+-------+-------------------+------------+--------------+
only showing top 20 rows
My concern: This works great on small dataset, but I am pretty sure, "self join" will blow up when number of rows are in millions.
I am trying to solve this without self join, this is what I have till now:
sparkDf.registerTempTable('input')
df = spark.sql("""
SELECT
acc_num,
tranAmnts,
datetime,
sum(tranAmnts) OVER (
PARTITION BY acc_num
ORDER BY datetime
RANGE BETWEEN INTERVAL 24 HOURS PRECEDING AND CURRENT ROW) AS totalAmnt,
count(tranAmnts) OVER (
PARTITION BY acc_num
ORDER BY datetime
RANGE BETWEEN INTERVAL 24 HOURS PRECEDING AND CURRENT ROW) AS totalCount
from input
""")
But I am unable to figure out how to use multiple conditions in "RANGE BETWEEN", so I can specify both conditions for past 24h and take top 45.
Edit: As I haven't received an answer on how to use multiple conditions in "RANGE BETWEEN" clause, I would like to see if someone has suggestions on how can I improve the working "self-join" to make it more performant.
Thanks in Advance,
Hussain Bohra

Related

Append new rows to a Spark dataframe based on a condition

I need some help on resolving this tricky transformation-
My spark dataframe look like this:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+
Where I need to add new records(rows) for a combination of A and B (having edit_flag as false) if the item_nbr matches with another A and B having edit_flag as true.
The new row will have everything columns copied from its parent row except rcv_qty and rcvr_nbr. So, final output will look like:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 1| 502| 10| 5| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 2| 510| 40| 10| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import spark.implicits._
case class Source(
A: Int,
B: Int,
rcvr_nbr: Int,
order_qty: Int,
rcv_qty: Int,
item_nbr: Int,
edit_flag: Boolean
)
val sourceDF = Seq(
Source(123, 1, 500, 10, 2, 1001, false),
Source(123, 1, 501, 10, 2, 1001, false),
Source(123, 4, 502, 60, 5, 1001, true),
Source(123, 2, 504, 40, 30, 1003, false),
Source(123, 5, 510, 10, 10, 1003, true)
).toDF()
sourceDF.printSchema()
// root
// |-- A: integer (nullable = false)
// |-- B: integer (nullable = false)
// |-- rcvr_nbr: integer (nullable = false)
// |-- order_qty: integer (nullable = false)
// |-- rcv_qty: integer (nullable = false)
// |-- item_nbr: integer (nullable = false)
// |-- edit_flag: boolean (nullable = false)
sourceDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |501 |10 |2 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+
val sourceDFTrueF = sourceDF.filter(col("edit_flag").equalTo(true))
val sourceDFTrue = sourceDFTrueF.columns.foldLeft(sourceDFTrueF) {
(tmpDF, col) =>
tmpDF.withColumnRenamed(col, s"${col}_true")
}
val sourceDFFalse = sourceDF
.filter(col("edit_flag").equalTo(false))
.dropDuplicates("item_nbr")
val resDF =
sourceDFFalse
.join(
sourceDFTrue,
sourceDFFalse.col("item_nbr") === sourceDFTrue.col("item_nbr_true"),
"inner"
)
.select(
sourceDFFalse.col("A"),
sourceDFFalse.col("B"),
sourceDFTrue.col("rcvr_nbr_true").alias("rcvr_nbr"),
sourceDFFalse.col("order_qty"),
sourceDFTrue.col("rcv_qty_true").alias("rcv_qty"),
sourceDFFalse.col("item_nbr"),
sourceDFFalse.col("edit_flag")
)
.union(sourceDF)
.orderBy(col("A"), col("item_nbr"), col("edit_flag"))
resDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |501 |10 |2 |1001 |false |
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |502 |10 |5 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|2 |510 |40 |10 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+

Replication of DataFrame rows based on Date

I have a Dataframe which has the following structure and data
Source:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 05/01/2021
M, N, 10/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a the subsequent date, as following:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
A, B, 08/01/2021
A, B, 09/01/2021
M, N, 10/01/2021
Any idea on how this can be achieved in scala spark?
Here is the working solution:
val dfp1 = List(("1001", 11, "01/10/2021"), ("1002", 21, "05/10/2021"), ("1001", 12, "10/10/2021"), ("1002", 22, "15/10/2021")).toDF("SerialNumber","SomeVal", "Date")
val dfProducts = dfp1.withColumn("Date", to_date($"Date","dd/MM/yyyy"))
dfProducts.show
+------------+-------+----------+
|SerialNumber|SomeVal| Date|
+------------+-------+----------+
| 1001| 11|2021-10-01|
| 1002| 21|2021-10-05|
| 1001| 12|2021-10-10|
| 1002| 22|2021-10-15|
+------------+-------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val overColumns = Window.partitionBy("SerialNumber").orderBy( "Date").rowsBetween(1, Window.unboundedFollowing)
val dfProduct1 = dfProducts.withColumn("NextSerialDate",first("Date", true).over(overColumns)).orderBy("Date")
dfProduct1.show
+------------+-------+----------+--------------+
|SerialNumber|SomeVal| Date|NextSerialDate|
+------------+-------+----------+--------------+
| 1001| 11|2021-10-01| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-15|
| 1001| 12|2021-10-10| null|
| 1002| 22|2021-10-15| null|
+------------+-------+----------+--------------+
val dfProduct2= dfProduct1.withColumn("NextSerialDate", when(col("NextSerialDate").isNull, col("Date")).otherwise(date_sub(col("NextSerialDate"), 1))).orderBy("SerialNumber")
dfProduct2.show
+------------+-------+----------+--------------+
|SerialNumber|SomeVal| Date|NextSerialDate|
+------------+-------+----------+--------------+
| 1001| 11|2021-10-01| 2021-10-09|
| 1001| 12|2021-10-10| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14|
| 1002| 22|2021-10-15| 2021-10-15|
+------------+-------+----------+--------------+
val dfProduct3= dfProduct2.withColumn("ExpandedDate", explode_outer(sequence($"Date", $"NextSerialDate")))
dfProduct3.show
+------------+-------+----------+--------------+------------+
|SerialNumber|SomeVal| Date|NextSerialDate|ExpandedDate|
+------------+-------+----------+--------------+------------+
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-01|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-02|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-03|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-04|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-05|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-06|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-07|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-08|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-09|
| 1001| 12|2021-10-10| 2021-10-10| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-05|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-06|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-07|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-08|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-09|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-11|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-12|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-13|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-14|
+------------+-------+----------+--------------+------------+
only showing top 20 rows
val dfProduct4 = dfProduct3.drop("Date", "NextSerialDate").withColumn("Date", col("ExpandedDate")).drop("ExpandedDate")
dfProduct4.show(50, false)
+------------+-------+----------+
|SerialNumber|SomeVal|Date |
+------------+-------+----------+
|1001 |11 |2021-10-01|
|1001 |11 |2021-10-02|
|1001 |11 |2021-10-03|
|1001 |11 |2021-10-04|
|1001 |11 |2021-10-05|
|1001 |11 |2021-10-06|
|1001 |11 |2021-10-07|
|1001 |11 |2021-10-08|
|1001 |11 |2021-10-09|
|1001 |12 |2021-10-10|
|1002 |21 |2021-10-05|
|1002 |21 |2021-10-06|
|1002 |21 |2021-10-07|
|1002 |21 |2021-10-08|
|1002 |21 |2021-10-09|
|1002 |21 |2021-10-10|
|1002 |21 |2021-10-11|
|1002 |21 |2021-10-12|
|1002 |21 |2021-10-13|
|1002 |21 |2021-10-14|
|1002 |22 |2021-10-15|
+------------+-------+----------+

Divide function in pyspark

Suppose I have this dataframe on PySpark:
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])
I want to create a function that takes column v2 divided by column v1, with the condition:
import numpy as np
from pyspark.sql.functions import pandas_udf
#pandas_udf('long', PandasUDFType.SCALAR)
def pandas_div(a,b):
if b == 0:
return np.nan
else:
return (a/b)
However the result turn out to be like this
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The output that I want should be like this:
+---------+---------+---+
|color_new|fruit_new|div|
+---------+---------+---+
| red| banana|10 |
| blue| banana|20 |
| red| carrot|30 |
| blue| grape|40 |
| red| carrot|50 |
| black| carrot|60 |
| red| banana|70 |
| red| grape|80 |
+---------+---------+---+
All you needed was a WHEN and OTHERWISE. See example below
# Create data frame
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80], ['orange', 'grapefruit', 0, 100]], schema=['color', 'fruit', 'v1', 'v2'])
# display result
df.show()
+------+----------+---+---+
| color| fruit| v1| v2|
+------+----------+---+---+
| red| banana| 1| 10|
| blue| banana| 2| 20|
| red| carrot| 3| 30|
| blue| grape| 4| 40|
| red| carrot| 5| 50|
| black| carrot| 6| 60|
| red| banana| 7| 70|
| red| grape| 8| 80|
|orange|grapefruit| 0|100|
+------+----------+---+---+
# Import functions
import pyspark.sql.functions as f
# apply case when
df1 = df.withColumn("divide", f.when(f.col("v1") == 0, None).otherwise(f.lit(f.col("v2")/f.col("v1"))))
# display result
df1.show()
+------+----------+---+---+------+
| color| fruit| v1| v2|divide|
+------+----------+---+---+------+
| red| banana| 1| 10| 10.0|
| blue| banana| 2| 20| 10.0|
| red| carrot| 3| 30| 10.0|
| blue| grape| 4| 40| 10.0|
| red| carrot| 5| 50| 10.0|
| black| carrot| 6| 60| 10.0|
| red| banana| 7| 70| 10.0|
| red| grape| 8| 80| 10.0|
|orange|grapefruit| 0|100| null|
+------+----------+---+---+------+

Spark: how to do aggregation operations on string array in dataframe

I want to do group cols some aggregations operations like count, count_distinct or nunique.
For examples,
# the samples values in `date` column are all unique
df.show(7)
+--------------------+---------------------------------+-------------------+---------+
| category| tags| datetime| date|
+--------------------+---------------------------------+-------------------+---------+
| null| ,industry,display,Merchants|2018-01-08 14:30:32| 20200704|
| social,smart| smart,swallow,game,Experience|2019-06-17 04:34:51| 20200705|
| ,beauty,social| social,picture,social|2017-08-19 09:01:37| 20200706|
| default| default,game,us,adventure|2019-10-02 14:18:56| 20200707|
|financial management|financial management,loan,product|2018-07-17 02:07:39| 20200708|
| system| system,font,application,setting|2015-07-18 00:45:57| 20200709|
| null| ,system,profile,optimization|2018-09-07 19:59:03| 20200710|
df.printSchema()
root
|-- category: string (nullable = true)
|-- tags: string (nullable = true)
|-- datetime: string (nullable = true)
|-- date: string (nullable = true)
# I want to do some group aggregations by PySpark like follows in pandas
group_date_tags_cnt_df = df.groupby('date')['tags'].count()
group_date_tags_nunique_df = df.groupby('date')['tags'].nunique()
group_date_category_cnt_df = df.groupby('date')['category'].count()
group_date_category_nunique_df = df.groupby('date')['category'].nunique()
# expected output here
# AND all results should ignore ',' in the splitted result and `null` value in aggregations operations
group_date_tags_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
group_date_tags_nunique_df.show(4)
+---------+---------------------------------+
| date| count(DISTINCT tag)|
+---------+---------------------------------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
# It should ignore `null` here
group_date_category_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 0|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
group_date_category_nunique_df.show(4)
+---------+----------------------------+
| date| count(DISTINCT category)|
+---------+----------------------------+
| 20200704| 1|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
But the tags and category columns are string type here.
So I think I should do split way first and do group aggregations operations based on.
But I am a little awkward to implement it.
So could anyone help me?
case class d(
category: Option[String],
tags: String,
datetime: String,
date: String
)
val sourceDF = Seq(
d(None, ",industry,display,Merchants", "2018-01-08 14:30:32", "20200704"),
d(Some("social,smart"), "smart,swallow,game,Experience", "2019-06-17 04:34:51", "20200704"),
d(Some(",beauty,social"), "social,picture,social", "2017-08-19 09:01:37", "20200704")
).toDF("category", "tags", "datetime", "date")
val df1 = sourceDF.withColumn("category", split('category, ","))
.withColumn("tags", split('tags, ","))
val df2 = df1.select('datetime, 'date, 'tags,
explode(
when(col("category").isNotNull, col("category"))
.otherwise(array(lit(null).cast("string")))).alias("category")
)
val df3 = df2.select('category, 'datetime, 'date,
explode(
when(col("tags").isNotNull, col("tags"))
.otherwise(array(lit(null).cast("string")))).alias("tags")
)
val resDF = df3.select('category, 'tags, 'datetime, 'date)
resDF.show
// +--------+----------+-------------------+--------+
// |category| tags| datetime| date|
// +--------+----------+-------------------+--------+
// | null| |2018-01-08 14:30:32|20200704|
// | null| industry|2018-01-08 14:30:32|20200704|
// | null| display|2018-01-08 14:30:32|20200704|
// | null| Merchants|2018-01-08 14:30:32|20200704|
// | social| smart|2019-06-17 04:34:51|20200704|
// | social| swallow|2019-06-17 04:34:51|20200704|
// | social| game|2019-06-17 04:34:51|20200704|
// | social|Experience|2019-06-17 04:34:51|20200704|
// | smart| smart|2019-06-17 04:34:51|20200704|
// | smart| swallow|2019-06-17 04:34:51|20200704|
// | smart| game|2019-06-17 04:34:51|20200704|
// | smart|Experience|2019-06-17 04:34:51|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | | picture|2017-08-19 09:01:37|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | beauty| picture|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | social| social|2017-08-19 09:01:37|20200704|
// | social| picture|2017-08-19 09:01:37|20200704|
// +--------+----------+-------------------+--------+
val group1DF = resDF.groupBy('date, 'category).count()
group1DF.show
// +--------+--------+-----+
// | date|category|count|
// +--------+--------+-----+
// |20200704| social| 7|
// |20200704| | 3|
// |20200704| smart| 4|
// |20200704| beauty| 3|
// |20200704| null| 4|
// +--------+--------+-----+
val group2DF = resDF.groupBy('datetime, 'category).count()
group2DF.show
// +-------------------+--------+-----+
// | datetime|category|count|
// +-------------------+--------+-----+
// |2017-08-19 09:01:37| social| 3|
// |2017-08-19 09:01:37| beauty| 3|
// |2019-06-17 04:34:51| smart| 4|
// |2019-06-17 04:34:51| social| 4|
// |2018-01-08 14:30:32| null| 4|
// |2017-08-19 09:01:37| | 3|
// +-------------------+--------+-----+
Pyspark code for which solves your problem, I have taken the 3 dates data 20200702, 20200704, 20200705
from pyspark.sql import Row
from pyspark.sql.functions import *
drow = Row("category","tags","datetime","date")
data = [drow("", ",industry,display,Merchants","2018-01-08 14:30:32","20200704"),drow("social,smart","smart,swallow,game,Experience","2019-06-17 04:34:51","20200702"),drow(",beauty,social", "social,picture,social", "2017-08-19 09:01:37", "20200705")]
df = spark.createDataFrame(data)
final_df=df.withColumn("category", split(df['category'], ",")).withColumn("tags", split(df['tags'], ",")).select('datetime', 'date', 'tags', explode(when(col("category").isNotNull(), col("category")).otherwise(array(lit("").cast("string")))).alias("category")).select('datetime', 'date', 'category', explode(when(col("tags").isNotNull(), col("tags")).otherwise(array(lit("").cast("string")))).alias("tags")).alias("tags")
final_df.show()
'''
+-------------------+--------+--------+----------+
| datetime| date|category| tags|
+-------------------+--------+--------+----------+
|2018-01-08 14:30:32|20200704| | |
|2018-01-08 14:30:32|20200704| | industry|
|2018-01-08 14:30:32|20200704| | display|
|2018-01-08 14:30:32|20200704| | Merchants|
|2019-06-17 04:34:51|20200702| social| smart|
|2019-06-17 04:34:51|20200702| social| swallow|
|2019-06-17 04:34:51|20200702| social| game|
|2019-06-17 04:34:51|20200702| social|Experience|
|2019-06-17 04:34:51|20200702| smart| smart|
|2019-06-17 04:34:51|20200702| smart| swallow|
|2019-06-17 04:34:51|20200702| smart| game|
|2019-06-17 04:34:51|20200702| smart|Experience|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| | picture|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| beauty| picture|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| social| social|
|2017-08-19 09:01:37|20200705| social| picture|
+-------------------+--------+--------+----------+
only showing top 20 rows'''
final_df.groupBy('date','tags').count().show()
'''
+--------+----------+-----+
| date| tags|count|
+--------+----------+-----+
|20200702| smart| 2|
|20200705| picture| 3|
|20200702| swallow| 2|
|20200704| industry| 1|
|20200704| display| 1|
|20200702| game| 2|
|20200704| | 1|
|20200704| Merchants| 1|
|20200702|Experience| 2|
|20200705| social| 6|
+--------+----------+-----+
'''
final_df.groupBy('date','category').count().show()
'''
+--------+--------+-----+
| date|category|count|
+--------+--------+-----+
|20200702| smart| 4|
|20200702| social| 4|
|20200705| | 3|
|20200705| beauty| 3|
|20200704| | 4|
|20200705| social| 3|
+--------+--------+-----+
'''

Pyspark: Drop/Filter rows based on Summing of columns and Rank

I have a dataframe like this:
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
+----------+-----------+-----------+-----------+----+
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
+----------+-----------+-----------+-----------+----+
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 2|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 2|
|2020-05-11| 30| 120| -60| 2|
+----------+-----------+-----------+-----------+----+
For each Date I have a Total_Space that can be filled. So for 2020-05-10, I have 60 seconds, and for 2020-05-11 I have 120 seconds.
Each Date also already have assigned slots with a certain Slot_Length.
For each Date I have already calculated the amount of space that Date is over in the Amount_Over column and have ranked them appropriately based on a priority column not shown here.
What I would like to do is to drop the rows with lowest Rank for a Date until the Slot_Lengths add up to the Total_Space for a Date.
+----------+-----------+-----------+-----------+----+
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
+----------+-----------+-----------+-----------+----+
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
+----------+-----------+-----------+-----------+----+
In this example, it is as easy as dropping all Rank equal to 2, but there will be examples where there is a tie between ranks, so first take the highest ranks, and then take a random one if there is a tie.
What is the best way to do this? I already understand it will need a Window function over the Date to do each calculation over the Slot_Length, Total_Space, and Amount_Over columns correctly.
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11",
"2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
w = Window.partitionBy("Date").orderBy("Rank").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn(
"Cumulative_Sum", F.sum("Slot_Length").over(w)
).filter(
F.col("Cumulative_Sum") <= F.col("Total_Space")
).orderBy("Date","Rank","Cumulative_Sum").show()
which results
+----------+-----------+-----------+-----------+----+--------------+
| Date|Slot_Length|Total_Space|Amount_Over|Rank|Cumulative_Sum|
+----------+-----------+-----------+-----------+----+--------------+
|2020-05-10| 30| 60| -30| 1| 30|
|2020-05-10| 30| 60| -30| 1| 60|
|2020-05-11| 30| 120| -60| 1| 30|
|2020-05-11| 30| 120| -60| 1| 60|
|2020-05-11| 30| 120| -60| 1| 90|
|2020-05-11| 30| 120| -60| 1| 120|
+----------+-----------+-----------+-----------+----+--------------+