PySpark generate missing dates and fill data with previous value - dataframe

I need help for this case to fill, with a new row, missing values:
This is just an example, but I have a lot of rows with different IDs.
Input dataframe:
ID
FLAG
DATE
123
1
01/01/2021
123
0
01/02/2021
123
1
01/03/2021
123
0
01/06/2021
123
0
01/08/2021
777
0
01/01/2021
777
1
01/03/2021
So I have a finite set of dates and I wanna take until the last one for each ID (in the example, for ID = 123: 01/01/2021, 01/02/2021, 01/03/2021... until 01/08/2021). So basically I could do a cross join with a calendar, but I don't know how can I fill missing value with a rule or a filter, after the cross join.
Expected output: (in bold the generated missing values)
ID
FLAG
DATE
123
1
01/01/2021
123
0
01/02/2021
123
1
01/03/2021
123
1
01/04/2021
123
1
01/05/2021
123
0
01/06/2021
123
0
01/07/2021
123
0
01/08/2021
777
0
01/01/2021
777
0
01/02/2021
777
1
01/03/2021

You can first group by id to calculate max and min date then using sequence function, generate all the dates from min_date to max_date. Finally, join with original dataframe and fill nulls with last non null per group of id. Here's a complete working example:
Your input dataframe:
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([
(123, 1, "01/01/2021"), (123, 0, "01/02/2021"),
(123, 1, "01/03/2021"), (123, 0, "01/06/2021"),
(123, 0, "01/08/2021"), (777, 0, "01/01/2021"),
(777, 1, "01/03/2021")
], ["id", "flag", "date"])
Groupby id and generate all possible dates for each id:
all_dates_df = df.groupBy("id").agg(
F.date_trunc("mm", F.max(F.to_date("date", "dd/MM/yyyy"))).alias("max_date"),
F.date_trunc("mm", F.min(F.to_date("date", "dd/MM/yyyy"))).alias("min_date")
).select(
"id",
F.expr("sequence(min_date, max_date, interval 1 month)").alias("date")
).withColumn(
"date", F.explode("date")
).withColumn(
"date",
F.date_format("date", "dd/MM/yyyy")
)
Now, left join with df and use last function over a Window partitioned by id to fill null values:
w = Window.partitionBy("id").orderBy("date")
result = all_dates_df.join(df, ["id", "date"], "left").select(
"id",
"date",
*[F.last(F.col(c), ignorenulls=True).over(w).alias(c)
for c in df.columns if c not in ("id", "date")
]
)
result.show()
#+---+----------+----+
#| id| date|flag|
#+---+----------+----+
#|123|01/01/2021| 1|
#|123|01/02/2021| 0|
#|123|01/03/2021| 1|
#|123|01/04/2021| 1|
#|123|01/05/2021| 1|
#|123|01/06/2021| 0|
#|123|01/07/2021| 0|
#|123|01/08/2021| 0|
#|777|01/01/2021| 0|
#|777|01/02/2021| 0|
#|777|01/03/2021| 1|
#+---+----------+----+

You can find the ranges of dates between the DATE value in the current row and the following row and then use sequence to generate all intermediate dates and explode this array to fill in values for the missing dates.
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [(123, 1, "01/01/2021",),
(123, 0, "01/02/2021",),
(123, 1, "01/03/2021",),
(123, 0, "01/06/2021",),
(123, 0, "01/08/2021",),
(777, 0, "01/01/2021",),
(777, 1, "01/03/2021",), ]
df = spark.createDataFrame(data, ("ID", "FLAG", "DATE",)).withColumn("DATE", F.to_date(F.col("DATE"), "dd/MM/yyyy"))
window_spec = Window.partitionBy("ID").orderBy("DATE")
next_date = F.coalesce(F.lead("DATE", 1).over(window_spec), F.col("DATE") + F.expr("interval 1 month"))
end_date_range = next_date - F.expr("interval 1 month")
df.withColumn("Ranges", F.sequence(F.col("DATE"), end_date_range, F.expr("interval 1 month")))\
.withColumn("DATE", F.explode("Ranges"))\
.withColumn("DATE", F.date_format("date", "dd/MM/yyyy"))\
.drop("Ranges").show(truncate=False)
Output
+---+----+----------+
|ID |FLAG|DATE |
+---+----+----------+
|123|1 |01/01/2021|
|123|0 |01/02/2021|
|123|1 |01/03/2021|
|123|1 |01/04/2021|
|123|1 |01/05/2021|
|123|0 |01/06/2021|
|123|0 |01/07/2021|
|123|0 |01/08/2021|
|777|0 |01/01/2021|
|777|0 |01/02/2021|
|777|1 |01/03/2021|
+---+----+----------+

Related

Joining 2 dataframes pyspark

I am new to Pyspark.
I have data like this in 2 tables as below. I am using data frames.
Table1:
Id
Amount
Date
1
£100
01/04/2021
1
£50
08/04/2021
2
£60
02/04/2021
2
£20
06/05/2021
Table2:
Id
Status
Date
1
S1
01/04/2021
1
S2
05/04/2021
1
S3
10/04/2021
2
S1
02/04/2021
2
S2
10/04/2021
I need to join those 2 data frames above to produce output like this as below.
For every record in table 1, we need to get the record from table 2 valid as of that Date and vice versa. For e.g, table1 has £50 for Id=1 on 08/04/2021 but table 2 has a record for Id=1 on 05/04/2021 where status changed to S2. So, for 08/04/2021 the status is S2. That's what I am not sure how to give in the join condition to get this output
What's the efficient way of achieving this?
Expected Output:
Id
Status
Date
Amount
1
S1
01/04/2021
£100
1
S2
05/04/2021
£100
1
S2
08/04/2021
£50
1
S3
10/04/2021
£50
2
S1
02/04/2021
£60
2
S2
10/04/2021
£60
2
S2
06/05/2021
£20
Use full join on Id and Date then lag window function to get the values of Status and Amount from the precedent closest Date row:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy("Id").orderBy(F.to_date("Date", "dd/MM/yyyy"))
joined_df = df1.join(df2, ["Id", "Date"], "full").withColumn(
"Status",
F.coalesce(F.col("Status"), F.lag("Status").over(w))
).withColumn(
"Amount",
F.coalesce(F.col("Amount"), F.lag("Amount").over(w))
)
joined_df.show()
#+---+----------+------+------+
#| Id| Date|Amount|Status|
#+---+----------+------+------+
#| 1|01/04/2021| £100| S1|
#| 1|05/04/2021| £100| S2|
#| 1|08/04/2021| £50| S2|
#| 1|10/04/2021| £50| S3|
#| 2|02/04/2021| £60| S1|
#| 2|10/04/2021| £60| S2|
#| 2|06/05/2021| £20| S2|
#+---+----------+------+------+

Window & Aggregate functions in Pyspark SQL/SQL

After the answer by #Vaebhav realized the question was not set up correctly.
Hence editing it with his code snippet.
I have the following table
from pyspark.sql.types import IntegerType,TimestampType,DoubleType
input_str = """
4219,2018-01-01 08:10:00,3.0,50.78,
4216,2018-01-02 08:01:00,5.0,100.84,
4217,2018-01-02 20:00:00,4.0,800.49,
4139,2018-01-03 11:05:00,1.0,400.0,
4170,2018-01-03 09:10:00,2.0,100.0,
4029,2018-01-06 09:06:00,6.0,300.55,
4029,2018-01-06 09:16:00,2.0,310.55,
4217,2018-01-06 09:36:00,5.0,307.55,
1139,2018-01-21 11:05:00,1.0,400.0,
2170,2018-01-21 09:10:00,2.0,100.0,
4218,2018-02-06 09:36:00,5.0,307.55,
4218,2018-02-06 09:36:00,5.0,307.55
""".split(",")
input_values = list(map(lambda x: x.strip() if x.strip() != '' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "customer_id,timestamp,quantity,price".split(',')))
n = len(input_values)
n_cols = 4
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sqlContext.createDataFrame(input_list,cols)
sparkDF = sparkDF.withColumn('customer_id',F.col('customer_id').cast(IntegerType()))\
.withColumn('timestamp',F.col('timestamp').cast(TimestampType()))\
.withColumn('quantity',F.col('quantity').cast(IntegerType()))\
.withColumn('price',F.col('price').cast(DoubleType()))
I want to calculate the aggergate as follows :
trxn_date
unique_cust_visits
next_7_day_visits
next_30_day_visits
2018-01-01
1
7
9
2018-01-02
2
6
8
2018-01-03
2
4
6
2018-01-06
2
2
4
2018-01-21
2
2
3
2018-02-06
1
1
1
where the
trxn_date is date from the timestamp column,
daily_cust_visits is unique count of customers,
next_7_day_visits is a count of customers on a 7 day rolling window basis.
next_30_day_visits is a count of customers on a 30 day rolling window basis.
I want to write the code as a single SQL query.
You can achieve this by using ROW rather than a RANGE Frame Type , a good explanation can be found here
ROW - based on physical offsets from the position of the current input row
RANGE - based on logical offsets from the position of the current input row
Also in your implementation ,a PARTITION BY clause would be redundant, as it wont create the required Frames for a look-ahead.
Data Preparation
input_str = """
4219,2018-01-02 08:10:00,3.0,50.78,
4216,2018-01-02 08:01:00,5.0,100.84,
4217,2018-01-02 20:00:00,4.0,800.49,
4139,2018-01-03 11:05:00,1.0,400.0,
4170,2018-01-03 09:10:00,2.0,100.0,
4029,2018-01-06 09:06:00,6.0,300.55,
4029,2018-01-06 09:16:00,2.0,310.55,
4217,2018-01-06 09:36:00,5.0,307.55
""".split(",")
input_values = list(map(lambda x: x.strip() if x.strip() != '' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "customer_id timestamp quantity price".split('\t')))
n = len(input_values)
n_cols = 4
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sql.createDataFrame(input_list,cols)
sparkDF = sparkDF.withColumn('customer_id',F.col('customer_id').cast(IntegerType()))\
.withColumn('timestamp',F.col('timestamp').cast(TimestampType()))\
.withColumn('quantity',F.col('quantity').cast(IntegerType()))\
.withColumn('price',F.col('price').cast(DoubleType()))
sparkDF.show()
+-----------+-------------------+--------+------+
|customer_id| timestamp|quantity| price|
+-----------+-------------------+--------+------+
| 4219|2018-01-02 08:10:00| 3| 50.78|
| 4216|2018-01-02 08:01:00| 5|100.84|
| 4217|2018-01-02 20:00:00| 4|800.49|
| 4139|2018-01-03 11:05:00| 1| 400.0|
| 4170|2018-01-03 09:10:00| 2| 100.0|
| 4029|2018-01-06 09:06:00| 6|300.55|
| 4029|2018-01-06 09:16:00| 2|310.55|
| 4217|2018-01-06 09:36:00| 5|307.55|
+-----------+-------------------+--------+------+
Window Aggregates
sparkDF.createOrReplaceTempView("transactions")
sql.sql("""
SELECT
TO_DATE(timestamp) as trxn_date
,COUNT(DISTINCT customer_id) as unique_cust_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY 'timestamp'
ROWS BETWEEN CURRENT ROW AND 7 FOLLOWING
) as next_7_day_visits
FROM transactions
GROUP BY 1
""").show()
+----------+------------------+-----------------+
| trxn_date|unique_cust_visits|next_7_day_visits|
+----------+------------------+-----------------+
|2018-01-02| 3| 7|
|2018-01-03| 2| 4|
|2018-01-06| 2| 2|
+----------+------------------+-----------------+
Building upon #Vaebhav's answer the required query in this case is
sqlContext.sql("""
SELECT
TO_DATE(timestamp) as trxn_date
,COUNT(DISTINCT customer_id) as unique_cust_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY CAST(TO_DATE(timestamp) AS TIMESTAMP) DESC
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) as next_7_day_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY CAST(TO_DATE(timestamp) AS TIMESTAMP) DESC
RANGE BETWEEN INTERVAL 30 DAYS PRECEDING AND CURRENT ROW
) as next_30_day_visits
FROM transactions
GROUP BY 1
ORDER by trxn_date
""").show()
trxn_date
unique_cust_visits
next_7_day_visits
next_30_day_visits
2018-01-01
1
7
9
2018-01-02
2
6
8
2018-01-03
2
4
6
2018-01-06
2
2
4
2018-01-21
2
2
3
2018-02-06
1
1
1

Pyspark: Filter dataframe and apply function to offset time

I have a dataframe like this:
import time
import datetime
import pandas as pd
df = pd.DataFrame({'Number': ['1', '2', '1', '1'],
'Letter': ['A', 'A', 'B', 'A'],
'Time': ['2019-04-30 18:15:00', '2019-04-30 18:15:00', '2019-04-30 18:15:00', '2019-04-30 18:15:00'],
'Value': [30, 30, 30, 60]})
df['Time'] = pd.to_datetime(df['Time'])
Number Letter Time Value
0 1 A 2019-04-30 18:15:00 30
1 2 A 2019-04-30 18:15:00 30
2 1 B 2019-04-30 18:15:00 30
3 1 A 2019-04-30 18:15:00 60
I would like to do something similar in Pyspark as I do in Pandas where I filter on a specific set of data:
#: Want to target only rows where the Number = '1' and the Letter is 'A'.
target_df = df[
(df['Number'] == '1') &
(df['Letter'] == 'A')
]
And apply a change to a value based on another column:
#: Loop over these rows and subtract the offset value from the Time.
for index, row in target_df.iterrows():
offset = row['Value']
df.loc[index, 'Time'] = row['Time'] - datetime.timedelta(seconds=row['Value'])
To get a final output like so:
Number Letter Time Value
0 1 A 2019-04-30 18:14:30 30
1 2 A 2019-04-30 18:15:00 30
2 1 B 2019-04-30 18:15:00 30
3 1 A 2019-04-30 18:14:00 60
What is the best way to go about this in Pyspark?
I was thinking something along the lines of this:
pyspark_df = spark.createDataFrame(df)
pyspark_df.withColumn('new_time', F.when(
F.col('Number') == '1' & F.col('Letter' == 'A'), F.col('Time') - datetime.timedelta(seconds=(F.col('Value')))).otherwise(
F.col('Time')))
But that doesn't seem to work for me.
You can try with unix timestamp:
import pyspark.sql.functions as F
cond_val = (F.when((F.col("Number")==1)&(F.col("Letter")=="A")
,F.from_unixtime(F.unix_timestamp(F.col("Time"))-F.col("Value")))
.otherwise(F.col("Time")))
df.withColumn("Time",cond_val).show()
+------+------+-------------------+-----+
|Number|Letter| Time|Value|
+------+------+-------------------+-----+
| 1| A|2019-04-30 18:14:30| 30|
| 2| A|2019-04-30 18:15:00| 30|
| 1| B|2019-04-30 18:15:00| 30|
| 1| A|2019-04-30 18:14:00| 60|
+------+------+-------------------+-----+
Just an addition, you dont need iterrows in pandas, just do:
c = df['Number'].eq(1) & df['Letter'].eq('A')
df.loc[c,'Time'] = df['Time'].sub(pd.to_timedelta(df['Value'],unit='s'))
#or faster
#df['Time'] = np.where(c,df['Time'].sub(pd.to_timedelta(df['Value'],unit='s'))
#,df['Time'])

Aggregate while dropping duplicates in pyspark

I want to groupby aggregate a pyspark dataframe, while removing duplicates (keep last value) based on another column of this dataframe.
In summary, I would like to apply a dropDuplicates to a GroupedData object. So, for each group, I could keep only one row by some column, dynamically.
Example
The straight forward group aggregation, for the dataframe bellow, would be:
from pyspark.sql import functions
dataframe = spark.createDataFrame(
[
(1, "2020-01-01", 1, 1),
(2, "2020-01-01", 2, 1),
(3, "2020-01-02", 1, 1),
(2, "2020-01-02", 1, 1)
],
("id", "ts", "feature", "h3")
).withColumn("ts", functions.col("ts").cast("timestamp"))
# +---+-------------------+-------+---+
# | id| ts|feature| h3|
# +---+-------------------+-------+---+
# | 1|2020-01-01 00:00:00| 1| 1|
# | 2|2020-01-01 00:00:00| 2| 1|
# | 3|2020-01-02 00:00:00| 1| 1|
# | 2|2020-01-02 00:00:00| 1| 1|
# +---+-------------------+-------+---+
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.sum("feature")
)
aggregated.show(truncate=False)
resulting in the following dataframe:
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|5 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|5 |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|2 |
+---+------------------------------------------+------------+
The problem
I want the aggregation to use only the latest state of each id. In this case, id=2 have been updated to feature=1 at ts=2020-01-02 00:00:00, so all aggregations with base timestamp bigger than 2020-01-02 00:00:00 should use only this state for column feature when id=2. The expected aggregated dataframe is:
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|3 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|3 |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|2 |
+---+------------------------------------------+------------+
How can I do this with pyspark?
Update
I have assumed that a MapType variable should not have duplicate keys in Spark. With that assumption, I thought I could aggregate the column creating a map id -> feature and then just aggregate the map values with sum (or whatever the final aggregation should be).
So I did:
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.map_from_entries(
functions.collect_list(
functions.struct("id","feature")
)
).alias("id_feature")
)
aggregated.show(truncate=False)
But then I've found that maps can have duplicate keys:
+---+------------------------------------------+--------------------------------+
|h3 |window |id_feature |
+---+------------------------------------------+--------------------------------+
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|[1 -> 1, 2 -> 2, 3 -> 1, 2 -> 1]|
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|[1 -> 1, 2 -> 2, 3 -> 1, 2 -> 1]|
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|[1 -> 1, 2 -> 2] |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|[3 -> 1, 2 -> 1] |
+---+------------------------------------------+--------------------------------+
so it doesn't solve my problem. Instead, I just found another problem. When using the display function in a Databricks' notebook, it shows the MapType column without duplicated keys.
First, you can find the latest record for each id and time window and then join with the original dataframe with the latest records.
time_window = window(timeColumn="ts", windowDuration="3 days", slideDuration="1 day")
df2 = df.groupBy("h3", time_window, "id").agg(max("ts").alias("latest"))
df2.alias("a").join(df.alias("b"), (col("a.id") == col("b.id")) & (col("a.latest") == col("b.ts")), "left") \
.select("a.*", "feature") \
.groupBy("h3", "window") \
.agg(sum("feature")) \
.orderBy("window") \
.show(truncate=False)
Then, the result is the same as your expected one.
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-29 00:00:00, 2020-01-01 00:00:00]|3 |
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|3 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|2 |
+---+------------------------------------------+------------+
Since you are using Spark 2.4+, one way you can try is to use Spark SQL aggregate function, see below:
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.sort_array(functions.collect_list(
functions.struct("ts", "id", "feature")
), False).alias("id_feature")
)
I added ts field into the resulting array of structs from functions.collect_list. use functions.sort_array to sort the list by ts in descending order(to keep the latest record if duplicate exists). In the following aggregate function, we set the zero_value using a named_struct containing two fields: ids (MapType) to cache all processed id and total to do the sum only when the new id not exist in the cached ids.
aggregated.selectExpr("h3", "window", """
aggregate(
id_feature,
/* zero_value */
(map() as ids, 0L as total),
/* merge */
(acc, y) -> named_struct(
/* add y.id into the ids map */
'ids', map_concat(acc.ids, map(y.id,1)),
/* sum to total only when y.id doesn't exist in acc.ids map */
'total', acc.total + IF(acc.ids[y.id] is null,y.feature,0)
),
/* finish, take only acc.total, discard acc.ids map */
acc -> acc.total
) as id_features
""").show()
+---+--------------------+----------+
| h3| window|id_feature|
+---+--------------------+----------+
| 1|[2020-01-01 00:00...| 3|
| 1|[2019-12-31 00:00...| 3|
| 1|[2019-12-30 00:00...| 3|
| 1|[2020-01-02 00:00...| 2|
+---+--------------------+----------+

Visits during the last 2 years

I have a list with users and the dates of their last visit. For every time they visit, I want to know how many times they visited over the last 2 years.
# Create toy example
import pandas as pd
import numpy as np
date_range = pd.date_range(pd.to_datetime('2010-01-01'),
pd.to_datetime('2016-01-01'), freq='D')
date_range = np.random.choice(date_range, 8)
visits = {'user': list(np.repeat(1, 4)) + list(np.repeat(2, 4)) ,
'time': list(date_range)}
df = pd.DataFrame(visits)
df.sort_values(by = ['user', 'time'], axis = 0)
df = spark.createDataFrame(df).repartition(1).cache()
df.show()
What I am looking for is something like this:
time user nr_visits_during_2_previous_years
0 2010-02-27 1 0
2 2012-02-21 1 1
3 2013-04-30 1 1
1 2013-06-20 1 2
6 2010-06-23 2 0
4 2011-10-19 2 1
5 2011-11-10 2 2
7 2014-02-06 2 0
Suppose you create a dataframe with these values and you need to check for visits after 2015-01-01.
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = spark.createDataFrame([("2014-02-01", "1"),("2015-03-01", "2"),("2017-12-01", "3"),
("2014-05-01", "2"),("2016-10-12", "1"),("2016-08-21", "1"),
("2017-07-01", "3"),("2015-09-11", "1"),("2016-08-24", "1")
,("2016-04-05", "2"),("2014-11-19", "3"),("2016-03-11", "3")], ["date", "id"])
Now, you need to change your date column to DateType from StringType and then filter rows for which user visited after 2015-01-01.
df2 = df.withColumn("date",f.to_date('date', 'yyyy-MM-dd'))
df3 = df2.where(df2.date >= f.lit('2015-01-01'))
Last part, just use groupby on id column and use count to get the number of visits by a user after 2015-01-01
df3.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 3| 3|
| 1| 4|
| 2| 2|
+---+-----+