GroupBy using Time Frequency on PySpark DataFrame Issue - apache-spark-sql

I am novice to PySpark .
I am trying to perform a GroupBy operation to get the aggregated count. But I am not able to perform a groupBy based on time frequency. I need to perform "groupBy" using the fields "CAPTUREDTIME, NODE, CHANNEL, LOCATION, TACK". But in this groupBy I should group based on "hourly","daily","weekly", "monthly" using the "CAPTUREDTIME" field.
Please find the below sample data.
-----------------+------+------+--------+----------+--------------
|CAPTUREDTIME| NODE| CHANNEL | LOCATION| TACK
+-----------------+------+------+--------+----------+-------------
|20-05-09 03:06:21| PUSC_RES| SIMPLEX| NORTH_AL| UE220034
|20-05-09 04:33:04| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034
|20-05-09 12:04:52| TESC_RES| SIMPLEX| NORTH_AL| UE220057
|20-05-10 04:24:09| TESC_RES| SIMPLEX| NORTH_AL| UE220057
|20-05-10 04:33:04| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034
|20-04-09 10:57:48| TESC_RES| SIMPLEX| NORTH_AL| UE220057
|20-04-09 12:12:26| TESC_RES| SIMPLEX| NORTH_AL| UE220057
|20-04-09 03:26:33| PUSC_RES| SIMPLEX| NORTH_AL| UE220071
+-----------------+------+------+--------+----------+-------------
I have used the below pyspark code
df = df.groupby("CAPTUREDTIME", "NODE", "CHANNEL", "LOCATION", "TACK").agg(
func.count("TACK").alias("count")
)
How can I extend the above code to group on 'hourly','daily', 'weekly','monthly' ?
I require the output in below format(have shared sample output):
HOURLY :
|CAPTUREDTIME| NODE| CHANNEL | LOCATION| TACK| COUNT
|20-05-09 03:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 2
|20-05-09 04:00:00| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034| 2
DAILY :
|CAPTUREDTIME| NODE| CHANNEL | LOCATION| TACK| COUNT
|20-05-09 00:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 1
|20-05-09 00:00:00| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034| 2
|20-05-09 00:00:00| TESC_RES| SIMPLEX| NORTH_AL| UE220057| 3
WEEKLY :
|CAPTUREDTIME| NODE| CHANNEL | LOCATION| TACK| COUNT
|20-05-09 00:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 1
MONTHLY :
|CAPTUREDTIME| NODE| CHANNEL | LOCATION| TACK| COUNT
|20-05-09 00:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 1

You have two ways to answer your issue, either you cast your timestamps to the date granularity you want to group by with or (as you said in the comments) you use the sql window function to group by interval you'd like.
Just know that monthly aggregation are not possible through the window SQL function in Spark.
Here you can see the code, first three examples use the window SQL function and the last example cast the timestamp monthly and then group by every columns.
df = spark.createDataFrame(
[
("20-05-09 03:06:21", "PUSC_RES", "SIMPLEX", "NORTH_AL", "UE220034"),
("20-05-09 04:33:04", "PUSC_RES", "SIMPLEX", "SOUTH_AL", "UE220034"),
("20-05-09 12:04:52", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
("20-05-10 04:24:09", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
("20-05-10 04:33:04", "PUSC_RES", "SIMPLEX", "SOUTH_AL", "UE220034"),
("20-04-09 10:57:48", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
("20-04-09 12:12:26", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
("20-04-09 03:26:33", "PUSC_RES", "SIMPLEX", "NORTH_AL", "UE220071")
],
['CAPTUREDTIME', 'NODE', 'CHANNEL', 'LOCATION', 'TACK']
)
from pyspark.sql.functions import col, count, date_format, date_sub, date_trunc, month, next_day, to_timestamp, weekofyear, window, year
Hourly
I still keep the window logic just for this one, so we can reference for everyone every possibility in Spark. I only select the start of the window at the end before showing the dataframe.
hourly = (
df
.withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
.groupBy(window(col("captured_time"), "1 hour").alias("captured_time"), "NODE", "CHANNEL", "LOCATION", "TACK")
.agg(count("*"))
.withColumn("captured_time_hour", col("captured_time.start"))
.drop("captured_time")
)
hourly.sort("captured_time_hour").show(100, False)
Daily
Through the date_trunc function, I can truncate the timestamp only considering the day
daily = (
df
.withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
.withColumn("captured_time_day", date_trunc("day", col("captured_time")))
.groupBy("captured_time_day", "NODE", "CHANNEL", "LOCATION", "TACK")
.agg(count("*"))
)
daily.sort("captured_time_day").show(100, False)
Weekly
This one is a bit more tricky. First I use, a next_day function with monday. Please if you consider Sunday as the start of the week, update this code according to it, but I consider monday as the start of the week (it depends of SQL dialects I believe and regions)
Then we can also add a weekofyear function to retrieve the week number as you wanted
weekly = (
df
.withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
.withColumn("start_day", date_sub(next_day(col("captured_time"), "monday"), 7))
.groupBy("start_day", "NODE", "CHANNEL", "LOCATION", "TACK")
.agg(count("*"))
.withColumn("start_day", to_timestamp(col("start_day")))
.withColumn("week_of_year", weekofyear(col("start_day")))
)
weekly.sort("start_day").show(100, False)
Monthly
We just format the timestamp as a date, and then cast it back to timestamp. This is just done to show another way of doing it. We could just truncate the timestamp as the daily usecase. I also show two ways of extracting the month name and abbreviation. Just take care of your Spark version as this is tested in Spark 3.0.0
monthly = (
df
.withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
.withColumn("captured_time_month", date_format(col('captured_time'), '1/M/yyyy'))
.groupBy(col("captured_time_month"), "NODE", "CHANNEL", "LOCATION", "TACK")
.agg(count("*").alias("Count TACK"))
.withColumn("captured_time_month", to_timestamp(col("captured_time_month"), '1/M/yyyy'))
.withColumn("month", month(col("captured_time_month")))
.withColumn("month_abbr", date_format(col("captured_time_month"),'MMM'))
.withColumn("full_month_name", date_format(col("captured_time_month"),'MMMM'))
)
monthly.sort("captured_time_month").show(100, False)
Ciao !

Spark provides a relatively rich library for date manipulation. The answer to your question is a combination of extraction of date parts and date formatting for display.
I re-created your data as follows:
val capturesRaw = spark.read
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "|")
.option("header", "true")
.csv(spark.sparkContext.parallelize("""
CAPTUREDTIME| NODE| CHANNEL | LOCATION| TACK
20-05-09 03:06:21| PUSC_RES| SIMPLEX| NORTH_AL| UE220034
20-05-09 04:33:04| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034
20-05-09 12:04:52| TESC_RES| SIMPLEX| NORTH_AL| UE220057
20-05-10 04:24:09| TESC_RES| SIMPLEX| NORTH_AL| UE220057
20-05-10 04:33:04| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034
20-04-09 10:57:48| TESC_RES| SIMPLEX| NORTH_AL| UE220057
20-04-09 12:12:26| TESC_RES| SIMPLEX| NORTH_AL| UE220057
20-04-09 03:26:33| PUSC_RES| SIMPLEX| NORTH_AL| UE220071"""
.split("\n")).toDS)
Note: I use Scala, but the difference in the code is so small I hope you find it understandable. I believe the val in the beginning is the only difference in fact.
I assume the first two digits represent a two-digit year? To proceed, we need to make sure capturedtime is a timestamp. I prefer to use SQL to manipulate dataframes, as I find it more readable.
spark.sql("""select to_timestamp('20' || capturedtime) capturedtime, NODE, CHANNEL,
LOCATION, TACK from captures_raw""")
.createOrReplaceTempView("captures_raw")
The same thing can be done on the dataframe directly, if you prefer
capturesRaw.withColumn("capturedtimestamp",
to_timestamp(col("capturedtime"), "yy-MM-dd hh:mm:ss"))
At this point, we can create the fields you requested:
spark.sql("""select capturedtime,
month(capturedtime) cap_month,
weekofyear(capturedtime) cap_week,
day(capturedtime) cap_day,
hour(capturedtime) cap_hr, NODE, CHANNEL, LOCATION, TACK
from captures_raw""").createOrReplaceTempView("captures")
With the fields created, we are ready to answer your question. To aggregate by month alone (without rest of the timestamp), for instance, proceed as follows:
spark.sql("""select date_format(capturedtime, "yyyy-MM") year_month, cap_month,
cap_week, cap_day, cap_hr, count(*) count
from captures
group by 1,2,3,4,5""").show
Which returns
+----------+---------+--------+-------+------+-----+
|year_month|cap_month|cap_week|cap_day|cap_hr|count|
+----------+---------+--------+-------+------+-----+
| 2020-04| 4| 15| 9| 3| 1|
| 2020-04| 4| 15| 9| 10| 1|
| 2020-05| 5| 19| 9| 4| 1|
| 2020-05| 5| 19| 9| 12| 1|
| 2020-04| 4| 15| 9| 12| 1|
| 2020-05| 5| 19| 9| 3| 1|
| 2020-05| 5| 19| 10| 4| 2|
+----------+---------+--------+-------+------+-----+
A daily summary can be produced as follows:
spark.sql("""select date_format(capturedtime, "yyyy-MM-dd") captured_date,
cap_day, cap_hr, count(*) count
from captures
group by 1,2,3""").show
+-------------+-------+------+-----+
|captured_date|cap_day|cap_hr|count|
+-------------+-------+------+-----+
| 2020-05-10| 10| 4| 2|
| 2020-04-09| 9| 12| 1|
| 2020-05-09| 9| 4| 1|
| 2020-05-09| 9| 12| 1|
| 2020-04-09| 9| 3| 1|
| 2020-04-09| 9| 10| 1|
| 2020-05-09| 9| 3| 1|
+-------------+-------+------+-----+

Related

How to Pivot multiple columns in pyspark similar to pandas

I want to perform similar operation in pyspark like in how its possible with pandas
My dataframe is :
Year win_loss_date Deal L2 GFCID Name L2 GFCID GFCID GFCID Name Client Priority Location Deal Location Revenue Deal Conclusion New/Rebid
0 2021 2021-03-08 00:00:00 1-2JZONGU TEST GFCID CREATION P-1-P1DO P-1-P5O TEST GFCID CREATION None UNITED STATES UNITED STATES 4567.0000000 Won New
enter image description here
In pandas: code to pivot is :
df = pd.pivot_table(deal_df_pandas,
index=['GFCID', 'GFCID Name', 'Client Priority'],
columns=['New/Rebid', 'Year', 'Deal Conclusion'],
aggfunc={'Deal':'count',
'Revenue':'sum',
'Location': lambda x: set(x),
'Deal Location': lambda x: set(x)}).reset_index()
columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted
Output I get and expected:
GFCID GFCID Name Client Priority Deal Revenue
New/Rebid New Rebid New Rebid
Year 2020 2021 2020 2021 2020 2021 2020 2021
Deal Conclusion Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won
0 0000000752 ARAMARK SERVICES INC Bronze NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000 20000.0000000 NaN NaN NaN NaN
enter image description here
What i want is to convert above code to pyspark.
what i am trying is not working:
from pyspark.sql import functions as F
df_pivot2=(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))
AS THIS OPERATION NOT POSSIBLE IN PySPARK:
(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid','Year','Deal Conclusion') #--error
you can concatenate the multiple columns into a single column which can be used within pivot.
consider the following example
data_sdf.show()
# +---+-----+--------+--------+
# | id|state| time|expected|
# +---+-----+--------+--------+
# | 1| A|20220722| 1|
# | 1| A|20220723| 1|
# | 1| B|20220724| 2|
# | 2| B|20220722| 1|
# | 2| C|20220723| 2|
# | 2| B|20220724| 3|
# +---+-----+--------+--------+
data_sdf. \
withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
groupBy('id'). \
pivot('pivot_col'). \
agg(func.sum('expected')). \
fillna(0). \
show()
# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# | 1| 1| 1| 0| 2| 0|
# | 2| 0| 0| 1| 3| 2|
# +---+----------+----------+----------+----------+----------+
The input dataframe had 2 fields - state and time - that were to be pivoted. They were concatenated with a '_' delimiter and used within pivot. You can use multiple aggregations within the agg, per your requirements, post that.

Joining a dataframe on 2 different columns gives error "column are ambiguous"

I have 2 dataframes:
df contains all train routes with origin and arrival columns (both ids and names)
df_relation contains Station (Gare) name, relation and API number.
Goal: I need to join these two dataframes twice on both origin and arrival columns.
I tried this:
df.groupBy("origin", "origin_id", "arrival", "direction") \
.agg({'time_travelled': 'avg'}) \
.filter(df.direction == 0) \
.join(df_relation, df.origin == df_relation.Gare, "inner") \
.join(df_relation, df.arrival == df_relation.Gare, "inner") \
.orderBy("Relation")
.show()
But I got the following AnalysisException
AnalysisException: Column Gare#1708 are ambiguous.
It's probably because you joined several Datasets together, and some of these Datasets are the same.
This column points to one of the Datasets but Spark is unable to figure out which one.
Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id").
You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
How to rewrite this?
I have unsuccessfully tried to blindly follow the error recommendation like this
.as("a").join(df_relation.as("b"), $"a.arrival" == $"b.Gare", "inner")
This is my first dataframe (df):
+--------------------+----------+--------------------+---------+-------------------+
| origin| origin_id| arrival|direction|avg(time_travelled)|
+--------------------+----------+--------------------+---------+-------------------+
| Gare du Nord|IDFM:71410|La Plaine Stade d...| 1.0| 262.22222222222223|
| Gare du Nord|IDFM:71410|Aéroport CDG 1 (T...| 1.0| 1587.7551020408164|
|Villeparisis - Mi...|IDFM:68916| Mitry - Claye| 1.0| 240.0|
| Villepinte|IDFM:73547|Parc des Expositions| 1.0| 90.33898305084746|
| Le Blanc-Mesnil|IDFM:72648| Aulnay-sous-Bois| 1.0| 105.04273504273505|
|Aéroport CDG 1 (T...|IDFM:73596|Aéroport Charles ...| 1.0| 145.27777777777777|
This my second dataframe (df_relation):
+-----------------------------------------+--------+--------+
|Gare |Relation|Gare Api|
+-----------------------------------------+--------+--------+
|Aéroport Charles de Gaulle 2 (Terminal 2)|1 |87001479|
|Aéroport CDG 1 (Terminal 3) - RER |2 |87271460|
|Parc des Expositions |3 |87271486|
|Villepinte |4 |87271452|
And this is what I am trying to achieve:
+--------------------+-----------+--------------------+---------+-------------------+--------+----------+-----------+
| origin| origin_id| arrival|direction|avg(time_travelled)|Relation|Api origin|Api arrival|
+--------------------+-----------+--------------------+---------+-------------------+--------+----------+-----------+
|Aéroport Charles ...| IDFM:73699|Aéroport CDG 1 (T...| 0.0| 110.09345794392523| 1| 87001479| 87271460|
|Aéroport CDG 1 (T...| IDFM:73596|Parc des Expositions| 0.0| 280.17543859649123| 2| 87271460| 87271486|
|Aéroport CDG 1 (T...| IDFM:73596| Gare du Nord| 0.0| 1707.4| 2| 87271460| 87271007|
|Parc des Expositions| IDFM:73568| Villepinte| 0.0| 90.17543859649123| 3| 87271486| 87271452|
| Villepinte| IDFM:73547| Sevran Beaudottes| 0.0| 112.45614035087719| 4| 87271452| 87271445|
| Sevran Beaudottes| IDFM:73491| Aulnay-sous-Bois| 0.0| 168.24561403508773| 5| 87271445| 87271411|
| Mitry - Claye| IDFM:69065|Villeparisis - Mi...| 0.0| 210.51724137931035| 6| 87271528| 87271510|
|Villeparisis - Mi...| IDFM:68916| Vert Galant| 0.0| 150.0| 7| 87271510| 87271510|
You take the original df and join df_relation twice. This way you create duplicate columns for every column in df_relation. Column "Gare" just happens to be the first of them, so it is depicted in the error message.
To avoid the error, you will have to create alias for your dataframes. Notice how I create them
df_agg.alias("agg")
df_relation.alias("rel_o")
df_relation.alias("rel_a")
and how I latter reference them before every column.
from pyspark.sql import functions as F
df_agg = (
df.filter(F.col("direction") == 0)
.groupBy("origin", "origin_id", "arrival", "direction")
.agg({"time_travelled": "avg"})
)
df_result = (
df_agg.alias("agg")
.join(df_relation.alias("rel_o"), F.col("agg.origin") == F.col("rel_o.Gare"), "inner")
.join(df_relation.alias("rel_a"), F.col("agg.arrival") == F.col("rel_a.Gare"), "inner")
.orderBy("rel_o.Relation")
.select(
"agg.*",
"rel_o.Relation",
F.col("rel_o.Gare Api").alias("Api origin"),
F.col("rel_a.Gare Api").alias("Api arrival"),
)
)

pyspark extra column where dates are trasformed to 1, 2 , 3

I have a dataframe with dates in the format YYYYMM.
These start from 201801.
I now want to add a column where 201801 = 1, 201802 = 2 and so on up until the most recent month which is updated every month.
Kind regards,
wokter
months_between can be used:
from pyspark.sql import functions as F
from pyspark.sql import types as T
#some testdata
data = [
[201801],
[201802],
[201804],
[201812],
[202001],
[202010]
]
df = spark.createDataFrame(data, schema=["yyyymm"])
df.withColumn("months", F.months_between(
F.to_date(F.col("yyyymm").cast(T.StringType()), "yyyyMM"), F.lit("2017-12-01")
).cast(T.IntegerType())).show()
Output:
+------+------+
|yyyymm|months|
+------+------+
|201801| 1|
|201802| 2|
|201804| 4|
|201812| 12|
|202001| 25|
|202010| 34|
+------+------+

PySpark Filter between - provide a list of upper and lower bounds, based on groups

I have a PySpark dataframe and would like to filter for rows between an upper bound and lower bound.
Typically, I would just use a filter with between:
import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
pdfRaw=pd.DataFrame([{"vehicleID":'A', "Segment":'State Hwy', "speed":68.0},\
{"vehicleID":'B', "Segment":'State Hwy', "speed":76.0}])
dfRaw = sql_context.createDataFrame(pdfRaw).withColumn("vehicleID", "Segment", "speed")
dfRaw.show()
+-----------+------------+-----+
vehicleID| Segment|value|
+-----------+------------+-----+
| A| State Hwy| 68.0|
| B| State Hwy| 73.0|
+-----------+------------+-----+
dfRaw.filter(F.col("speed").between(70,75)).show()
+-----------+------------+-----+
vehicleID| Segment|value|
+-----------+------------+-----+
| B| State Hwy| 73.0|
+-----------+------------+-----+
However I have multiple speed values that I would like to filter between.
Speeds_Curious = {
[25,30],
[55,60],
[60,65],
[70,75]
}
And I actually want to take it one step further. The upper and lower bounds to the filter between depend on the result of a groupby of a previous data frame.
df_RoadSegments.groupby('Segment')\
.agg(F.min('SpeedLimit').alias('minSpeed'),\
F.max('SpeedLimit').alias('maxSpeed'))\
.show()
+-----------+----------+----------+
Segment| minSpeed| maxSpeed|
+-----------+----------+----------+
| Urban| 25.0| 30.0|
| State Hwy| 55.0| 60.0|
|I-State Hwy| 60.0| 65.0|
|I-State Hwy| 70.0| 75.0|
+-----------+----------+----------+
So basically I would like to filter a dataframe between values that are available as columns on a different dataframe.
Something like:
dfLimits = df_RoadSegments.groupby('Segment')\
.agg(F.min('SpeedLimit').alias('minSpeed'),\ F.max('SpeedLimit').alias('maxSpeed'))
dfRaw.groupby('Segment')\
.filter(F.col("speed")\
.between(dfLimits.where(dfLimits.Segment=="State Hwy"(??)).select('minSpeed')),\
dfLimits.where(dfLimits.Segment=="State Hwy"(??)).select('maxSpeed'))))\
.show()
Any thoughts?
Following approach will get you all the vehicles that are between the min and max speed for the particular segment that they belong to.
You can join the two dataframes:
df_joined = dfRaw.join(dfLimits, on="Segment", how="left")
+---------+---------+-----+--------+--------+
| Segment|vehicleID|speed|minSpeed|maxSpeed|
+---------+---------+-----+--------+--------+
|State Hwy| A| 68.0| 55| 60|
|State Hwy| B| 76.0| 55| 60|
+---------+---------+-----+--------+--------+
If you want a further flag of whether the speed is in between rhe mentioned bounds, then you can write:
flag_df = df_joined.withColumn("flag", F.when((F.col("speed") > F.col("minSpeed")) & (F.col("speed") < F.col("minSpeed")), 1).otherwise(0))
flag_df.show()
+---------+---------+-----+--------+--------+----+
| Segment|vehicleID|speed|minSpeed|maxSpeed|flag|
+---------+---------+-----+--------+--------+----+
|State Hwy| A| 68.0| 55| 60| 0|
|State Hwy| B| 76.0| 55| 60| 0|
+---------+---------+-----+--------+--------+----+
You can then simply filter on the flag saying:
df_final = df.filter(F.col("flag") == 1)

SQL/PySpark: Create a new column consisting of a number of rows in the past n days

Currently, I have a table consisting of encounter_id and date field like so:
+---------------------------+--------------------------+
|encounter_id |date |
+---------------------------+--------------------------+
|random_id34234 |2018-09-17 21:53:08.999999|
|this_can_be_anything2432432|2018-09-18 18:37:57.000000|
|423432 |2018-09-11 21:00:36.000000|
+---------------------------+--------------------------+
encounter_id is a random string.
I'm aiming to create a column which consists of the total number of encounters in the past 30 days.
+---------------------------+--------------------------+---------------------------+
|encounter_id |date | encounters_in_past_30_days|
+---------------------------+--------------------------+---------------------------+
|random_id34234 |2018-09-17 21:53:08.999999| 2 |
|this_can_be_anything2432432|2018-09-18 18:37:57.000000| 3 |
|423432 |2018-09-11 21:00:36.000000| 1 |
+---------------------------+--------------------------+---------------------------+
Currently, I'm thinking of somehow using window functions and specifying an aggregate function.
Thanks for the time.
Here is one possible solution, I added some sample data. It indeed uses a window function, as you suggested yourself. Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame(
[
('A','2018-10-01 00:15:00'),
('B','2018-10-11 00:30:00'),
('C','2018-10-21 00:45:00'),
('D','2018-11-10 00:00:00'),
('E','2018-12-20 00:15:00'),
('F','2018-12-30 00:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.col('date').astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*30,0)
df = df.withColumn('encounters_past_30_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+-----------------------+
|encounter_id| date| timestamp|encounters_past_30_days|
+------------+-------------------+----------+-----------------------+
| A|2018-10-01 00:15:00|1538345700| 1|
| B|2018-10-11 00:30:00|1539210600| 2|
| C|2018-10-21 00:45:00|1540075500| 3|
| D|2018-11-10 00:00:00|1541804400| 2|
| E|2018-12-20 00:15:00|1545261300| 1|
| F|2018-12-30 00:30:00|1546126200| 2|
+------------+-------------------+----------+-----------------------+
EDIT: If you want to have days as the granularity, you could first convert your date column to the Date type. Example below, assuming that a window of five days means today and the four days before. If it should be today and the past five days just remove the -1.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
n_days = 5
df = sqlContext.createDataFrame(
[
('A','2018-10-01 23:15:00'),
('B','2018-10-02 00:30:00'),
('C','2018-10-05 05:45:00'),
('D','2018-10-06 00:15:00'),
('E','2018-10-07 00:15:00'),
('F','2018-10-10 21:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.to_date(F.col('date')).astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*(n_days-1),0)
df = df.withColumn('encounters_past_n_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+----------------------+
|encounter_id| date| timestamp|encounters_past_n_days|
+------------+-------------------+----------+----------------------+
| A|2018-10-01 23:15:00|1538344800| 1|
| B|2018-10-02 00:30:00|1538431200| 2|
| C|2018-10-05 05:45:00|1538690400| 3|
| D|2018-10-06 00:15:00|1538776800| 3|
| E|2018-10-07 00:15:00|1538863200| 3|
| F|2018-10-10 21:30:00|1539122400| 3|
+------------+-------------------+----------+----------------------+