How to select rows with max values in categories? - dataframe

I would like to use the aggregation for each ID key to select rows with max(day).
ID
col1
col2
month
Day
AI1
5
2
janv
15
AI2
6
0
Dec
16
AI1
1
7
March
16
AI3
9
4
Nov
18
AI2
3
20
Fev
20
AI3
10
8
June
06
Desired result:
ID
col1
col2
month
Day
AI1
1
7
March
16
AI2
3
20
Fev
20
AI3
9
4
Nov
18

The only solution that comes to my mind is to :
Get the highest day for each ID (using groupBy)
Append the value of the highest day to each line (with matching ID) using join
Then a simple filter where the value of the two lines match
# select the max value for each of the ID
maxDayForIDs = df.groupBy("ID").max("day").withColumnRenamed("max(day)", "maxDay")
# now add the max value of the day for each line (with matching ID)
df = df.join(maxDayForIDs, "ID")
# keep only the lines where it matches "day" equals "maxDay"
df = df.filter(df.day == df.maxDay)

Usually this kind of operation is done using window functions like
rank,
dense_rank
or row_number.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('AI1', 5, 2, 'janv', '15'),
('AI2', 6, 0, 'Dec', '16'),
('AI1', 1, 7, 'March', '16'),
('AI3', 9, 4, 'Nov', '18'),
('AI2', 3, 20, 'Fev', '20'),
('AI3', 10, 8, 'June', '06')],
['ID', 'col1', 'col2', 'month', 'Day']
)
w = W.partitionBy('ID').orderBy(F.desc('Day'))
df = df.withColumn('_rn', F.row_number().over(w))
df = df.filter('_rn=1').drop('_rn')
df.show()
# +---+----+----+-----+---+
# | ID|col1|col2|month|Day|
# +---+----+----+-----+---+
# |AI1| 1| 7|March| 16|
# |AI2| 3| 20| Fev| 20|
# |AI3| 9| 4| Nov| 18|
# +---+----+----+-----+---+

Make it simple
new= (df.withColumn('max',first('Day').over(w))#Order by day descending and keep first value in a group in max
.where(col('Day')==col('max'))#filter where max=Day
.drop('max')#drop max
).show()

Related

Group ids by 2 date interval columns and 2 other columns

I have the following dataframe:
ID
Fruit
Price
Location
Start_Date
End_Date
01
Orange
12
ABC
01-03-2015
01-05-2015
01
Orange
9.5
ABC
01-03-2015
01-05-2015
02
Apple
10
PQR
04-09-2019
04-11-2019
06
Orange
11
ABC
01-04-2015
01-06-2015
05
Peach
15
XYZ
07-11-2021
07-13-2021
08
Apple
10.5
PQR
04-09-2019
04-11-2019
10
Apple
10
LMN
04-10-2019
04-12-2019
03
Peach
14.5
XYZ
07-11-2020
07-13-2020
11
Peach
12.5
ABC
01-04-2015
01-05-2015
12
Peach
12.5
ABC
01-03-2015
01-05-2015
I want to form a group of IDs that belong to the same location, fruit, and range of start date and end date.
The date interval condition is that we only group those ids together whose start_date and end_date are no more than 3 days apart.
Eg. ID 06 start_date is 01-04-2015 and end_date is 01-06-2015.
ID 01 start_date is 01-03-2015 and end_date is 01-05-2015.
So ID 06 and 01's start_date and end_date are only 1 day apart so the merge is acceptable (i.e. these two ids can be grouped together if other variables like location and fruit match).
Also, I only want to output groups with more than 1 unique IDs.
My output should be (the start date and end date is merged):
ID
Fruit
Price
Location
Start_Date
End_Date
01
Orange
12
ABC
01-03-2015
01-06-2015
01
Orange
9.5
06
Orange
11
11
Peach
12.5
12
Peach
12.5
02
Apple
10
PQR
04-09-2019
04-11-2019
08
Apple
10.5
IDs 05,03 get filtered out because it's a single record (they dont meet the date interval condition).
ID 10 gets filtered out because it's from a different location.
I have no idea how to merge intervals for 2 such date columns. I have tried a few techniques to test out grouping (without the date merge).
My latest one is using grouper.
output = df.groupby([pd.Grouper(key='Start_Date', freq='D'),pd.Grouper(key='End_Date', freq='D'),pd.Grouper(key='Location'),pd.Grouper(key='Fruit'),'ID']).agg(unique_emp=('ID', 'nunique'))
Need help getting the output. Thank you!!
This is essentially a gap-and-island problem. If you sort your dataframe by Fruit, Location and Start Date, you can create islands (i.e. fruit group) as follow:
If the current row's Fruit or Location is not the same as the previous row's, start a new island
If the current row's End Date is more than 3 days after the island's Start Date, make a new island
The code:
for col in ["Start_Date", "End_Date"]:
df[col] = pd.to_datetime(df[col])
# This algorithm requires a sorted dataframe
df = df.sort_values(["Fruit", "Location", "Start_Date"])
# Assign each row to an island
i = 0
islands = []
last_fruit, last_location, last_start = None, None, df["Start_Date"].iloc[0]
for _, (fruit, location, start, end) in df[["Fruit", "Location", "Start_Date", "End_Date"]].iterrows():
if (fruit != last_fruit) or (location != last_location) or (end - last_start > pd.Timedelta(days=3)):
i += 1
last_fruit, last_location, last_start = fruit, location, start
else:
last_fruit, last_location = fruit, location
islands.append(i)
df["Island"] = islands
# Filter for islands having more than 1 rows
idx = pd.Series(islands).value_counts().loc[lambda c: c > 1].index
df[df["Island"].isin(idx)]
Here is a slow/non-vectorized approach where we "manually" walk through sorted date values and assign them to bins, incrementing to the next bin when the gap is too large. Uses a function to add new columns to the df. Edited so that the ID column is the index
from datetime import timedelta
import pandas as pd
#Setup
df = pd.DataFrame(
columns = ['ID', 'Fruit', 'Price', 'Location', 'Start_Date', 'End_Date'],
data = [
[1, 'Orange', 12.0, 'ABC', '01-03-2015', '01-05-2015'],
[1, 'Orange', 9.5, 'ABC', '01-03-2015', '01-05-2015'],
[2, 'Apple', 10.0, 'PQR', '04-09-2019', '04-11-2019'],
[6, 'Orange', 11.0, 'ABC', '01-04-2015', '01-06-2015'],
[5, 'Peach', 15.0, 'XYZ', '07-11-2021', '07-13-2021'],
[8, 'Apple', 10.5, 'PQR', '04-09-2019', '04-11-2019'],
[10, 'Apple', 10.0, 'LMN', '04-10-2019', '04-12-2019'],
[3, 'Peach', 14.5, 'XYZ', '07-11-2020', '07-13-2020'],
[11, 'Peach', 12.5, 'ABC', '01-04-2015', '01-05-2015'],
[12, 'Peach', 12.5, 'ABC', '01-03-2015', '01-05-2015'],
]
)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
df = df.set_index('ID')
#Function to bin the dates
def create_date_bin_series(dates, max_span=timedelta(days=3)):
orig_order = zip(dates,range(len(dates)))
sorted_order = sorted(orig_order)
curr_bin = 1
curr_date = min(dates)
date_bins = []
for date,i in sorted_order:
if date-curr_date > max_span:
curr_bin += 1
curr_date = date
date_bins.append((curr_bin,i))
#sort the date_bins to match the original order
date_bins = [v for v,_ in sorted(date_bins, key = lambda x: x[1])]
return date_bins
#Apply function to group each date into a bin with other dates within 3 days of it
start_bins = create_date_bin_series(df['Start_Date'])
end_bins = create_date_bin_series(df['End_Date'])
#Group by new columns
df['fruit_group'] = df.groupby(['Fruit','Location',start_bins,end_bins]).ngroup()
#Print the table sorted by these new groups
print(df.sort_values('fruit_group'))
#you can use the new fruit_group column to filter and agg etc
Output

PySpark generate missing dates and fill data with previous value

I need help for this case to fill, with a new row, missing values:
This is just an example, but I have a lot of rows with different IDs.
Input dataframe:
ID
FLAG
DATE
123
1
01/01/2021
123
0
01/02/2021
123
1
01/03/2021
123
0
01/06/2021
123
0
01/08/2021
777
0
01/01/2021
777
1
01/03/2021
So I have a finite set of dates and I wanna take until the last one for each ID (in the example, for ID = 123: 01/01/2021, 01/02/2021, 01/03/2021... until 01/08/2021). So basically I could do a cross join with a calendar, but I don't know how can I fill missing value with a rule or a filter, after the cross join.
Expected output: (in bold the generated missing values)
ID
FLAG
DATE
123
1
01/01/2021
123
0
01/02/2021
123
1
01/03/2021
123
1
01/04/2021
123
1
01/05/2021
123
0
01/06/2021
123
0
01/07/2021
123
0
01/08/2021
777
0
01/01/2021
777
0
01/02/2021
777
1
01/03/2021
You can first group by id to calculate max and min date then using sequence function, generate all the dates from min_date to max_date. Finally, join with original dataframe and fill nulls with last non null per group of id. Here's a complete working example:
Your input dataframe:
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([
(123, 1, "01/01/2021"), (123, 0, "01/02/2021"),
(123, 1, "01/03/2021"), (123, 0, "01/06/2021"),
(123, 0, "01/08/2021"), (777, 0, "01/01/2021"),
(777, 1, "01/03/2021")
], ["id", "flag", "date"])
Groupby id and generate all possible dates for each id:
all_dates_df = df.groupBy("id").agg(
F.date_trunc("mm", F.max(F.to_date("date", "dd/MM/yyyy"))).alias("max_date"),
F.date_trunc("mm", F.min(F.to_date("date", "dd/MM/yyyy"))).alias("min_date")
).select(
"id",
F.expr("sequence(min_date, max_date, interval 1 month)").alias("date")
).withColumn(
"date", F.explode("date")
).withColumn(
"date",
F.date_format("date", "dd/MM/yyyy")
)
Now, left join with df and use last function over a Window partitioned by id to fill null values:
w = Window.partitionBy("id").orderBy("date")
result = all_dates_df.join(df, ["id", "date"], "left").select(
"id",
"date",
*[F.last(F.col(c), ignorenulls=True).over(w).alias(c)
for c in df.columns if c not in ("id", "date")
]
)
result.show()
#+---+----------+----+
#| id| date|flag|
#+---+----------+----+
#|123|01/01/2021| 1|
#|123|01/02/2021| 0|
#|123|01/03/2021| 1|
#|123|01/04/2021| 1|
#|123|01/05/2021| 1|
#|123|01/06/2021| 0|
#|123|01/07/2021| 0|
#|123|01/08/2021| 0|
#|777|01/01/2021| 0|
#|777|01/02/2021| 0|
#|777|01/03/2021| 1|
#+---+----------+----+
You can find the ranges of dates between the DATE value in the current row and the following row and then use sequence to generate all intermediate dates and explode this array to fill in values for the missing dates.
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [(123, 1, "01/01/2021",),
(123, 0, "01/02/2021",),
(123, 1, "01/03/2021",),
(123, 0, "01/06/2021",),
(123, 0, "01/08/2021",),
(777, 0, "01/01/2021",),
(777, 1, "01/03/2021",), ]
df = spark.createDataFrame(data, ("ID", "FLAG", "DATE",)).withColumn("DATE", F.to_date(F.col("DATE"), "dd/MM/yyyy"))
window_spec = Window.partitionBy("ID").orderBy("DATE")
next_date = F.coalesce(F.lead("DATE", 1).over(window_spec), F.col("DATE") + F.expr("interval 1 month"))
end_date_range = next_date - F.expr("interval 1 month")
df.withColumn("Ranges", F.sequence(F.col("DATE"), end_date_range, F.expr("interval 1 month")))\
.withColumn("DATE", F.explode("Ranges"))\
.withColumn("DATE", F.date_format("date", "dd/MM/yyyy"))\
.drop("Ranges").show(truncate=False)
Output
+---+----+----------+
|ID |FLAG|DATE |
+---+----+----------+
|123|1 |01/01/2021|
|123|0 |01/02/2021|
|123|1 |01/03/2021|
|123|1 |01/04/2021|
|123|1 |01/05/2021|
|123|0 |01/06/2021|
|123|0 |01/07/2021|
|123|0 |01/08/2021|
|777|0 |01/01/2021|
|777|0 |01/02/2021|
|777|1 |01/03/2021|
+---+----+----------+

Window & Aggregate functions in Pyspark SQL/SQL

After the answer by #Vaebhav realized the question was not set up correctly.
Hence editing it with his code snippet.
I have the following table
from pyspark.sql.types import IntegerType,TimestampType,DoubleType
input_str = """
4219,2018-01-01 08:10:00,3.0,50.78,
4216,2018-01-02 08:01:00,5.0,100.84,
4217,2018-01-02 20:00:00,4.0,800.49,
4139,2018-01-03 11:05:00,1.0,400.0,
4170,2018-01-03 09:10:00,2.0,100.0,
4029,2018-01-06 09:06:00,6.0,300.55,
4029,2018-01-06 09:16:00,2.0,310.55,
4217,2018-01-06 09:36:00,5.0,307.55,
1139,2018-01-21 11:05:00,1.0,400.0,
2170,2018-01-21 09:10:00,2.0,100.0,
4218,2018-02-06 09:36:00,5.0,307.55,
4218,2018-02-06 09:36:00,5.0,307.55
""".split(",")
input_values = list(map(lambda x: x.strip() if x.strip() != '' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "customer_id,timestamp,quantity,price".split(',')))
n = len(input_values)
n_cols = 4
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sqlContext.createDataFrame(input_list,cols)
sparkDF = sparkDF.withColumn('customer_id',F.col('customer_id').cast(IntegerType()))\
.withColumn('timestamp',F.col('timestamp').cast(TimestampType()))\
.withColumn('quantity',F.col('quantity').cast(IntegerType()))\
.withColumn('price',F.col('price').cast(DoubleType()))
I want to calculate the aggergate as follows :
trxn_date
unique_cust_visits
next_7_day_visits
next_30_day_visits
2018-01-01
1
7
9
2018-01-02
2
6
8
2018-01-03
2
4
6
2018-01-06
2
2
4
2018-01-21
2
2
3
2018-02-06
1
1
1
where the
trxn_date is date from the timestamp column,
daily_cust_visits is unique count of customers,
next_7_day_visits is a count of customers on a 7 day rolling window basis.
next_30_day_visits is a count of customers on a 30 day rolling window basis.
I want to write the code as a single SQL query.
You can achieve this by using ROW rather than a RANGE Frame Type , a good explanation can be found here
ROW - based on physical offsets from the position of the current input row
RANGE - based on logical offsets from the position of the current input row
Also in your implementation ,a PARTITION BY clause would be redundant, as it wont create the required Frames for a look-ahead.
Data Preparation
input_str = """
4219,2018-01-02 08:10:00,3.0,50.78,
4216,2018-01-02 08:01:00,5.0,100.84,
4217,2018-01-02 20:00:00,4.0,800.49,
4139,2018-01-03 11:05:00,1.0,400.0,
4170,2018-01-03 09:10:00,2.0,100.0,
4029,2018-01-06 09:06:00,6.0,300.55,
4029,2018-01-06 09:16:00,2.0,310.55,
4217,2018-01-06 09:36:00,5.0,307.55
""".split(",")
input_values = list(map(lambda x: x.strip() if x.strip() != '' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "customer_id timestamp quantity price".split('\t')))
n = len(input_values)
n_cols = 4
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sql.createDataFrame(input_list,cols)
sparkDF = sparkDF.withColumn('customer_id',F.col('customer_id').cast(IntegerType()))\
.withColumn('timestamp',F.col('timestamp').cast(TimestampType()))\
.withColumn('quantity',F.col('quantity').cast(IntegerType()))\
.withColumn('price',F.col('price').cast(DoubleType()))
sparkDF.show()
+-----------+-------------------+--------+------+
|customer_id| timestamp|quantity| price|
+-----------+-------------------+--------+------+
| 4219|2018-01-02 08:10:00| 3| 50.78|
| 4216|2018-01-02 08:01:00| 5|100.84|
| 4217|2018-01-02 20:00:00| 4|800.49|
| 4139|2018-01-03 11:05:00| 1| 400.0|
| 4170|2018-01-03 09:10:00| 2| 100.0|
| 4029|2018-01-06 09:06:00| 6|300.55|
| 4029|2018-01-06 09:16:00| 2|310.55|
| 4217|2018-01-06 09:36:00| 5|307.55|
+-----------+-------------------+--------+------+
Window Aggregates
sparkDF.createOrReplaceTempView("transactions")
sql.sql("""
SELECT
TO_DATE(timestamp) as trxn_date
,COUNT(DISTINCT customer_id) as unique_cust_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY 'timestamp'
ROWS BETWEEN CURRENT ROW AND 7 FOLLOWING
) as next_7_day_visits
FROM transactions
GROUP BY 1
""").show()
+----------+------------------+-----------------+
| trxn_date|unique_cust_visits|next_7_day_visits|
+----------+------------------+-----------------+
|2018-01-02| 3| 7|
|2018-01-03| 2| 4|
|2018-01-06| 2| 2|
+----------+------------------+-----------------+
Building upon #Vaebhav's answer the required query in this case is
sqlContext.sql("""
SELECT
TO_DATE(timestamp) as trxn_date
,COUNT(DISTINCT customer_id) as unique_cust_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY CAST(TO_DATE(timestamp) AS TIMESTAMP) DESC
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) as next_7_day_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY CAST(TO_DATE(timestamp) AS TIMESTAMP) DESC
RANGE BETWEEN INTERVAL 30 DAYS PRECEDING AND CURRENT ROW
) as next_30_day_visits
FROM transactions
GROUP BY 1
ORDER by trxn_date
""").show()
trxn_date
unique_cust_visits
next_7_day_visits
next_30_day_visits
2018-01-01
1
7
9
2018-01-02
2
6
8
2018-01-03
2
4
6
2018-01-06
2
2
4
2018-01-21
2
2
3
2018-02-06
1
1
1

Pyspark: Filter dataframe and apply function to offset time

I have a dataframe like this:
import time
import datetime
import pandas as pd
df = pd.DataFrame({'Number': ['1', '2', '1', '1'],
'Letter': ['A', 'A', 'B', 'A'],
'Time': ['2019-04-30 18:15:00', '2019-04-30 18:15:00', '2019-04-30 18:15:00', '2019-04-30 18:15:00'],
'Value': [30, 30, 30, 60]})
df['Time'] = pd.to_datetime(df['Time'])
Number Letter Time Value
0 1 A 2019-04-30 18:15:00 30
1 2 A 2019-04-30 18:15:00 30
2 1 B 2019-04-30 18:15:00 30
3 1 A 2019-04-30 18:15:00 60
I would like to do something similar in Pyspark as I do in Pandas where I filter on a specific set of data:
#: Want to target only rows where the Number = '1' and the Letter is 'A'.
target_df = df[
(df['Number'] == '1') &
(df['Letter'] == 'A')
]
And apply a change to a value based on another column:
#: Loop over these rows and subtract the offset value from the Time.
for index, row in target_df.iterrows():
offset = row['Value']
df.loc[index, 'Time'] = row['Time'] - datetime.timedelta(seconds=row['Value'])
To get a final output like so:
Number Letter Time Value
0 1 A 2019-04-30 18:14:30 30
1 2 A 2019-04-30 18:15:00 30
2 1 B 2019-04-30 18:15:00 30
3 1 A 2019-04-30 18:14:00 60
What is the best way to go about this in Pyspark?
I was thinking something along the lines of this:
pyspark_df = spark.createDataFrame(df)
pyspark_df.withColumn('new_time', F.when(
F.col('Number') == '1' & F.col('Letter' == 'A'), F.col('Time') - datetime.timedelta(seconds=(F.col('Value')))).otherwise(
F.col('Time')))
But that doesn't seem to work for me.
You can try with unix timestamp:
import pyspark.sql.functions as F
cond_val = (F.when((F.col("Number")==1)&(F.col("Letter")=="A")
,F.from_unixtime(F.unix_timestamp(F.col("Time"))-F.col("Value")))
.otherwise(F.col("Time")))
df.withColumn("Time",cond_val).show()
+------+------+-------------------+-----+
|Number|Letter| Time|Value|
+------+------+-------------------+-----+
| 1| A|2019-04-30 18:14:30| 30|
| 2| A|2019-04-30 18:15:00| 30|
| 1| B|2019-04-30 18:15:00| 30|
| 1| A|2019-04-30 18:14:00| 60|
+------+------+-------------------+-----+
Just an addition, you dont need iterrows in pandas, just do:
c = df['Number'].eq(1) & df['Letter'].eq('A')
df.loc[c,'Time'] = df['Time'].sub(pd.to_timedelta(df['Value'],unit='s'))
#or faster
#df['Time'] = np.where(c,df['Time'].sub(pd.to_timedelta(df['Value'],unit='s'))
#,df['Time'])

How to sort a dataframe by values from a list

I have a list with numbers:
[18, 22, 20]
and a dataframe:
Id | node_id
UC5E9-r42JlymhLPnDv2wHuA | 20
UCFqcNI0NaAA21NS9W3ExCRg | 18
UCrb6U1FuOP5EZ7n7LfOJMMQ | 22
list numbers map to node_id numbers. The order of the node_id numbers matters, they must be in the order of the list numbers.
So the dataframe is in the wrong order.
I need to sort the dataframe by the list values.
End result should be:
Id | node_id
UCFqcNI0NaAA21NS9W3ExCRg | 18
UCrb6U1FuOP5EZ7n7LfOJMMQ | 22
UC5E9-r42JlymhLPnDv2wHuA | 20
How can I do this?
Use sorted Categorical, so you can use DataFrame.sort_values:
L = [18, 22, 20]
df['node_id'] = pd.Categorical(df['node_id'], ordered=True, categories=L)
df = df.sort_values('node_id')
print (df)
Id node_id
1 UCFqcNI0NaAA21NS9W3ExCRg 18
2 UCrb6U1FuOP5EZ7n7LfOJMMQ 22
0 UC5E9-r42JlymhLPnDv2wHuA 20
If want avoid Categorical column:
df = df.iloc[df['node_id'].map({v: k for k, v in enumerate(L)}).argsort()]
I will do
l=[18, 22, 20]
df=df.iloc[pd.Categorical(df.node_id, l).argsort()]
Out[79]:
Id node_id
1 UCFqcNI0NaAA21NS9W3ExCRg 18
2 UCrb6U1FuOP5EZ7n7LfOJMMQ 22
0 UC5E9-r42JlymhLPnDv2wHuA 20