Group ids by 2 date interval columns and 2 other columns - pandas

I have the following dataframe:
ID
Fruit
Price
Location
Start_Date
End_Date
01
Orange
12
ABC
01-03-2015
01-05-2015
01
Orange
9.5
ABC
01-03-2015
01-05-2015
02
Apple
10
PQR
04-09-2019
04-11-2019
06
Orange
11
ABC
01-04-2015
01-06-2015
05
Peach
15
XYZ
07-11-2021
07-13-2021
08
Apple
10.5
PQR
04-09-2019
04-11-2019
10
Apple
10
LMN
04-10-2019
04-12-2019
03
Peach
14.5
XYZ
07-11-2020
07-13-2020
11
Peach
12.5
ABC
01-04-2015
01-05-2015
12
Peach
12.5
ABC
01-03-2015
01-05-2015
I want to form a group of IDs that belong to the same location, fruit, and range of start date and end date.
The date interval condition is that we only group those ids together whose start_date and end_date are no more than 3 days apart.
Eg. ID 06 start_date is 01-04-2015 and end_date is 01-06-2015.
ID 01 start_date is 01-03-2015 and end_date is 01-05-2015.
So ID 06 and 01's start_date and end_date are only 1 day apart so the merge is acceptable (i.e. these two ids can be grouped together if other variables like location and fruit match).
Also, I only want to output groups with more than 1 unique IDs.
My output should be (the start date and end date is merged):
ID
Fruit
Price
Location
Start_Date
End_Date
01
Orange
12
ABC
01-03-2015
01-06-2015
01
Orange
9.5
06
Orange
11
11
Peach
12.5
12
Peach
12.5
02
Apple
10
PQR
04-09-2019
04-11-2019
08
Apple
10.5
IDs 05,03 get filtered out because it's a single record (they dont meet the date interval condition).
ID 10 gets filtered out because it's from a different location.
I have no idea how to merge intervals for 2 such date columns. I have tried a few techniques to test out grouping (without the date merge).
My latest one is using grouper.
output = df.groupby([pd.Grouper(key='Start_Date', freq='D'),pd.Grouper(key='End_Date', freq='D'),pd.Grouper(key='Location'),pd.Grouper(key='Fruit'),'ID']).agg(unique_emp=('ID', 'nunique'))
Need help getting the output. Thank you!!

This is essentially a gap-and-island problem. If you sort your dataframe by Fruit, Location and Start Date, you can create islands (i.e. fruit group) as follow:
If the current row's Fruit or Location is not the same as the previous row's, start a new island
If the current row's End Date is more than 3 days after the island's Start Date, make a new island
The code:
for col in ["Start_Date", "End_Date"]:
df[col] = pd.to_datetime(df[col])
# This algorithm requires a sorted dataframe
df = df.sort_values(["Fruit", "Location", "Start_Date"])
# Assign each row to an island
i = 0
islands = []
last_fruit, last_location, last_start = None, None, df["Start_Date"].iloc[0]
for _, (fruit, location, start, end) in df[["Fruit", "Location", "Start_Date", "End_Date"]].iterrows():
if (fruit != last_fruit) or (location != last_location) or (end - last_start > pd.Timedelta(days=3)):
i += 1
last_fruit, last_location, last_start = fruit, location, start
else:
last_fruit, last_location = fruit, location
islands.append(i)
df["Island"] = islands
# Filter for islands having more than 1 rows
idx = pd.Series(islands).value_counts().loc[lambda c: c > 1].index
df[df["Island"].isin(idx)]

Here is a slow/non-vectorized approach where we "manually" walk through sorted date values and assign them to bins, incrementing to the next bin when the gap is too large. Uses a function to add new columns to the df. Edited so that the ID column is the index
from datetime import timedelta
import pandas as pd
#Setup
df = pd.DataFrame(
columns = ['ID', 'Fruit', 'Price', 'Location', 'Start_Date', 'End_Date'],
data = [
[1, 'Orange', 12.0, 'ABC', '01-03-2015', '01-05-2015'],
[1, 'Orange', 9.5, 'ABC', '01-03-2015', '01-05-2015'],
[2, 'Apple', 10.0, 'PQR', '04-09-2019', '04-11-2019'],
[6, 'Orange', 11.0, 'ABC', '01-04-2015', '01-06-2015'],
[5, 'Peach', 15.0, 'XYZ', '07-11-2021', '07-13-2021'],
[8, 'Apple', 10.5, 'PQR', '04-09-2019', '04-11-2019'],
[10, 'Apple', 10.0, 'LMN', '04-10-2019', '04-12-2019'],
[3, 'Peach', 14.5, 'XYZ', '07-11-2020', '07-13-2020'],
[11, 'Peach', 12.5, 'ABC', '01-04-2015', '01-05-2015'],
[12, 'Peach', 12.5, 'ABC', '01-03-2015', '01-05-2015'],
]
)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
df = df.set_index('ID')
#Function to bin the dates
def create_date_bin_series(dates, max_span=timedelta(days=3)):
orig_order = zip(dates,range(len(dates)))
sorted_order = sorted(orig_order)
curr_bin = 1
curr_date = min(dates)
date_bins = []
for date,i in sorted_order:
if date-curr_date > max_span:
curr_bin += 1
curr_date = date
date_bins.append((curr_bin,i))
#sort the date_bins to match the original order
date_bins = [v for v,_ in sorted(date_bins, key = lambda x: x[1])]
return date_bins
#Apply function to group each date into a bin with other dates within 3 days of it
start_bins = create_date_bin_series(df['Start_Date'])
end_bins = create_date_bin_series(df['End_Date'])
#Group by new columns
df['fruit_group'] = df.groupby(['Fruit','Location',start_bins,end_bins]).ngroup()
#Print the table sorted by these new groups
print(df.sort_values('fruit_group'))
#you can use the new fruit_group column to filter and agg etc
Output

Related

Write text in a column based on ascending dates. Pandas Python

There are three dates in a df Date column sorted in ascending order. How to write text 'Short' for nearest date, 'Mid' for next date, 'Long' for the farthest date in a new column adjacent to the Date column ? i.e. 2021-04-23 = Short, 2021-05-11 = Mid and 2021-10-08 = Long.
data = {"product_name":["Keyboard","Mouse", "Monitor", "CPU","CPU", "Speakers"],
"Unit_Price":[500,200, 5000.235, 10000.550, 10000.550, 250.50],
"No_Of_Units":[5,5, 10, 20, 20, 8],
"Available_Quantity":[5,6,10,1,3,2],
"Date":['11-05-2021', '23-04-2021', '08-10-2021','23-04-2021', '08-10-2021','11-05-2021']
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'], format = '%d-%m-%Y')
df = df.sort_values(by='Date')
Convert to_datetime and rank the dates, then map your values in the desired order:
df['New'] = (pd.to_datetime(df['Date']).rank(method='dense')
.map(dict(enumerate(['Short', 'Mid', 'Long'], start=1)))
)
Output:
product_name Unit_Price No_Of_Units Available_Quantity Date New
1 Mouse 200.000 5 6 2021-04-23 Short
3 CPU 10000.550 20 1 2021-04-23 Short
0 Keyboard 500.000 5 5 2021-05-11 Mid
5 Speakers 250.500 8 2 2021-05-11 Mid
2 Monitor 5000.235 10 10 2021-10-08 Long
4 CPU 10000.550 20 3 2021-10-08 Long

How to select rows with max values in categories?

I would like to use the aggregation for each ID key to select rows with max(day).
ID
col1
col2
month
Day
AI1
5
2
janv
15
AI2
6
0
Dec
16
AI1
1
7
March
16
AI3
9
4
Nov
18
AI2
3
20
Fev
20
AI3
10
8
June
06
Desired result:
ID
col1
col2
month
Day
AI1
1
7
March
16
AI2
3
20
Fev
20
AI3
9
4
Nov
18
The only solution that comes to my mind is to :
Get the highest day for each ID (using groupBy)
Append the value of the highest day to each line (with matching ID) using join
Then a simple filter where the value of the two lines match
# select the max value for each of the ID
maxDayForIDs = df.groupBy("ID").max("day").withColumnRenamed("max(day)", "maxDay")
# now add the max value of the day for each line (with matching ID)
df = df.join(maxDayForIDs, "ID")
# keep only the lines where it matches "day" equals "maxDay"
df = df.filter(df.day == df.maxDay)
Usually this kind of operation is done using window functions like
rank,
dense_rank
or row_number.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('AI1', 5, 2, 'janv', '15'),
('AI2', 6, 0, 'Dec', '16'),
('AI1', 1, 7, 'March', '16'),
('AI3', 9, 4, 'Nov', '18'),
('AI2', 3, 20, 'Fev', '20'),
('AI3', 10, 8, 'June', '06')],
['ID', 'col1', 'col2', 'month', 'Day']
)
w = W.partitionBy('ID').orderBy(F.desc('Day'))
df = df.withColumn('_rn', F.row_number().over(w))
df = df.filter('_rn=1').drop('_rn')
df.show()
# +---+----+----+-----+---+
# | ID|col1|col2|month|Day|
# +---+----+----+-----+---+
# |AI1| 1| 7|March| 16|
# |AI2| 3| 20| Fev| 20|
# |AI3| 9| 4| Nov| 18|
# +---+----+----+-----+---+
Make it simple
new= (df.withColumn('max',first('Day').over(w))#Order by day descending and keep first value in a group in max
.where(col('Day')==col('max'))#filter where max=Day
.drop('max')#drop max
).show()

how to create monthly and season 24 hours average table using pandas

I have a dataframe with 2 columns: Date and LMP and there are totals of 8760 rows. This is the dummy dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2023-01-01 00:00', '2023-12-31 23:00', freq='1H'), 'LMP': np.random.randint(10, 20, 8760)})
I extract month from the date and then created the season column for the specific dates. Like this
df['month'] = pd.DatetimeIndex(df['Date']).month
season = []
for i in df['month']:
if i <= 2 or i == 12:
season.append('Winter')
elif 2 < i <= 5:
season.append('Spring')
elif 5 < i <= 8:
season.append('Summer')
else:
season.append('Autumn')
df['Season'] = season
df2 = df.groupby(['month']).mean()
df3 = df.groupby(['Season']).mean()
print(df2['LMP'])
print(df3['LMP'])
Output:
**month**
1 20.655113
2 20.885532
3 19.416946
4 22.025248
5 26.040606
6 19.323863
7 51.117965
8 51.434093
9 21.404680
10 14.701989
11 20.009590
12 38.706160
**Season**
Autumn 18.661426
Spring 22.499365
Summer 40.856845
Winter 26.944382
But I want the output to be in 24 hour average for both monthly and seasonal.
Desired Output:
for seasonal 24 hours average
For monthyl 24 hours average
Note: in the monthyl 24 hour average columns are months(1,2,3,4,5,6,7,8,9,10,11,12) and rows are hours(starting from 0).
Can anyone help?
try:
df['hour']=pd.DatetimeIndex(df['Date']).hour
dft = df[['Season', 'hour', 'LMP']]
dftg = dft.groupby(['hour', 'Season'])['LMP'].mean()
dftg.reset_index().pivot(index='hour', columns='Season')
result:

how to perform an operation similar to group by on the first index of a multi indexed dataframe

The code to generate a sample dataframe is as follows
fruits=pd.DataFrame()
fruits['month']=['jan','feb','feb','march','jan','april','april','june','march','march','june','april']
fruits['fruit']=['apple','orange','pear','orange','apple','pear','cherry','pear','orange','cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
ind=(fruits.index)
fruits_grp = fruits.set_index(['month', ind],drop=False)
The output dataframe should look something like this:
fruits_new1=pd.DataFrame()
fruits_new1['month']=['jan','jan','feb','feb','march','march','march','apr','apr','apr','jun','jun']
fruits_new1['fruit']=['apple','apple','orange','pear','orange','orange','cherry','pear','cherry','cherry','pear','apple']
fruits_new1['price']=[30,30,20,40,25,25,55,45,60,60,45,37]
ind1=fruits_new1.index
fruits_grp1 = fruits_new1.set_index(['month', ind1],drop=False)
fruits_grp1
Thank you
use:
d={'Jan': 0, 'Feb': 1, 'Mar': 2, 'Apr': 3, 'May': 4, 'Jun': 5, 'Jul': 6, 'Aug': 7, 'Sep': 8, 'Oct': 9, 'Nov': 10, 'Dec': 11}
idx=fruits_grp['month'].str.title().str[:3].map(d).sort_values().index
fruits_grp=fruits_grp.reindex(idx)
fruits_grp['s']=list(range(len(fruits_grp)))
fruits_grp=fruits_grp.set_index('s',append=True).droplevel(1).rename_axis(index=['month',None])
Update:
sample dataframe:
fruits=pd.DataFrame()
fruits['month']=[1,2,2,3,1,4,4,6,3,3,6,4]
fruits['fruit']=['apple','orange','pear','orange','apple','pear','cherry','pear','orange','cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
ind=(fruits.index)
fruits_grp = fruits.set_index(['month', ind],drop=False)
Then just simply use:
idx=fruits_grp['month'].sort_values().index
fruits_grp=fruits_grp.reindex(idx)
fruits_grp['s']=list(range(len(fruits_grp)))
fruits_grp=fruits_grp.set_index('s',append=True).droplevel(1).rename_axis(index=['month',None])
fruits=pd.DataFrame()
fruits['month']=
['jan','feb','feb','mar','jan','apr','apr','jun','mar','mar','jun','apr']
fruits['fruit']=
['apple','orange','pear','orange','apple','pear','cherry','pear','orange',
'cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
fruits["month"] = fruits["month"].str.capitalize()
fruits["month"] = pd.to_datetime(fruits.month, format='%b',
errors='coerce').dt.month
fruits = fruits.sort_values(by="month")
fruits["month"] = pd.to_datetime(fruits['month'], format='%m').dt.strftime('%b')
ind1 = fruits.index
fruits_grp1 = fruits.set_index(['month', ind1],drop=False)
fruits_grp1
print(fruits_grp1)
Thank you so much for all the answers. I've figured out that the sort_values() can make this happen.
The reproducible code for the same is as follows:
fruit_grp_srt=fruits_grp.sort_values(by='month')
But this sorts the rows in alphabetical order and not in the original order of the 1st index.
Still looking for a better solution, Thank you
To me, it looks like a simple sorting by month. First, you need to eliminate the month column (as there is a month in the index) followed by a reset_index
del fruits_grp['month']
df = fruits_grp.reset_index()
Then, it is important to set the months as an ordered categorical datatype, and define the custom order.
df.month = df.month.astype('category')
df.month = df.month.cat.reorder_categories(['jan', 'feb', 'march', 'april', 'june'])
Now, it is just simply sorting by month
df.sort_values(by='month')
Output
month level_1 fruit price
0 jan 0 apple 30
4 jan 4 apple 30
1 feb 1 orange 20
2 feb 2 pear 40
3 march 3 orange 25
8 march 8 orange 25
9 march 9 cherry 55
5 april 5 pear 45
6 april 6 cherry 60
11 april 11 cherry 60
7 june 7 pear 45
10 june 10 apple 37

Apply rolling function to groupby over several columns

I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.
In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.
Input
Here is a sample of the data, with one id=33 although we expect several id's.
X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
{'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]
# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])
df
id item points
date
2017-02-05 33 A 20
2017-02-05 33 B 10
2017-02-06 33 B 10
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-13 33 A 4
Goal
Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive
Desired Output
id A B
date
2017-02-05 33 20 10
2017-02-07 33 20 30
2017-02-09 33 0 10
2017-02-11 33 3 0
2017-02-13 33 7 0
E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7
Attempts
An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates
df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis
Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.
We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used
df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())
id item date
33 A 2017-02-05 20
2017-02-07 0
2017-02-09 0
2017-02-11 3
2017-02-13 4
B 2017-02-05 10
2017-02-07 10
Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.
Thanks to #jezrael for good suggestions so far
Notes
Pandas version = 0.20.1
I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer")
It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py :
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py
elif not is_integer(self.window):
raise ValueError("window must be an integer")
It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
df = pd.DataFrame(X)
# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])
# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)
# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())
# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)
df