populate new rows by comparing two dataframes - pandas

I have two dataframe:
df = pd.DataFrame({'ID': ['1','1','1','2','2','3','4','4'], \
'ward': ['icu', 'surgery','icu', 'neurology','neurology','obstetrics','OPD', 'surgery'], \
'start_date': ['2016-10-22 18:19:19', '2016-10-24 10:20:00','2016-10-24 12:41:30', '2016-11-09 19:41:30','2016-11-09 23:20:00','2016-11-08 09:45:00','2016-10-15 09:15:00','2016-10-15 12:15:01'], \
'end_date': ['2016-10-24 10:10:19', '2016-10-24 12:40:30','2016-10-26 11:15:00', '2016-11-09 22:11:00','2016-11-11 13:30:00','2016-11-09 07:25:00','2016-10-15 12:15:00','2016-10-17 17:25:00'] })
df1 = pd.DataFrame({'ID': ['1','2','4'], \
'ward': ['radiology', 'rehabilitation','radiology'], \
'date': ['2016-10-23 10:50:00', '2016-11-24 10:20:00','2016-10-15 18:41:30']})
I want to populate the data shown in df1 into df by comparing the ID and if the date in the df1 falls somewhere between the start_date and end_date of df. If both conditions match, I would like to add another row (data taken from df1) in the df for that specific ID. Where I add the new row, I also would like to change the date/time on the previous and the next row.
What I want is the following as an end result:
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1 1 radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2 1 icu 2016-10-23 10:50:00 2016-10-24 10:10:19
3 1 surgery 2016-10-24 10:20:00 2016-10-24 12:40:30
4 1 icu 2016-10-24 12:41:30 2016-10-26 11:15:00
5 2 neurology 2016-11-09 19:41:30 2016-11-09 22:11:00
6 2 neurology 2016-11-09 23:20:00 2016-11-11 13:30:00
7 3 obstetrics 2016-11-08 09:45:00 2016-11-09 07:25:00
8 4 OPD 2016-10-15 09:15:00 2016-10-15 12:15:00
9 4 hematology 2016-10-15 12:15:00 2016-10-15 18:41:30
10 4 radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
11 4 hematology 2016-10-15 18:41:30 2016-10-17 17:25:00
In this example, ID 1 and ID 4 met the condition in both dataframes. Just explaining the example of ID 1, initially ID 1 moved from icu -> surgery -> icu, but after comparing and populating new row, the final data shows that ID 1 moves from icu -> radiology -> icu -> surgery -> icu. now ID 1 has five row instead of 3 and in every row, start_date and end_date is updated as well.
The dataset (df) is large and includes 1 Million rows and I do not know what method should I use to get the right result efficiently. Any help will be appreciated.

By interpretting the guidance from here I have the following method:
import pandas as pd
df = pd.DataFrame({'ID': ['1','1','1','2','2','3','4','4'], \
'ward': ['icu', 'surgery','icu', 'neurology','neurology','obstetrics','OPD', 'surgery'], \
'start_date': ['2016-10-22 18:19:19', '2016-10-24 10:20:00','2016-10-24 12:41:30', '2016-11-09 19:41:30','2016-11-09 23:20:00','2016-11-08 09:45:00','2016-10-15 09:15:00','2016-10-15 12:15:01'], \
'end_date': ['2016-10-24 10:10:19', '2016-10-24 12:40:30','2016-10-26 11:15:00', '2016-11-09 22:11:00','2016-11-11 13:30:00','2016-11-09 07:25:00','2016-10-15 12:15:00','2016-10-17 17:25:00'] })
df1 = pd.DataFrame({'ID': ['1','2','4'], \
'ward': ['radiology', 'rehabilitation','radiology'], \
'date': ['2016-10-23 10:50:00', '2016-11-24 10:20:00','2016-10-15 18:41:30']})
# Converting str datetime to datetime objects
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df1.date = pd.to_datetime(df1.date)
# Change the index to intervals
df_temp = df.copy()
df_temp.index = pd.IntervalIndex.from_arrays(df_temp['start_date'],df_temp['end_date'],closed='both')
# Find the interval to split
def find_interval(row):
try:
return df_temp.loc[row.date].loc[(df_temp.ID == row.ID)].iloc[0]
except KeyError:
# This value does not fall within any interval in df
return
# These are all the rows to be altered:
to_remove = df1.apply(find_interval, axis=1).dropna()
"""
to_remove
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-24 10:10:19
2 4 surgery 2016-10-15 12:15:01 2016-10-17 17:25:00 """
# Create 3 new rows for every matching
def new_rows(row):
try:
# Create the new rows by taking information from the existing row
existing = df_temp.loc[row.date].loc[(df_temp.ID == row.ID)].iloc[0]
out = pd.DataFrame(dict(
ID=[row.ID] * 3,
ward=[existing.ward, row.ward, existing.ward],
start_date=[existing.start_date, row.date, row.date],
end_date=[row.date, row.date, existing.end_date]
))
return out
except KeyError:
return
to_add = pd.concat(df1.apply(new_rows, axis=1).values)
"""
to_add
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1 1 radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2 1 icu 2016-10-23 10:50:00 2016-10-24 10:10:19
0 4 surgery 2016-10-15 12:15:01 2016-10-15 18:41:30
1 4 radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
2 4 surgery 2016-10-15 18:41:30 2016-10-17 17:25:00 """
# Remove the 'to_remove'
new = pd.concat([df,to_remove]).drop_duplicates(keep=False)
# Add the 'to_add'
new = pd.concat([new, to_add])
# Sort the finished dataframe
new = new.sort_values(['ID', 'start_date']).reset_index(drop=True)
new
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1 1 radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2 1 icu 2016-10-23 10:50:00 2016-10-24 10:10:19
3 1 surgery 2016-10-24 10:20:00 2016-10-24 12:40:30
4 1 icu 2016-10-24 12:41:30 2016-10-26 11:15:00
5 2 neurology 2016-11-09 19:41:30 2016-11-09 22:11:00
6 2 neurology 2016-11-09 23:20:00 2016-11-11 13:30:00
7 3 obstetrics 2016-11-08 09:45:00 2016-11-09 07:25:00
8 4 OPD 2016-10-15 09:15:00 2016-10-15 12:15:00
9 4 surgery 2016-10-15 12:15:01 2016-10-15 18:41:30
10 4 radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
11 4 surgery 2016-10-15 18:41:30 2016-10-17 17:25:00

Related

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

How to calculate monthly normals?

I have this df:
CODE TMAX TMIN PP
DATE
1991-01-01 000130 32.6 23.4 0.0
1991-01-02 000130 31.2 22.4 0.0
1991-01-03 000130 32.0 NaN 0.0
1991-01-04 000130 32.2 23.0 0.0
1991-01-05 000130 30.5 22.0 0.0
... ... ... ...
2020-12-27 158328 NaN NaN NaN
2020-12-28 158328 NaN NaN NaN
2020-12-29 158328 NaN NaN NaN
2020-12-30 158328 NaN NaN NaN
2020-12-31 158328 NaN NaN NaN
I have data of 30 years (1991-2020) for each CODE, and i want to calculate monthly normals of TMAX, TMIN and PP. So for TMAX and TMIN i should calculate the average for every month, so if January have 31 days i should get the mean of those 31 values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020), 30 Februarys, etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month). Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So i'm using this code but i don't know if it's ok.
from datetime import date
normalstemp=df[['CODE','TMAX','TMIN']].groupby([df.CODE, df.index.month]).mean().round(1)
For PP (precipitation) i should sum the values of every PP value of the month, so if January have 31 days i should sum all of their values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020) , 30 Februarys (February 1991, February 1992, ... ,February 2020), etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month, the same as TMAX and TMIN).
Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So im using this code but i know this code isn't correct because i'm not getting the mean of the januarys, februarys, etc.
normalspp=df[['CODE','PP']].groupby([df.CODE, df.index.month]).sum().round(1)
I only have basic knowledge of python so i will appreciate if you can help me.
Thanks in advance.
Ver 2: Average by Year-Month and by Month
import pandas as pd
import numpy as np
x = pd.date_range(start='1/1/1991', end='12/31/2020',freq='D')
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*10958 + ['158328']*10958,
'TMAX': np.random.randint(6,10, size=21916),
'TMIN': np.random.randint(1,5, size=21916)
})
# Create a Month column to get Average by Month for all years
df['Month'] = df.Date.dt.month
# Create a Year-Month column to get Average of each Month within the Year
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Print the Average of each Month within each Year for each code
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
# Print the Average of each Month irrespective of the year (for each code)
print (df.groupby(['Code','Month'])['TMAX'].mean())
print (df.groupby(['Code','Month'])['TMAX'].mean())
If you want to give a name for the TMAX Average value, you can add the reset_index and rename column. Here's code to do that.
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean().reset_index().rename(columns={'TMAX':'TMAX_Avg'}))
The output of this will be:
Average of TMAX for each Year-Month for each Code
Code Year_Mon
000130 1991-01 7.225806
1991-02 7.678571
1991-03 7.354839
1991-04 7.500000
1991-05 7.516129
...
158328 2020-08 7.387097
2020-09 7.300000
2020-10 7.516129
2020-11 7.500000
2020-12 7.451613
Name: TMAX, Length: 720, dtype: float64
Average of TMIN for each Year-Month for each Code
Code Year_Mon
000130 1991-01 2.419355
1991-02 2.571429
1991-03 2.193548
1991-04 2.366667
1991-05 2.451613
...
158328 2020-08 2.451613
2020-09 2.566667
2020-10 2.612903
2020-11 2.666667
2020-12 2.580645
Name: TMIN, Length: 720, dtype: float64
Average of TMAX for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Average of TMIN for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Ver 1: Average by Year and Month for each Code
Here is one way to do this.
You can create two columns - Year and Month. Then get the average of TMAX, TMIN, and PP for each month within the year by doing a groupby ('Code','Year_Mon')
See code for more details.
import pandas as pd
import numpy as np
# create a range of dates from 1/1/2018 thru 12/31/2020 for each day
x = pd.date_range(start='1/1/2018', end='12/31/2020',freq='D')
# create a dataframe with the date ranges x 2 for two codes
# TMIN is a random value from 1 thru 5 - you can put your actual data here
# TMAX is a random value from 6 thru 10 - you can put your actual data here
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*1096 + ['158328']*1096,
'TMAX': np.random.randint(6,10, size=2192),
'TMIN': np.random.randint(1,5, size=2192)
})
# Create a Year-Month column using df.Date.dt.strftime
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Calculate the Average of TMAX and TMIN using groupby Code and Year_Mon
df['TMAX_Avg'] = df.groupby(['Code','Year_Mon'])['TMAX'].transform('mean')
df['TMIN_Avg'] = df.groupby(['Code','Year_Mon'])['TMIN'].transform('mean')
The output of this will be:
Date Code TMAX TMIN Year_Mon TMAX_Avg TMIN_Avg
0 2018-01-01 000130 8 2 2018-01 7.451613 2.129032
1 2018-01-02 000130 7 4 2018-01 7.451613 2.129032
2 2018-01-03 000130 9 2 2018-01 7.451613 2.129032
3 2018-01-04 000130 6 1 2018-01 7.451613 2.129032
4 2018-01-05 000130 9 4 2018-01 7.451613 2.129032
5 2018-01-06 000130 6 1 2018-01 7.451613 2.129032
6 2018-01-07 000130 9 2 2018-01 7.451613 2.129032
7 2018-01-08 000130 9 2 2018-01 7.451613 2.129032
8 2018-01-09 000130 7 2 2018-01 7.451613 2.129032
9 2018-01-10 000130 8 2 2018-01 7.451613 2.129032
10 2018-01-11 000130 8 3 2018-01 7.451613 2.129032
11 2018-01-12 000130 7 2 2018-01 7.451613 2.129032
12 2018-01-13 000130 7 1 2018-01 7.451613 2.129032
13 2018-01-14 000130 8 1 2018-01 7.451613 2.129032
14 2018-01-15 000130 7 3 2018-01 7.451613 2.129032
15 2018-01-16 000130 6 1 2018-01 7.451613 2.129032
16 2018-01-17 000130 6 3 2018-01 7.451613 2.129032
17 2018-01-18 000130 9 3 2018-01 7.451613 2.129032
18 2018-01-19 000130 7 2 2018-01 7.451613 2.129032
19 2018-01-20 000130 8 1 2018-01 7.451613 2.129032
20 2018-01-21 000130 9 4 2018-01 7.451613 2.129032
21 2018-01-22 000130 6 2 2018-01 7.451613 2.129032
22 2018-01-23 000130 9 4 2018-01 7.451613 2.129032
23 2018-01-24 000130 6 2 2018-01 7.451613 2.129032
24 2018-01-25 000130 8 3 2018-01 7.451613 2.129032
25 2018-01-26 000130 6 2 2018-01 7.451613 2.129032
26 2018-01-27 000130 8 1 2018-01 7.451613 2.129032
27 2018-01-28 000130 8 3 2018-01 7.451613 2.129032
28 2018-01-29 000130 6 1 2018-01 7.451613 2.129032
29 2018-01-30 000130 6 1 2018-01 7.451613 2.129032
30 2018-01-31 000130 8 1 2018-01 7.451613 2.129032
31 2018-02-01 000130 7 1 2018-02 7.250000 2.428571
32 2018-02-02 000130 6 2 2018-02 7.250000 2.428571
33 2018-02-03 000130 6 4 2018-02 7.250000 2.428571
34 2018-02-04 000130 8 3 2018-02 7.250000 2.428571
35 2018-02-05 000130 8 2 2018-02 7.250000 2.428571
36 2018-02-06 000130 6 3 2018-02 7.250000 2.428571
37 2018-02-07 000130 6 3 2018-02 7.250000 2.428571
38 2018-02-08 000130 7 1 2018-02 7.250000 2.428571
39 2018-02-09 000130 9 4 2018-02 7.250000 2.428571
40 2018-02-10 000130 8 2 2018-02 7.250000 2.428571
41 2018-02-11 000130 7 4 2018-02 7.250000 2.428571
42 2018-02-12 000130 8 1 2018-02 7.250000 2.428571
43 2018-02-13 000130 6 4 2018-02 7.250000 2.428571
44 2018-02-14 000130 6 1 2018-02 7.250000 2.428571
45 2018-02-15 000130 6 4 2018-02 7.250000 2.428571
46 2018-02-16 000130 8 2 2018-02 7.250000 2.428571
47 2018-02-17 000130 7 3 2018-02 7.250000 2.428571
48 2018-02-18 000130 9 3 2018-02 7.250000 2.428571
49 2018-02-19 000130 8 2 2018-02 7.250000 2.428571
If you want only the Code, Year-Month, and TMIN and TMAX values, you can do:
TMAX average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
Output will be:
Code Year_Mon
000130 2018-01 7.451613
2018-02 7.250000
2018-03 7.774194
2018-04 7.366667
2018-05 7.451613
...
158328 2020-08 7.935484
2020-09 7.666667
2020-10 7.548387
2020-11 7.333333
2020-12 7.580645
TMIN average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
Output will be:
Code Year_Mon
000130 2018-01 2.129032
2018-02 2.428571
2018-03 2.451613
2018-04 2.500000
2018-05 2.677419
...
158328 2020-08 2.709677
2020-09 2.166667
2020-10 2.161290
2020-11 2.366667
2020-12 2.548387

pandas: get range between date columns

I have pandas DataFrame:
start_date finish_date progress_id
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00 a387ab916f402cb3fbfffd29f68fd0ce
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00 3b9dce04f32da32763124602557f92a3
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00 73e17a05355852fe65b785c82c37d1ad
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00 cc3eb34ae49c719648352c4175daee88
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00 04ace4fe130d90c801e24eea13ee808e
I converted columns to datetime.date because I don't need time in df:
df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['finish_date'] = pd.to_datetime(df['finish_date']).dt.date
So, I need a new column which will be contain year-month if start_date and finish_date have same month. And if different put range between them. For example start_date = 06-2020, finish_date = 08-2020 the result is [06-2020, 07-2020, 08-2020]. Then I need to explode it by column.
I tried:
df['range'] = df.apply(lambda x: pd.date_range(x['start_date'], x['finish_date'], freq="M"), axis=1)
df = df.explode('range')
but as a result I had many NaT's in the column.
Any solutions will be great.
One alternative is the following. Assume you have the following dataframe, df:
start_date finish_date \
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
5 2019-05-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
progress_id
0 a387ab916f402cb3fbfffd29f68fd0ce
1 3b9dce04f32da32763124602557f92a3
2 73e17a05355852fe65b785c82c37d1ad
3 cc3eb34ae49c719648352c4175daee88
4 04ace4fe130d90c801e24eea13ee808e
5 04ace4fe130d90c801e24eea13ee808e
It is the same you shared pllus one row where the dates (year and month) differ.
Then applying this:
df['start_date'] = pd.to_datetime(df['start_date'],format='%Y-%m-%d')
df['finish_date'] = pd.to_datetime(df['finish_date'],format='%Y-%m-%d')
df['finish_M_Y'] = df['finish_date'].dt.strftime('%Y-%m')
df['Start_M_Y'] = df['start_date'].dt.strftime('%Y-%m')
def range(row):
if row['Start_M_Y'] == row['finish_M_Y']:
val = row['Start_M_Y']
elif row['Start_M_Y'] != row['finish_M_Y']:
val = pd.date_range(row['Start_M_Y'] , row['finish_M_Y'], freq='M')
else:
val = -1
return val
df['Range'] = df.apply(range, axis=1)
df.explode('Range').drop(['Start_M_Y', 'finish_M_Y'], axis=1)
gives you
start_date finish_date \
0 2018-06-23 08:28:50.681065+00:00 2018-06-23 08:28:52.439542+00:00
1 2019-03-18 14:23:17.328374+00:00 2019-03-18 14:54:50.979612+00:00
2 2019-07-09 09:18:46.198620+00:00 2019-07-11 08:03:09.222385+00:00
3 2018-07-27 15:39:17.666629+00:00 2018-07-27 16:13:55.086871+00:00
4 2019-04-24 18:42:40.272854+00:00 2019-04-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
progress_id Range
0 a387ab916f402cb3fbfffd29f68fd0ce 2018-06
1 3b9dce04f32da32763124602557f92a3 2019-03
2 73e17a05355852fe65b785c82c37d1ad 2019-07
3 cc3eb34ae49c719648352c4175daee88 2018-07
4 04ace4fe130d90c801e24eea13ee808e 2019-04
5 04ace4fe130d90c801e24eea13ee808e 2019-05-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-06-30 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-07-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-08-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-09-30 00:00:00

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).

how do i access only specific entries of a dataframe having date as index

[this is tail of my DataFrame for around 1000 entries][1]
Open Close High Change mx_profitable
Date
2018-06-06 263.00 270.15 271.4 7.15 8.40
2018-06-08 268.95 273.00 273.9 4.05 4.95
2018-06-11 273.30 274.00 278.4 0.70 5.10
2018-06-12 274.00 282.85 284.4 8.85 10.40
I have to sort out the entries of only certain dates, for example, 25th of every month.
I think need DatetimeIndex.day with boolean indexing:
df[df.index.day == 25]
Sample:
rng = pd.date_range('2017-04-03', periods=1000)
df = pd.DataFrame({'a': range(1000)}, index=rng)
print (df.head())
a
2017-04-03 0
2017-04-04 1
2017-04-05 2
2017-04-06 3
2017-04-07 4
df1 = df[df.index.day == 25]
print (df1.head())
a
2017-04-25 22
2017-05-25 52
2017-06-25 83
2017-07-25 113
2017-08-25 144