Merging two series with alternating dates into one grouped Pandas dataframe - pandas

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!

p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2

# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

Related

filter data based on month start and month end

Given a dataframe with date column in this format.
Date Group
2020-05-18 1
2020-06-22 1
2019-07-11 1
2018-03-01 1
2021-01-21 2
2021-05-05 2
2021-09-11 2
And two strings;
Start = 2020-05 (indicating month start)
End = 2021-09 (indicating month end)
I want to filter out the data so that only the dates that fall within the start and end date are available in the dataframe.
Expected output:
Date Group
2020-05-18 1
2020-06-22 1
2021-01-21 2
2021-05-05 2
2021-09-11 2
# Creating dummy data
d = {'dt':['2020-05-18',
'2020-06-22',
'2019-07-11',
'2018-03-01',
'2021-01-21',
'2021-05-05',
'2021-09-11'],
'group':[1,1,1,1,2,2,2]}
dt_df = pd.DataFrame(data=d)
dt_df
dt_df['dt'] = pd.to_datetime(dt_df['dt'])
dt_df
Inital Input:
0 2020-05-18
1 2020-06-22
2 2019-07-11
3 2018-03-01
4 2021-01-21
5 2021-05-05
6 2021-09-11
Name: dt, dtype: datetime64[ns]
Start = '2020-05'
End = '2021-09'
Start = pd.to_datetime(Start)
End = pd.to_datetime(End)
End = End+np.timedelta64(1, 'M')
Use loc to select only dates between Start and End timestamp.
dt_df.loc[(dt_df['dt'] - Start >= np.timedelta64(0,'D')) & (dt_df['dt'] - End <= np.timedelta64(0, 'D'))]
Output:
dt group
0 2020-05-18 1
1 2020-06-22 1
4 2021-01-21 2
5 2021-05-05 2
6 2021-09-11 2

Pandas groupby and subtract each element of a column by element in nth row

I have a dataframe df as given below:
country_code count_date confirmed_cases
0 AFG 2020-09-13 38641.0
1 AFG 2020-09-12 38606.0
2 AFG 2020-09-11 38572.0
3 AFG 2020-09-10 38544.0
4 AFG 2020-09-09 38520.0
... ... ... ...
19521 ZWE 2020-06-03 206.0
19522 ZWE 2020-06-02 203.0
19523 ZWE 2020-06-01 178.0
19524 ZWE 2020-05-31 174.0
19525 ZWE 2020-05-30 149.0
After groupby country_code how do I create a new column which has confirmed_cases of each date subtracted by confirmed_cases n days earlier.
I tried
n = 7
df.groupby('country_code').confirmed_cases.transform(lambda x:x-x.iloc[::n])
which doesnt work.
You can do shift
n = 7
out = df['confirmed_cases'] - df.groupby('country_code').confirmed_cases.shift(n)
Update :
df.groupby('country_code').confirmed_cases.apply(lambda x:x-x.shift(n))

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT

Subtract day column from date column in pandas data frame

I have two columns in my data frame.One column is date(df["Start_date]) and other is number of days.I want to subtract no of days column(df["days"]) from Date column.
I was trying something like this
df["new_date"]=df["Start_date"]-datetime.timedelta(days=df["days"])
I think you need to_timedelta:
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
Sample:
np.random.seed(120)
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10)
df = pd.DataFrame({'Start_date': rng, 'days': np.random.choice(np.arange(10), size=10)})
print (df)
Start_date days
0 2015-02-24 7
1 2015-02-25 0
2 2015-02-26 8
3 2015-02-27 4
4 2015-02-28 1
5 2015-03-01 7
6 2015-03-02 1
7 2015-03-03 3
8 2015-03-04 8
9 2015-03-05 9
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
print (df)
Start_date days new_date
0 2015-02-24 7 2015-02-17
1 2015-02-25 0 2015-02-25
2 2015-02-26 8 2015-02-18
3 2015-02-27 4 2015-02-23
4 2015-02-28 1 2015-02-27
5 2015-03-01 7 2015-02-22
6 2015-03-02 1 2015-03-01
7 2015-03-03 3 2015-02-28
8 2015-03-04 8 2015-02-24
9 2015-03-05 9 2015-02-24