I have a time-based feature in my pandas 5 min interval data frame, so it looks something like
dataDate TimeconinSec
2020-11-11 22:25:00 302
2020-11-11 23:25:00 605
2020-11-12 00:25:00 302
few times this feature may have value beyond 5 mins( 300sec), so I want it to be like the following output, going back on time and distribute the time feature
dataDate TimeconinSec
2020-11-11 22:20:00 300
2020-11-11 22:25:00 002
2020-11-11 23:15:00 300
2020-11-11 23:20:00 300
2020-11-11 23:25:00 005
2020-11-12 00:20:00 300
2020-11-12 00:25:00 002
I have tried different pandas date range functions, but how can I partition my time-based features across the intervals
Let’s first convert everything to proper timestamps, and compute the beginning and end of every interval:
>>> df['date'] = pd.to_datetime(df['dataDate'])
>>> df['since'] = (df['date'] - df['TimeconinSec'].astype('timedelta64[s]')).dt.floor(freq='300s')
>>> df['until'] = df['since'] + df['TimeconinSec'].astype('timedelta64[s]')
Then we can use pd.date_range to generate all the proper intermediate interval bounds:
>>> bounds = df.apply(lambda s: [*pd.date_range(s['since'], s['until'], freq='300s'), s['until']], axis='columns')
>>> bounds
0 [2020-11-11 22:15:00, 2020-11-11 22:20:00, 202...
1 [2020-11-11 23:10:00, 2020-11-11 23:15:00, 202...
2 [2020-11-12 00:15:00, 2020-11-12 00:20:00, 202...
dtype: object
Then with explode we can make these into their own series. I’m using the series twice, once for the beginning of the interval and once for the end, so shifted. Note the groupby().shift() which allows to perform the shift only within the same index.
>>> interval_ends = pd.concat([bounds.explode(), bounds.explode().groupby(level=0).shift(-1)], axis='columns', keys=['start', 'end'])
>>> interval_ends
start end
0 2020-11-11 22:15:00 2020-11-11 22:20:00
0 2020-11-11 22:20:00 2020-11-11 22:20:02
0 2020-11-11 22:20:02 NaT
1 2020-11-11 23:10:00 2020-11-11 23:15:00
1 2020-11-11 23:15:00 2020-11-11 23:20:00
1 2020-11-11 23:20:00 2020-11-11 23:20:05
1 2020-11-11 23:20:05 NaT
2 2020-11-12 00:15:00 2020-11-12 00:20:00
2 2020-11-12 00:20:00 2020-11-12 00:20:02
2 2020-11-12 00:20:02 NaT
After that we can discard the indexes and simply compute the time inside each interval:
>>> interval_ends.reset_index(drop=True, inplace=True)
>>> delays = (interval_ends['end'] - interval_ends['start']).astype('timedelta64[s]')
>>> delays
0 300.0
1 2.0
2 NaN
3 300.0
4 300.0
5 5.0
6 NaN
7 300.0
8 2.0
9 NaN
dtype: float64
Finally we just have to join the interval starts with these delays and drop lines containing NaNs, and we’ve got your final result:
>>> delays = delays.rename('time_in_secs').dropna().astype('int')
>>> interval_ends[['start']].join(delays, how='inner')
start time_in_secs
0 2020-11-11 22:15:00 300
1 2020-11-11 22:20:00 2
3 2020-11-11 23:10:00 300
4 2020-11-11 23:15:00 300
5 2020-11-11 23:20:00 5
7 2020-11-12 00:15:00 300
8 2020-11-12 00:20:00 2
Related
Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1
I have pandas DataFrame:
start_date finish_date progress_id
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00 a387ab916f402cb3fbfffd29f68fd0ce
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00 3b9dce04f32da32763124602557f92a3
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00 73e17a05355852fe65b785c82c37d1ad
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00 cc3eb34ae49c719648352c4175daee88
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00 04ace4fe130d90c801e24eea13ee808e
I converted columns to datetime.date because I don't need time in df:
df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['finish_date'] = pd.to_datetime(df['finish_date']).dt.date
So, I need a new column which will be contain year-month if start_date and finish_date have same month. And if different put range between them. For example start_date = 06-2020, finish_date = 08-2020 the result is [06-2020, 07-2020, 08-2020]. Then I need to explode it by column.
I tried:
df['range'] = df.apply(lambda x: pd.date_range(x['start_date'], x['finish_date'], freq="M"), axis=1)
df = df.explode('range')
but as a result I had many NaT's in the column.
Any solutions will be great.
One alternative is the following. Assume you have the following dataframe, df:
start_date finish_date \
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
5 2019-05-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
progress_id
0 a387ab916f402cb3fbfffd29f68fd0ce
1 3b9dce04f32da32763124602557f92a3
2 73e17a05355852fe65b785c82c37d1ad
3 cc3eb34ae49c719648352c4175daee88
4 04ace4fe130d90c801e24eea13ee808e
5 04ace4fe130d90c801e24eea13ee808e
It is the same you shared pllus one row where the dates (year and month) differ.
Then applying this:
df['start_date'] = pd.to_datetime(df['start_date'],format='%Y-%m-%d')
df['finish_date'] = pd.to_datetime(df['finish_date'],format='%Y-%m-%d')
df['finish_M_Y'] = df['finish_date'].dt.strftime('%Y-%m')
df['Start_M_Y'] = df['start_date'].dt.strftime('%Y-%m')
def range(row):
if row['Start_M_Y'] == row['finish_M_Y']:
val = row['Start_M_Y']
elif row['Start_M_Y'] != row['finish_M_Y']:
val = pd.date_range(row['Start_M_Y'] , row['finish_M_Y'], freq='M')
else:
val = -1
return val
df['Range'] = df.apply(range, axis=1)
df.explode('Range').drop(['Start_M_Y', 'finish_M_Y'], axis=1)
gives you
start_date finish_date \
0 2018-06-23 08:28:50.681065+00:00 2018-06-23 08:28:52.439542+00:00
1 2019-03-18 14:23:17.328374+00:00 2019-03-18 14:54:50.979612+00:00
2 2019-07-09 09:18:46.198620+00:00 2019-07-11 08:03:09.222385+00:00
3 2018-07-27 15:39:17.666629+00:00 2018-07-27 16:13:55.086871+00:00
4 2019-04-24 18:42:40.272854+00:00 2019-04-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
progress_id Range
0 a387ab916f402cb3fbfffd29f68fd0ce 2018-06
1 3b9dce04f32da32763124602557f92a3 2019-03
2 73e17a05355852fe65b785c82c37d1ad 2019-07
3 cc3eb34ae49c719648352c4175daee88 2018-07
4 04ace4fe130d90c801e24eea13ee808e 2019-04
5 04ace4fe130d90c801e24eea13ee808e 2019-05-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-06-30 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-07-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-08-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-09-30 00:00:00
I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT
I have a time Series. I'd like to group into into blocks of 24-hour blocks, from 8am to 7:59am the next day. I know how to group by date, but I've tried and failed to handle this 8-hour offset using TimeGroupers and DateOffsets.
I think you can use Grouper with parameter base:
print df
date name
0 2015-06-13 00:21:25 1
1 2015-06-14 01:00:25 2
2 2015-06-14 02:54:48 3
3 2015-06-15 14:38:15 2
4 2015-06-15 15:29:28 1
print df.groupby(pd.Grouper(key='date', freq='24h', base=8)).sum()
name
date
2015-06-12 08:00:00 1.0
2015-06-13 08:00:00 5.0
2015-06-14 08:00:00 NaN
2015-06-15 08:00:00 3.0
alternatively to #jezrael's method you can use your custom grouper function:
start_ts = '2016-01-01 07:59:59'
df = pd.DataFrame({'Date': pd.date_range(start_ts, freq='10min', periods=1000)})
def my_grouper(df, idx):
return df.ix[idx, 'Date'].date() if df.ix[idx, 'Date'].hour >= 8 else df.ix[idx, 'Date'].date() - pd.Timedelta('1day')
df.groupby(lambda x: my_grouper(df, x)).size()
Test:
In [468]: df.head()
Out[468]:
Date
0 2016-01-01 07:59:59
1 2016-01-01 08:09:59
2 2016-01-01 08:19:59
3 2016-01-01 08:29:59
4 2016-01-01 08:39:59
In [469]: df.tail()
Out[469]:
Date
995 2016-01-08 05:49:59
996 2016-01-08 05:59:59
997 2016-01-08 06:09:59
998 2016-01-08 06:19:59
999 2016-01-08 06:29:59
In [470]: df.groupby(lambda x: my_grouper(df, x)).size()
Out[470]:
2015-12-31 1
2016-01-01 144
2016-01-02 144
2016-01-03 144
2016-01-04 144
2016-01-05 144
2016-01-06 144
2016-01-07 135
dtype: int64
I'm using pandas 0.12.0. I have a DataFrame that looks like:
date ms
0 2013-06-03 00:10:00 75.846318
1 2013-06-03 00:20:00 78.408277
2 2013-06-03 00:30:00 75.807990
3 2013-06-03 00:40:00 70.509438
4 2013-06-03 00:50:00 71.537499
I want to generate a third column, "tod", which contains just the time portion of the date (i.e. call .time() on each value). I'm somewhat of a pandas newbie, so I suspect this is trivial but I'm just not seeing how to do it.
Just apply the Timestamp time method to items in the date column:
In [11]: df['date'].apply(lambda x: x.time())
# equivalently .apply(pd.Timestamp.time)
Out[11]:
0 00:10:00
1 00:20:00
2 00:30:00
3 00:40:00
4 00:50:00
Name: date, dtype: object
In [12]: df['tod'] = df['date'].apply(lambda x: x.time())
This gives a column of datetime.time objects.
Using the method Andy created on Index is faster than apply
In [93]: df = DataFrame(randn(5,1),columns=['A'])
In [94]: df['date'] = date_range('20130101 9:05',periods=5)
In [95]: df['time'] = Index(df['date']).time
In [96]: df
Out[96]:
A date time
0 0.053570 2013-01-01 09:05:00 09:05:00
1 -0.382155 2013-01-02 09:05:00 09:05:00
2 0.357984 2013-01-03 09:05:00 09:05:00
3 -0.718300 2013-01-04 09:05:00 09:05:00
4 0.531953 2013-01-05 09:05:00 09:05:00
In [97]: df.dtypes
Out[97]:
A float64
date datetime64[ns]
time object
dtype: object
In [98]: df['time'][0]
Out[98]: datetime.time(9, 5)