How to resample intra-day intervals and use .idxmax()? - pandas

I am using data from yfinance which returns a pandas Data-Frame.
Volume
Datetime
2021-09-13 09:30:00-04:00 951104
2021-09-13 09:35:00-04:00 408357
2021-09-13 09:40:00-04:00 498055
2021-09-13 09:45:00-04:00 466363
2021-09-13 09:50:00-04:00 315385
2021-12-06 15:35:00-05:00 200748
2021-12-06 15:40:00-05:00 336136
2021-12-06 15:45:00-05:00 473106
2021-12-06 15:50:00-05:00 705082
2021-12-06 15:55:00-05:00 1249763
There are 5 minute intra-day intervals in the data-frame. I want to resample to daily data and get the idxmax of the maximum volume for that day.
df.resample("B")["Volume"].idxmax()
Returns an error:
ValueError: attempt to get argmax of an empty sequence
I used B(business-days) as the resampling period, so there shouldn't be any empty sequences.
I should say .max() works fine.
Also using .agg as was suggested in another question returns an error:
df["Volume"].resample("B").agg(lambda x : np.nan if x.count() == 0 else x.idxmax())
error:
IndexError: index 77 is out of bounds for axis 0 with size 0

You can use groupby as an alternative of resample:
>>> df.groupby(df.index.normalize())['Volume'].agg(Datetime='idxmax', Volume='max')
Datetime Volume
Datetime
2021-09-13 2021-09-13 09:30:00 951104
2021-12-06 2021-12-06 15:55:00 1249763

For me working test if all NaNs per group in if-else:
df = df.resample("B")["Volume"].agg(lambda x: np.nan if x.isna().all() else x.idxmax())

Related

How to sort a column that have dates in a dataframe?

I have a dataframe like this:
SEMANAS HIDROLOGICAS METEOROLOGICAS
0 02042020 36.00583090379008 31.284418529316522
1 05032020 86.91690962099126 77.01136731748973
2 12032020 87.31778425655976 77.24180581323434
3 19032020 59.2201166180758 54.57343110404338
4 26032020 32.39795918367347 29.049238743116323
I used this code to change df.SEMANAS to datetime
Semanas_Oper['SEMANAS']=pd.to_datetime(Semanas_Oper['SEMANAS'], format='%d%m%Y').dt.strftime('%d/%m/%Y')
SEMANAS HIDROLOGICAS METEOROLOGICAS
02/04/2020 36.01 31.28
05/03/2020 86.92 77.01
12/03/2020 87.32 77.24
19/03/2020 59.22 54.57
26/03/2020 32.4 29.05
But pd.to_datetime is not sorting the dates of the column df.SEMANAS
Can you tell me how to sort this columns. 02/04/2020 must be in the last row.
dt.strftime() undoes the datetime conversion and brings you back to strings. If you sort on this, you'll be left with lexiographical sorting, not what you want given your format is '%d/%m/%Y' (would be fine with '%Y/%m/%d').
When working with dates in pandas you should keep the datetime64[ns] dtype. It's the easiest way to perform all datetime operations. Only use .strftime when you need to move to some other library or file output that requires a very specific string format.
df['SEMANAS'] = pd.to_datetime(df['SEMANAS'], format='%d%m%Y')
df.dtypes
#SEMANAS datetime64[ns]
#HIDROLOGICAS object
#METEOROLOGICAS object
df = df.sort_values('SEMANAS')
# SEMANAS HIDROLOGICAS METEOROLOGICAS
#1 2020-03-05 86.91690962099126 77.01136731748973
#2 2020-03-12 87.31778425655976 77.24180581323434
#3 2020-03-19 59.2201166180758 54.57343110404338
#4 2020-03-26 32.39795918367347 29.049238743116323
#0 2020-04-02 36.00583090379008 31.284418529316522
You need to sort it using datetime64 ns format and change it back to dd/mm/yyyy if you want
df['SEMANAS'] = pd.to_datetime(df['SEMANAS'], format='%d%m%Y')
df.sort_values(by=['SEMANAS'], inplace=True)
df['SEMANAS'] = pd.to_datetime(df['SEMANAS'], format='%d%m%Y').dt.strftime('%d/%m/%Y')
print(df)
SEMANAS HIDROLOGICAS METEOROLOGICAS
1 05/03/2020 86.916910 77.011367
2 12/03/2020 87.317784 77.241806
3 19/03/2020 59.220117 54.573431
4 26/03/2020 32.397959 29.049239
0 02/04/2020 36.005831 31.284419

How to move the timestamp bounds for datetime in pandas (working with historical data)?

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23

how to plot bar gaps in pandas dataframe with timedelta and timestamp

Given a timestamped df with timedelta showing time covered such as:
df = pd.DataFrame(pd.to_timedelta(['00:45:00','01:00:00','00:30:00']).rename('span'),
index=pd.to_datetime(['2019-09-19 18:00','2019-09-19 19:00','2019-09-19 21:00']).rename('ts'))
# span
# ts
# 2019-09-19 18:00:00 00:45:00
# 2019-09-19 19:00:00 01:00:00
# 2019-09-19 21:00:00 00:30:00
How can I plot a bar graph showing drop outs every 15 minutes? What I want is a bar graph that will show 0 or 1 on the Y axis with a 1 for each 15 minute segment in the time periods covered above, and a 0 for all the 15 minute segments not covered.
Per this answer I tried:
df['span'].astype('timedelta64[m]').plot.bar()
However this plots each timespan vertically, and does not show that the whole hour of 2019-09-19 20:00 is missing.
.
I tried
df['span'].astype('timedelta64[m]').plot()
It plots the following which is not very useful.
I also tried this answer to no avail.
Update
Based on lostCode's answer I was able to further modify the DataFrame as follows:
def isvalid(period):
for ndx, row in df.iterrows():
if (period.start_time >= ndx) and (period.start_time < row.end):
return 1
return 0
df['end']= df.index + df.span
ds = pd.period_range(df.index.min(), df.end.max(), freq='15T')
df_valid = pd.DataFrame(ds.map(isvalid).rename('valid'), index=ds.rename('period'))
Is there a better, more efficient way to do it?
You can use DataFrame.resample to create a new DataFrame to
to verify the existence of time spaces. To check use DataFrame.isin
import numpy as np
check=df.resample('H')['span'].sum().reset_index()
d=df.reset_index('ts').sort_values('ts')
check['valid']=np.where(check['ts'].isin(d['ts']),1,0)
check.set_index('ts')['valid'].plot(kind='bar',figsize=(10,10))

matplotlib, pandas, how to generate a histogram of timedeltas?

I have a pandas.DataFrame df with that contains the following series:
Time
2182447 0 days 05:44:00
2182447 0 days 05:49:00
3129563 0 days 22:09:00
13341029 0 days 16:49:00
13341029 0 days 16:58:00
25622668 0 days 08:24:00
25622668 0 days 08:28:00
30077018 24 days 15:01:00
30077018 24 days 15:09:00
20131954 0 days 06:18:00
I would like to plot a histogram of the timedeltas. However:
hist(df)
df.Time.hist()
# both functions give the same error
>>> TypeError: Cannot cast ufunc less input from dtype('float64') to dtype('<m8[ns]') with casting rule 'same_kind'
The following works:
hist(df.Time.astype('timedelta64[h]'))
You can use different units in the astype argument. Here I use ´h´ hours.
More detailed description can be found here.

'NaTType' object has no attribute 'days'

I have a column in my dataset which represents a date in ms and sometimes its values is nan (actually my columns is of type str and sometimes its valus is 'nan'). I want to compute the epoch in days of this column. The problem is that when doing the difference of two dates:
(pd.to_datetime('now') - pd.to_datetime(np.nan)).days
if one is nan it is converted to NaT and the difference is of type NaTType which hasn't the attribute days.
In my case I would like to have nan as a result.
Other approach I have tried: np.datetime64 cannot be used, since it cannot take as argument nan. My data cannot be converted to int since int doesn't have nan.
It will just work even if you filter first:
In [201]:
df = pd.DataFrame({'date':[dt.datetime.now(), pd.NaT, dt.datetime(2015,1,1)]})
df
Out[201]:
date
0 2015-08-28 12:12:12.851729
1 NaT
2 2015-01-01 00:00:00.000000
In [203]:
df.loc[df['date'].notnull(), 'days'] = (pd.to_datetime('now') - df['date']).dt.days
df
Out[203]:
date days
0 2015-08-28 12:12:12.851729 -1
1 NaT NaN
2 2015-01-01 00:00:00.000000 239
For me upgrading to pandas 0.20.3 from pandas 0.19.2 helped resolve this error.
pip install --upgrade pandas