Resample TimedeltaIndex and normalize to frequency - pandas

For example, I have got this Series :
17:50:51.050929 5601
17:52:15.429169 5601
17:52:19.538702 5601
17:53:44.776350 5601
17:53:51.870372 5598
17:55:33.952417 5600
17:56:48.736539 5596
17:57:01.205767 5593
17:57:26.066097 5593
17:57:30.644398 5591
I want to resample it but I want that the index start to a rounded frequency.
So in the case above, I want the first index 17:51:00 if I resample on Min frequency.
However Pandas implements it like that :
a.resample('1T', 'mean')
Out[125]:
17:50:51.050929 5601.000000
17:51:51.050929 5601.000000
17:52:51.050929 5601.000000
17:53:51.050929 5598.000000
17:54:51.050929 5600.000000
17:55:51.050929 5596.000000
17:56:51.050929 5592.333333
17:57:51.050929 NaN
How can I have a TimedeltaIndex starting from a rounded index ? Such as Timestamp resampling

A quick way to do it is to normalise the index before resampling (using either floor, ceil, or round):
a.index = a.index.floor(freq='1T')
a = a.resample('1T').mean()

Related

How to resample intra-day intervals and use .idxmax()?

I am using data from yfinance which returns a pandas Data-Frame.
Volume
Datetime
2021-09-13 09:30:00-04:00 951104
2021-09-13 09:35:00-04:00 408357
2021-09-13 09:40:00-04:00 498055
2021-09-13 09:45:00-04:00 466363
2021-09-13 09:50:00-04:00 315385
2021-12-06 15:35:00-05:00 200748
2021-12-06 15:40:00-05:00 336136
2021-12-06 15:45:00-05:00 473106
2021-12-06 15:50:00-05:00 705082
2021-12-06 15:55:00-05:00 1249763
There are 5 minute intra-day intervals in the data-frame. I want to resample to daily data and get the idxmax of the maximum volume for that day.
df.resample("B")["Volume"].idxmax()
Returns an error:
ValueError: attempt to get argmax of an empty sequence
I used B(business-days) as the resampling period, so there shouldn't be any empty sequences.
I should say .max() works fine.
Also using .agg as was suggested in another question returns an error:
df["Volume"].resample("B").agg(lambda x : np.nan if x.count() == 0 else x.idxmax())
error:
IndexError: index 77 is out of bounds for axis 0 with size 0
You can use groupby as an alternative of resample:
>>> df.groupby(df.index.normalize())['Volume'].agg(Datetime='idxmax', Volume='max')
Datetime Volume
Datetime
2021-09-13 2021-09-13 09:30:00 951104
2021-12-06 2021-12-06 15:55:00 1249763
For me working test if all NaNs per group in if-else:
df = df.resample("B")["Volume"].agg(lambda x: np.nan if x.isna().all() else x.idxmax())

Seasonal Decomposition plots won't show despite pandas recognizing DateTime Index

I loaded the data as follows
og_data=pd.read_excel(r'C:\Users\hp\Downloads\dataframe.xlsx',index_col='DateTime')
And it looks like this:
DateTime A
2019-02-04 10:37:54 0
2019-02-04 10:47:54 1
2019-02-04 10:57:54 2
2019-02-04 11:07:54 3
2019-02-04 11:17:54 4
Problem is, I'm trying to set the data to a frequency, but there are NaN values that I have to drop, and even if I don't, it seems to be an irregular frequency. I've got pandas to recognize the DateTime to the index:
og_data.index
DatetimeIndex
dtype='datetime64[ns]', name='DateTime', length=15536, freq=None)
but when I try doing this:
og_data.index.freq = '10T'
That should mean 10min, right?
But I get the following error instead:
ValueError: Inferred frequency None from passed values does not conform to passed frequency 10T
Even if I set the frequency to days:
og_data.index.freq = 'D'
I get a similar error.
The goal is to get a seasonal decomposition plots because I want to forecast the data. But I get the following error when I try to do so:
result=seasonal_decompose(og_data['A'],model='add')
result.plot();
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
Which makes sense, I can't set the datetime index to a specified frequency. I need it to be every 10min, please advise.

How to move the timestamp bounds for datetime in pandas (working with historical data)?

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23

Scatter plot of Multiindex GroupBy()

I'm trying to make a scatter plot of a GroupBy() with Multiindex (http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-with-multiindex). That is, I want to plot one of the labels on the x-axis, another label on the y-axis, and the mean() as the size of each point.
df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean() returns:
Sigma_ang Epsilon_K
3.4 30 0.647000
40 0.602071
50 0.619786
3.6 30 0.646538
40 0.591833
50 0.607769
3.8 30 0.616833
40 0.590714
50 0.578364
Name: RMSD, dtype: float64
And I'd like to to plot something like: plt.scatter(x=Sigma, y=Epsilon, s=RMSD)
What's the best way to do this? I'm having trouble getting the proper Sigma and Epsilon values for each RMSD value.
+1 to Vaishali Garg. Based on his comment, the following works:
df_mean = df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean().reset_index()
plt.scatter(df_mean['Sigma'], df_mean['Epsilon'], s=100.*df_mean['RMSD'])

Pandas: fancy indexing a dataframe

I have a Pandas dataframe, df1, that is a year-long 5 minute timeseries with columns A-Z.
df1.shape
(105121, 26)
df1.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2002-01-02 00:00:00, ..., 2003-01-02 00:00:00]
Length: 105121, Freq: 5T, Timezone: None
I have a second dataframe, df2, that is a year-long daily timeseries (over the same period) with matching columns. The values of this second frame are Booleans.
df2.shape
(365, 26)
df2.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2002-01-02 00:00:00, ..., 2003-01-01 00:00:00]
Length: 365, Freq: D, Timezone: None
I want to use df2 as a fancy index to df1, i.e. "df1.ix[df2]" or somesuch, such that I get back a subset of df1's columns for each date -- i.e. those which df2 says are True on that date (with all timestamps thereon). Thus the shape of the result should be (105121, width), where width is the number of distinct columns the Booleans imply (width<=26).
Currently, df1.ix[df2] only partially works. Only the 00:00 values for each day are picked out, which makes sense in the light of df2's 'point-like' time series.
I next tried time spans as the df2 index:
df2.index
PeriodIndex: 365 entries, 2002-01-02 to 2003-01-01
This time, I get an error:
/home/wchapman/.local/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_indexer(self, target, method, limit)
844 this = self.astype(object)
845 target = target.astype(object)
--> 846 return this.get_indexer(target, method=method, limit=limit)
847
848 if not self.is_unique:
AttributeError: 'numpy.ndarray' object has no attribute 'get_indexer'
My interim solution is to loop by date, but this seems inefficient. Is Pandas capable of this kind of fancy indexing? I don't see examples anywhere in the documentation.
Here's one way to do this:
t_index = df1.index
d_index = df2.index
mask = t_index.map(lambda t: t.date() in d_index)
df1[mask]
And slightly faster (but with the same idea) would be to use:
mask = pd.to_datetime([datetime.date(*t_tuple)
for t_tuple in zip(t_index.year,
t_index.month,
t_index.day)]).isin(d_index)