Seasonal Decomposition plots won't show despite pandas recognizing DateTime Index - pandas

I loaded the data as follows
og_data=pd.read_excel(r'C:\Users\hp\Downloads\dataframe.xlsx',index_col='DateTime')
And it looks like this:
DateTime A
2019-02-04 10:37:54 0
2019-02-04 10:47:54 1
2019-02-04 10:57:54 2
2019-02-04 11:07:54 3
2019-02-04 11:17:54 4
Problem is, I'm trying to set the data to a frequency, but there are NaN values that I have to drop, and even if I don't, it seems to be an irregular frequency. I've got pandas to recognize the DateTime to the index:
og_data.index
DatetimeIndex
dtype='datetime64[ns]', name='DateTime', length=15536, freq=None)
but when I try doing this:
og_data.index.freq = '10T'
That should mean 10min, right?
But I get the following error instead:
ValueError: Inferred frequency None from passed values does not conform to passed frequency 10T
Even if I set the frequency to days:
og_data.index.freq = 'D'
I get a similar error.
The goal is to get a seasonal decomposition plots because I want to forecast the data. But I get the following error when I try to do so:
result=seasonal_decompose(og_data['A'],model='add')
result.plot();
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
Which makes sense, I can't set the datetime index to a specified frequency. I need it to be every 10min, please advise.

Related

Stumped. How do I convert my datetime format to an acceptable format for pandas and mplfinance?

I am trying to use various charting packages for ohlc bar charting. Some success but I keep getting stuck on "TypeError: Expect data.index as DatetimeIndex". The samples that I copy work perfectly fine, like this below:
import yfinance as yf
import mplfinance as mpf
symbol = 'AAPL'
df = yf.download(symbol, period='6mo')
mpf.plot(df, type='candle')
which has the following type of index for the df :
DatetimeIndex(['2022-06-30', '2022-07-01', '2022-07-05', '2022-07-06',
'2022-12-29', '2022-12-30'],
dtype='datetime64[ns]', name='Date', length=128, freq=None)
So I am trying to get my dataframe index to look the same, with a DatetimeIndex format. My index looks like this:
0 2022-11-09 14:30:00+00:00
1 2022-11-09 14:35:00+00:00
2 2022-11-09 14:40:00+00:00
3 2022-11-09 14:45:00+00:00
4 2022-11-09 14:50:00+00:00
...
2299 2022-12-21 20:35:00+00:00
2300 2022-12-21 20:40:00+00:00
2301 2022-12-21 20:45:00+00:00
2302 2022-12-21 20:50:00+00:00
2303 2022-12-21 20:55:00+00:00
Name: date, Length: 2304, dtype: object
Note the default integer index on left. I believe that I dont need to format it exactly the same, as long the internal datatype being datetime64 in a DatetimeIndex form.
thanks for any help.
So I tried this ( and whole lot of other ideas)
df['timestamp'] = pd.to_datetime(df.date)
new = pd.DataFrame(index=[df.timestamp])
which gives
MultiIndex([('2022-11-09 14:30:00+00:00',),
...
('2022-12-21 20:55:00+00:00',)],
names=['timestamp'], length=2304)
as well as this:
df['timestamp'] = mpl_dates.datestr2num(df.date)
which gives:
MultiIndex([(19305.604166666668,),
( 19305.60763888889,),
(19347.868055555555,),
(19347.871527777777,)],
names=['timestamp'], length=2304)
and neither work.
Am I on the right track, and what is the correct way to do this? How do I get rid of the MultiIndex? And how do I get it to be of type DatetimeIndex?
responding to the question on source of data, its from IBKR, using API routines and I am storing the data in an intermediary CSV file. It has the following format:
,date,open,high,low,close,volume,barCount,average 0,2022-11-09
14:30:00+00:00,174.44,174.44,173.8,174.05,994,64,174.408 1,2022-11-09
14:35:00+00:00,174.11,174.38,173.58,173.62,160,123,173.95 2,2022-11-09
14:40:00+00:00,173.59,173.6,173.14,173.56,98,73,173.363 3,2022-11-09
14:45:00+00:00,173.55,174.02,173.52,173.96,88,53,173.716
I was reading in with the following:
`bars = pd.read_csv(name, header=0, index_col=0, sep=",")

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

How to move the timestamp bounds for datetime in pandas (working with historical data)?

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23

Pandas dataframe datetime timestamp from string

I am trying to convert a column in a pandas dataframe from a string to a timestamp.
Due to a slightly annoying constraint (I am limited by my employers software & IT policy) I am running an older version of Pandas (0.14.1). This version does include the "pd.Timestamp".
Essentially, I want to pass a dataframe column formatted as a string to "pd.Timestamp" to create a column of Timestamps. Here is an example dataframe
'Date/Time String' 'timestamp'
0 2017-01-01 01:02:03 NaN
1 2017-01-02 04:05:06 NaN
2 2017-01-03 07:08:09 NaN
My DataFrame is very big, so iterating through it is really inefficient. But this is what I came up with:
for i in range (len(df['Date/Time String'])):
df['timestamp'].iloc[i] = pd.Timestamp(df['Date/Time String'].iloc[i])
What would be the sensible way to make this operation much faster?
You can check this:
import pandas as pd
df['Date/Time Timestamp'] = pd.to_datetime(df['Date/Time String'])

SARIMAX with series without time as index, or how to make row sequential index as time index?

I have a pandas series ts of numbers that I want to predict for its future (next 600 points). The pandas series is just indexed by sequence number, not by date nor time.
Here is the example content of ts:
0 1
0 -0.801552 1.0
1 -0.997606 2.0
2 -3.659062 3.0
3 -1.193043 4.0
4 -2.858001 5.0
When I ran
statsmodels.tsa.statespace.SARIMAX
with the series, I got the following exception error:
ValueError('Given a pandas object and the index does not contain dates',)
But according to SARIMAX's documentation, for the first argument:
endog : array_like
The observed time-series process y
It seems that it only expects the argument to be array_like.
How can I make the series work with SARIMAX and the eventual ARIMA model without date and time as the index?
I'm thinking if I could fake the row index as minutes or seconds from the start of time? How?
You can try converting the series into a numpy array.
newts = ts.values
Another method that worked for me was similar to what you implied with replacing the index with timestamp.
ts.index = pd.date_range('1/1/2011', periods=72, freq='H')
Here's the documentation for the date_range function. To help you generate the exact timestamps your series requires.