Stumped. How do I convert my datetime format to an acceptable format for pandas and mplfinance? - dataframe

I am trying to use various charting packages for ohlc bar charting. Some success but I keep getting stuck on "TypeError: Expect data.index as DatetimeIndex". The samples that I copy work perfectly fine, like this below:
import yfinance as yf
import mplfinance as mpf
symbol = 'AAPL'
df = yf.download(symbol, period='6mo')
mpf.plot(df, type='candle')
which has the following type of index for the df :
DatetimeIndex(['2022-06-30', '2022-07-01', '2022-07-05', '2022-07-06',
'2022-12-29', '2022-12-30'],
dtype='datetime64[ns]', name='Date', length=128, freq=None)
So I am trying to get my dataframe index to look the same, with a DatetimeIndex format. My index looks like this:
0 2022-11-09 14:30:00+00:00
1 2022-11-09 14:35:00+00:00
2 2022-11-09 14:40:00+00:00
3 2022-11-09 14:45:00+00:00
4 2022-11-09 14:50:00+00:00
...
2299 2022-12-21 20:35:00+00:00
2300 2022-12-21 20:40:00+00:00
2301 2022-12-21 20:45:00+00:00
2302 2022-12-21 20:50:00+00:00
2303 2022-12-21 20:55:00+00:00
Name: date, Length: 2304, dtype: object
Note the default integer index on left. I believe that I dont need to format it exactly the same, as long the internal datatype being datetime64 in a DatetimeIndex form.
thanks for any help.
So I tried this ( and whole lot of other ideas)
df['timestamp'] = pd.to_datetime(df.date)
new = pd.DataFrame(index=[df.timestamp])
which gives
MultiIndex([('2022-11-09 14:30:00+00:00',),
...
('2022-12-21 20:55:00+00:00',)],
names=['timestamp'], length=2304)
as well as this:
df['timestamp'] = mpl_dates.datestr2num(df.date)
which gives:
MultiIndex([(19305.604166666668,),
( 19305.60763888889,),
(19347.868055555555,),
(19347.871527777777,)],
names=['timestamp'], length=2304)
and neither work.
Am I on the right track, and what is the correct way to do this? How do I get rid of the MultiIndex? And how do I get it to be of type DatetimeIndex?
responding to the question on source of data, its from IBKR, using API routines and I am storing the data in an intermediary CSV file. It has the following format:
,date,open,high,low,close,volume,barCount,average 0,2022-11-09
14:30:00+00:00,174.44,174.44,173.8,174.05,994,64,174.408 1,2022-11-09
14:35:00+00:00,174.11,174.38,173.58,173.62,160,123,173.95 2,2022-11-09
14:40:00+00:00,173.59,173.6,173.14,173.56,98,73,173.363 3,2022-11-09
14:45:00+00:00,173.55,174.02,173.52,173.96,88,53,173.716
I was reading in with the following:
`bars = pd.read_csv(name, header=0, index_col=0, sep=",")

Related

To prevent automatic type change in Pandas

I have a excel (.xslx) file with 4 columns:
pmid (int)
gene (string)
disease (string)
label (string)
I attempt to load this directly into python with pandas.read_excel
df = pd.read_excel(path, parse_dates=False)
capture from excel
capture from pandas using my ide debugger
As shown above, pandas tries to be smart, automatically converting some of gene fields such as 3.Oct, 4.Oct to a datetime type. The issue is that 3.Oct or 4.Oct is a abbreviation of Gene type and totally different meaning. so I don't want pandas to do so. How can I prevent pandas from converting types automatically?
Update:
In fact, there is no conversion. The value appears as 2020-10-03 00:00:00 in Pandas because it is the real value stored in the cell. Excel show this value in another format
Update 2:
To keep the same format as Excel, you can use pd.to_datetime and a custom function to reformat the date.
# Sample
>>> df
gene
0 PDGFRA
1 2021-10-03 00:00:00 # Want: 3.Oct
2 2021-10-04 00:00:00 # Want: 4.Oct
>>> df['gene'] = (pd.to_datetime(df['gene'], errors='coerce')
.apply(lambda dt: f"{dt.day}.{calendar.month_abbr[dt.month]}"
if dt is not pd.NaT else np.NaN)
.fillna(df['gene']))
>>> df
gene
0 PDGFRA
1 3.Oct
2 4.Oct
Old answer
Force dtype=str to prevent Pandas try to transform your dataframe
df = pd.read_excel(path, dtype=str)
Or use converters={'colX': str, ...} to map the dtype for each columns.
pd.read_excel has a dtype argument you can use to specify data types explicitly.

Seasonal Decomposition plots won't show despite pandas recognizing DateTime Index

I loaded the data as follows
og_data=pd.read_excel(r'C:\Users\hp\Downloads\dataframe.xlsx',index_col='DateTime')
And it looks like this:
DateTime A
2019-02-04 10:37:54 0
2019-02-04 10:47:54 1
2019-02-04 10:57:54 2
2019-02-04 11:07:54 3
2019-02-04 11:17:54 4
Problem is, I'm trying to set the data to a frequency, but there are NaN values that I have to drop, and even if I don't, it seems to be an irregular frequency. I've got pandas to recognize the DateTime to the index:
og_data.index
DatetimeIndex
dtype='datetime64[ns]', name='DateTime', length=15536, freq=None)
but when I try doing this:
og_data.index.freq = '10T'
That should mean 10min, right?
But I get the following error instead:
ValueError: Inferred frequency None from passed values does not conform to passed frequency 10T
Even if I set the frequency to days:
og_data.index.freq = 'D'
I get a similar error.
The goal is to get a seasonal decomposition plots because I want to forecast the data. But I get the following error when I try to do so:
result=seasonal_decompose(og_data['A'],model='add')
result.plot();
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
Which makes sense, I can't set the datetime index to a specified frequency. I need it to be every 10min, please advise.

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

'<' not supported between instances of 'datetime.date' and 'str'

I get a TypeError:
TypeError: '<' not supported between instances of 'datetime.date' and 'str'`
While running the following piece of code:
import requests
import re
import json
import pandas as pd
def retrieve_quotes_historical(stock_code):
quotes = []
url = 'https://finance.yahoo.com/quote/%s/history?p=%s' % (stock_code, stock_code)
r = requests.get(url)
m = re.findall('"HistoricalPriceStore":{"prices":(.*?), "isPending"', r.text)
if m:
quotes = json.loads(m[0])
quotes = quotes[::-1]
return [item for item in quotes if not 'type' in item]
quotes = retrieve_quotes_historical('INTC')
df = pd.DataFrame(quotes)
s = pd.Series(pd.to_datetime(df.date, unit='s'))
df.date = s.dt.date
df = df.set_index('date')
This piece runs all smooth, but when I try to run this piece of code:
df['2017-07-07':'2017-07-10']
I get the TypeError.
How can I fix this?
The thing is you want to slice using Strings '2017-07-07' while your index is of type datetime.date. Your slices should be of this type too.
You can do this by defining your startdate and endate as follows:
import pandas as pd
startdate = pd.to_datetime("2017-7-7").date()
enddate = pd.to_datetime("2017-7-10").date()
df.loc[startdate:enddate]
startdate & enddate are now of type datetime.date and your slice will work:
adjclose close high low open volume
date
2017-07-07 33.205006 33.880001 34.119999 33.700001 33.700001 18304500
2017-07-10 32.979588 33.650002 33.740002 33.230000 33.250000 29918400
It is also possible to create datetime.date type without pandas:
import datetime
startdate = datetime.datetime.strptime('2017-07-07', "%Y-%m-%d").date()
enddate = datetime.datetime.strptime('2017-07-10', "%Y-%m-%d").date()
In addition to Paul's answer, a few things to note:
pd.to_datetime(df['date'],unit='s') already returns a Series so you do not need to wrap it.
besides, when parsing is successful the Series returned by pd.to_datetime has dtype datetime64[ns] (timezone-naïve) or datetime64[ns, tz] (timezone-aware). If parsing fails, it may still return a Series without error, of dtype O for "object" (at least in pandas 1.2.4), denoting falling back to Python's stdlib datetime.datetime.
filtering using strings as in df['2017-07-07':'2017-07-10'] only works when the dtype of the index is datetime64[...], not when it is O (object
So with all of this, your example can be made to work by only changing the last lines:
df = pd.DataFrame(quotes)
s = pd.to_datetime(df['date'],unit='s') # no need to wrap in Series
assert str(s.dtype) == 'datetime64[ns]' # VERY IMPORTANT!!!!
df.index = s
print(df['2020-08-01':'2020-08-10']) # it now works!
It yields:
date open ... volume adjclose
date ...
2020-08-03 13:30:00 1596461400 48.270000 ... 31767100 47.050617
2020-08-04 13:30:00 1596547800 48.599998 ... 29045800 47.859154
2020-08-05 13:30:00 1596634200 49.720001 ... 29438600 47.654583
2020-08-06 13:30:00 1596720600 48.790001 ... 23795500 47.634968
2020-08-07 13:30:00 1596807000 48.529999 ... 36765200 47.105358
2020-08-10 13:30:00 1597066200 48.200001 ... 37442600 48.272457
Also finally note that if your datetime format somehow contains the time offset, there seem to be a mandatory utc=True argument to add (in Pandas 1.2.4) to pd.to_datetime, otherwise the returned dtype will be 'O' even if parsing is successful. I hope that this will improve in the future, as it is not intuitive at all.
See to_datetime documentation for details.

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))