'<' not supported between instances of 'datetime.date' and 'str' - pandas

I get a TypeError:
TypeError: '<' not supported between instances of 'datetime.date' and 'str'`
While running the following piece of code:
import requests
import re
import json
import pandas as pd
def retrieve_quotes_historical(stock_code):
quotes = []
url = 'https://finance.yahoo.com/quote/%s/history?p=%s' % (stock_code, stock_code)
r = requests.get(url)
m = re.findall('"HistoricalPriceStore":{"prices":(.*?), "isPending"', r.text)
if m:
quotes = json.loads(m[0])
quotes = quotes[::-1]
return [item for item in quotes if not 'type' in item]
quotes = retrieve_quotes_historical('INTC')
df = pd.DataFrame(quotes)
s = pd.Series(pd.to_datetime(df.date, unit='s'))
df.date = s.dt.date
df = df.set_index('date')
This piece runs all smooth, but when I try to run this piece of code:
df['2017-07-07':'2017-07-10']
I get the TypeError.
How can I fix this?

The thing is you want to slice using Strings '2017-07-07' while your index is of type datetime.date. Your slices should be of this type too.
You can do this by defining your startdate and endate as follows:
import pandas as pd
startdate = pd.to_datetime("2017-7-7").date()
enddate = pd.to_datetime("2017-7-10").date()
df.loc[startdate:enddate]
startdate & enddate are now of type datetime.date and your slice will work:
adjclose close high low open volume
date
2017-07-07 33.205006 33.880001 34.119999 33.700001 33.700001 18304500
2017-07-10 32.979588 33.650002 33.740002 33.230000 33.250000 29918400
It is also possible to create datetime.date type without pandas:
import datetime
startdate = datetime.datetime.strptime('2017-07-07', "%Y-%m-%d").date()
enddate = datetime.datetime.strptime('2017-07-10', "%Y-%m-%d").date()

In addition to Paul's answer, a few things to note:
pd.to_datetime(df['date'],unit='s') already returns a Series so you do not need to wrap it.
besides, when parsing is successful the Series returned by pd.to_datetime has dtype datetime64[ns] (timezone-naïve) or datetime64[ns, tz] (timezone-aware). If parsing fails, it may still return a Series without error, of dtype O for "object" (at least in pandas 1.2.4), denoting falling back to Python's stdlib datetime.datetime.
filtering using strings as in df['2017-07-07':'2017-07-10'] only works when the dtype of the index is datetime64[...], not when it is O (object
So with all of this, your example can be made to work by only changing the last lines:
df = pd.DataFrame(quotes)
s = pd.to_datetime(df['date'],unit='s') # no need to wrap in Series
assert str(s.dtype) == 'datetime64[ns]' # VERY IMPORTANT!!!!
df.index = s
print(df['2020-08-01':'2020-08-10']) # it now works!
It yields:
date open ... volume adjclose
date ...
2020-08-03 13:30:00 1596461400 48.270000 ... 31767100 47.050617
2020-08-04 13:30:00 1596547800 48.599998 ... 29045800 47.859154
2020-08-05 13:30:00 1596634200 49.720001 ... 29438600 47.654583
2020-08-06 13:30:00 1596720600 48.790001 ... 23795500 47.634968
2020-08-07 13:30:00 1596807000 48.529999 ... 36765200 47.105358
2020-08-10 13:30:00 1597066200 48.200001 ... 37442600 48.272457
Also finally note that if your datetime format somehow contains the time offset, there seem to be a mandatory utc=True argument to add (in Pandas 1.2.4) to pd.to_datetime, otherwise the returned dtype will be 'O' even if parsing is successful. I hope that this will improve in the future, as it is not intuitive at all.
See to_datetime documentation for details.

Related

To prevent automatic type change in Pandas

I have a excel (.xslx) file with 4 columns:
pmid (int)
gene (string)
disease (string)
label (string)
I attempt to load this directly into python with pandas.read_excel
df = pd.read_excel(path, parse_dates=False)
capture from excel
capture from pandas using my ide debugger
As shown above, pandas tries to be smart, automatically converting some of gene fields such as 3.Oct, 4.Oct to a datetime type. The issue is that 3.Oct or 4.Oct is a abbreviation of Gene type and totally different meaning. so I don't want pandas to do so. How can I prevent pandas from converting types automatically?
Update:
In fact, there is no conversion. The value appears as 2020-10-03 00:00:00 in Pandas because it is the real value stored in the cell. Excel show this value in another format
Update 2:
To keep the same format as Excel, you can use pd.to_datetime and a custom function to reformat the date.
# Sample
>>> df
gene
0 PDGFRA
1 2021-10-03 00:00:00 # Want: 3.Oct
2 2021-10-04 00:00:00 # Want: 4.Oct
>>> df['gene'] = (pd.to_datetime(df['gene'], errors='coerce')
.apply(lambda dt: f"{dt.day}.{calendar.month_abbr[dt.month]}"
if dt is not pd.NaT else np.NaN)
.fillna(df['gene']))
>>> df
gene
0 PDGFRA
1 3.Oct
2 4.Oct
Old answer
Force dtype=str to prevent Pandas try to transform your dataframe
df = pd.read_excel(path, dtype=str)
Or use converters={'colX': str, ...} to map the dtype for each columns.
pd.read_excel has a dtype argument you can use to specify data types explicitly.

Pandas: Series and DataFrames handling of datetime objects

Pandas behaves in an unusual way when interacting with datetime information when the data type is a Series vs when it is not. Specifically, either .dt is required (if it's a Series) or .dt will throw an error (if it's not a Series.) I've spent the better part of an hour tracking the behavior down.
import pandas as pd
data = {'dates':['2019-03-01','2019-03-02'],'event':[0,1]}
df = pd.DataFrame(data)
df['dates'] = pd.to_datetime(df['dates'])
Pandas Series:
df['dates'][0:1].dt.year
>>>
0 2019
Name: dates, dtype: int64
df['dates'][0:1].year
>>>
AttributeError: 'Series' object has no attribute 'year'
Not Pandas Series:
df['dates'][0].year
>>>
2019
df['dates'][0].dt.year
>>>
AttributeError: 'Timestamp' object has no attribute 'dt'
Does anyone know why Pandas behaves this way? Is this a "feature not a bug" like it's actually useful in setting?
This behaviour is consistent with python. A collection of datetimes is fundamentally different than a single datetime.
We can see this simply with list vs datetime object:
from datetime import datetime
a = datetime.now()
print(a.year)
# 2021
list_of_datetimes = [datetime.now(), datetime.now()]
print(list_of_datetimes.year)
# AttributeError: 'list' object has no attribute 'year'
Naturally a list does not have a year attribute, because in python we cannot guarantee the list contains only datetimes.
We would have to apply some function to each element in the list to access the year, for example:
from datetime import datetime
list_of_datetimes = [datetime.now(), datetime.now()]
print(*map(lambda d: d.year, list_of_datetimes))
# 2021 2021
This concept of "applying an operation over a collection of datetimes" is fundamentally what the dt accessor does. By extension, this accessor is unnecessary when affecting a single element as it is when working with only a single datetime.
In pandas we can only use the dt accessor with DateTime Series.
There are a lot of guarantees needed to be made in order to apply the year to all elements in the Series:
import pandas as pd
data = {'dates': ['2019-03-01', '2019-03-02'], 'event': [0, 1]}
df = pd.DataFrame(data)
df['dates'] = pd.to_datetime(df['dates'])
print(df['dates'].dt.year)
0 2019
1 2019
Name: dates, dtype: int64
Again, however, since a column of object type could contain both datetimes and non-datetimes we may need to access the individual elements. Like:
import pandas as pd
data = {'dates': ['2019-03-01', 87], 'event': [0, 1]}
df = pd.DataFrame(data)
print(df)
# dates event
# 0 2019-03-01 0
# 1 87 1
# Convert only 1 value to datetime
df.loc[0, 'dates'] = pd.to_datetime(df.loc[0, 'dates'])
print(df.loc[0, 'dates'].year)
# 2019
print(df.loc[1, 'dates'].year)
# AttributeError: 'int' object has no attribute 'year'

Find local datetime at a specific zipcode given UTC time

I have a dataframe that includes a US Zip Code and a datetimeoffset field that is UTC time. I want to add a column to the dataframe that shows the local time based on the Zip. It looks like pyzipcode might have what I need but I can't figure out how to code it. Here is an example of the dataframe I have:
import pandas as pd
from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
data = [{'Zip':78745, 'DateTimeOffsetUTC':'7/8/2020 5:17:48 PM +00:00'}]
df = pd.DataFrame(data)
df["LocalDatetime"] = ???
Thanks in advance! I think this might be a really simple thing but I'm really new to python.
Assuming you have no invalid data, it should suffice to map a lambda to each zip code to extract timezone offset.
dt = pd.to_datetime(df['DateTimeOffsetUTC'])
offset = df['Zip'].map(lambda z: pd.Timedelta(hours=zcdb[z].timezone))
df['LocalDatetime'] = (dt + offset).dt.tz_localize(None)
df
Zip DateTimeOffsetUTC LocalDatetime
0 78745 7/8/2020 5:17:48 PM +00:00 2020-07-08 11:17:48

Converting Date Time index in Pandas [duplicate]

My dataframe has a DOB column (example format 1/1/2016) which by default gets converted to Pandas dtype 'object'.
Converting this to date format with df['DOB'] = pd.to_datetime(df['DOB']), the date gets converted to: 2016-01-26 and its dtype is: datetime64[ns].
Now I want to convert this date format to 01/26/2016 or any other general date format. How do I do it?
(Whatever the method I try, it always shows the date in 2016-01-26 format.)
You can use dt.strftime if you need to convert datetime to other formats (but note that then dtype of column will be object (string)):
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)
DOB
0 26/1/2016
1 26/1/2016
df['DOB'] = pd.to_datetime(df.DOB)
print (df)
DOB
0 2016-01-26
1 2016-01-26
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)
DOB DOB1
0 2016-01-26 01/26/2016
1 2016-01-26 01/26/2016
Changing the format but not changing the type:
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y-%m'))
There is a difference between
the content of a dataframe cell (a binary value) and
its presentation (displaying it) for us, humans.
So the question is: How to reach the appropriate presentation of my datas without changing the data / data types themselves?
Here is the answer:
If you use the Jupyter notebook for displaying your dataframe, or
if you want to reach a presentation in the form of an HTML file (even with many prepared superfluous id and class attributes for further CSS styling — you may or you may not use them),
use styling. Styling don't change data / data types of columns of your dataframe.
Now I show you how to reach it in the Jupyter notebook — for a presentation in the form of HTML file see the note near the end of this answer.
I will suppose that your column DOB already has the datetime64 type (you have shown that you know how to reach it). I prepared a simple dataframe (with only one column) to show you some basic styling:
Not styled:
df
DOB
0 2019-07-03
1 2019-08-03
2 2019-09-03
3 2019-10-03
Styling it as mm/dd/yyyy:
df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
DOB
0 07/03/2019
1 08/03/2019
2 09/03/2019
3 10/03/2019
Styling it as dd-mm-yyyy:
df.style.format({"DOB": lambda t: t.strftime("%d-%m-%Y")})
DOB
0 03-07-2019
1 03-08-2019
2 03-09-2019
3 03-10-2019
Be careful!
The returning object is NOT a dataframe — it is an object of the class Styler, so don't assign it back to df:
Don't do this:
df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")}) # Don't do this!
(Every dataframe has its Styler object accessible by its .style property, and we changed this df.style object, not the dataframe itself.)
Questions and Answers:
Q: Why your Styler object (or an expression returning it) used as the last command in a Jupyter notebook cell displays your (styled) table, and not the Styler object itself?
A: Because every Styler object has a callback method ._repr_html_() which returns an HTML code for rendering your dataframe (as a nice HTML table).
Jupyter Notebook IDE calls this method automatically to render objects which have it.
Note:
You don't need the Jupyter notebook for styling (i.e., for nice outputting a dataframe without changing its data / data types).
A Styler object has a method render(), too, if you want to obtain a string with the HTML code (e.g., for publishing your formatted dataframe on the Web, or simply present your table in the HTML format):
df_styler = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
HTML_string = df_styler.render()
Compared to the first answer, I will recommend to use dt.strftime() first, and then pd.to_datetime(). In this way, it will still result in the datetime data type.
For example,
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016 ', 1: '26/1/2016 '})
print(df.dtypes)
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print(df.dtypes)
df['DOB1'] = pd.to_datetime(df['DOB1'])
print(df.dtypes)
The below code worked for me instead of the previous one:
df['DOB']=pd.to_datetime(df['DOB'].astype(str), format='%m/%d/%Y')
You can try this. It'll convert the date format to DD-MM-YYYY:
df['DOB'] = pd.to_datetime(df['DOB'], dayfirst = True)
The below code changes to the 'datetime' type and also formats in the given format string.
df['DOB'] = pd.to_datetime(df['DOB'].dt.strftime('%m/%d/%Y'))
Below is the code that worked for me. And we need to be very careful for format. The below link will be definitely useful for knowing your exiting format and changing into the desired format (follow the strftime() and strptime() format codes in strftime() and strptime() Behavior):
data['date_new_format'] = pd.to_datetime(data['date_to_be_changed'] , format='%b-%y')

Reformatting dataframe column issue [duplicate]

My dataframe has a DOB column (example format 1/1/2016) which by default gets converted to Pandas dtype 'object'.
Converting this to date format with df['DOB'] = pd.to_datetime(df['DOB']), the date gets converted to: 2016-01-26 and its dtype is: datetime64[ns].
Now I want to convert this date format to 01/26/2016 or any other general date format. How do I do it?
(Whatever the method I try, it always shows the date in 2016-01-26 format.)
You can use dt.strftime if you need to convert datetime to other formats (but note that then dtype of column will be object (string)):
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)
DOB
0 26/1/2016
1 26/1/2016
df['DOB'] = pd.to_datetime(df.DOB)
print (df)
DOB
0 2016-01-26
1 2016-01-26
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)
DOB DOB1
0 2016-01-26 01/26/2016
1 2016-01-26 01/26/2016
Changing the format but not changing the type:
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y-%m'))
There is a difference between
the content of a dataframe cell (a binary value) and
its presentation (displaying it) for us, humans.
So the question is: How to reach the appropriate presentation of my datas without changing the data / data types themselves?
Here is the answer:
If you use the Jupyter notebook for displaying your dataframe, or
if you want to reach a presentation in the form of an HTML file (even with many prepared superfluous id and class attributes for further CSS styling — you may or you may not use them),
use styling. Styling don't change data / data types of columns of your dataframe.
Now I show you how to reach it in the Jupyter notebook — for a presentation in the form of HTML file see the note near the end of this answer.
I will suppose that your column DOB already has the datetime64 type (you have shown that you know how to reach it). I prepared a simple dataframe (with only one column) to show you some basic styling:
Not styled:
df
DOB
0 2019-07-03
1 2019-08-03
2 2019-09-03
3 2019-10-03
Styling it as mm/dd/yyyy:
df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
DOB
0 07/03/2019
1 08/03/2019
2 09/03/2019
3 10/03/2019
Styling it as dd-mm-yyyy:
df.style.format({"DOB": lambda t: t.strftime("%d-%m-%Y")})
DOB
0 03-07-2019
1 03-08-2019
2 03-09-2019
3 03-10-2019
Be careful!
The returning object is NOT a dataframe — it is an object of the class Styler, so don't assign it back to df:
Don't do this:
df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")}) # Don't do this!
(Every dataframe has its Styler object accessible by its .style property, and we changed this df.style object, not the dataframe itself.)
Questions and Answers:
Q: Why your Styler object (or an expression returning it) used as the last command in a Jupyter notebook cell displays your (styled) table, and not the Styler object itself?
A: Because every Styler object has a callback method ._repr_html_() which returns an HTML code for rendering your dataframe (as a nice HTML table).
Jupyter Notebook IDE calls this method automatically to render objects which have it.
Note:
You don't need the Jupyter notebook for styling (i.e., for nice outputting a dataframe without changing its data / data types).
A Styler object has a method render(), too, if you want to obtain a string with the HTML code (e.g., for publishing your formatted dataframe on the Web, or simply present your table in the HTML format):
df_styler = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
HTML_string = df_styler.render()
Compared to the first answer, I will recommend to use dt.strftime() first, and then pd.to_datetime(). In this way, it will still result in the datetime data type.
For example,
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016 ', 1: '26/1/2016 '})
print(df.dtypes)
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print(df.dtypes)
df['DOB1'] = pd.to_datetime(df['DOB1'])
print(df.dtypes)
The below code worked for me instead of the previous one:
df['DOB']=pd.to_datetime(df['DOB'].astype(str), format='%m/%d/%Y')
You can try this. It'll convert the date format to DD-MM-YYYY:
df['DOB'] = pd.to_datetime(df['DOB'], dayfirst = True)
The below code changes to the 'datetime' type and also formats in the given format string.
df['DOB'] = pd.to_datetime(df['DOB'].dt.strftime('%m/%d/%Y'))
Below is the code that worked for me. And we need to be very careful for format. The below link will be definitely useful for knowing your exiting format and changing into the desired format (follow the strftime() and strptime() format codes in strftime() and strptime() Behavior):
data['date_new_format'] = pd.to_datetime(data['date_to_be_changed'] , format='%b-%y')