Problems getting two columns into datetime.datetime format - pandas

I have code at the moment written to change two columns of my dataframe from strings into datetime.datetime objects similar to the following:
def converter(date):
date = dt.strptime(date, '%m/%d/%Y %H:%M:%S')
return date
df = pd.DataFrame({'A':['12/31/9999 0:00:00','1/1/2018 0:00:00'],
'B':['4/1/2015 0:00:00','11/1/2014 0:00:00']})
df['A'] = df['A'].apply(converter)
df['B'] = df['B'].apply(converter)
When I run this code and print the dataframe, it comes out like this
A B
0 9999-12-31 00:00:00 2015-04-01
1 2018-01-01 00:00:00 2014-11-01
When I checked the data types of each column, they read
A object
B datetime64[ns]
But when I check the format of the actual cells of the first row, they read
<class 'datetime.datetime'>
<class 'pandas._libs.tslib.Timestamp'>
After experimenting around, I think I've run into an out of bounds error because of the date '12/31/9999 0:00:00' in column 'A' and this is causing this column to be cast as a datetime.datetime object. My question is how I can also convert column 'B' of my dataframe to a datetime.datetime object so that I can run a query on the columns similar to
df.query('A > B')
without getting an error or the wrong output.
Thanks in advance

Since '9999' is just some dummy year, you can simplify your life by choosing a dummy year which is in bounds (or one that makes more sense given your actual data):
import pandas as pd
df.replace('9999', '2060', regex=True).apply(pd.to_datetime)
Output:
A B
0 2060-12-31 2015-04-01
1 2018-01-01 2014-11-01
A datetime64[ns]
B datetime64[ns]
dtype: object
As #coldspeed points out, it's perhaps better to remove those bad dates:
df.apply(pd.to_datetime, errors='coerce')
# A B
#0 NaT 2015-04-01
#1 2018-01-01 2014-11-01

Related

Inconsistent output for pandas groupby-resample with missing values in first time bin

I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.
I submitted bug report 46826 to pandas.
The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

To prevent automatic type change in Pandas

I have a excel (.xslx) file with 4 columns:
pmid (int)
gene (string)
disease (string)
label (string)
I attempt to load this directly into python with pandas.read_excel
df = pd.read_excel(path, parse_dates=False)
capture from excel
capture from pandas using my ide debugger
As shown above, pandas tries to be smart, automatically converting some of gene fields such as 3.Oct, 4.Oct to a datetime type. The issue is that 3.Oct or 4.Oct is a abbreviation of Gene type and totally different meaning. so I don't want pandas to do so. How can I prevent pandas from converting types automatically?
Update:
In fact, there is no conversion. The value appears as 2020-10-03 00:00:00 in Pandas because it is the real value stored in the cell. Excel show this value in another format
Update 2:
To keep the same format as Excel, you can use pd.to_datetime and a custom function to reformat the date.
# Sample
>>> df
gene
0 PDGFRA
1 2021-10-03 00:00:00 # Want: 3.Oct
2 2021-10-04 00:00:00 # Want: 4.Oct
>>> df['gene'] = (pd.to_datetime(df['gene'], errors='coerce')
.apply(lambda dt: f"{dt.day}.{calendar.month_abbr[dt.month]}"
if dt is not pd.NaT else np.NaN)
.fillna(df['gene']))
>>> df
gene
0 PDGFRA
1 3.Oct
2 4.Oct
Old answer
Force dtype=str to prevent Pandas try to transform your dataframe
df = pd.read_excel(path, dtype=str)
Or use converters={'colX': str, ...} to map the dtype for each columns.
pd.read_excel has a dtype argument you can use to specify data types explicitly.

unable to fetch row where index is of type dtype='datetime64[ns]'

I have a pandas main_df dataframe with date as index
<bound method Index.get_value of DatetimeIndex(['2021-05-11', '2021-05-12','2021-05-13'],
dtype='datetime64[ns]', name='date', freq=None)>
what am trying to do is fetch row based on certain date.
I tried like this
main_df.loc['2021-05-11'] and it works fine.
But If I pass a date object its failing
main_df.loc[datetime.date(2021, 5, 12)] and its showing key error.
The index is DatetimeIndex then why its throwing an error if I didn't pass key as string?
Reason is DatetimeIndex is simplified array of datetimes, so if select vy dates it failed.
So need select by datetimes:
main_df = pd.DataFrame({'a':range(3)},
index=pd.to_datetime(['2021-05-11', '2021-05-12','2021-05-13']))
print (main_df)
a
2021-05-11 0
2021-05-12 1
2021-05-13 2
print (main_df.index)
DatetimeIndex(['2021-05-11', '2021-05-12', '2021-05-13'], dtype='datetime64[ns]', freq=None)
print (main_df.loc[datetime.datetime(2021, 5, 12)])
a 1
Name: 2021-05-12 00:00:00, dtype: int64
If need select by dates first convert datetimes to dates by DatetimeIndex.date:
main_df.index = main_df.index.date
print (main_df.index)
Index([2021-05-11, 2021-05-12, 2021-05-13], dtype='object')
print (main_df.loc[datetime.date(2021, 5, 12)])
a 1
Name: 2021-05-12, dtype: int64
If use string it use exact indexing, so pandas select in DatetimeIndex correct way.

Error: "not found in axis" when droping list of index values

I am trying to drop two days every year from a dataframe with hourly values from 6am-8pm for dates from the 15.07 to 20.10. Therefore, I created a list with all the dates that should be dropped like this:
for i in range(0,6):
Year = 1999 + i
drop_list.append(str(Year)+'-07-15')
drop_list.append(str(Year)+'-07-16')
My data looks like this:
data
When I now call:
y = y.drop(drop_list)
I get:
KeyError: "['1999-07-15' '1999-07-16' '2000-07-15' '2000-07-16' '2001-07-15'\n '2001-07-16' '2002-07-15' '2002-07-16' '2003-07-15' '2003-07-16'\n '2004-07-15' '2004-07-16'] not found in axis"
Any suggestions what I am missing?
If you want to drop from a datetime index, you need to pass datetimes.
E.g.
>>> s = pd.Series([1,2,3], index=pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03']))
>>> s
2020-01-01 1
2020-01-02 2
2020-01-03 3
dtype: int64
>>> s.drop(pd.to_datetime(['2020-01-01']))
2020-01-02 2
2020-01-03 3
dtype: int64
In your case, y.drop(pd.to_datetime(drop_list)) might work. For the future, please see https://stackoverflow.com/help/how-to-ask

pandas datetime is shown as numbers in plot

I have got a datetime variable in pandas dataframe 1, when I check the dtypes, it shows the right format (datetime) [2], however when I try to plot this variable, it is being plotted as numbers and not datetime [3].
The most surprising is that this variable was working fine till yesterday, I do not know what has change today :( and as the dtype is showing fine, I am clueless what else could go wrong.
I would highly appreciate your feedback.
thank you,
1
df.head()
reactive_power current timeofmeasurement
0 0 0.000 2018-12-12 10:43:41
1 0 0.000 2018-12-12 10:44:32
2 0 1.147 2018-12-12 10:46:16
3 262 1.135 2018-12-12 10:47:30
4 1159 4.989 2018-12-12 10:49:47
[2]
[] df.dtypes
reactive_power int64
current float64
timeofmeasurement datetime64[ns]
dtype: object
[3]
[]1
You need to convert your datetime column from string type into datetime type, and then set it as index. I don't have your original code, but something along those lines:
#Convert to datetime
df["current timeofmeasurement"] = pd.to_datetime(df["current timeofmeasurement"], format = "%Y-%m-%d %H:%H:%S")
#Set date as index
df = df.set_index("current timeofmeasurement")
#Then you can plot easily
df.plot()