I have code at the moment written to change two columns of my dataframe from strings into datetime.datetime objects similar to the following:
def converter(date):
date = dt.strptime(date, '%m/%d/%Y %H:%M:%S')
return date
df = pd.DataFrame({'A':['12/31/9999 0:00:00','1/1/2018 0:00:00'],
'B':['4/1/2015 0:00:00','11/1/2014 0:00:00']})
df['A'] = df['A'].apply(converter)
df['B'] = df['B'].apply(converter)
When I run this code and print the dataframe, it comes out like this
A B
0 9999-12-31 00:00:00 2015-04-01
1 2018-01-01 00:00:00 2014-11-01
When I checked the data types of each column, they read
A object
B datetime64[ns]
But when I check the format of the actual cells of the first row, they read
<class 'datetime.datetime'>
<class 'pandas._libs.tslib.Timestamp'>
After experimenting around, I think I've run into an out of bounds error because of the date '12/31/9999 0:00:00' in column 'A' and this is causing this column to be cast as a datetime.datetime object. My question is how I can also convert column 'B' of my dataframe to a datetime.datetime object so that I can run a query on the columns similar to
df.query('A > B')
without getting an error or the wrong output.
Thanks in advance
Since '9999' is just some dummy year, you can simplify your life by choosing a dummy year which is in bounds (or one that makes more sense given your actual data):
import pandas as pd
df.replace('9999', '2060', regex=True).apply(pd.to_datetime)
Output:
A B
0 2060-12-31 2015-04-01
1 2018-01-01 2014-11-01
A datetime64[ns]
B datetime64[ns]
dtype: object
As #coldspeed points out, it's perhaps better to remove those bad dates:
df.apply(pd.to_datetime, errors='coerce')
# A B
#0 NaT 2015-04-01
#1 2018-01-01 2014-11-01
Related
I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.
I submitted bug report 46826 to pandas.
The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.
I have a excel (.xslx) file with 4 columns:
pmid (int)
gene (string)
disease (string)
label (string)
I attempt to load this directly into python with pandas.read_excel
df = pd.read_excel(path, parse_dates=False)
capture from excel
capture from pandas using my ide debugger
As shown above, pandas tries to be smart, automatically converting some of gene fields such as 3.Oct, 4.Oct to a datetime type. The issue is that 3.Oct or 4.Oct is a abbreviation of Gene type and totally different meaning. so I don't want pandas to do so. How can I prevent pandas from converting types automatically?
Update:
In fact, there is no conversion. The value appears as 2020-10-03 00:00:00 in Pandas because it is the real value stored in the cell. Excel show this value in another format
Update 2:
To keep the same format as Excel, you can use pd.to_datetime and a custom function to reformat the date.
# Sample
>>> df
gene
0 PDGFRA
1 2021-10-03 00:00:00 # Want: 3.Oct
2 2021-10-04 00:00:00 # Want: 4.Oct
>>> df['gene'] = (pd.to_datetime(df['gene'], errors='coerce')
.apply(lambda dt: f"{dt.day}.{calendar.month_abbr[dt.month]}"
if dt is not pd.NaT else np.NaN)
.fillna(df['gene']))
>>> df
gene
0 PDGFRA
1 3.Oct
2 4.Oct
Old answer
Force dtype=str to prevent Pandas try to transform your dataframe
df = pd.read_excel(path, dtype=str)
Or use converters={'colX': str, ...} to map the dtype for each columns.
pd.read_excel has a dtype argument you can use to specify data types explicitly.
I have a pandas main_df dataframe with date as index
<bound method Index.get_value of DatetimeIndex(['2021-05-11', '2021-05-12','2021-05-13'],
dtype='datetime64[ns]', name='date', freq=None)>
what am trying to do is fetch row based on certain date.
I tried like this
main_df.loc['2021-05-11'] and it works fine.
But If I pass a date object its failing
main_df.loc[datetime.date(2021, 5, 12)] and its showing key error.
The index is DatetimeIndex then why its throwing an error if I didn't pass key as string?
Reason is DatetimeIndex is simplified array of datetimes, so if select vy dates it failed.
So need select by datetimes:
main_df = pd.DataFrame({'a':range(3)},
index=pd.to_datetime(['2021-05-11', '2021-05-12','2021-05-13']))
print (main_df)
a
2021-05-11 0
2021-05-12 1
2021-05-13 2
print (main_df.index)
DatetimeIndex(['2021-05-11', '2021-05-12', '2021-05-13'], dtype='datetime64[ns]', freq=None)
print (main_df.loc[datetime.datetime(2021, 5, 12)])
a 1
Name: 2021-05-12 00:00:00, dtype: int64
If need select by dates first convert datetimes to dates by DatetimeIndex.date:
main_df.index = main_df.index.date
print (main_df.index)
Index([2021-05-11, 2021-05-12, 2021-05-13], dtype='object')
print (main_df.loc[datetime.date(2021, 5, 12)])
a 1
Name: 2021-05-12, dtype: int64
If use string it use exact indexing, so pandas select in DatetimeIndex correct way.
I am trying to drop two days every year from a dataframe with hourly values from 6am-8pm for dates from the 15.07 to 20.10. Therefore, I created a list with all the dates that should be dropped like this:
for i in range(0,6):
Year = 1999 + i
drop_list.append(str(Year)+'-07-15')
drop_list.append(str(Year)+'-07-16')
My data looks like this:
data
When I now call:
y = y.drop(drop_list)
I get:
KeyError: "['1999-07-15' '1999-07-16' '2000-07-15' '2000-07-16' '2001-07-15'\n '2001-07-16' '2002-07-15' '2002-07-16' '2003-07-15' '2003-07-16'\n '2004-07-15' '2004-07-16'] not found in axis"
Any suggestions what I am missing?
If you want to drop from a datetime index, you need to pass datetimes.
E.g.
>>> s = pd.Series([1,2,3], index=pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03']))
>>> s
2020-01-01 1
2020-01-02 2
2020-01-03 3
dtype: int64
>>> s.drop(pd.to_datetime(['2020-01-01']))
2020-01-02 2
2020-01-03 3
dtype: int64
In your case, y.drop(pd.to_datetime(drop_list)) might work. For the future, please see https://stackoverflow.com/help/how-to-ask
I have got a datetime variable in pandas dataframe 1, when I check the dtypes, it shows the right format (datetime) [2], however when I try to plot this variable, it is being plotted as numbers and not datetime [3].
The most surprising is that this variable was working fine till yesterday, I do not know what has change today :( and as the dtype is showing fine, I am clueless what else could go wrong.
I would highly appreciate your feedback.
thank you,
1
df.head()
reactive_power current timeofmeasurement
0 0 0.000 2018-12-12 10:43:41
1 0 0.000 2018-12-12 10:44:32
2 0 1.147 2018-12-12 10:46:16
3 262 1.135 2018-12-12 10:47:30
4 1159 4.989 2018-12-12 10:49:47
[2]
[] df.dtypes
reactive_power int64
current float64
timeofmeasurement datetime64[ns]
dtype: object
[3]
[]1
You need to convert your datetime column from string type into datetime type, and then set it as index. I don't have your original code, but something along those lines:
#Convert to datetime
df["current timeofmeasurement"] = pd.to_datetime(df["current timeofmeasurement"], format = "%Y-%m-%d %H:%H:%S")
#Set date as index
df = df.set_index("current timeofmeasurement")
#Then you can plot easily
df.plot()