unable to fetch row where index is of type dtype='datetime64[ns]' - pandas

I have a pandas main_df dataframe with date as index
<bound method Index.get_value of DatetimeIndex(['2021-05-11', '2021-05-12','2021-05-13'],
dtype='datetime64[ns]', name='date', freq=None)>
what am trying to do is fetch row based on certain date.
I tried like this
main_df.loc['2021-05-11'] and it works fine.
But If I pass a date object its failing
main_df.loc[datetime.date(2021, 5, 12)] and its showing key error.
The index is DatetimeIndex then why its throwing an error if I didn't pass key as string?

Reason is DatetimeIndex is simplified array of datetimes, so if select vy dates it failed.
So need select by datetimes:
main_df = pd.DataFrame({'a':range(3)},
index=pd.to_datetime(['2021-05-11', '2021-05-12','2021-05-13']))
print (main_df)
a
2021-05-11 0
2021-05-12 1
2021-05-13 2
print (main_df.index)
DatetimeIndex(['2021-05-11', '2021-05-12', '2021-05-13'], dtype='datetime64[ns]', freq=None)
print (main_df.loc[datetime.datetime(2021, 5, 12)])
a 1
Name: 2021-05-12 00:00:00, dtype: int64
If need select by dates first convert datetimes to dates by DatetimeIndex.date:
main_df.index = main_df.index.date
print (main_df.index)
Index([2021-05-11, 2021-05-12, 2021-05-13], dtype='object')
print (main_df.loc[datetime.date(2021, 5, 12)])
a 1
Name: 2021-05-12, dtype: int64
If use string it use exact indexing, so pandas select in DatetimeIndex correct way.

Related

Pandas reindex Dates To Subset of Dates from List

I am sorry, but there is online documentation and examples and I'm still not understanding. I have a pandas df with an index of dates in datetime format (yyyy-mm-dd) and I'm trying to resample or reindex this dataframe based on a subset of dates in the same format (yyyy-mm-dd) that are in a list. I have converted the df.index values to datetime using:
dfmla.index = pd.to_datetime(dfmla.index)
I've tried various things and I keep getting NaN's after applying the reindex. I know this must be a datatypes problem and my df is in the form of:
df.dtypes
Out[30]:
month int64
mean_mon_flow float64
std_mon_flow float64
monthly_flow_ln float64
std_anomaly float64
dtype: object
My data looks like this:
df.head(5)
Out[31]:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1949-10-01 10 8.565828 0.216126 8.848631 1.308506
1949-11-01 11 8.598055 0.260254 8.368006 -0.883938
1949-12-01 12 8.612080 0.301156 8.384662 -0.755149
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
My month_list (list datatype) looks like this:
month_list[0:2]
Out[37]: ['1950-08-01', '1950-09-01']
I need my condensed, new reindexed df to look like this:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
thank you for your suggestions,
If you're certain that all month_list are in the index, you can do df.loc[month_list], else you can use reindex:
df.reindex(pd.to_datetime(month_list))
Output:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Problems getting two columns into datetime.datetime format

I have code at the moment written to change two columns of my dataframe from strings into datetime.datetime objects similar to the following:
def converter(date):
date = dt.strptime(date, '%m/%d/%Y %H:%M:%S')
return date
df = pd.DataFrame({'A':['12/31/9999 0:00:00','1/1/2018 0:00:00'],
'B':['4/1/2015 0:00:00','11/1/2014 0:00:00']})
df['A'] = df['A'].apply(converter)
df['B'] = df['B'].apply(converter)
When I run this code and print the dataframe, it comes out like this
A B
0 9999-12-31 00:00:00 2015-04-01
1 2018-01-01 00:00:00 2014-11-01
When I checked the data types of each column, they read
A object
B datetime64[ns]
But when I check the format of the actual cells of the first row, they read
<class 'datetime.datetime'>
<class 'pandas._libs.tslib.Timestamp'>
After experimenting around, I think I've run into an out of bounds error because of the date '12/31/9999 0:00:00' in column 'A' and this is causing this column to be cast as a datetime.datetime object. My question is how I can also convert column 'B' of my dataframe to a datetime.datetime object so that I can run a query on the columns similar to
df.query('A > B')
without getting an error or the wrong output.
Thanks in advance
Since '9999' is just some dummy year, you can simplify your life by choosing a dummy year which is in bounds (or one that makes more sense given your actual data):
import pandas as pd
df.replace('9999', '2060', regex=True).apply(pd.to_datetime)
Output:
A B
0 2060-12-31 2015-04-01
1 2018-01-01 2014-11-01
A datetime64[ns]
B datetime64[ns]
dtype: object
As #coldspeed points out, it's perhaps better to remove those bad dates:
df.apply(pd.to_datetime, errors='coerce')
# A B
#0 NaT 2015-04-01
#1 2018-01-01 2014-11-01

Understanding resampling of datetime in pandas

I have a question regarding resampling of DataFrames.
import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20],
['2005-01-27', 40], ['2005-01-28', 50]],
columns=['date', 'num'])
# Convert the column to datetime
df['date'] = pd.to_datetime(df['date'])
# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()
print(df.head())
# OUTPUT:
# date num
# 0 2005-01-23 30
# 1 2005-01-30 90
Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.
QUESTION #1
Why the following happens:
The result of df.resample('W', on='date') is DatetimeIndexResampler.
The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.
The result of df.resample('W', on='date')['num'].sum() is
date
2005-01-23 30
2005-01-30 90
Freq: W-SUN, Name: num, dtype: int64
QUESTION #2
Is there a way to produce the same results without resampling? For example, using groupby.
Answer1
As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.
Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.
Oh by the way,
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num']
Would return
<pandas.core.groupby.SeriesGroupBy as well.
Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.
Answer2
You can achieve results using groupby with the help from Grouper as follow:
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()
Output:
date
2005-01-23 30
2005-01-30 90
Name: num, dtype: int64

How do I get pandas update function to correctly handle numpy.datetime64?

I have a dataframe with a column that may contain None and another dataframe with the same index that has datetime values populated. I am trying to update the first from the second using pandas.update.
import numpy as np
import pandas as pd
df = pd.DataFrame([{'id': 0, 'as_of_date': np.datetime64('2017-05-08')}])
print(df.as_of_date)
df2 = pd.DataFrame([{'id': 0, 'as_of_date': None}])
print(df2.as_of_date)
df2.update(df)
print(df2.as_of_date)
print(df2.apply(lambda x: x['as_of_date'] - np.timedelta64(1, 'D'), axis=1))
This results in
0 2017-05-08
Name: as_of_date, dtype: datetime64[ns]
0 None
Name: as_of_date, dtype: object
0 1494201600000000000
Name: as_of_date, dtype: object
0 -66582 days +10:33:31.122941
dtype: timedelta64[ns]
So basically update converts the datetime to milliseconds, but keeps the type as object. Then if I try to do date math on it, I get wacky results because numpy doesn't know how to treat it.
I was hoping df2 would look like df1 after updating. How can I fix this?
Try this:
In [391]: df2 = df2.combine_first(df)
In [392]: df2
Out[392]:
as_of_date id
0 2017-05-08 0
In [396]: df2.dtypes
Out[396]:
as_of_date datetime64[ns]
id int64
dtype: object
A two step approach
Fill None data in df2 using date from df:
df2 = df2.combine_first(df)
Update all elements in df2 using elements from df
df2.update(df)
Without 2nd step, df2 will only take the values from df to fill its Nones.