Panda Index Datetime Switching Months and Days - pandas

I have a panda df.index in the format below.
It's a string of day/month/year, so the first item is 05Sep2017 etc:
05/09/17 #05Sep2017
07/09/17 #07Sep2017
...
18/10/17 #18Oct2017
Applying
df.index = pd.to_datetime(df.index)
to the above, transforms it to:
2017-05-09 #09May2017
2017-07-09 #09Jul2017
...
2017-10-18 #18Oct2017
What seems to be happening is that the first entries are having the Day and Month switched. The last entry instead, where the day is greater than 12, is converted correctly.
I tried to switch month days by converting the index to a column and applying:
df['date'] = df.index
df['date'].apply(lambda x: dt.datetime.strftime(x, '%Y-%d-%m'))
as well as:
df['date'].apply(lambda x: dt.datetime.strftime(x, '%Y-%m-%d'))
but to no avail.
How can i convert the index to datetime, where all entries are day/month/year please?

In pandas is default format of dates YY-MM-DD.
df = df.set_index('date_col')
df.index = pd.to_datetime(df.index)
print (df)
val
2017-05-09 4
2017-07-09 8
2017-10-18 2
print (df.index)
DatetimeIndex(['2017-05-09', '2017-07-09', '2017-10-18'], dtype='datetime64[ns]', freq=None)
You need strftime, but lost datetimes, because get strings:
df.index = pd.to_datetime(df.index).strftime('%Y-%d-%m')
print (df.index)
Index(['2017-09-05', '2017-09-07', '2017-18-10'], dtype='object')
df.index = pd.to_datetime(df.index).strftime('%d-%b-%Y')
print (df)
val
09-May-2017 4
09-Jul-2017 8
18-Oct-2017 2

Related

Parsing date in pandas.read_csv

I am trying to read a CSV file which has in its first column date values specified in this format:
"Dec 30, 2021","1.1","1.2","1.3","1"
While I can define the types for the remaining columns using dtype= clause, I do not know how to handle the Date.
I have tried the obvious np.datetime64 without success.
Is there any way to specify a format to parse this date directly using read_csv method?
You may use parse_dates :
df = pd.read_csv('data.csv', parse_dates=['date'])
But in my experience it is a frequent source of errors, I think it is better to specify the date format and convert manually the date column. For example, in your case :
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format = '%b %d, %Y')
Just specify a list of columns that should be convert to dates in the parse_dates= of pd.read_csv:
>>> df = pd.read_csv('file.csv', parse_dates=['date'])
>>> df
date a b c d
0 2021-12-30 1.1 1.2 1.3 1
>>> df.dtypes
date datetime64[ns]
a float64
b float64
c float64
d int64
Update
What if I want to further specify the format for a,b,c and d? I used a simplified example, in my file numbers are formated like this "2,345.55" and those are read as object by read_csv, not as float64 or int64 as in your example
converters = {
'Date': lambda x: datetime.strptime(x, "%b %d, %Y"),
'Number': lambda x: float(x.replace(',', ''))
}
df = pd.read_csv('data.csv', converters=converters)
Output:
>>> df
Date Number
0 2021-12-30 2345.55
>>> df.dtypes
Date datetime64[ns]
Number float64
dtype: object
# data.csv
Date,Number
"Dec 30, 2021","2,345.55"
Old answer
If you have a particular format, you can pass a custom function to date_parser parameter:
from datetime import datetime
custom_date_parser = lambda x: datetime.strptime(x, "%b %d, %Y")
df = pd.read_csv('data.csv', parse_dates=['Date'], date_parser=custom_date_parser)
print(df)
# Output
Date A B C D
0 2021-12-30 1.1 1.2 1.3 1
Or let Pandas try to determine the format as suggested by #richardec.

unable to fetch row where index is of type dtype='datetime64[ns]'

I have a pandas main_df dataframe with date as index
<bound method Index.get_value of DatetimeIndex(['2021-05-11', '2021-05-12','2021-05-13'],
dtype='datetime64[ns]', name='date', freq=None)>
what am trying to do is fetch row based on certain date.
I tried like this
main_df.loc['2021-05-11'] and it works fine.
But If I pass a date object its failing
main_df.loc[datetime.date(2021, 5, 12)] and its showing key error.
The index is DatetimeIndex then why its throwing an error if I didn't pass key as string?
Reason is DatetimeIndex is simplified array of datetimes, so if select vy dates it failed.
So need select by datetimes:
main_df = pd.DataFrame({'a':range(3)},
index=pd.to_datetime(['2021-05-11', '2021-05-12','2021-05-13']))
print (main_df)
a
2021-05-11 0
2021-05-12 1
2021-05-13 2
print (main_df.index)
DatetimeIndex(['2021-05-11', '2021-05-12', '2021-05-13'], dtype='datetime64[ns]', freq=None)
print (main_df.loc[datetime.datetime(2021, 5, 12)])
a 1
Name: 2021-05-12 00:00:00, dtype: int64
If need select by dates first convert datetimes to dates by DatetimeIndex.date:
main_df.index = main_df.index.date
print (main_df.index)
Index([2021-05-11, 2021-05-12, 2021-05-13], dtype='object')
print (main_df.loc[datetime.date(2021, 5, 12)])
a 1
Name: 2021-05-12, dtype: int64
If use string it use exact indexing, so pandas select in DatetimeIndex correct way.

Add fractional number of years to date in pandas Python

I have a pandas df that includes two columns: time_in_years (float64) and date (datetime64).
import pandas as pd
df = pd.DataFrame({
'date': ['2009-12-25','2005-01-09','2010-10-31'],
'time_in_years': ['10.3434','5.0977','3.3426']
})
df['date'] = pd.to_datetime(df['date'])
df["time_in_years"] = df.time_in_years.astype(float)
I need to create date2 as a datetime64 column by adding the number of years to the date.
I tried the following but with no luck:
df['date_2'] = df['date'] + datetime.timedelta(years=df['time_in_years'])
I know that with fractions I will not be able to get the exact date, but I want to get the closest new date as possible.
Try package dateutil:
from dateutil.relativedelta import relativedelta
First convert fractional years to number of days, then use lambda function and apply it to dataframe:
df['date_2'] = df.apply(lambda x: x['date'] + relativedelta(days = int(x['time_in_years']*365)), axis = 1)
Result:
date time_in_years date_2
0 2009-12-25 10.3434 2020-04-26
1 2005-01-09 5.0977 2010-02-12
2 2010-10-31 3.3426 2014-03-04
datetime.timedelta also works fine:
df['date_2'] = df.apply(lambda x: x['date'] + datetime.timedelta(days = int(x['time_in_years']*365)), axis = 1)
Please note conversion to int is necessary, because relativedelta and timedelta do not accept fractional values.

Understanding resampling of datetime in pandas

I have a question regarding resampling of DataFrames.
import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20],
['2005-01-27', 40], ['2005-01-28', 50]],
columns=['date', 'num'])
# Convert the column to datetime
df['date'] = pd.to_datetime(df['date'])
# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()
print(df.head())
# OUTPUT:
# date num
# 0 2005-01-23 30
# 1 2005-01-30 90
Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.
QUESTION #1
Why the following happens:
The result of df.resample('W', on='date') is DatetimeIndexResampler.
The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.
The result of df.resample('W', on='date')['num'].sum() is
date
2005-01-23 30
2005-01-30 90
Freq: W-SUN, Name: num, dtype: int64
QUESTION #2
Is there a way to produce the same results without resampling? For example, using groupby.
Answer1
As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.
Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.
Oh by the way,
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num']
Would return
<pandas.core.groupby.SeriesGroupBy as well.
Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.
Answer2
You can achieve results using groupby with the help from Grouper as follow:
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()
Output:
date
2005-01-23 30
2005-01-30 90
Name: num, dtype: int64

pandas resampling dataframe and keep datetime index as a column

I'm trying to resample daily data to weekly data using pandas.
I'm using the following:
weekly_start_date =pd.Timestamp('01/05/2011')
weekly_end_date =pd.Timestamp('05/28/2013')
daily_data = daily_data[(daily_data["date"] >= weekly_start_date) & (daily_data["date"] <= weekly_end_date)]
daily_data = daily_data.set_index('date',drop=False)
weekly_data = daily_data.resample('7D',how=np.sum,closed='left',label='left')
The problem is weekly_data doesn't have the date column anymore.
What did I miss?
Thanks,
If I understand your question, it looks like your doing the resampling correctly (Pandas docs on resampling here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html).
weekly_data = daily_data.resample('7D',how=np.sum,closed='left',label='left')
If the only issue is that you'd like the DateTimeIndex replicated in a column you can just do this.
weekly_data['date'] = weekly_data.index.values
Apologies if I misunderstood the question. :)
You can only resample by numeric columns:
In [11]: df = pd.DataFrame([[pd.Timestamp('1/1/2012'), 1, 'a', [1]], [pd.Timestamp('1/2/2012'), 2, 'b', [2]]], columns=['date', 'no', 'letter', 'li'])
In [12]: df1 = df.set_index('date', drop=False)
In [13]: df1
Out[13]:
date no letter li
date
2012-01-01 2012-01-01 00:00:00 1 a [1]
2012-01-02 2012-01-02 00:00:00 2 b [2]
In [15]: df1.resample('M', how=np.sum)
Out[15]:
no
date
2012-01-31 3
We can see that it uses the dtype to determine whether it's numeric:
In [16]: df1.no = df1.no.astype(object)
In [17]: df1.resample('M', how=sum)
Out[17]:
date no letter li
date
2012-01-31 0 0 0 0
An awful hack for actual summing:
In [21]: rng = pd.date_range(weekly_start_date, weekly_end_date, freq='M')
In [22]: g = df1.groupby(rng.asof)
In [23]: g.apply(lambda t: t.apply(lambda x: x.sum(1))).unstack()
Out[23]:
date no letter li
2011-12-31 2650838400000000000 3 ab [1, 2]
The date is the sum of the epoch nanoseconds...
(Hopefully I'm doing something silly, and there's is an easier way!)