Add business days to pandas dataframe with dates and skip over holidays python - pandas

I have a dataframe with dates as seen in the table below. 1st block is what it should look like and the 2nd block is what I get when just adding the BDays. This is an example of what it should look like when completed. I want to use the 1st column and add 5 business days to the dates, but if the 5 Bdays overlaps a holiday (like 15 Feb'21) then I need to add one additional day. It is fairly simple to add the 5Bday using pandas.tseries.offsets import BDay, but i cannot skip the holidays while using the dataframe.
I have tried to use pandas.tseries.holiday import USFederalHolidayCalendar, the workdays and workalendar modules, but cannot figure it out. Anyone have an idea what I can do.
Correct Example
DATE
EXIT DATE +5
2021/02/09
2021/02/17
2021/02/10
2021/02/18
Wrong Example
DATE
EXIT DATE +5
2021/02/09
2021/02/16
2021/02/10
2021/02/17
Here are some examples of code I tried:
import pandas as pd
from workdays import workday
...
df['DATE'] = workday(df['EXIT DATE +5'], days=5, holidays=holidays)
Next Example:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt = df['DATE']
df['EXIT DATE +5'] = dt + bday_us
=========================================
Final code:
Below is the code I finally settled on. I had to define the holidays manually due to the days the NYSE actually trades. Like for instance the day Pres Bush was laid to rest.
import datetime as dt
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
from pandas.tseries.holiday import AbstractHolidayCalendar, Holiday, nearest_workday, \
USMartinLutherKingJr, USPresidentsDay, GoodFriday, USMemorialDay, \
USLaborDay, USThanksgivingDay
class USTradingCalendar(AbstractHolidayCalendar):
rules = [
Holiday('NewYearsDay', month=1, day=1, observance=nearest_workday),
USMartinLutherKingJr,
USPresidentsDay,
GoodFriday,
USMemorialDay,
Holiday('USIndependenceDay', month=7, day=4, observance=nearest_workday),
Holiday('BushDay', year=2018, month=12, day=5),
USLaborDay,
USThanksgivingDay,
Holiday('Christmas', month=12, day=25, observance=nearest_workday)
]
offset = 5
df = pd.DataFrame(['2019-10-11', '2019-10-14', '2017-04-13', '2018-11-28', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USTradingCalendar())
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
print(df)

Input data
df = pd.DataFrame(['2021-02-09', '2021-02-10', '2021-06-28', '2021-06-29', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
Suggested solution using apply
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USFederalHolidayCalendar())
offset = 5
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
DATE END
2021-02-09 2021-02-17
2021-02-10 2021-02-18
2021-06-28 2021-07-06
2021-06-29 2021-07-07
2021-07-02 2021-07-12
PS: If you want to use a particular calendar such as the NYSE, instead of the default USFederalHolidayCalendar, I recommend following the instructions on this answer, about creating a custom calendar.
Alternative solution which I do not recommend
Currently, to the best of my knowledge, pandas do not support a vectorized approach to your problem. But if you want to follow a similar approach to the one you mentioned, here is what you should do.
First, you will have to define an arbitrary far away end date that includes all the periods you might need and use it to create a list of holidays.
holidays = USFederalHolidayCalendar().holidays(start='2021-02-09', end='2030-02-09')
Then, you pass the holidays list to CustomBusinessDay through the holidays parameter instead of the calendar to generate the desired offset.
offset = 5
bday_us = pd.offsets.CustomBusinessDay(n=offset, holidays=holidays)
df['END'] = df['DATE'] + bday_us
However, this type of approach is not a true vectorized solution, even though it might seem like it. See the following SO answer for further clarification. Under the hood, this approach is probably doing a conversion that is not efficient. This why it yields the following warning.
PerformanceWarning: Non-vectorized DateOffset being applied to Series
or DatetimeIndex

Here's one way to do it
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from datetime import timedelta as td
def get_exit_date(date):
holiday_list = cals.holidays(start=date, end=date + td(weeks=2)).tolist()
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)
this is using bdate_range which returns a datetime index
Results:
DATE EXIT DATE +5
0 2021-02-09 2021-02-17
1 2021-02-10 2021-02-18
Another option is instead of dynamically creating the holiday list. You could also just choose a start date and leave it outside the function like so:
def get_exit_date(date):
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
holiday_list = cals.holidays(start='2021-01-01').tolist()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)

Related

How to categorize a range of hours in Pandas?

In my project I am trying to create a new column to categorize records by range of hours, let me explain, I have a column in the dataframe called 'TowedTime' with time series data, I want another column to categorize by full hour without minutes, for example if the value in the 'TowedTime' column is 09:32:10 I want it to be categorized as 9 AM, if says 12:45:10 it should be categorized as 12 PM and so on with all the other values. I've read about the .cut and bins function but I can't get the result I want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
df = pd.read_excel("Baltimore Towing Division.xlsx",sheet_name="TowingData")
df['Month'] = pd.DatetimeIndex(df['TowedDate']).strftime("%b")
df['Week day'] = pd.DatetimeIndex(df['TowedDate']).strftime("%a")
monthOrder = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dayOrder = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
pivotHours = pd.pivot_table(df, values='TowedDate',index='TowedTime',
columns='Week day',
fill_value=0,
aggfunc= 'count',
margins = False, margins_name='Total').reindex(dayOrder,axis=1)
print(pivotHours)
First, make sure the type of the column 'TowedTime' is datetime. Second, you can easily extract the hour from this data type.
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S')
df['hour'] = df['TowedTime'].dt.hour
hope it answers your question
With the help of #Fabien C I was able to solve the problem.
First, I had to check the data type of values in the 'TowedTime' column with dtypes function. I found that were a Object.
I proceed to try convert 'TowedTime' to datetime:
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.time
Then to create a new column in the df, for only the hours:
df['Hour'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.hour
And the result was this:
You can notice in the image that 'TowedTime' column remains as an object, but the new 'Hour' column correctly returns the hour value.
Originally, the dataset already had the date and time separated into different columns, I think they used some method to separate date and time in excel and this created the time ('TowedTime') to be an object, I could not convert it, Or at least that's what the dtypes function shows me.
I tried all this Pandas methods for converting the Object to Datetime :
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = df['TowedTime'].astype('datetime64[ns]')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')

Daily to Weekly Pandas conversion

I am trying to convert my 15ys worth of daily data into weekly by taking the mean, diff and count of certain features. I tried using .resample but I was not sure if that is the most efficient way.
My sample data:
Date,Product,New Quantity,Price,Refund Flag
8/16/1994,abc,10,0.5,
8/17/1994,abc,11,0.9,1
8/18/1994,abc,15,0.6,
8/19/1994,abc,19,0.4,
8/22/1994,abc,22,0.2,1
8/23/1994,abc,19,0.1,
8/16/1994,xyz,16,0.5,1
8/17/1994,xyz,10,0.9,1
8/18/1994,xyz,12,0.6,1
8/19/1994,xyz,19,0.4,
8/22/1994,xyz,26,0.2,1
8/23/1994,xyz,30,0.1,
8/16/1994,pqr,0,0,
8/17/1994,pqr,0,0,
8/18/1994,pqr,1,1,
8/19/1994,pqr,2,0.6,
8/22/1994,pqr,9,0.1,
8/23/1994,pqr,12,0.2,
This is the output I am looking for:
Date,Product,Net_Quantity_diff,Price_avg,Refund
8/16/1994,abc,9,0.6,1
8/22/1994,abc,-3,0.15,0
8/16/1994,xyz,3,0.6,3
8/22/1994,xyz,4,0.15,1
8/16/1994,pqr,2,0.4,0
8/22/1994,pqr,3,0.15,0
I think the pandas resample method is indeed ideal for this. You can pass a dictionary to the agg method, defining which aggregation function to use for each column. For example:
import numpy as np
import pandas as pd
df = pd.read_csv('sales.txt') # your sample data
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(df['Date'])
del df['Date']
df['Refund Flag'] = df['Refund Flag'].fillna(0).astype(bool)
def span(s):
return np.max(s) - np.min(s)
df_weekly = df.resample('w').agg({'New Quantity': span,
'Price': np.mean,
'Refund Flag': np.sum})
df_weekly
New Quantity Price Refund Flag
Date
1994-08-21 19 0.533333 4
1994-08-28 21 0.150000 2

Add fractional number of years to date in pandas Python

I have a pandas df that includes two columns: time_in_years (float64) and date (datetime64).
import pandas as pd
df = pd.DataFrame({
'date': ['2009-12-25','2005-01-09','2010-10-31'],
'time_in_years': ['10.3434','5.0977','3.3426']
})
df['date'] = pd.to_datetime(df['date'])
df["time_in_years"] = df.time_in_years.astype(float)
I need to create date2 as a datetime64 column by adding the number of years to the date.
I tried the following but with no luck:
df['date_2'] = df['date'] + datetime.timedelta(years=df['time_in_years'])
I know that with fractions I will not be able to get the exact date, but I want to get the closest new date as possible.
Try package dateutil:
from dateutil.relativedelta import relativedelta
First convert fractional years to number of days, then use lambda function and apply it to dataframe:
df['date_2'] = df.apply(lambda x: x['date'] + relativedelta(days = int(x['time_in_years']*365)), axis = 1)
Result:
date time_in_years date_2
0 2009-12-25 10.3434 2020-04-26
1 2005-01-09 5.0977 2010-02-12
2 2010-10-31 3.3426 2014-03-04
datetime.timedelta also works fine:
df['date_2'] = df.apply(lambda x: x['date'] + datetime.timedelta(days = int(x['time_in_years']*365)), axis = 1)
Please note conversion to int is necessary, because relativedelta and timedelta do not accept fractional values.

Pandas- conditional information retrieval with on a date range

I'm still fairly new to pandas and the script i wrote to accomplish a seemily easy task seems needlessly complicated. If you guys know of an easier way to accomplish this I would be extremely grateful.
task:
I hate two spreadsheets (df1&df2), each with an identifier (mrn) and a date. my task is to retrieve an value from df2 for each row in df1 if the following conditions are met:
the identifier for a given row in df1 exists in df2
if above is true, then retrieve the value in df2 if the associated date is within a +/-5 day range from the date in df1.
I have written the following code which accomplishes this:
#%%housekeeping
import numpy as np
import pandas as pd
import csv
import datetime
from datetime import datetime, timedelta
import sys
from io import StringIO
#%%dataframe import
df1=',mrn,date,foo\n0,1,2015-03-06,n/a\n1,11,2009-08-14,n/a\n2,14,2009-05-18,n/a\n3,20,2010-06-19,n/a\n'
df2=',mrn,collection Date,Report\n0,1,2015-03-06,report to import1\n1,11,2009-08-12,report to import11\n2,14,2009-05-21,report to import14\n3,20,2010-06-25,report to import20\n'
df1 = pd.read_csv(StringIO(df1))
df2 = pd.read_csv(StringIO(df2))
#converting to date-time format
df1['date']=pd.to_datetime(df1['date'])
df2['collection Date']=pd.to_datetime(df2['collection Date'])
#%%mask()
def mask(df2, rangeTime):
mask= (df2> rangeTime -timedelta(days=5)) & (df2 <= rangeTime + timedelta(days=5))
return mask
#%% detailLoop()
i=0
for element in df1["mrn"]:
df1DateIter = df1.ix[i, 'date']
df2MRNmatch= df2.loc[df2['mrn']==element, ['collection Date', 'Report']]
df2Date= df2MRNmatch['collection Date']
df2Report= df2MRNmatch['Report']
maskOut= mask(df2Date, df1DateIter)
dateBoolean= maskOut.iloc[0]
if dateBoolean==True:
df1.ix[i, 'foo'] = df2Report.iloc[0]
i+=1
#: once the script has been run the df1 looks like:
Out[824]:
mrn date foo
0 1 2015-03-06 report to import1
1 11 2009-08-14 report to import11
2 14 2009-05-18 report to import14
3 20 2010-06-19 NaN

subset a data frame based on date range [duplicate]

I have a Pandas DataFrame with a 'date' column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.
What is the best way to achieve this?
If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.
For example:
df.loc['2014-01-01':'2014-02-01']
See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
If the column is not the index you have two choices:
Make it the index (either temporarily or permanently if it's time-series data)
df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]
See here for the general explanation
Note: .ix is deprecated.
Previous answer is not correct in my experience, you can't pass it a simple string, needs to be a datetime object. So:
import datetime
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]
And if your dates are standardized by importing datetime package, you can simply use:
df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]
For standarding your date string using datetime package, you can use this function:
import datetime
datetime.datetime.strptime
If you have already converted the string to a date format using pd.to_datetime you can just use:
df = df[(df['Date'] > "2018-01-01") & (df['Date'] < "2019-07-01")]
The shortest way to filter your dataframe by date:
Lets suppose your date column is type of datetime64[ns]
# filter by single day
df_filtered = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']
# filter by single month
df_filtered = df[df['date'].dt.strftime('%Y-%m') == '2014-01']
# filter by single year
df_filtered = df[df['date'].dt.strftime('%Y') == '2014']
If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:
from datetime import date
import pandas as pd
value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]
If the dates are in the index then simply:
df['20160101':'20160301']
You can use pd.Timestamp to perform a query and a local reference
import pandas as pd
import numpy as np
df = pd.DataFrame()
ts = pd.Timestamp
df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')
print(df)
print(df.query('date > #ts("20190515T071320")')
with the output
date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing # prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string
So when loading the csv data file, we'll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().
If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:
import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost']
This has been tested working for Python 3.7. Hope you will find this useful.
I'm not allowed to write any comments yet, so I'll write an answer, if somebody will read all of them and reach this one.
If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:
df.loc[df.index.month == 3]
That will filter the dataset for you by March.
How about using pyjanitor
It has cool features.
After pip install pyjanitor
import janitor
df_filtered = df.filter_date(your_date_column_name, start_date, end_date)
You could just select the time range by doing: df.loc['start_date':'end_date']
In pandas version 1.1.3 I encountered a situation where the python datetime based index was in descending order. In this case
df.loc['2021-08-01':'2021-08-31']
returned empty. Whereas
df.loc['2021-08-31':'2021-08-01']
returned the expected data.
Another solution if you would like to use the .query() method.
It allows you to use write readable code like .query(f"{start} < MyDate < {end}") on the trade off, that .query() parses strings and the columns values must be in pandas date format (so that it is also understandable for .query())
df = pd.DataFrame({
'MyValue': [1,2,3],
'MyDate': pd.to_datetime(['2021-01-01','2021-01-02','2021-01-03'])
})
start = datetime.date(2021,1,1).strftime('%Y%m%d')
end = datetime.date(2021,1,3).strftime('%Y%m%d')
df.query(f"{start} < MyDate < {end}")
(following the comment from #Phillip Cloud, answer from #Retozi)
import the pandas library
import pandas as pd
STEP 1: convert the date column into a string using the pd.to_datetime() method
df['date']=pd.to_datetime(df["date"],unit='s')
STEP 2: perform the filtering in any predetermined manner ( i.e 2 months)
df = df[(df["date"] >"2022-03-01" & df["date"] < "2022-05-03")]
STEP 3 : Check the output
print(df)
# 60 days from today
after_60d = pd.to_datetime('today').date() + datetime.timedelta(days=60)
# filter date col less than 60 days date
df[df['date_col'] < after_60d]