Business Day Counts btw 2 date series (for countries which have a "friday - saturday" weekend ) - dataframe

I try to calculate the working day counts between 2 date columns for each row. My data is consisted of different countries all over the world. -I found the working day counts for European countries by:
df['count'] = np.busday_count (df['Start_Date_column'].tolist(), df['Final_Date_column'].tolist())
-However, some muslim countries like Oman, Bahrain, Kuwait, Qatar etc. have Friday-Saturday weekend. Do you have a suggestion for me to solve this problem for these exceptional countries?

After all, I have solved my problem combining some methods and wanted to share it with everybody who may need it.
I'm kinda beginner in Python, therefore my code could be way improved but here is
the solution which works for me:
P.S: I used some European holidays for trying out, you can customize it in line with your need.
start_date= df['start_date']
end_date= df["final_date"]
data = pd.DataFrame(list(zip(start_date,end_date)), columns = ['Start Date', 'End Date'])
dubai_workdays= "Sun Mon Tue Wed Thu"
dubai_hol = CustomBusinessDay(holidays= [pd.datetime(2022, 10, 3),
pd.datetime(2023, 1, 6),
pd.datetime(2022, 12, 26),
pd.datetime(2022, 12, 31),
pd.datetime(2022, 12, 25)],
weekmask=dubai_workdays)
#pd.bdate_range(pd.datetime(2023, 1, 1), pd.datetime(2023, 1, 2), holidays=dubai_hol, freq= 'C', weekmask = None)
data['Bus_Days'] = data.apply(lambda x: len(pd.bdate_range(x['Start Date'],
x['End Date'],
freq= dubai_hol)), axis=1)

Related

Capturing the Timestamp values from resampled DataFrame

Using the .resample() method yields a DataFrame with a DatetimeIndex and a frequency.
Does anyone have an idea on how to iterate through the values of that DatetimeIndex ?
df = pd.DataFrame(
data=np.random.randint(0, 10, 100),
index=pd.date_range('20220101', periods=100),
columns=['a'],
)
df.resample('M').mean()
If you iterate, you get individual entries taking the Timestamp(‘2022-11-XX…’, freq=‘M’) form but I did not manage to get the date only.
g.resample('M').mean().index[0]
Timestamp('2022-01-31 00:00:00', freq='M')
I am aiming at feeding all the dates in a list for instance.
Thanks for your help !
You an convert each entry in the index into a Datetime object using .date and to a list using .tolist() as below
>>> df.resample('M').mean().index.date.tolist()
[datetime.date(2022, 1, 31), datetime.date(2022, 2, 28), datetime.date(2022, 3, 31), datetime.date(2022, 4, 30)]
You can also truncate the timestamp as follows (reference solution)
>>> df.resample('M').mean().index.values.astype('<M8[D]')
array(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[D]')
This solution seems to work fine both for dates and periods:
I = [k.strftime('%Y-%m') for k in g.resample('M').groups]

How do you iterate two variables in DataFrame, one of which is autoincrementing year?

import pandas as pd
df = pd.DataFrame(
[['New York', 1995, 160000],
['Philadelphia', 1995, 115000],
['Boston', 1995, 145000],
['New York', 1996, 167500],
['Philadelphia', 1996, 125000],
['Boston', 1996, 148000],
['New York', 1997, 180000],
['Philadelphia', 1997, 135000],
['Boston', 1997, 185000],
['New York', 1998, 200000],
['Philadelphia', 1998, 145000],
['Boston', 1998, 215000]],
index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12],
columns = ['city', 'year', 'average_price'])
def percent_change(d):
y1995 = float(d['average_price'][d['year']==1995])
y1996 = float(d['average_price'][d['year']==1996])
ratio = str(round(((y1996 / y1995)-1)*100,2)) + '%'
return ratio
city = df[df['city']=='New York']
percent_change(city)
my_final = {}
for c in df['city'].unique():
city = df[df['city'] == c]
my_final[c] = percent_change(city)
print(my_final)
My goal is to get the percentage change between each year for each city. This way I can chart the percentage changes on a line chart. I can only figure out (crudely as it may be) how to do it for one year. Even them, I don't think I'm properly assigning the year to the result in that one. I don't know how to iterate it through ALL the years. I'm so confused, but if someone can help me out, I feel like I can truly start to learn.
So, from 1995 to 1996 the percentage change in price is as follows:
{'New York': '4.69%', 'Philadelphia': '8.7%', 'Boston': '2.07%'}
Going through examples were easy, but the data was so abstract to me. Now that I have actual information that I want, I don't know how to process it.
We can use pivoting and rolling windows to achieve the desired output:
relative_changes = (
df
.pivot('year', 'city', 'average_price')
.rolling(window=2)
.apply(lambda price: price.iloc[1]/price.iloc[0] - 1)
.dropna()
)
I prefer not to hardcode the formatting inside the data so that we can use them in further calculations. Any formatting can be applied later when needed. For example, when displaying data on the screen:
display(
relative_changes
.style
.format("{:.2%}")
.set_caption("Relative changes")
)
The same with charts:
ax = relative_changes.plot(kind='bar', figsize=(10,6))
ax.xaxis.set_tick_params(labelrotation=0)
ax.yaxis.set_major_formatter(lambda y, pos: f'{y:.0%}')
ax.yaxis.grid(linestyle='--', linewidth=0.8)
ax.set_title("Relative changes of the average price")

How to convert Multi-Index into a Heatmap

New to Pandas/Python, I have managed to make an index like below;
MultiIndex([( 1, 1, 4324),
( 1, 2, 8000),
( 1, 3, 8545),
( 1, 4, 8544),
( 1, 5, 7542),
(12, 30, 7854),
(12, 31, 7511)],
names=['month', 'day', 'count'], length=366)
I'm struggling to find out how I can store the first number into a list (the 1-12 one) the second number into another list (1-31 values) and the third number into another seperate list (scores 0-9000)
I am trying to build a heatmap that is Month x Day on the axis' and using count as the values and failing horribly! I am assuming I have to seperate Month, Day and Count into seperate lists to make the heat map?
data1 = pd.read_csv("a2data/Data1.csv")
data2 = pd.read_csv("a2data/Data2.csv")
merged_df = pd.concat([data1, data2])
merged_df.set_index(['month', 'day'], inplace=True)
merged_df.sort_index(inplace=True)
merged_df2=merged_df.groupby(['month', 'day']).count.mean().reset_index()
merged_df2.set_index(['month', 'day', 'count'], inplace=True)
#struggling here to seperate out month, day and count in order to make a heatmap
Are you looking for:
# let start here
merged_df2=merged_df.groupby(['month', 'day']).count.mean()
# use sns
import seaborn as sns
sns.heatmap(merged_df2.unstack('day'))
Output:
Or you can use plt:
merged_df2=merged_df.groupby(['month', 'day']).count.mean().unstack('day')
plt.imshow(merged_df2)
plt.xticks(np.arange(merged_df2.shape[1]), merged_df2.columns)
plt.yticks(np.arange(merged_df2.shape[0]), merged_df2.index)
plt.show()
which gives:

Calculate the integral of a Pandas DataFrame column for specific time intervals (e.g. per day) using time index

I have a dataframe (df) that includes power sensor data for a year. The data are sampled in irregular frequencies. My df is similar to that:
rng = pd.date_range('2020-07-30 12:00:00', periods=24, freq='6H')
df = pd.DataFrame(np.array([1, 4, 5, 2, 1, 6, 1, 4, 5, 2, 1, 6, 1, 4, 5, 2, 1, 6, 1, 4, 5, 2, 1, 6]), rng, columns=['power'])
df.index.name = 'Date'
df["month"] = df.index.month
df["week"] = df.index.week
What I want to do is to calculate the integral for each day and then be able to sum up these integrals for different duration e.g. weekly, monthly, etc.
For the whole dataframe the following give correct answers (they consider the time in the x-axis):
np.trapz(df["power"], df.index, axis=0)/np.timedelta64(1, 'h')
or
df.apply(integrate.trapz, args=(df.index,))/np.timedelta64(1, 'h')
When I try to integrate per day I have tried:
df.groupby(df.index.date)["power"].apply(np.trapz)
It has two problems:
it assumes that the "power" measurements are equally spaced and are per 1 unit of time
it does not consider the contribution from the first time unit when the day changes (e.g. on 31/7/2020 the value should have been 13 but now it calculates 8.5
I also tried:
df.groupby(df.index.date)["power"].apply(integrate.trapz, args=(df.index,))
but I get: TypeError: trapz() got an unexpected keyword argument 'args'
I would like my results to look like:
Date Energy(kWh)
2020-07-30 15
2020-07-31 78
2020-08-01 84
2020-08-02 66
2020-08-03 78
2020-08-04 84
2020-08-05 30
and then to be able to groupby e.g.
df = df.groupby(["month", "week"])["power"].sum()
and the result looks like:
month week Energy(kWh)
7 31 93
8 31 150
32 192
So how can I use in the integration, the index of my initial dataframe?
Try this:
dp = df.set_index('Date')
dp['Energy(kWh)'] = dp["power"].rolling('1D').apply(integrate.trapz) #1D is referenced as 1 day you can choose 1H or 1S.

How to find the number of hours between two dates excluding weekends and certain holidays in Python? BusinessHours package

I'm trying to find a very clean method to calculate the number of hours between two dates excluding weekends and certain holidays.
What I found out is that the package BusinessHours (https://pypi.python.org/pypi/BusinessHours/1.01) can do this. However I did not find any instruction on how to use the package (the syntax actually) especially how to input the holidays.
I found the original code of the package (https://github.com/dnel/BusinessHours/blob/master/BusinessHours.py) but still not so sure.
I guess it could be something like this:
date1 = pd.to_datetime('2017-01-01 00:00:00')
date2 = pd.to_datetime('2017-01-22 12:00:00')
import BusinessHour
gethours(date1, date2, worktiming=[8, 17], weekends=[6, 7])
Still, where can I input the holidays? And what if I do not want to exclude the non-office-hour, am I just adjust the worktiming to worktiming=[0,23]?
Anyone know how to use this package please tell me about it. I appreciate it.
P/s: I knew a command in numpy to get the number of business days between 2 dates (busday_count) but there is no command to get the result in hours. Any other commands in pandas or numpy that can fulfill the task are welcomed too.
Thank you
Try out this package called business-duration in PyPi Link to PyPi
Example Code
from business_duration import businessDuration
import pandas as pd
from datetime import time,datetime
import holidays as pyholidays
startdate = pd.to_datetime('2017-01-01 00:00:00')
enddate = pd.to_datetime('2017-01-22 12:00:00')
holidaylist = pyholidays.Australia()
unit='hour'
#By default Saturday and Sunday are excluded
print(businessDuration(startdate,enddate,holidaylist=holidaylist,unit=unit))
Output: 335.99611
holidaylist:
{datetime.date(2017, 1, 1): "New Year's Day",
datetime.date(2017, 1, 2): "New Year's Day (Observed)",
datetime.date(2017, 1, 26): 'Australia Day',
datetime.date(2017, 3, 6): 'Canberra Day',
datetime.date(2017, 4, 14): 'Good Friday',
datetime.date(2017, 4, 15): 'Easter Saturday',
datetime.date(2017, 4, 17): 'Easter Monday',
datetime.date(2017, 4, 25): 'Anzac Day',
datetime.date(2017, 6, 12): "Queen's Birthday",
datetime.date(2017, 9, 26): 'Family & Community Day',
datetime.date(2017, 10, 2): 'Labour Day',
datetime.date(2017, 12, 25): 'Christmas Day',
datetime.date(2017, 12, 26): 'Boxing Day'}
Reusing code from sources out there, I assembled this code that seems to work (for UK holidays) but I'd be keen on comments on how to improve it.
I know it is not particularly elegant but may help someone.
Btw, I would like find a way to plug calendars from the Holiday library into this one.
In any case, currently it does not need many libraries, just pandas and datetime, which is possibly a plus.
import pandas as pd
import datetime
from pandas.tseries.offsets import CDay
from pandas.tseries.holiday import (
AbstractHolidayCalendar, DateOffset, EasterMonday,
GoodFriday, Holiday, MO,
next_monday, next_monday_or_tuesday)
# This function will calculate the number of working minutes by first
# generating a time series of business days. Then it will calculate the
# precise working minutes for the start and end date, and use the total
# working hours for each day in-between.
def count_mins(starttime,endtime, bus_day_series, bus_start_time,bus_end_time):
mins_in_working_day=(bus_end_time-bus_start_time)*60
# now we are going to take the series of business days (pre-calculated)
# and sub select the period provided as argument of the function
# we could do the calculation of that "calendar" in the function itself
# but to improve performance, we calculate it separately and then we c
# call the function with that series as argument, provided the dates
# fall within the calculated range, of course
days = bus_day_series[starttime.date():endtime.date()]
daycount = len(days)
if len(days)==0:
return 0
else:
first_day_start = days[0].replace(hour=bus_start_time, minute=0)
first_day_end = days[0].replace(hour=bus_end_time, minute=0)
first_period_start = max(first_day_start, starttime)
first_period_end = min(first_day_end, endtime)
if first_period_end<=first_period_start:
first_day_mins=0
else:
first_day_sec=first_period_end - first_period_start
first_day_mins=first_day_sec.seconds/60
if daycount == 1:
return first_day_mins
else:
last_period_start = days[-1].replace(hour=bus_start_time, minute=0)
#we know the last day will always start in the bus_start_time
last_day_end = days[-1].replace(hour=bus_end_time, minute=0)
last_period_end = min(last_day_end, endtime)
if last_period_end<=last_period_start:
last_day_mins=0
else:
last_day_sec=last_period_end - last_period_start
last_day_mins=last_day_sec.seconds/60
middle_days_mins=0
if daycount>2:
middle_days_mins=(daycount-2)*mins_in_working_day
return first_day_mins + last_day_mins + middle_days_mins
# Calculates the date series with all the business days
# of the period we are interested on
class EnglandAndWalesHolidayCalendar(AbstractHolidayCalendar):
rules = [
Holiday('New Years Day', month=1, day=1, observance=next_monday),
GoodFriday,
EasterMonday,
Holiday('Early May bank holiday',
month=5, day=1, offset=DateOffset(weekday=MO(1))),
Holiday('Spring bank holiday',
month=5, day=31, offset=DateOffset(weekday=MO(-1))),
Holiday('Summer bank holiday',
month=8, day=31, offset=DateOffset(weekday=MO(-1))),
Holiday('Christmas Day', month=12, day=25, observance=next_monday),
Holiday('Boxing Day',
month=12, day=26, observance=next_monday_or_tuesday)
]
# From this point its how we use the function
# Here we hardcode a start/end date to create the list of business days
cal = EnglandAndWalesHolidayCalendar()
dayindex = pd.bdate_range(datetime.date(2019,1,1),datetime.date.today(),freq=CDay(calendar=cal))
day_series = dayindex.to_series()
# Convenience function to simplify how we call the main function
# It will take a pre calculated day_series.
def bus_hr(ts_start, ts_end, day_series ):
BUS_START=8
BUS_END=20
minutes = count_mins(ts_start, ts_end, day_series, BUS_START, BUS_END)
return int(round(minutes/60,0))
#A set of checks that the function is working properly
assert bus_hr( pd.Timestamp(2019,9,30,6,1,0) , pd.Timestamp(2019,10,1,9,0,0),day_series) == 13
assert bus_hr( pd.Timestamp(2019,10,3,10,30,0) , pd.Timestamp(2019,10,3,23,30,0),day_series)==10
assert bus_hr( pd.Timestamp(2019,8,25,10,30,0) , pd.Timestamp(2019,8,27,10,0,0),day_series) ==2
assert bus_hr( pd.Timestamp(2019,12,25,8,0,0) , pd.Timestamp(2019,12,25,17,0,0),day_series) ==0
assert bus_hr( pd.Timestamp(2019,12,26,8,0,0) , pd.Timestamp(2019,12,26,17,0,0),day_series) ==0
assert bus_hr( pd.Timestamp(2019,12,27,8,0,0) , pd.Timestamp(2019,12,27,17,0,0),day_series) ==9
assert bus_hr( pd.Timestamp(2019,6,24,5,10,44) , pd.Timestamp(2019,6,24,7,39,17),day_series)==0
assert bus_hr( pd.Timestamp(2019,6,24,5,10,44) , pd.Timestamp(2019,6,24,8,29,17),day_series)==0
assert bus_hr( pd.Timestamp(2019,6,24,5,10,44) , pd.Timestamp(2019,6,24,10,0,0),day_series)==2
assert bus_hr(pd.Timestamp(2019,4,30,21,19,0) , pd.Timestamp(2019,5,1,16,17,56),day_series)==8
assert bus_hr(pd.Timestamp(2019,4,30,21,19,0) , pd.Timestamp(2019,5,1,20,17,56),day_series)==12
The most current pip install of this package 1.2 has an error in line 51 with "extraday" which needs to be changed to "extradays" .
I too have been scouring the internet for some workable code to calculate business hours and business days. This package had a little bit of tweeking but works just fine when you get it up and running.
This is what I have in my notebook:
#import BusinessHours
from BusinessHours import BusinessHours as bh
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
date1 = pd.to_datetime('2017-01-01 00:00:00')
date2 = pd.to_datetime('2017-01-22 12:00:00')
bh(date1, date2, worktiming=[8, 17], weekends=[6, 7]).gethours()
This was also in the source code:
'''
holidayfile - A file consisting of the predetermined office holidays.
Each date starts in a new line and currently must only be in the format
dd-mm-yyyy
'''
Hope this helps