pyspark equivalent of pd.offsets.MontEnd(x) - pandas

In pandas there is the function pd.offsets.MonthEnd(x), such that
Date + pd.offsets.MonthEnd(x) returns the End of the month, x months after Date.
pd.to_datetime('2010-01-15') + pd.offsets.MonthEnd(3) = Timestamp('2010-03-31 00:00:00')
I am aware of the function last_day(Date) in pyspark. However, this does not take an offset argument, but simmply returns the end of the month. In which way do I retrieve similar behaviour of pd.offsets.MonthEnd(x) in pyspark?

You can combine last_day with add_months:
import pyspark.sql.functions as F
df \
.withColumn('add_months', F.add_months('my_timestamp', 3)) \
.withColumn('result', F.last_day('add_months')) \
.show()

Related

Get last business day of the month in PySpark without UDF

I would like to get the last business day (LBD) of the month, and use LBD to filter records in a dataframe, I did come up with python code. But to achieve this functionality I need to use UDF. Is there any way to get the last business day of the month without using PySpark UDF?
import calendar
def last_business_day_in_month(calendarYearMonth):
year = int(calendarYearMonth[0:4])
month = int(calendarYearMonth[4:])
return str(year) + str(month) + str(max(calendar.monthcalendar(year, month)[-1:][0][:5]))
last_business_day_in_month(calendarYearMonth)
calendarYearMonth is in format YYYYMM
Ref: https://stackoverflow.com/a/62392077/6187792
You can calculate it using last_day and its dayofweek.
from pyspark.sql import functions as func
spark.sparkContext.parallelize([(202010,), (202201,)]).toDF(['yrmth']). \
withColumn('lastday_mth', func.last_day(func.to_date(func.col('yrmth').cast('string'), 'yyyyMM'))). \
withColumn('dayofwk', func.dayofweek('lastday_mth')). \
withColumn('lastbizday_mth',
func.when(func.col('dayofwk') == 7, func.date_add('lastday_mth', -1)).
when(func.col('dayofwk') == 1, func.date_add('lastday_mth', -2)).
otherwise(func.col('lastday_mth'))
). \
show()
# +------+-----------+-------+--------------+
# | yrmth|lastday_mth|dayofwk|lastbizday_mth|
# +------+-----------+-------+--------------+
# |202010| 2020-10-31| 7| 2020-10-30|
# |202201| 2022-01-31| 2| 2022-01-31|
# +------+-----------+-------+--------------+
Create a small sequence of last dates of the month, filter out weekends and use array_max to return the max date.
from pyspark.sql import functions as F
df = spark.createDataFrame([('202010',), ('202201',)], ['yrmth'])
last_day = F.last_day(F.to_date('yrmth', 'yyyyMM'))
last_days = F.sequence(F.date_sub(last_day, 3), last_day)
df = df.withColumn(
'last_business_day_in_month',
F.array_max(F.filter(last_days, lambda x: ~F.dayofweek(x).isin([1, 7])))
)
df.show()
# +------+--------------------------+
# | yrmth|last_business_day_in_month|
# +------+--------------------------+
# |202010| 2020-10-30|
# |202201| 2022-01-31|
# +------+--------------------------+
For lower Spark versions:
last_day = "last_day(to_date(yrmth, 'yyyyMM'))"
df = df.withColumn(
'last_business_day_in_month',
F.expr(f"array_max(filter(sequence(date_sub({last_day}, 3), {last_day}), x -> weekday(x) < 5))")
)

Add business days to pandas dataframe with dates and skip over holidays python

I have a dataframe with dates as seen in the table below. 1st block is what it should look like and the 2nd block is what I get when just adding the BDays. This is an example of what it should look like when completed. I want to use the 1st column and add 5 business days to the dates, but if the 5 Bdays overlaps a holiday (like 15 Feb'21) then I need to add one additional day. It is fairly simple to add the 5Bday using pandas.tseries.offsets import BDay, but i cannot skip the holidays while using the dataframe.
I have tried to use pandas.tseries.holiday import USFederalHolidayCalendar, the workdays and workalendar modules, but cannot figure it out. Anyone have an idea what I can do.
Correct Example
DATE
EXIT DATE +5
2021/02/09
2021/02/17
2021/02/10
2021/02/18
Wrong Example
DATE
EXIT DATE +5
2021/02/09
2021/02/16
2021/02/10
2021/02/17
Here are some examples of code I tried:
import pandas as pd
from workdays import workday
...
df['DATE'] = workday(df['EXIT DATE +5'], days=5, holidays=holidays)
Next Example:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt = df['DATE']
df['EXIT DATE +5'] = dt + bday_us
=========================================
Final code:
Below is the code I finally settled on. I had to define the holidays manually due to the days the NYSE actually trades. Like for instance the day Pres Bush was laid to rest.
import datetime as dt
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
from pandas.tseries.holiday import AbstractHolidayCalendar, Holiday, nearest_workday, \
USMartinLutherKingJr, USPresidentsDay, GoodFriday, USMemorialDay, \
USLaborDay, USThanksgivingDay
class USTradingCalendar(AbstractHolidayCalendar):
rules = [
Holiday('NewYearsDay', month=1, day=1, observance=nearest_workday),
USMartinLutherKingJr,
USPresidentsDay,
GoodFriday,
USMemorialDay,
Holiday('USIndependenceDay', month=7, day=4, observance=nearest_workday),
Holiday('BushDay', year=2018, month=12, day=5),
USLaborDay,
USThanksgivingDay,
Holiday('Christmas', month=12, day=25, observance=nearest_workday)
]
offset = 5
df = pd.DataFrame(['2019-10-11', '2019-10-14', '2017-04-13', '2018-11-28', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USTradingCalendar())
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
print(df)
Input data
df = pd.DataFrame(['2021-02-09', '2021-02-10', '2021-06-28', '2021-06-29', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
Suggested solution using apply
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USFederalHolidayCalendar())
offset = 5
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
DATE END
2021-02-09 2021-02-17
2021-02-10 2021-02-18
2021-06-28 2021-07-06
2021-06-29 2021-07-07
2021-07-02 2021-07-12
PS: If you want to use a particular calendar such as the NYSE, instead of the default USFederalHolidayCalendar, I recommend following the instructions on this answer, about creating a custom calendar.
Alternative solution which I do not recommend
Currently, to the best of my knowledge, pandas do not support a vectorized approach to your problem. But if you want to follow a similar approach to the one you mentioned, here is what you should do.
First, you will have to define an arbitrary far away end date that includes all the periods you might need and use it to create a list of holidays.
holidays = USFederalHolidayCalendar().holidays(start='2021-02-09', end='2030-02-09')
Then, you pass the holidays list to CustomBusinessDay through the holidays parameter instead of the calendar to generate the desired offset.
offset = 5
bday_us = pd.offsets.CustomBusinessDay(n=offset, holidays=holidays)
df['END'] = df['DATE'] + bday_us
However, this type of approach is not a true vectorized solution, even though it might seem like it. See the following SO answer for further clarification. Under the hood, this approach is probably doing a conversion that is not efficient. This why it yields the following warning.
PerformanceWarning: Non-vectorized DateOffset being applied to Series
or DatetimeIndex
Here's one way to do it
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from datetime import timedelta as td
def get_exit_date(date):
holiday_list = cals.holidays(start=date, end=date + td(weeks=2)).tolist()
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)
this is using bdate_range which returns a datetime index
Results:
DATE EXIT DATE +5
0 2021-02-09 2021-02-17
1 2021-02-10 2021-02-18
Another option is instead of dynamically creating the holiday list. You could also just choose a start date and leave it outside the function like so:
def get_exit_date(date):
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
holiday_list = cals.holidays(start='2021-01-01').tolist()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)

How to replace the Timedelta Pandas function with a pure PySpark function?

I am developing a small script in PySpark that generates a date sequence (36 months before today's date) and (while applying a truncate to be the first day of the month). Overall I succeeded this task however
But with the help of the Pandas package Timedelta to calculate the time delta .
Is there a way to replace this Timedelta from Pandas with a pure PySpark function ?
import pandas as pd
from datetime import date, timedelta, datetime
from pyspark.sql.functions import col, date_trunc
today = datetime.today()
data = [((date(today.year, today.month, 1) - pd.Timedelta(36,'M')),date(today.year, today.month, 1))] # I want to replace this Pandas function
df = spark.createDataFrame(data, ["minDate", "maxDate"])
+----------+----------+
| minDate| maxDate|
+----------+----------+
|2016-10-01|2019-10-01|
+----------+----------+
import pyspark.sql.functions as f
df = df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\ #
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
+----------+
| date|
+----------+
|2016-10-01|
|2016-11-01|
|2016-12-01|
|2017-01-01|
|2017-02-01|
|2017-03-01|
etc...
+----------+
You can use Pyspark inbuilt trunc function.
pyspark.sql.functions.trunc(date, format)
Returns date truncated to the unit specified by the format.
Parameters:
format – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Imagine I have a below dataframe.
list = [(1,),]
df=spark.createDataFrame(list, ['id'])
import pyspark.sql.functions as f
df=df.withColumn("start_date" ,f.add_months(f.trunc(f.current_date(),"month") ,-36))
df=df.withColumn("max_date" ,f.trunc(f.current_date(),"month"))
>>> df.show()
+---+----------+----------+
| id|start_date| max_date|
+---+----------+----------+
| 1|2016-10-01|2019-10-01|
+---+----------+----------+
Here's a link with more details on Spark date functions.
Pyspark date Functions

Equivalent concept in pandas to R 'frequency()' command?

In R, I can determine the frequency of a dataframe using the frequency() command, e.g.
myts = ts(x[1:240], frequency = 12)
frequency(myts)
> 12
As per the docs:
frequency returns the number of samples per unit time and deltat the time interval between observations (see ts).
Is there a similar concept for verifying pandas timeseries dataframes?
It only works with datetime or timedelta, but you can use pd.infer_freq
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2010-01-01', periods=10, freq='13.2min'))
pd.infer_freq(df.index)
#'792S'
df = pd.DataFrame(index=pd.timedelta_range(start='00:00:00', freq='1H', periods=20))
pd.infer_freq(df.index)
#'H'

pandas to_datetime function default year

I am newbie about pandas, when I run below code, I got a different result:
import pandas as pd
ts = pd.to_datetime("2014-6-10 10:10:10.30",format="%Y-%m-%d %H:%M:%S.%f")
print ts
ts = pd.to_datetime("6-10 10:10:10.30",format="%m-%d %H:%M:%S.%f")
print ts
The output is:
2014-06-10 10:10:10.300000
1900-06-10 10:10:10.300000
That means the default year is 1900, how can I change it to 2014 for the second one?
You cannot write to the year attribute of a datetime, so the easiest thing to do is use replace:
In [57]:
ts = ts.replace(year=2014)
ts
Out[57]:
Timestamp('2014-06-10 10:10:10.300000')
Another possiblity is to store the current year as a string and prepend this as required, this has an advantage that you can use the same format string for all dates:
In [68]:
this_year = str(datetime.datetime.now().year)
datestr = this_year +'-' + '6-10 10:10:10.30'
pd.to_datetime(datestr,format="%Y-%m-%d %H:%M:%S.%f")
Out[68]:
Timestamp('2014-06-10 10:10:10.300000')
Can't think of a better way but you could wrap the above in a function to test if you need to set the year