I would like to get the last business day (LBD) of the month, and use LBD to filter records in a dataframe, I did come up with python code. But to achieve this functionality I need to use UDF. Is there any way to get the last business day of the month without using PySpark UDF?
import calendar
def last_business_day_in_month(calendarYearMonth):
year = int(calendarYearMonth[0:4])
month = int(calendarYearMonth[4:])
return str(year) + str(month) + str(max(calendar.monthcalendar(year, month)[-1:][0][:5]))
last_business_day_in_month(calendarYearMonth)
calendarYearMonth is in format YYYYMM
Ref: https://stackoverflow.com/a/62392077/6187792
You can calculate it using last_day and its dayofweek.
from pyspark.sql import functions as func
spark.sparkContext.parallelize([(202010,), (202201,)]).toDF(['yrmth']). \
withColumn('lastday_mth', func.last_day(func.to_date(func.col('yrmth').cast('string'), 'yyyyMM'))). \
withColumn('dayofwk', func.dayofweek('lastday_mth')). \
withColumn('lastbizday_mth',
func.when(func.col('dayofwk') == 7, func.date_add('lastday_mth', -1)).
when(func.col('dayofwk') == 1, func.date_add('lastday_mth', -2)).
otherwise(func.col('lastday_mth'))
). \
show()
# +------+-----------+-------+--------------+
# | yrmth|lastday_mth|dayofwk|lastbizday_mth|
# +------+-----------+-------+--------------+
# |202010| 2020-10-31| 7| 2020-10-30|
# |202201| 2022-01-31| 2| 2022-01-31|
# +------+-----------+-------+--------------+
Create a small sequence of last dates of the month, filter out weekends and use array_max to return the max date.
from pyspark.sql import functions as F
df = spark.createDataFrame([('202010',), ('202201',)], ['yrmth'])
last_day = F.last_day(F.to_date('yrmth', 'yyyyMM'))
last_days = F.sequence(F.date_sub(last_day, 3), last_day)
df = df.withColumn(
'last_business_day_in_month',
F.array_max(F.filter(last_days, lambda x: ~F.dayofweek(x).isin([1, 7])))
)
df.show()
# +------+--------------------------+
# | yrmth|last_business_day_in_month|
# +------+--------------------------+
# |202010| 2020-10-30|
# |202201| 2022-01-31|
# +------+--------------------------+
For lower Spark versions:
last_day = "last_day(to_date(yrmth, 'yyyyMM'))"
df = df.withColumn(
'last_business_day_in_month',
F.expr(f"array_max(filter(sequence(date_sub({last_day}, 3), {last_day}), x -> weekday(x) < 5))")
)
Related
I'm looking how to translate this chunk of SQL code into PySpark syntax.
SELECT MEAN(some_value) OVER (
ORDER BY yyyy_mm_dd
RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW
) AS mean
FROM
df
If the above was a range expressed in days, this could easily have been done using something like
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
(See also ZygD's solution here: Spark Window Functions - rangeBetween dates)
For a range in months, this however doesn't work as the number of days in a month is not a constant. Any idea how to perform a range considering months using PySpark syntax?
You can "borrow" the full SQL column expression and use it in PySpark.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2022-05-01', 1),
('2022-06-01', 2),
('2022-07-01', 3),
('2022-08-01', 4),
('2022-09-01', 5)],
['yyyy_mm_dd', 'some_value']
).withColumn('yyyy_mm_dd', F.to_date('yyyy_mm_dd'))
Script:
df = df.withColumn('mean', F.expr("mean(some_value) over (order by yyyy_mm_dd range between interval 3 months preceding and current row)"))
df.show()
# +----------+----------+----+
# |yyyy_mm_dd|some_value|mean|
# +----------+----------+----+
# |2022-05-01| 1| 1.0|
# |2022-06-01| 2| 1.5|
# |2022-07-01| 3| 2.0|
# |2022-08-01| 4| 2.5|
# |2022-09-01| 5| 3.5|
# +----------+----------+----+
In pandas there is the function pd.offsets.MonthEnd(x), such that
Date + pd.offsets.MonthEnd(x) returns the End of the month, x months after Date.
pd.to_datetime('2010-01-15') + pd.offsets.MonthEnd(3) = Timestamp('2010-03-31 00:00:00')
I am aware of the function last_day(Date) in pyspark. However, this does not take an offset argument, but simmply returns the end of the month. In which way do I retrieve similar behaviour of pd.offsets.MonthEnd(x) in pyspark?
You can combine last_day with add_months:
import pyspark.sql.functions as F
df \
.withColumn('add_months', F.add_months('my_timestamp', 3)) \
.withColumn('result', F.last_day('add_months')) \
.show()
I am developing a small script in PySpark that generates a date sequence (36 months before today's date) and (while applying a truncate to be the first day of the month). Overall I succeeded this task however
But with the help of the Pandas package Timedelta to calculate the time delta .
Is there a way to replace this Timedelta from Pandas with a pure PySpark function ?
import pandas as pd
from datetime import date, timedelta, datetime
from pyspark.sql.functions import col, date_trunc
today = datetime.today()
data = [((date(today.year, today.month, 1) - pd.Timedelta(36,'M')),date(today.year, today.month, 1))] # I want to replace this Pandas function
df = spark.createDataFrame(data, ["minDate", "maxDate"])
+----------+----------+
| minDate| maxDate|
+----------+----------+
|2016-10-01|2019-10-01|
+----------+----------+
import pyspark.sql.functions as f
df = df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\ #
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
+----------+
| date|
+----------+
|2016-10-01|
|2016-11-01|
|2016-12-01|
|2017-01-01|
|2017-02-01|
|2017-03-01|
etc...
+----------+
You can use Pyspark inbuilt trunc function.
pyspark.sql.functions.trunc(date, format)
Returns date truncated to the unit specified by the format.
Parameters:
format – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Imagine I have a below dataframe.
list = [(1,),]
df=spark.createDataFrame(list, ['id'])
import pyspark.sql.functions as f
df=df.withColumn("start_date" ,f.add_months(f.trunc(f.current_date(),"month") ,-36))
df=df.withColumn("max_date" ,f.trunc(f.current_date(),"month"))
>>> df.show()
+---+----------+----------+
| id|start_date| max_date|
+---+----------+----------+
| 1|2016-10-01|2019-10-01|
+---+----------+----------+
Here's a link with more details on Spark date functions.
Pyspark date Functions
For example, if i have a table with transaction number and transaction date [as timestamp] columns, how do i find out the total number of transactions on an hourly basis?
Is there any Spark sql functions available for this kind of range calculation?
You can use from_unixtime function.
val sqlContext = new SQLContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
.groupBy('hour)
.agg(count('transaction_number) as 'transactions)
Result:
+----+------------+
|hour|transactions|
+----+------------+
| 10| 1000|
| 12| 2000|
| 13| 3000|
| 14| 4000|
| ..| ....|
+----+------------+
Here I'm trying to give some pointer to approach, rather complete code, please see this
Time Interval Literals :
Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like
“Find all transactions that have happened during the past hour”.
An interval literal is constructed using the following syntax:
[sql]INTERVAL value unit[/sql]
Below is the way in python. you can modify the below example to match your requirement i.e transaction date start time, end time accordingly. instead of id in your case its transaction number.
# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
df.start_time.cast("timestamp").alias("start_time"),
df.end_time.cast("timestamp").alias("end_time"),
df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
(to_date(df.start_time) == to_date(df.end_time)) & \
(df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time | end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2 |
+———————+———————+—+
using this method, you can apply group function to find total number of transactions in your case.
Above is python code, what about scala ?
expr function used above also available in scala as well
Also have a look at spark-scala-datediff-of-two-columns-by-hour-or-minute
which describes below..
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
I have a DataFrame of phone calls that contains timestamp and duration of the call. How would I sum the total duration for each day for all phone calls? The timestamp is a string so I am having trouble parsing it to an actual date. I'm not sure if spark has any support for timestamps.
DataFrame table
timestamp | duration
1414592818364 | 210
1414575535061 | 110
1411328461890 | 140
1434606396339 | 90
You can use UDF to parse timestamps. Below you can find a Python solution but it should pretty easy to do the same thing using another supported language:
With raw SQL:
from datetime import datetime
df = sqlContext.createDataFrame(sc.parallelize([
{'timestamp': 1414592818364, 'duration': 210},
{'timestamp': 1414575535061, 'duration': 110},
{'timestamp': 1411328461890, 'duration': 140},
{'timestamp': 1434606396339, 'duration': 90}]))
def parse_timestamp(tm):
dt = datetime.fromtimestamp(tm / 1000)
return '{0}-{1}-{2}'.format(dt.year, dt.month, dt.day)
sqlContext.registerFunction('parse_timestamp', parse_timestamp)
df.registerTempTable('df')
query = '''
SELECT parse_timestamp(timestamp) AS date, sum(duration) AS total_durtaion
FROM df GROUP BY parse_timestamp(timestamp)'''
(sqlContext
.sql(query)
.show())
or SQL DSL:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
(df
.withColumn('date', udf(parse_timestamp, StringType())(df.timestamp))
.select('date', 'duration')
.groupby('date')
.sum()
.show())
EDIT:
Since Spark 1.5 there is no need for a custom udf.
from pyspark.sql.functions import from_unixtime, col, sum
(df
.groupBy(from_unixtime(df.timestamp / 1000, "yyyy-MM-dd").alias("date"))
.agg(sum(col("duration"))))