Using rangeBetween considering months rather than days in PySpark - sql

I'm looking how to translate this chunk of SQL code into PySpark syntax.
SELECT MEAN(some_value) OVER (
ORDER BY yyyy_mm_dd
RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW
) AS mean
FROM
df
If the above was a range expressed in days, this could easily have been done using something like
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
(See also ZygD's solution here: Spark Window Functions - rangeBetween dates)
For a range in months, this however doesn't work as the number of days in a month is not a constant. Any idea how to perform a range considering months using PySpark syntax?

You can "borrow" the full SQL column expression and use it in PySpark.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2022-05-01', 1),
('2022-06-01', 2),
('2022-07-01', 3),
('2022-08-01', 4),
('2022-09-01', 5)],
['yyyy_mm_dd', 'some_value']
).withColumn('yyyy_mm_dd', F.to_date('yyyy_mm_dd'))
Script:
df = df.withColumn('mean', F.expr("mean(some_value) over (order by yyyy_mm_dd range between interval 3 months preceding and current row)"))
df.show()
# +----------+----------+----+
# |yyyy_mm_dd|some_value|mean|
# +----------+----------+----+
# |2022-05-01| 1| 1.0|
# |2022-06-01| 2| 1.5|
# |2022-07-01| 3| 2.0|
# |2022-08-01| 4| 2.5|
# |2022-09-01| 5| 3.5|
# +----------+----------+----+

Related

Get last business day of the month in PySpark without UDF

I would like to get the last business day (LBD) of the month, and use LBD to filter records in a dataframe, I did come up with python code. But to achieve this functionality I need to use UDF. Is there any way to get the last business day of the month without using PySpark UDF?
import calendar
def last_business_day_in_month(calendarYearMonth):
year = int(calendarYearMonth[0:4])
month = int(calendarYearMonth[4:])
return str(year) + str(month) + str(max(calendar.monthcalendar(year, month)[-1:][0][:5]))
last_business_day_in_month(calendarYearMonth)
calendarYearMonth is in format YYYYMM
Ref: https://stackoverflow.com/a/62392077/6187792
You can calculate it using last_day and its dayofweek.
from pyspark.sql import functions as func
spark.sparkContext.parallelize([(202010,), (202201,)]).toDF(['yrmth']). \
withColumn('lastday_mth', func.last_day(func.to_date(func.col('yrmth').cast('string'), 'yyyyMM'))). \
withColumn('dayofwk', func.dayofweek('lastday_mth')). \
withColumn('lastbizday_mth',
func.when(func.col('dayofwk') == 7, func.date_add('lastday_mth', -1)).
when(func.col('dayofwk') == 1, func.date_add('lastday_mth', -2)).
otherwise(func.col('lastday_mth'))
). \
show()
# +------+-----------+-------+--------------+
# | yrmth|lastday_mth|dayofwk|lastbizday_mth|
# +------+-----------+-------+--------------+
# |202010| 2020-10-31| 7| 2020-10-30|
# |202201| 2022-01-31| 2| 2022-01-31|
# +------+-----------+-------+--------------+
Create a small sequence of last dates of the month, filter out weekends and use array_max to return the max date.
from pyspark.sql import functions as F
df = spark.createDataFrame([('202010',), ('202201',)], ['yrmth'])
last_day = F.last_day(F.to_date('yrmth', 'yyyyMM'))
last_days = F.sequence(F.date_sub(last_day, 3), last_day)
df = df.withColumn(
'last_business_day_in_month',
F.array_max(F.filter(last_days, lambda x: ~F.dayofweek(x).isin([1, 7])))
)
df.show()
# +------+--------------------------+
# | yrmth|last_business_day_in_month|
# +------+--------------------------+
# |202010| 2020-10-30|
# |202201| 2022-01-31|
# +------+--------------------------+
For lower Spark versions:
last_day = "last_day(to_date(yrmth, 'yyyyMM'))"
df = df.withColumn(
'last_business_day_in_month',
F.expr(f"array_max(filter(sequence(date_sub({last_day}, 3), {last_day}), x -> weekday(x) < 5))")
)

Better way to add column values combinations to data frame in PySpark

I have a dataset that contains 3 columns, id, day, value. I need to add rows with zeros in value for all combinations of id and day.
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data,['id','day', 'value'])
What I have come up with is:
# Create all combinations of id and day
ids= df.select('id').distinct()
days = df.select('day').distinct()
full = ids.crossJoin(days)
# Add combinations back to df filling value with zeros
df_full = df.join(full, ['id', 'day'], 'rightouter')\
.na.fill(value=0,subset=['value'])
Which outputs what I need:
>>> df_full.orderBy(['id','day']).show()
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+
The problem is that both of these operations a very computationally expensive. When I'm running it with my full data, it gives me a job that an order of magnitude larger than something that usually takes a couple of hours to run.
Is there a more efficient way of doing this? Or is there something I'm missing?
That's the way I would implement. Just a point, both dataframes must have the same schema, otherwise stack function will raise an error
import pyspark.sql.functions as f
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data, ['id', 'day', 'value'])
# Creating a dataframe with all distinct days
df_days = df.select(f.col('day').alias('r_day')).distinct()
# Self Join to find all combinations
df_final = df.join(df_days, on=df['day'] != df_days['r_day'])
# +---+----------+-----+----------+
# | id| day|value| r_day|
# +---+----------+-----+----------+
# | 1|2020-04-01| 5|2020-04-02|
# | 2|2020-04-01| 5|2020-04-02|
# | 3|2020-04-02| 4|2020-04-01|
# +---+----------+-----+----------+
# Unpivot dataframe
df_final = df_final.select('id', f.expr('stack(2, day, value, r_day, cast(0 as bigint)) as (day, value)'))
df_final.orderBy('id', 'day').show()
Output:
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+
Something like this. You could, I keep the first row separate since it's more clear what happens. You could add it to the "main loop" though.
data = [
("1", date(2020, 4, 1), 5),
("2", date(2020, 4, 2), 5),
("3", date(2020, 4, 3), 5),
("1", date(2020, 4, 3), 5),
]
df = spark.createDataFrame(data, ["id", "date", "value"])
row_dates = df.select("date").distinct().collect()
dates = [item.asDict()["date"] for item in row_dates]
def map_row(dates: List[date]) -> Callable[[Iterator[Row]], Iterator[Row]]:
dates.sort()
def inner(partition):
last_row = None
for row in partition:
# fill in missing dates for first row in partition
if last_row is None:
for day in dates:
if day < row.date:
yield Row(row.id, day, 0)
else:
# set current row as last row, yield current row and break out of the loop
last_row = row
yield row
break
else:
# if current row has same id as last row
if last_row.id == row.id:
# yield dates between last and current
for day in dates:
if day > last_row.date and day < row.date:
yield Row(row.id, day, 0)
# set current as last and yield current
last_row = row
yield row
else:
# if current row is new id
for day in dates:
# run potential remaining dates for last_row.id
if day > last_row.date:
yield Row(last_row.id, day, 0)
for day in dates:
# fill in missing dates before row.date
if day < row.date:
yield Row(row.id, day, 0)
else:
# unt so weiter
last_row = row
yield row
break
return inner
rdd = (
df.repartition(1, "id")
.sortWithinPartitions("id", "date")
.rdd.mapPartitions(map_row(dates))
)
new_df = spark.createDataFrame(rdd)
new_df.show(10, False)

Pyspark number of unique values in dataframe is different compared with Pandas result

I have large dataframe with 4 million rows. One of the columns is a variable called "name".
When I check the number of unique values in Pandas by: df['name].nunique() I get a different answer than from Pyspark df.select("name").distinct().show() (around 1800 in Pandas versus 350 in Pyspark). How can this be? Is this a data partitioning thing?
EDIT:
The record "name" in the dataframe looks like: name-{number}, for example: name-1, name-2, etc.
In Pandas:
df['name'] = df['name'].str.lstrip('name-').astype(int)
df['name'].nunique() # 1800
In Pyspark:
import pyspark.sql.functions as f
df = df.withColumn("name", f.split(df['name'], '\-')[1].cast("int"))
df.select(f.countDistinct("name")).show()
IIUC, it's most likely from non-numeric chars(i.e. SPACE) shown in the name column. Pandas will force the type conversion while with Spark, you get NULL, see below example:
df = spark.createDataFrame([(e,) for e in ['name-1', 'name-22 ', 'name- 3']],['name'])
for PySpark:
import pyspark.sql.functions as f
df.withColumn("name1", f.split(df['name'], '\-')[1].cast("int")).show()
#+--------+-----+
#| name|name1|
#+--------+-----+
#| name-1| 1|
#|name-22 | null|
#| name- 3| null|
#+--------+-----+
for Pandas:
df.toPandas()['name'].str.lstrip('name-').astype(int)
#Out[xxx]:
#0 1
#1 22
#2 3
#Name: name, dtype: int64

Statistics of Columns computed parallely

Best way to get the max value in a Spark dataframe column
This post shows how to run an aggregation (distinct, min, max) on a table something like:
for colName in df.columns:
dt = cd[[colName]].distinct().count()
mx = cd.agg({colName: "max"}).collect()[0][0]
mn = cd.agg({colName: "min"}).collect()[0][0]
print(colName, dt, mx, mn)
This can be easily done by compute statistics. The stats from Hive and spark are different:
Hive gives - distinct, max, min, nulls, length, version
Spark Gives - count, mean, stddev, min, max
Looks like there are quite a few statistics that are calculated. How get all of them for all columns using one command?
However, I have 1000s of columns and doing this serially is very slow. Suppose I want to compute some other function say Standard Deviation on each of the columns - how can that be done parallely?
You can use pyspark.sql.DataFrame.describe() to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary() which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg(). For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe() and summary() do.
You can put as many expressions into an agg as you want, when you collect they all get computed at once. The result is a single row with all the values. Here's an example:
from pyspark.sql.functions import min, max, countDistinct
r = df.agg(
min(df.col1).alias("minCol1"),
max(df.col1).alias("maxCol1"),
(max(df.col1) - min(df.col1)).alias("diffMinMax"),
countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
# |-- minCol1: long (nullable = true)
# |-- maxCol1: long (nullable = true)
# |-- diffMinMax: long (nullable = true)
# |-- distinctItemsInCol2: long (nullable = false)
row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)
You can also use the dictionary syntax here, but it's harder to manage for more complex things.

Spark SQL - How to find total number of transactions on an hourly basis

For example, if i have a table with transaction number and transaction date [as timestamp] columns, how do i find out the total number of transactions on an hourly basis?
Is there any Spark sql functions available for this kind of range calculation?
You can use from_unixtime function.
val sqlContext = new SQLContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
.groupBy('hour)
.agg(count('transaction_number) as 'transactions)
Result:
+----+------------+
|hour|transactions|
+----+------------+
| 10| 1000|
| 12| 2000|
| 13| 3000|
| 14| 4000|
| ..| ....|
+----+------------+
Here I'm trying to give some pointer to approach, rather complete code, please see this
Time Interval Literals :
Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like
“Find all transactions that have happened during the past hour”.
An interval literal is constructed using the following syntax:
[sql]INTERVAL value unit[/sql]
Below is the way in python. you can modify the below example to match your requirement i.e transaction date start time, end time accordingly. instead of id in your case its transaction number.
# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
df.start_time.cast("timestamp").alias("start_time"),
df.end_time.cast("timestamp").alias("end_time"),
df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
(to_date(df.start_time) == to_date(df.end_time)) & \
(df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time | end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2 |
+———————+———————+—+
using this method, you can apply group function to find total number of transactions in your case.
Above is python code, what about scala ?
expr function used above also available in scala as well
Also have a look at spark-scala-datediff-of-two-columns-by-hour-or-minute
which describes below..
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )