Better way to add column values combinations to data frame in PySpark - dataframe

I have a dataset that contains 3 columns, id, day, value. I need to add rows with zeros in value for all combinations of id and day.
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data,['id','day', 'value'])
What I have come up with is:
# Create all combinations of id and day
ids= df.select('id').distinct()
days = df.select('day').distinct()
full = ids.crossJoin(days)
# Add combinations back to df filling value with zeros
df_full = df.join(full, ['id', 'day'], 'rightouter')\
.na.fill(value=0,subset=['value'])
Which outputs what I need:
>>> df_full.orderBy(['id','day']).show()
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+
The problem is that both of these operations a very computationally expensive. When I'm running it with my full data, it gives me a job that an order of magnitude larger than something that usually takes a couple of hours to run.
Is there a more efficient way of doing this? Or is there something I'm missing?

That's the way I would implement. Just a point, both dataframes must have the same schema, otherwise stack function will raise an error
import pyspark.sql.functions as f
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data, ['id', 'day', 'value'])
# Creating a dataframe with all distinct days
df_days = df.select(f.col('day').alias('r_day')).distinct()
# Self Join to find all combinations
df_final = df.join(df_days, on=df['day'] != df_days['r_day'])
# +---+----------+-----+----------+
# | id| day|value| r_day|
# +---+----------+-----+----------+
# | 1|2020-04-01| 5|2020-04-02|
# | 2|2020-04-01| 5|2020-04-02|
# | 3|2020-04-02| 4|2020-04-01|
# +---+----------+-----+----------+
# Unpivot dataframe
df_final = df_final.select('id', f.expr('stack(2, day, value, r_day, cast(0 as bigint)) as (day, value)'))
df_final.orderBy('id', 'day').show()
Output:
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+

Something like this. You could, I keep the first row separate since it's more clear what happens. You could add it to the "main loop" though.
data = [
("1", date(2020, 4, 1), 5),
("2", date(2020, 4, 2), 5),
("3", date(2020, 4, 3), 5),
("1", date(2020, 4, 3), 5),
]
df = spark.createDataFrame(data, ["id", "date", "value"])
row_dates = df.select("date").distinct().collect()
dates = [item.asDict()["date"] for item in row_dates]
def map_row(dates: List[date]) -> Callable[[Iterator[Row]], Iterator[Row]]:
dates.sort()
def inner(partition):
last_row = None
for row in partition:
# fill in missing dates for first row in partition
if last_row is None:
for day in dates:
if day < row.date:
yield Row(row.id, day, 0)
else:
# set current row as last row, yield current row and break out of the loop
last_row = row
yield row
break
else:
# if current row has same id as last row
if last_row.id == row.id:
# yield dates between last and current
for day in dates:
if day > last_row.date and day < row.date:
yield Row(row.id, day, 0)
# set current as last and yield current
last_row = row
yield row
else:
# if current row is new id
for day in dates:
# run potential remaining dates for last_row.id
if day > last_row.date:
yield Row(last_row.id, day, 0)
for day in dates:
# fill in missing dates before row.date
if day < row.date:
yield Row(row.id, day, 0)
else:
# unt so weiter
last_row = row
yield row
break
return inner
rdd = (
df.repartition(1, "id")
.sortWithinPartitions("id", "date")
.rdd.mapPartitions(map_row(dates))
)
new_df = spark.createDataFrame(rdd)
new_df.show(10, False)

Related

Using rangeBetween considering months rather than days in PySpark

I'm looking how to translate this chunk of SQL code into PySpark syntax.
SELECT MEAN(some_value) OVER (
ORDER BY yyyy_mm_dd
RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW
) AS mean
FROM
df
If the above was a range expressed in days, this could easily have been done using something like
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
(See also ZygD's solution here: Spark Window Functions - rangeBetween dates)
For a range in months, this however doesn't work as the number of days in a month is not a constant. Any idea how to perform a range considering months using PySpark syntax?
You can "borrow" the full SQL column expression and use it in PySpark.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2022-05-01', 1),
('2022-06-01', 2),
('2022-07-01', 3),
('2022-08-01', 4),
('2022-09-01', 5)],
['yyyy_mm_dd', 'some_value']
).withColumn('yyyy_mm_dd', F.to_date('yyyy_mm_dd'))
Script:
df = df.withColumn('mean', F.expr("mean(some_value) over (order by yyyy_mm_dd range between interval 3 months preceding and current row)"))
df.show()
# +----------+----------+----+
# |yyyy_mm_dd|some_value|mean|
# +----------+----------+----+
# |2022-05-01| 1| 1.0|
# |2022-06-01| 2| 1.5|
# |2022-07-01| 3| 2.0|
# |2022-08-01| 4| 2.5|
# |2022-09-01| 5| 3.5|
# +----------+----------+----+

How can I find the average of every nth number of rows in PySpark

I have 1440 rows in my dataframe (one row for every minute of the day). I want to convert this into hours so that I have 24 values (rows) left in total.
This is a 2 column dataframe. First column is minutes, second column is integers. I would like a 2 X 24 dataframe where the first column is hours and the second column is an average of 60 values.
If your minutes column is an integer starting at 0, something along these lines should work:
hour = F.floor(F.col('minute') / 60).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
The example where I assume that every hour has 3 minutes:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(0, 5),
(1, 5),
(2, 5),
(3, 0),
(4, 0),
(5, 1)],
['minute', 'integer'])
hour = F.floor(F.col('minute') / 3).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
df.show()
# +----+------------------+
# |hour| average|
# +----+------------------+
# | 0| 5.0|
# | 1|0.3333333333333333|
# +----+------------------+

Pyspark -Sql loop optimization

I have a scenario where i need to filter date column on date condition ,like wise i need to do it for entire month . Problem is while looping for each date it is taking time . I wanted to do entire month in one go. Following is the code.
target_date = [1,2,3...30]
for i in target_date:
df = spark.sql(f'select * from table where x_date <={i} and y_date >={i}')
df = df.withColumn('load_date',f.lit(i))
df.write.partition('load_date').mode('append').parquet(output_path)
Any approaches to make this faster
Maybe you can move the write to outside the loop. Something like
target_date = [1,2,3...30]
df_final = []
for i in target_date:
df = spark.sql(f'select * from table where x_date <={i} and y_date >={i}')
df = df.withColumn('load_date',f.lit(i))
df_final = df_final.union(df)
df_final.write.partition('load_date').parquet(output_path)
I believe you could solve it with a kind of cross-join like this:
load_dates = spark.createDataFrame([[i,] for i in range(1,31)], ['load_date'])
load_dates.show()
+---------+
|load_date|
+---------+
| 1|
| 2|
| 3|
| ...|
| 30|
+---------+
df = spark.sql(f'select * from table')
df.join(
load_dates,
on=(F.col('x_date') <= F.col('load_date')) & (F.col('y_date') >= F.col('load_date')),
how='inner',
)
df.write.partitionBy('load_date').parquet(output_path)
You should be able to do it by
Creating an array of load_dates in each row
Exploding the array so that you have a unique load_date per original row
Filtering to get just the load_dates you want
For example
target_dates = [1,2,3...30]
df = spark.sql(f'select * from table')
# create an array of all load_dates in each row
df = df.withColumn("load_date", F.array([F.lit(i) for i in target_dates]))
# Explode the load_dates so that you get a new row for each load_date
df = df.withColumn("load_date", F.explode("load_date"))
# Filter only the load_dates you want to keep
df = df.filter("x_date <= load_date and y_date >=load_date")
df.write.partition('load_date').mode('append').parquet(output_path)

Filtering pyspark dataframe rows based on a column condition

I have two dataframe , and I want to select rows in the first dataframe which timestamp field is bigger ( more recent ) than the max (timestamp) of the second dataframe.
I tried this:
df1 = sqlContext.table("db.table1") # FIRST DATAFRAME
max_timestamp = sqlContext.sql("select max(timestamp) as max from db.table2") # MAX TIMESTAMP IN THE SECOND DATAFRAME
df1.where(df1.timestamp > max_timestamp.max).show(10,False)
but it says: AttributeError: 'DataFrame' object has no attribute '_get_object_id'
Any ideas/solutions?
Your issue is that you are comparing against a DataFrame column (max_timestamp.max) from another DataFrame. You need to either collect the result as a String or crossJoin as a new column to compare against.
Reproducible example
data1 = [("1", "2020-01-01 00:00:00"), ("2", "2020-02-01 23:59:59")]
data2 = [("1", "2020-01-15 00:00:00"), ("2", "2020-01-16 23:59:59")]
df1 = spark.createDataFrame(data1, ["id", "timestamp"])
df2 = spark.createDataFrame(data2, ["id", "timestamp"])
collect as String
from pyspark.sql.functions import col, max
max_timestamp = df2.select(max(col("timestamp")).alias("max")).distinct().collect()[0][0]
max_timestamp
# '2020-01-16 23:59:59'
df1.where(col("timestamp") > max_timestamp).show(10, truncate=False)
# +---+-------------------+
# |id |timestamp |
# +---+-------------------+
# |2 |2020-02-01 23:59:59|
# +---+-------------------+
crossJoin as new column
from pyspark.sql.functions import col, max
intermediate = (
df2.
agg(max(col("timestamp")).alias("start_date_filter"))
)
intermediate.show(1, truncate=False)
# +-------------------+
# |start_date_filter |
# +-------------------+
# |2020-01-16 23:59:59|
# +-------------------+
(
df1.
crossJoin(intermediate).
where(col("timestamp") > col("start_date_filter")).
show(10, truncate=False)
)
# +---+-------------------+-------------------+
# |id |timestamp |start_date_filter |
# +---+-------------------+-------------------+
# |2 |2020-02-01 23:59:59|2020-01-16 23:59:59|
# +---+-------------------+-------------------+

Spark SQL - How to find total number of transactions on an hourly basis

For example, if i have a table with transaction number and transaction date [as timestamp] columns, how do i find out the total number of transactions on an hourly basis?
Is there any Spark sql functions available for this kind of range calculation?
You can use from_unixtime function.
val sqlContext = new SQLContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
.groupBy('hour)
.agg(count('transaction_number) as 'transactions)
Result:
+----+------------+
|hour|transactions|
+----+------------+
| 10| 1000|
| 12| 2000|
| 13| 3000|
| 14| 4000|
| ..| ....|
+----+------------+
Here I'm trying to give some pointer to approach, rather complete code, please see this
Time Interval Literals :
Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like
“Find all transactions that have happened during the past hour”.
An interval literal is constructed using the following syntax:
[sql]INTERVAL value unit[/sql]
Below is the way in python. you can modify the below example to match your requirement i.e transaction date start time, end time accordingly. instead of id in your case its transaction number.
# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
df.start_time.cast("timestamp").alias("start_time"),
df.end_time.cast("timestamp").alias("end_time"),
df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
(to_date(df.start_time) == to_date(df.end_time)) & \
(df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time | end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2 |
+———————+———————+—+
using this method, you can apply group function to find total number of transactions in your case.
Above is python code, what about scala ?
expr function used above also available in scala as well
Also have a look at spark-scala-datediff-of-two-columns-by-hour-or-minute
which describes below..
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )