Pyspark -Sql loop optimization - sql

I have a scenario where i need to filter date column on date condition ,like wise i need to do it for entire month . Problem is while looping for each date it is taking time . I wanted to do entire month in one go. Following is the code.
target_date = [1,2,3...30]
for i in target_date:
df = spark.sql(f'select * from table where x_date <={i} and y_date >={i}')
df = df.withColumn('load_date',f.lit(i))
df.write.partition('load_date').mode('append').parquet(output_path)
Any approaches to make this faster

Maybe you can move the write to outside the loop. Something like
target_date = [1,2,3...30]
df_final = []
for i in target_date:
df = spark.sql(f'select * from table where x_date <={i} and y_date >={i}')
df = df.withColumn('load_date',f.lit(i))
df_final = df_final.union(df)
df_final.write.partition('load_date').parquet(output_path)

I believe you could solve it with a kind of cross-join like this:
load_dates = spark.createDataFrame([[i,] for i in range(1,31)], ['load_date'])
load_dates.show()
+---------+
|load_date|
+---------+
| 1|
| 2|
| 3|
| ...|
| 30|
+---------+
df = spark.sql(f'select * from table')
df.join(
load_dates,
on=(F.col('x_date') <= F.col('load_date')) & (F.col('y_date') >= F.col('load_date')),
how='inner',
)
df.write.partitionBy('load_date').parquet(output_path)

You should be able to do it by
Creating an array of load_dates in each row
Exploding the array so that you have a unique load_date per original row
Filtering to get just the load_dates you want
For example
target_dates = [1,2,3...30]
df = spark.sql(f'select * from table')
# create an array of all load_dates in each row
df = df.withColumn("load_date", F.array([F.lit(i) for i in target_dates]))
# Explode the load_dates so that you get a new row for each load_date
df = df.withColumn("load_date", F.explode("load_date"))
# Filter only the load_dates you want to keep
df = df.filter("x_date <= load_date and y_date >=load_date")
df.write.partition('load_date').mode('append').parquet(output_path)

Related

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Better way to add column values combinations to data frame in PySpark

I have a dataset that contains 3 columns, id, day, value. I need to add rows with zeros in value for all combinations of id and day.
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data,['id','day', 'value'])
What I have come up with is:
# Create all combinations of id and day
ids= df.select('id').distinct()
days = df.select('day').distinct()
full = ids.crossJoin(days)
# Add combinations back to df filling value with zeros
df_full = df.join(full, ['id', 'day'], 'rightouter')\
.na.fill(value=0,subset=['value'])
Which outputs what I need:
>>> df_full.orderBy(['id','day']).show()
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+
The problem is that both of these operations a very computationally expensive. When I'm running it with my full data, it gives me a job that an order of magnitude larger than something that usually takes a couple of hours to run.
Is there a more efficient way of doing this? Or is there something I'm missing?
That's the way I would implement. Just a point, both dataframes must have the same schema, otherwise stack function will raise an error
import pyspark.sql.functions as f
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data, ['id', 'day', 'value'])
# Creating a dataframe with all distinct days
df_days = df.select(f.col('day').alias('r_day')).distinct()
# Self Join to find all combinations
df_final = df.join(df_days, on=df['day'] != df_days['r_day'])
# +---+----------+-----+----------+
# | id| day|value| r_day|
# +---+----------+-----+----------+
# | 1|2020-04-01| 5|2020-04-02|
# | 2|2020-04-01| 5|2020-04-02|
# | 3|2020-04-02| 4|2020-04-01|
# +---+----------+-----+----------+
# Unpivot dataframe
df_final = df_final.select('id', f.expr('stack(2, day, value, r_day, cast(0 as bigint)) as (day, value)'))
df_final.orderBy('id', 'day').show()
Output:
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+
Something like this. You could, I keep the first row separate since it's more clear what happens. You could add it to the "main loop" though.
data = [
("1", date(2020, 4, 1), 5),
("2", date(2020, 4, 2), 5),
("3", date(2020, 4, 3), 5),
("1", date(2020, 4, 3), 5),
]
df = spark.createDataFrame(data, ["id", "date", "value"])
row_dates = df.select("date").distinct().collect()
dates = [item.asDict()["date"] for item in row_dates]
def map_row(dates: List[date]) -> Callable[[Iterator[Row]], Iterator[Row]]:
dates.sort()
def inner(partition):
last_row = None
for row in partition:
# fill in missing dates for first row in partition
if last_row is None:
for day in dates:
if day < row.date:
yield Row(row.id, day, 0)
else:
# set current row as last row, yield current row and break out of the loop
last_row = row
yield row
break
else:
# if current row has same id as last row
if last_row.id == row.id:
# yield dates between last and current
for day in dates:
if day > last_row.date and day < row.date:
yield Row(row.id, day, 0)
# set current as last and yield current
last_row = row
yield row
else:
# if current row is new id
for day in dates:
# run potential remaining dates for last_row.id
if day > last_row.date:
yield Row(last_row.id, day, 0)
for day in dates:
# fill in missing dates before row.date
if day < row.date:
yield Row(row.id, day, 0)
else:
# unt so weiter
last_row = row
yield row
break
return inner
rdd = (
df.repartition(1, "id")
.sortWithinPartitions("id", "date")
.rdd.mapPartitions(map_row(dates))
)
new_df = spark.createDataFrame(rdd)
new_df.show(10, False)

Is there a way in pyspark to count unique values

I have a spark dataframe (12m x 132) and I am trying to calculate the number of unique values by column, and remove columns that have only 1 unique value.
So far, I have used the pandas nunique function as such:
import pandas as pd
df = sql_dw.read_table(<table>)
df_p = df.toPandas()
nun = df_p.nunique(axis=0)
nundf = pd.DataFrame({'atr':nun.index, 'countU':nun.values})
dropped = []
for i, j in nundf.values:
if j == 1:
dropped.append(i)
df = df.drop(i)
print(dropped)
Is there a way to do this that is more native to spark - i.e. not using pandas?
Please have a look at the commented example below. The solution requires more python as pyspark specific knowledge.
import pyspark.sql.functions as F
#creating a dataframe
columns = ['asin' ,'ctx' ,'fo' ]
l = [('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO2')
,('ASIN1','CTX2','FO1')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO3')
,('ASIN1','CTX3','FO1')
,('ASIN1','CTX3','FO3')]
df=spark.createDataFrame(l, columns)
df.show()
#we create a list of functions we want to apply
#in this case countDistinct for each column
expr = [F.countDistinct(c).alias(c) for c in df.columns]
#we apply those functions
countdf = df.select(*expr)
#this df has just one row
countdf.show()
#we extract the columns which have just one value
cols2drop = [k for k,v in countdf.collect()[0].asDict().items() if v == 1]
df.drop(*cols2drop).show()
Output:
+-----+----+---+
| asin| ctx| fo|
+-----+----+---+
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO2|
|ASIN1|CTX2|FO1|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO3|
|ASIN1|CTX3|FO1|
|ASIN1|CTX3|FO3|
+-----+----+---+
+----+---+---+
|asin|ctx| fo|
+----+---+---+
| 1| 3| 3|
+----+---+---+
+----+---+
| ctx| fo|
+----+---+
|CTX1|FO1|
|CTX1|FO1|
|CTX1|FO2|
|CTX2|FO1|
|CTX2|FO2|
|CTX2|FO2|
|CTX2|FO3|
|CTX3|FO1|
|CTX3|FO3|
+----+---+
My apologies as I don't have the solution in pyspark but in pure spark, which may be transferable or used in case you can't find a pyspark way.
You can create a blank list and then using a foreach, check which columns have a distinct count of 1, then append them to the blank list.
From there you can use the list as a filter and drop those columns from your dataframe.
var list_of_columns: List[String] = ()
df_p.columns.foreach{c =>
if (df_p.select(c).distinct.count == 1)
list_of_columns ++= List(c)
df_p_new = df_p.drop(list_of_columns:_*)
you can group your df by that column and count distinct value of this column:
df = df.groupBy("column_name").agg(countDistinct("column_name").alias("distinct_count"))
And then filter your df by row which has more than 1 distinct_count:
df = df.filter(df.distinct_count > 1)

Spark SQL - How to find total number of transactions on an hourly basis

For example, if i have a table with transaction number and transaction date [as timestamp] columns, how do i find out the total number of transactions on an hourly basis?
Is there any Spark sql functions available for this kind of range calculation?
You can use from_unixtime function.
val sqlContext = new SQLContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
.groupBy('hour)
.agg(count('transaction_number) as 'transactions)
Result:
+----+------------+
|hour|transactions|
+----+------------+
| 10| 1000|
| 12| 2000|
| 13| 3000|
| 14| 4000|
| ..| ....|
+----+------------+
Here I'm trying to give some pointer to approach, rather complete code, please see this
Time Interval Literals :
Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like
“Find all transactions that have happened during the past hour”.
An interval literal is constructed using the following syntax:
[sql]INTERVAL value unit[/sql]
Below is the way in python. you can modify the below example to match your requirement i.e transaction date start time, end time accordingly. instead of id in your case its transaction number.
# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
df.start_time.cast("timestamp").alias("start_time"),
df.end_time.cast("timestamp").alias("end_time"),
df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
(to_date(df.start_time) == to_date(df.end_time)) & \
(df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time | end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2 |
+———————+———————+—+
using this method, you can apply group function to find total number of transactions in your case.
Above is python code, what about scala ?
expr function used above also available in scala as well
Also have a look at spark-scala-datediff-of-two-columns-by-hour-or-minute
which describes below..
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

Sql DataFrame - Operation

I am stuck with a situation where i need to perform division on output from two sql Data Frame . Any Suggestion How it can be done ?
scala> val TotalDie = sqlc.sql("select COUNT(DISTINCT XY) from Data")
TotalDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> TotalDie.show()
+---+
|_c0|
+---+
|887|
+---+
scala> val PassDie = sqlc.sql("select COUNT(DISTINCT XY) from Data where Sbin = '1'")
PassDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> PassDie.show()
+---+
|_c0|
+---+
|413|
+---+
I need to perform to calculate the Yield which refer to (PassDie/TotalDie)*100,
I am new to spark-shell
In case of multiple values (ie multiple rows): do you have a column (or key or id) to join the two dataframes (or tables) on ?
In case of always a single value (ie single row): something along the lines of: 100* PassDie.collect() / TotalDie.collect()
UPDATE
The exact syntax in case of 1 value:
100.0 * passdie.collect()(0).getInt(0) / totaldie.collect()(0).getInt(0)
res25: Double = 46.56144306651635
It is possible to do this with just SparkSQL, too.
Here's what i'd do to solve it that way:
>>> rdd1 = sc.parallelize([("a",1.12),("a",2.22)])
>>> rdd2 = sc.parallelize([("b",9.12),("b",12.22)])
>>> r1df = rdd1.toDF()
>>> r2df = rdd2.toDF()
>>> r1df.registerTempTable('r1')
>>> r2df.registerTempTable('r2')
>>> r3df = sqlContext.sql("SELECT * FROM r1 UNION SELECT * FROM r2").show()
>>> r3df.registerTempTable('r3')
>>> sqlContext.sql("SELECT * FROM r3") -------> do your aggregation / math here.
Now from here, in theory, you can do basic grouping and arithmetic just using SQL queries, since you've got this grand table of data. I realize in my example code, I didn't really declare a good schema with column names, and that makes this example not really work, but you have a schema, so you get the idea.