Spark dataframe inner join without duplicate match - apache-spark-sql

I want to join two dataframes based on certain condition is spark scala. However the catch is if row in df1 matches any row in df2, it should not try to match same row of df1 with any other row in df2. Below is sample data and outcome I am trying to get.
DF1
--------------------------------
Emp_id | Emp_Name | Address_id
1 | ABC | 1
2 | DEF | 2
3 | PQR | 3
4 | XYZ | 1
DF2
-----------------------
Address_id | City
1 | City_1
1 | City_2
2 | City_3
REST | Some_City
Output DF
----------------------------------------
Emp_id | Emp_Name | Address_id | City
1 | ABC | 1 | City_1
2 | DEF | 2 | City_3
3 | PQR | 3 | Some_City
4 | XYZ | 1 | City_1
Note:- REST is like wild card. Any value can be equal to REST.
So in above sample emp_name "ABC" can match with City_1, City_2 or Some_City. Output DF contains only City_1 because it finds it first.

You seem to have a custom logic for your join. Basically I've been to come up with the below UDF.
Note that you may want to change the logic for the UDF as per your requirement.
import spark.implicits._
import org.apache.spark.sql.functions.to_timestamp
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.first
//dataframe 1
val df_1 = Seq(("1", "ABC", "1"), ("2", "DEF", "2"), ("3", "PQR", "3"), ("4", "XYZ", "1")).toDF("Emp_Id", "Emp_Name", "Address_Id")
//dataframe 2
val df_2 = Seq(("1", "City_1"), ("1", "City_2"), ("2", "City_3"), ("REST","Some_City")).toDF("Address_Id", "City_Name")
// UDF logic
val join_udf = udf((a: String, b: String) => {
(a,b) match {
case ("1", "1") => true
case ("1", _) => false
case ("2", "2") => true
case ("2", _) => false
case(_, "REST") => true
case(_, _) => false
}})
val dataframe_join = df_1.join(df_2, join_udf(df_1("Address_Id"), df_2("Address_Id")), "inner").drop(df_2("Address_Id"))
.orderBy($"City_Name")
.groupBy($"Emp_Id", $"Emp_Name", $"Address_Id")
.agg(first($"City_Name"))
.orderBy($"Emp_Id")
dataframe_join.show(false)
Basically post applying UDF, what you get is all possible combinations of the matches.
Post that when you apply groupBy and make use of first function of agg, you would only get the filtered values as what you are looking for.
+------+--------+----------+-----------------------+
|Emp_Id|Emp_Name|Address_Id|first(City_Name, false)|
+------+--------+----------+-----------------------+
|1 |ABC |1 |City_1 |
|2 |DEF |2 |City_3 |
|3 |PQR |3 |Some_City |
|4 |XYZ |1 |City_1 |
+------+--------+----------+-----------------------+
Note that I've made use of Spark 2.3 and hope this helps!

{
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions._
object JoinTwoDataFrame extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val df1 = Seq(
(1, "ABC", "1"),
(2, "DEF", "2"),
(3, "PQR", "3"),
(4, "XYZ", "1")
).toDF("Emp_id", "Emp_Name", "Address_id")
val df2 = Seq(
("1", "City_1"),
("1", "City_2"),
("2", "City_3"),
("REST", "Some_City")
).toDF("Address_id", "City")
val restCity: Option[String] = Some(df2.filter('Address_id.equalTo("REST")).select('City).first()(0).toString)
val res = df1.join(df2, df1.col("Address_id") === df2.col("Address_id") , "left_outer")
.select(
df1.col("Emp_id"),
df1.col("Emp_Name"),
df1.col("Address_id"),
df2.col("City")
)
.withColumn("city2", when('City.isNotNull, 'City).otherwise(restCity.getOrElse("")))
.drop("City")
.withColumnRenamed("city2", "City")
.orderBy("Address_id", "City")
.groupBy("Emp_id", "Emp_Name", "Address_id")
.agg(collect_list("City").alias("cityList"))
.withColumn("City", 'cityList.getItem(0))
.drop("cityList")
.orderBy("Emp_id")
res.show(false)
// +------+--------+----------+---------+
// |Emp_id|Emp_Name|Address_id|City |
// +------+--------+----------+---------+
// |1 |ABC |1 |City_1 |
// |2 |DEF |2 |City_3 |
// |3 |PQR |3 |Some_City|
// |4 |XYZ |1 |City_1 |
// +------+--------+----------+---------+
}
}

Related

data frame parsing column scala

I have some problem with parsing Dataframe
val result = df_app_clickstream.withColumn(
"attributes",
explode(expr(raw"transform(attributes, x -> str_to_map(regexp_replace(x, '{\\}',''), ' '))"))
).select(
col("userId"),
col("attributes").getField("campaign_id").alias("app_campaign_id"),
col("attributes").getField("channel_id").alias("app_channel_id")
)
result.show()
I have input like this :
-------------------------------------------------------------------------------
| userId | attributes |
-------------------------------------------------------------------------------
| f6e8252f-b5cc-48a4-b348-29d89ee4fa9e |{'campaign_id':082,'channel_id':'Chnl'}|
-------------------------------------------------------------------------------
and need to get output like this :
--------------------------------------------------------------------
| userId | campaign_id | channel_id|
--------------------------------------------------------------------
| f6e8252f-b5cc-48a4-b348-29d89ee4fa9e | 082 | Facebook |
--------------------------------------------------------------------
but have error
you can try below solution
import org.apache.spark.sql.functions._
val data = Seq(("f6e8252f-b5cc-48a4-b348-29d89ee4fa9e", """{'campaign_id':082, 'channel_id':'Chnl'}""")).toDF("user_id", "attributes")
val out_df = data.withColumn("splitted_col", split(regexp_replace(col("attributes"),"'|\\}|\\{", ""), ","))
.withColumn("campaign_id", split(element_at(col("splitted_col"), 1), ":")(1))
.withColumn("channel_id", split(element_at(col("splitted_col"), 2), ":")(1))
out_df.show(truncate = false)
+------------------------------------+----------------------------------------+-----------------------------------+-----------+----------+
|user_id |attributes |splitted_col |campaign_id|channel_id|
+------------------------------------+----------------------------------------+-----------------------------------+-----------+----------+
|f6e8252f-b5cc-48a4-b348-29d89ee4fa9e|{'campaign_id':082, 'channel_id':'Chnl'}|[campaign_id:082, channel_id:Chnl]|082 |Chnl |
+------------------------------------+----------------------------------------+-----------------------------------+-----------+----------+

size function applied to empty array column in dataframe returns 1 after spilt

Noticed that with size function on an array column in a dataframe using following code - which includes a split:
import org.apache.spark.sql.functions.{trim, explode, split, size}
val df1 = Seq(
(1, "[{a},{b},{c}]"),
(2, "[]"),
(3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)
val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)
we get:
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |1 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+
I was hoping for a zero so as to be able distinguish between 0 or 1 entries.
A few hints here and there on SO, but none that helped.
If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.
On the other hand, this borrowed sample from the internet:
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()
does return 0 - as expected.
Looking for the correct approach here so as to get size = 0.
This behavior is inherited from the Java function split which is used in the same way in Scala and Spark. The empty input is a special case, and this is well discussed in this SO post.
Spark sets the default value for the second parameter (limit) of the split function to -1. And as of Spark 3, we can now pass a limit parameter for split function.
You can see this in Scala split function vs Spark SQL split function:
"".split(",").length
//res31: Int = 1
spark.sql("""select size(split("", '[,]'))""").show
//+----------------------+
//|size(split(, [,], -1))|
//+----------------------+
//| 1|
//+----------------------+
And
",".split(",").length // without setting limit=-1 this gives empty array
//res33: Int = 0
",".split(",", -1).length
//res34: Int = 2
spark.sql("""select size(split(",", '[,]'))""").show
//+-----------------------+
//|size(split(,, [,], -1))|
//+-----------------------+
//| 2|
//+-----------------------+
I suppose the root cause is that split returns an empty string, instead of a null.
scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""
And the size of an array containing an empty string is, of course, 1.
To get around this, perhaps you could do
val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
.withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
.otherwise(lit(0)))
df2.show(false)
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |0 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+

How do I group records that are within a specific time interval using Spark Scala or sql?

I would like to group records in scala only if they have the same ID and their time is within 1 min of each other.
I am thinking conceptually something like this? But I am not really sure
HAVING a.ID = b.ID AND a.time + 30 sec > b.time AND a.time - 30 sec < b.time
| ID | volume | Time |
|:-----------|------------:|:--------------------------:|
| 1 | 10 | 2019-02-17T12:00:34Z |
| 2 | 20 | 2019-02-17T11:10:46Z |
| 3 | 30 | 2019-02-17T13:23:34Z |
| 1 | 40 | 2019-02-17T12:01:02Z |
| 2 | 50 | 2019-02-17T11:10:30Z |
| 1 | 60 | 2019-02-17T12:01:57Z |
to this:
| ID | volume |
|:-----------|------------:|
| 1 | 50 | // (10+40)
| 2 | 70 | // (20+50)
| 3 | 30 |
df.groupBy($"ID", window($"Time", "1 minutes")).sum("volume")
the code above is 1 solution but it always rounds.
For example 2019-02-17T12:00:45Z will have a range of
2019-02-17T12:00:00Z TO 2019-02-17T12:01:00Z.
I am looking for this instead:
2019-02-17T11:45:00Z TO 2019-02-17T12:01:45Z.
Is there a way?
org.apache.spark.sql.functions provides overloaded window functions as below.
1. window(timeColumn: Column, windowDuration: String) : Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}
2. window((timeColumn: Column, windowDuration: String, slideDuration: String):
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
slideDuration Parameter specifying the sliding interval of the window, e.g. 1 minute.A new window will be generated every slideDuration. Must be less than or equal to the windowDuration.
The windows will look like:
{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}
3. window((timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}
For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. This is the perfect overloaded window function which suites your requirement.
Please find working code as below.
import org.apache.spark.sql.SparkSession
object SparkWindowTest extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//Prepare Test Data
val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
(3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
(1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
.toDF("ID", "Volume", "TimeString")
df.show()
df.printSchema()
+---+------+-------------------+
| ID|Volume| TimeString|
+---+------+-------------------+
| 1| 10|2019-02-17 12:00:49|
| 2| 20|2019-02-17 11:10:46|
| 3| 30|2019-02-17 13:23:34|
| 2| 50|2019-02-17 11:10:30|
| 1| 40|2019-02-17 12:01:02|
| 1| 60|2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
//Converted String Timestamp into Timestamp
val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))
//Dropped String Timestamp from DF
val modifiedDF1 = modifiedDF.drop("TimeString")
modifiedDF.show(false)
modifiedDF.printSchema()
+---+------+-------------------+-------------------+
|ID |Volume|TimeString |Time |
+---+------+-------------------+-------------------+
|1 |10 |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
|-- Time: timestamp (nullable = true)
modifiedDF1.show(false)
modifiedDF1.printSchema()
+---+------+-------------------+
|ID |Volume|Time |
+---+------+-------------------+
|1 |10 |2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- Time: timestamp (nullable = true)
//Main logic
val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")
//Renamed all columns of DF.
val newNames = Seq("ID", "WINDOW", "VOLUME")
val finalDF = modifiedDF2.toDF(newNames: _*)
finalDF.show(false)
+---+---------------------------------------------+------+
|ID |WINDOW |VOLUME|
+---+---------------------------------------------+------+
|2 |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50 |
|1 |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60 |
|1 |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50 |
|3 |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30 |
|2 |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20 |
+---+---------------------------------------------+------+
}

Spark Advanced Window with dynamic last

Problem:
Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id using spark.
Session Definition
Session expires after inactivity of 1 hour
Session remains active for a total duration of 2 hours
Data:
click_time,user_id
2018-01-01 11:00:00,u1
2018-01-01 12:10:00,u1
2018-01-01 13:00:00,u1
2018-01-01 13:50:00,u1
2018-01-01 14:40:00,u1
2018-01-01 15:30:00,u1
2018-01-01 16:20:00,u1
2018-01-01 16:50:00,u1
2018-01-01 11:00:00,u2
2018-01-02 11:00:00,u2
Below is partial solution considering only 1st point in session definition:
val win1 = Window.partitionBy("user_id").orderBy("click_time")
val sessionnew = when((unix_timestamp($"click_time") - unix_timestamp(lag($"click_time",1,"2017-01-01 11:00:00.0").over(win1)))/60 >= 60, 1).otherwise(0)
userActivity
.withColumn("session_num",sum(sessionnew).over(win1))
.withColumn("session_id",concat($"user_id", $"session_num"))
.show(truncate = false)
Actual Output:
+---------------------+-------+-----------+----------+
|click_time |user_id|session_num|session_id|
+---------------------+-------+-----------+----------+
|2018-01-01 11:00:00.0|u1 |1 |u11 |
|2018-01-01 12:10:00.0|u1 |2 |u12 | -- session u12 starts
|2018-01-01 13:00:00.0|u1 |2 |u12 |
|2018-01-01 13:50:00.0|u1 |2 |u12 |
|2018-01-01 14:40:00.0|u1 |2 |u12 | -- this should be a new session as diff of session start of u12 and this row exceeds 2 hours
|2018-01-01 15:30:00.0|u1 |2 |u12 |
|2018-01-01 16:20:00.0|u1 |2 |u12 |
|2018-01-01 16:50:00.0|u1 |2 |u12 | -- now this has to be compared with row 5 to find difference
|2018-01-01 11:00:00.0|u2 |1 |u21 |
|2018-01-02 11:00:00.0|u2 |2 |u22 |
+---------------------+-------+-----------+----------+
To include the second condition, I tried to find difference between the current time with last session start time to check if that exceeds 2 hours, but however the reference itself changes for the following rows. These are some some use cases which can be achieved through running sum but this doesn’t suit here.
Not a straight forward problem to solve, but here's one approach:
Use Window lag timestamp difference to identify sessions (with 0 = start of a session) per user for rule #1
Group the dataset to assemble the timestamp diff list per user
Process via a UDF the timestamp diff list to identify sessions for rule #2 and create all session ids per user
Expand the grouped dataset via Spark's explode
Sample code below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val userActivity = Seq(
("2018-01-01 11:00:00", "u1"),
("2018-01-01 12:10:00", "u1"),
("2018-01-01 13:00:00", "u1"),
("2018-01-01 13:50:00", "u1"),
("2018-01-01 14:40:00", "u1"),
("2018-01-01 15:30:00", "u1"),
("2018-01-01 16:20:00", "u1"),
("2018-01-01 16:50:00", "u1"),
("2018-01-01 11:00:00", "u2"),
("2018-01-02 11:00:00", "u2")
).toDF("click_time", "user_id")
def clickSessList(tmo: Long) = udf{ (uid: String, clickList: Seq[String], tsList: Seq[Long]) =>
def sid(n: Long) = s"$uid-$n"
val sessList = tsList.foldLeft( (List[String](), 0L, 0L) ){ case ((ls, j, k), i) =>
if (i == 0 || j + i >= tmo) (sid(k + 1) :: ls, 0L, k + 1) else
(sid(k) :: ls, j + i, k)
}._1.reverse
clickList zip sessList
}
Note that the accumulator for foldLeft in the UDF is a Tuple of (ls, j, k), where:
ls is the list of formatted session ids to be returned
j and k are for carrying over the conditionally changing timestamp value and session id number, respectively, to the next iteration
Step 1:
val tmo1: Long = 60 * 60
val tmo2: Long = 2 * 60 * 60
val win1 = Window.partitionBy("user_id").orderBy("click_time")
val df1 = userActivity.
withColumn("ts_diff", unix_timestamp($"click_time") - unix_timestamp(
lag($"click_time", 1).over(win1))
).
withColumn("ts_diff", when(row_number.over(win1) === 1 || $"ts_diff" >= tmo1, 0L).
otherwise($"ts_diff")
)
df1.show
// +-------------------+-------+-------+
// | click_time|user_id|ts_diff|
// +-------------------+-------+-------+
// |2018-01-01 11:00:00| u1| 0|
// |2018-01-01 12:10:00| u1| 0|
// |2018-01-01 13:00:00| u1| 3000|
// |2018-01-01 13:50:00| u1| 3000|
// |2018-01-01 14:40:00| u1| 3000|
// |2018-01-01 15:30:00| u1| 3000|
// |2018-01-01 16:20:00| u1| 3000|
// |2018-01-01 16:50:00| u1| 1800|
// |2018-01-01 11:00:00| u2| 0|
// |2018-01-02 11:00:00| u2| 0|
// +-------------------+-------+-------+
Steps 2-4:
val df2 = df1.
groupBy("user_id").agg(
collect_list($"click_time").as("click_list"), collect_list($"ts_diff").as("ts_list")
).
withColumn("click_sess_id",
explode(clickSessList(tmo2)($"user_id", $"click_list", $"ts_list"))
).
select($"user_id", $"click_sess_id._1".as("click_time"), $"click_sess_id._2".as("sess_id"))
df2.show
// +-------+-------------------+-------+
// |user_id|click_time |sess_id|
// +-------+-------------------+-------+
// |u1 |2018-01-01 11:00:00|u1-1 |
// |u1 |2018-01-01 12:10:00|u1-2 |
// |u1 |2018-01-01 13:00:00|u1-2 |
// |u1 |2018-01-01 13:50:00|u1-2 |
// |u1 |2018-01-01 14:40:00|u1-3 |
// |u1 |2018-01-01 15:30:00|u1-3 |
// |u1 |2018-01-01 16:20:00|u1-3 |
// |u1 |2018-01-01 16:50:00|u1-4 |
// |u2 |2018-01-01 11:00:00|u2-1 |
// |u2 |2018-01-02 11:00:00|u2-2 |
// +-------+-------------------+-------+
Also note that click_time is "passed thru" in steps 2-4 so as to be included in the final dataset.
Though the answer provided by Leo Works perfectly I feel its a complicated approach to Solve the problem by using Collect and Explode functions.This can be solved using Spark's Way by using UDAF to make it feasible for modifications in the near future as well.Please take a look into a solution on similar lines below
scala> //Importing Packages
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> // CREATE UDAF To Calculate total session duration Based on SessionIncativeFlag and Current Session Duration
scala> import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.MutableAggregationBuffer
scala> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala>
scala> class TotalSessionDuration extends UserDefinedAggregateFunction {
| // This is the input fields for your aggregate function.
| override def inputSchema: org.apache.spark.sql.types.StructType =
| StructType(
| StructField("sessiondur", LongType) :: StructField(
| "inactivityInd",
| IntegerType
| ) :: Nil
| )
|
| // This is the internal fields you keep for computing your aggregate.
| override def bufferSchema: StructType = StructType(
| StructField("sessionSum", LongType) :: Nil
| )
|
| // This is the output type of your aggregatation function.
| override def dataType: DataType = LongType
|
| override def deterministic: Boolean = true
|
| // This is the initial value for your buffer schema.
| override def initialize(buffer: MutableAggregationBuffer): Unit = {
| buffer(0) = 0L
| }
|
| // This is how to update your buffer schema given an input.
| override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
| if (input.getAs[Int](1) == 1)
| buffer(0) = 0L
| else if (buffer.getAs[Long](0) >= 7200L)
| buffer(0) = input.getAs[Long](0)
| else
| buffer(0) = buffer.getAs[Long](0) + input.getAs[Long](0)
| }
|
| // This is how to merge two objects with the bufferSchema type.
| override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
| if (buffer2.getAs[Int](1) == 1)
| buffer1(0) = 0L
| else if (buffer2.getAs[Long](0) >= 7200)
| buffer1(0) = buffer2.getAs[Long](0)
| else
| buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
| }
| // This is where you output the final value, given the final value of your bufferSchema.
| override def evaluate(buffer: Row): Any = {
| buffer.getLong(0)
| }
| }
defined class TotalSessionDuration
scala> //Create handle for using the UDAD Defined above
scala> val sessionSum=spark.udf.register("sessionSum", new TotalSessionDuration)
sessionSum: org.apache.spark.sql.expressions.UserDefinedAggregateFunction = TotalSessionDuration#64a9719a
scala> //Create Session Dataframe
scala> val clickstream = Seq(
| ("2018-01-01T11:00:00Z", "u1"),
| ("2018-01-01T12:10:00Z", "u1"),
| ("2018-01-01T13:00:00Z", "u1"),
| ("2018-01-01T13:50:00Z", "u1"),
| ("2018-01-01T14:40:00Z", "u1"),
| ("2018-01-01T15:30:00Z", "u1"),
| ("2018-01-01T16:20:00Z", "u1"),
| ("2018-01-01T16:50:00Z", "u1"),
| ("2018-01-01T11:00:00Z", "u2"),
| ("2018-01-02T11:00:00Z", "u2")
| ).toDF("timestamp", "userid").withColumn("curr_timestamp",unix_timestamp($"timestamp", "yyyy-MM-dd'T'HH:mm:ss'Z'").cast(TimestampType)).drop("timestamp")
clickstream: org.apache.spark.sql.DataFrame = [userid: string, curr_timestamp: timestamp]
scala>
scala> clickstream.show(false)
+------+-------------------+
|userid|curr_timestamp |
+------+-------------------+
|u1 |2018-01-01 11:00:00|
|u1 |2018-01-01 12:10:00|
|u1 |2018-01-01 13:00:00|
|u1 |2018-01-01 13:50:00|
|u1 |2018-01-01 14:40:00|
|u1 |2018-01-01 15:30:00|
|u1 |2018-01-01 16:20:00|
|u1 |2018-01-01 16:50:00|
|u2 |2018-01-01 11:00:00|
|u2 |2018-01-02 11:00:00|
+------+-------------------+
scala> //Generate column SEF with values 0 or 1 depending on whether difference between current and previous activity time is greater than 1 hour=3600 sec
scala>
scala> //Window on Current Timestamp when last activity took place
scala> val windowOnTs = Window.partitionBy("userid").orderBy("curr_timestamp")
windowOnTs: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#41dabe47
scala> //Create Lag Expression to find previous timestamp for the User
scala> val lagOnTS = lag(col("curr_timestamp"), 1).over(windowOnTs)
lagOnTS: org.apache.spark.sql.Column = lag(curr_timestamp, 1, NULL) OVER (PARTITION BY userid ORDER BY curr_timestamp ASC NULLS FIRST unspecifiedframe$())
scala> //Compute Timestamp for previous activity and subtract the same from Timestamp for current activity to get difference between 2 activities
scala> val diff_secs_col = col("curr_timestamp").cast("long") - col("prev_timestamp").cast("long")
diff_secs_col: org.apache.spark.sql.Column = (CAST(curr_timestamp AS BIGINT) - CAST(prev_timestamp AS BIGINT))
scala> val UserActWindowed=clickstream.withColumn("prev_timestamp", lagOnTS).withColumn("last_session_activity_after", diff_secs_col ).na.fill(0, Array("last_session_activity_after"))
UserActWindowed: org.apache.spark.sql.DataFrame = [userid: string, curr_timestamp: timestamp ... 2 more fields]
scala> //Generate Flag Column SEF (Session Expiry Flag) to indicate Session Has Expired due to inactivity for more than 1 hour
scala> val UserSessionFlagWhenInactive=UserActWindowed.withColumn("SEF",when(col("last_session_activity_after")>3600, 1).otherwise(0)).withColumn("tempsessid",sum(col("SEF")) over windowOnTs)
UserSessionFlagWhenInactive: org.apache.spark.sql.DataFrame = [userid: string, curr_timestamp: timestamp ... 4 more fields]
scala> UserSessionFlagWhenInactive.show(false)
+------+-------------------+-------------------+---------------------------+---+----------+
|userid|curr_timestamp |prev_timestamp |last_session_activity_after|SEF|tempsessid|
+------+-------------------+-------------------+---------------------------+---+----------+
|u1 |2018-01-01 11:00:00|null |0 |0 |0 |
|u1 |2018-01-01 12:10:00|2018-01-01 11:00:00|4200 |1 |1 |
|u1 |2018-01-01 13:00:00|2018-01-01 12:10:00|3000 |0 |1 |
|u1 |2018-01-01 13:50:00|2018-01-01 13:00:00|3000 |0 |1 |
|u1 |2018-01-01 14:40:00|2018-01-01 13:50:00|3000 |0 |1 |
|u1 |2018-01-01 15:30:00|2018-01-01 14:40:00|3000 |0 |1 |
|u1 |2018-01-01 16:20:00|2018-01-01 15:30:00|3000 |0 |1 |
|u1 |2018-01-01 16:50:00|2018-01-01 16:20:00|1800 |0 |1 |
|u2 |2018-01-01 11:00:00|null |0 |0 |0 |
|u2 |2018-01-02 11:00:00|2018-01-01 11:00:00|86400 |1 |1 |
+------+-------------------+-------------------+---------------------------+---+----------+
scala> //Compute Total session duration using the UDAF TotalSessionDuration such that :
scala> //(i)counter will be rest to 0 if SEF is set to 1
scala> //(ii)or set it to current session duration if session exceeds 2 hours
scala> //(iii)If both of them are inapplicable accumulate the sum
scala> val UserSessionDur=UserSessionFlagWhenInactive.withColumn("sessionSum",sessionSum(col("last_session_activity_after"),col("SEF")) over windowOnTs)
UserSessionDur: org.apache.spark.sql.DataFrame = [userid: string, curr_timestamp: timestamp ... 5 more fields]
scala> //Generate Session Marker if SEF is 1 or sessionSum Exceeds 2 hours(7200) seconds
scala> val UserNewSessionMarker=UserSessionDur.withColumn("SessionFlagChangeIndicator",when(col("SEF")===1 || col("sessionSum")>7200, 1).otherwise(0) )
UserNewSessionMarker: org.apache.spark.sql.DataFrame = [userid: string, curr_timestamp: timestamp ... 6 more fields]
scala> //Create New Session ID based on the marker
scala> val computeSessionId=UserNewSessionMarker.drop("SEF","tempsessid","sessionSum").withColumn("sessid",concat(col("userid"),lit("-"),(sum(col("SessionFlagChangeIndicator")) over windowOnTs)+1.toLong))
computeSessionId: org.apache.spark.sql.DataFrame = [userid: string, curr_timestamp: timestamp ... 4 more fields]
scala> computeSessionId.show(false)
+------+-------------------+-------------------+---------------------------+--------------------------+------+
|userid|curr_timestamp |prev_timestamp |last_session_activity_after|SessionFlagChangeIndicator|sessid|
+------+-------------------+-------------------+---------------------------+--------------------------+------+
|u1 |2018-01-01 11:00:00|null |0 |0 |u1-1 |
|u1 |2018-01-01 12:10:00|2018-01-01 11:00:00|4200 |1 |u1-2 |
|u1 |2018-01-01 13:00:00|2018-01-01 12:10:00|3000 |0 |u1-2 |
|u1 |2018-01-01 13:50:00|2018-01-01 13:00:00|3000 |0 |u1-2 |
|u1 |2018-01-01 14:40:00|2018-01-01 13:50:00|3000 |1 |u1-3 |
|u1 |2018-01-01 15:30:00|2018-01-01 14:40:00|3000 |0 |u1-3 |
|u1 |2018-01-01 16:20:00|2018-01-01 15:30:00|3000 |0 |u1-3 |
|u1 |2018-01-01 16:50:00|2018-01-01 16:20:00|1800 |1 |u1-4 |
|u2 |2018-01-01 11:00:00|null |0 |0 |u2-1 |
|u2 |2018-01-02 11:00:00|2018-01-01 11:00:00|86400 |1 |u2-2 |
+------+-------------------+-------------------+---------------------------+--------------------------+------+
Simple approach :
import spark.implicits._
val userActivity = Seq(
("2018-01-01 11:00:00", "u1"),
("2018-01-01 12:10:00", "u1"),
("2018-01-01 13:00:00", "u1"),
("2018-01-01 13:50:00", "u1"),
("2018-01-01 14:40:00", "u1"),
("2018-01-01 15:30:00", "u1"),
("2018-01-01 16:20:00", "u1"),
("2018-01-01 16:50:00", "u1"),
("2018-01-01 11:00:00", "u2"),
("2018-01-02 11:00:00", "u2")
).toDF("click_time", "user_id")
var df1 = userActivity.orderBy(asc("click_time")).
groupBy("user_id").agg(collect_list("click_time").alias("ts_diff"))
df1=df1.withColumn("test",convert_Case()($"ts_diff")).drop("ts_diff")
df1=df1.withColumn("test",explode(col("test"))).
withColumn("click_time",$"test._1").withColumn("session_generated",
concat($"user_id",lit("_"),$"test._2")).drop("test")
df1.show(truncate = false)
def convert_Case()=
udf {(timeList:Array[String])=>
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val new_list = ListBuffer[(String, Int)]()
var session = 1
new_list.append((timeList(0), session))
var x = timeList(0)
var time = 0L
for(i<-1 until(timeList.length)){
val d1 = dateFormat.parse(x)
val d2 = dateFormat.parse(timeList(i))
val diff = ((d1.getTime - d2.getTime)/1000)*(-1)
val test = diff + time
if (diff >= 3600.0 || test >= 7200.0) {
session = session + 1
new_list.append((timeList(i), session))
time = 0L
}
else
{new_list.append((timeList(i), session))
time = test}
x = timeList(i)}
new_list
}
Complete solution
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import scala.collection.mutable.ListBuffer
import scala.util.control._
import spark.sqlContext.implicits._
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val interimSessionThreshold=60
val totalSessionTimeThreshold=120
val sparkSession = SparkSession.builder.master("local").appName("Window Function").getOrCreate()
val clickDF = sparkSession.createDataFrame(Seq(
("2018-01-01T11:00:00Z","u1"),
("2018-01-01T12:10:00Z","u1"),
("2018-01-01T13:00:00Z","u1"),
("2018-01-01T13:50:00Z","u1"),
("2018-01-01T14:40:00Z","u1"),
("2018-01-01T15:30:00Z","u1"),
("2018-01-01T16:20:00Z","u1"),
("2018-01-01T16:50:00Z","u1"),
("2018-01-01T11:00:00Z","u2"),
("2018-01-02T11:00:00Z","u2")
)).toDF("clickTime","user")
val newDF=clickDF.withColumn("clickTimestamp",unix_timestamp($"clickTime", "yyyy-MM-dd'T'HH:mm:ss'Z'").cast(TimestampType).as("timestamp")).drop($"clickTime")
val partitionWindow = Window.partitionBy($"user").orderBy($"clickTimestamp".asc)
val lagTest = lag($"clickTimestamp", 1, "0000-00-00 00:00:00").over(partitionWindow)
val df_test=newDF.select($"*", ((unix_timestamp($"clickTimestamp")-unix_timestamp(lagTest))/60D cast "int") as "diff_val_with_previous")
val distinctUser=df_test.select($"user").distinct.as[String].collect.toList
val rankTest = rank().over(partitionWindow)
val ddf = df_test.select($"*", rankTest as "rank")
case class finalClick(User:String,clickTime:Timestamp,session:String)
val rowList: ListBuffer[finalClick] = new ListBuffer()
distinctUser.foreach{x =>{
val tempDf= ddf.filter($"user" === x)
var cumulDiff:Int=0
var session_index=1
var startBatch=true
var dp=0
val len = tempDf.count.toInt
for(i <- 1 until len+1){
val r = tempDf.filter($"rank" === i).head()
dp = r.getAs[Int]("diff_val_with_previous")
cumulDiff += dp
if(dp <= interimSessionThreshold && cumulDiff <= totalSessionTimeThreshold){
startBatch=false
rowList += finalClick(r.getAs[String]("user"),r.getAs[Timestamp]("clickTimestamp"),r.getAs[String]("user")+session_index)
}
else{
session_index+=1
cumulDiff = 0
startBatch=true
dp=0
rowList += finalClick(r.getAs[String]("user"),r.getAs[Timestamp]("clickTimestamp"),r.getAs[String]("user")+session_index)
}
}
}}
val dataFrame = sc.parallelize(rowList.toList).toDF("user","clickTimestamp","session")
dataFrame.show
+----+-------------------+-------+
|user| clickTimestamp|session|
+----+-------------------+-------+
| u1|2018-01-01 11:00:00| u11|
| u1|2018-01-01 12:10:00| u12|
| u1|2018-01-01 13:00:00| u12|
| u1|2018-01-01 13:50:00| u12|
| u1|2018-01-01 14:40:00| u13|
| u1|2018-01-01 15:30:00| u13|
| u1|2018-01-01 16:20:00| u13|
| u1|2018-01-01 16:50:00| u14|
| u2|2018-01-01 11:00:00| u21|
| u2|2018-01-02 11:00:00| u22|
+----+-------------------+-------+
-----Solution without using explode----.
`In my point of view explode is heavy process and inorder to apply you have taken groupby and collect_list.`
`
import pyspark.sql.functions as f
from pyspark.sql.window import Window
streaming_data=[("U1","2019-01-01T11:00:00Z") ,
("U1","2019-01-01T11:15:00Z") ,
("U1","2019-01-01T12:00:00Z") ,
("U1","2019-01-01T12:20:00Z") ,
("U1","2019-01-01T15:00:00Z") ,
("U2","2019-01-01T11:00:00Z") ,
("U2","2019-01-02T11:00:00Z") ,
("U2","2019-01-02T11:25:00Z") ,
("U2","2019-01-02T11:50:00Z") ,
("U2","2019-01-02T12:15:00Z") ,
("U2","2019-01-02T12:40:00Z") ,
("U2","2019-01-02T13:05:00Z") ,
("U2","2019-01-02T13:20:00Z") ]
schema=("UserId","Click_Time")
window_spec=Window.partitionBy("UserId").orderBy("Click_Time")
df_stream=spark.createDataFrame(streaming_data,schema)
df_stream=df_stream.withColumn("Click_Time",df_stream["Click_Time"].cast("timestamp"))
df_stream=df_stream\
.withColumn("time_diff",
(f.unix_timestamp("Click_Time")-f.unix_timestamp(f.lag(f.col("Click_Time"),1).over(window_spec)))/(60*60)).na.fill(0)
df_stream=df_stream\
.withColumn("cond_",f.when(f.col("time_diff")>1,1).otherwise(0))
df_stream=df_stream.withColumn("temp_session",f.sum(f.col("cond_")).over(window_spec))
new_window=Window.partitionBy("UserId","temp_session").orderBy("Click_Time")
new_spec=new_window.rowsBetween(Window.unboundedPreceding,Window.currentRow)
cond_2hr=(f.unix_timestamp("Click_Time")-f.unix_timestamp(f.lag(f.col("Click_Time"),1).over(new_window)))
df_stream=df_stream.withColumn("temp_session_2hr",f.when(f.sum(f.col("2hr_time_diff")).over(new_spec)-(2*60*60)>0,1).otherwise(0))
new_window_2hr=Window.partitionBy(["UserId","temp_session","temp_session_2hr"]).orderBy("Click_Time")
hrs_cond_=(f.when(f.unix_timestamp(f.col("Click_Time"))-f.unix_timestamp(f.first(f.col("Click_Time")).over(new_window_2hr))-(2*60*60)>0,1).otherwise(0))
df_stream=df_stream\
.withColumn("final_session_groups",hrs_cond_)
df_stream=df_stream.withColumn("final_session",df_stream["temp_session_2hr"]+df_stream["temp_session"]+df_stream["final_session_groups"]+1)\
.drop("temp_session","final_session_groups","time_diff","temp_session_2hr","final_session_groups")
df_stream=df_stream.withColumn("session_id",f.concat(f.col("UserId"),f.lit(" session_val----->"),f.col("final_session")))
df_stream.show(20,0)
`
---Steps taken to solve ---
` 1.first find out those clickstream which are clicked less than one hour and find the
continuous groups.
2.then find out the click streams based on the 2hrs condition and make the continuous groups for this condition to be apply i have to create two continuos group based on logic as follows .
3 . One group will be based on the sum of time difference and make one group i.e temp_session_2hr and based on that find the next group final_session_groups .
4.Sum of these above continuous groups and add +1 to populate the final_session column at the end of algo and do concat as per your requirement to show the session_id.`
result will be looking like this
`+------+---------------------+-------------+---------------------+
|UserId|Click_Time |final_session|session_id |
+------+---------------------+-------------+---------------------+
|U2 |2019-01-01 11:00:00.0|1 |U2 session_val----->1|
|U2 |2019-01-02 11:00:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 11:25:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 11:50:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 12:15:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 12:40:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 13:05:00.0|3 |U2 session_val----->3|
|U2 |2019-01-02 13:20:00.0|3 |U2 session_val----->3|
|U1 |2019-01-01 11:00:00.0|1 |U1 session_val----->1|
|U1 |2019-01-01 11:15:00.0|1 |U1 session_val----->1|
|U1 |2019-01-01 12:00:00.0|2 |U1 session_val----->2|
|U1 |2019-01-01 12:20:00.0|2 |U1 session_val----->2|
|U1 |2019-01-01 15:00:00.0|3 |U1 session_val----->3|
+------+---------------------+-------------+---------------------+
`

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}