check first dataframe value startswith any of the second dataframe value - dataframe

I have two pyspark dataframe as follow :
df1 = spark.createDataFrame(
["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
df2 = spark.createDataFrame(
["yes","no"],
"string"
).toDF("location")
i want to check if values in location col from df1, startsWith, values in location col of df2 and vice versa.
Something like :
df1.select("location").startsWith(df2.location)
Following is the output i am expecting here:
+-------------+
| location|
+-------------+
| yes|
| no|
| yes23|
| no3|
+-------------+

Using spark SQL looks the easiest to me:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
joined = spark.sql("""
select df1.*
from df1
join df2
on df1.location rlike '^' || df2.location
""")

Related

Scala - Apply a function to each value in a dataframe column

I have a function that takes a LocalDate (it could take any other type) and returns a DataFrame, e.g.:
def genDataFrame(refDate: LocalDate): DataFrame = {
Seq(
(refDate,refDate.minusDays(7)),
(refDate.plusDays(3),refDate.plusDays(7))
).toDF("col_A","col_B")
}
genDataFrame(LocalDate.parse("2021-07-02")) output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
+----------+----------+
I wanna apply this function to each element in a dataframe column (which contains, obviously, LocalDate values), such as:
val myDate = LocalDate.parse("2021-07-02")
val df = Seq(
(myDate),
(myDate.plusDays(1)),
(myDate.plusDays(3))
).toDF("date")
df:
+----------+
| date|
+----------+
|2021-07-02|
|2021-07-03|
|2021-07-05|
+----------+
Required output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
|2021-07-03|2021-06-26|
|2021-07-06|2021-07-10|
|2021-07-05|2021-06-28|
|2021-07-08|2021-07-12|
+----------+----------+
How could I achieve that (without using collect)?
You can always convert your data frame to a lazily evaluated view and use Spark SQL:
val df_2 = df.map(x => x.getDate(0).toLocalDate()).withColumnRenamed("value", "col_A")
.withColumn("col_B", col("col_A"))
df_2.createOrReplaceTempView("test")
With that you can create a view like this one:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-07-02|
|2021-07-03|2021-07-03|
|2021-07-05|2021-07-05|
+----------+----------+
And then you can use SQL wich I find more intuitive:
spark.sql(s"""SELECT col_A, date_add(col_B, -7) as col_B FROM test
UNION
SELECT date_add(col_A, 3), date_add(col_B, 7) as col_B FROM test""")
.show()
This gives your expected output as a DataFrame:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-03|2021-06-26|
|2021-07-05|2021-06-28|
|2021-07-05|2021-07-09|
|2021-07-06|2021-07-10|
|2021-07-08|2021-07-12|
+----------+----------+

How to filter out elements in each row of a List[StringType] column in a spark Dataframe?

I want to retain the elements which are present in a row of a ArrayType column. For instance, if a row of my spark dataframe is:
val df = sc.parallelize(Seq(Seq("A", "B"),Seq("C","X"),Seq("A", "B", "C", "D", "Z"))).toDF("column")
+---------------+
| column|
+---------------+
| [A, B]|
| [C, X]|
|[A, B, C, D, Z]|
+---------------+
and I have a dictionary_list = ["A", "Z", "X", "Y"], I want the row of my output dataframe to be:
val outp_df
+---------------+
| column|
+---------------+
| [A]|
| [X]|
| [A, Z]|
+---------------+
I tried array_contains, array_overlap, etc. but the output I'm getting is like this:
val result = df.where(array_contains(col("column"), "A"))
+---------------+
| column|
+---------------+
| [A, B]|
|[A, B, C, D, Z]|
+---------------+
The rows are getting filtered, but I want to filter inside the list/row itself. Any way I can do this?
The result you're getting makes perfect sense, you are ONLY selecting rows, which their column array value contains "A", which is the first row and the last row. What you need is a function (NOT a SQL filter function) which receives an input sequence of string, and returns a sequence which contains only values that exist in your dictionary. You can use udfs like this:
// rename this as you wish
val myCustomFilter: Seq[String] => Seq[String] =
input => input.filter(dictionaryList.contains)
// register to your spark context
// rename this as you wish
spark.udf.register("custom_filter", myCustomFilter)
And then, you need select operator, not a filter operator! a filter operator only selects rows that can satisfy the predicate, this is not what you want.
Spark shell result:
scala> df.select(expr("custom_filter(column)")).show
+--------------+
|cusfil(column)|
+--------------+
| [A]|
| [X]|
| [A, Z]|
+--------------+
Use filter function:
val df1 = df.withColumn(
"column",
expr("filter(column, x -> x in ('A', 'Z', 'X', 'Y'))")
)

find specific string in spark sql--pyspark

Im trying to find an exact string match in a dataframe column from employee dataframe
Employee days_present
Alex 1,2,11,23,
John 21,23,25,28
Need to find which employees are present on 2nd based on days_present column
expected output:
Alex
below is what i have tried
df = spark.sql("select * from employee where days_present RLIKE '2')
df.show()
This returns both Alex & John
Also i would like to find out who are present on 2 & 11, in this case expected ouput is only ALex
We can use array_intersect function starting from Spark-2.4+ and then check the array size if size >=2
Example:
df.show()
+--------+------------+
|Employee|days_present|
+--------+------------+
| Alex| 1,2,11,23|
| John| 21,23,25,28|
+--------+------------+
#DataFrame[Employee: string, days_present: string]
df.withColumn("tmp",split(col("days_present"),",")).\
withColumn("intersect",array_intersect(col("tmp"),array(lit("2"),lit("11")))).\
filter(size("intersect") >= 2).\
drop("tmp","intersect").\
show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+
In spark-sql:
df.createOrReplaceTempView("tmp")
spark.sql("""select Employee,days_present from (select *,size(array_intersect(split(days_present,","),array("2","11")))size from tmp)e where size >=2""").show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+

PySpark - how to update Dataframe by using join?

I have a dataframe a:
id,value
1,11
2,22
3,33
And another dataframe b:
id,value
1,123
3,345
I want to update dataframe a with all matching values from b (based on column 'id').
Final dataframe 'c' would be:
id,value
1,123
2,22
3,345
How to achieve that using datafame joins (or other approach)?
Tried:
a.join(b, a.id == b.id, "inner").drop(a.value)
Gives (not desired output):
+---+---+-----+
| id| id|value|
+---+---+-----+
| 1| 1| 123|
| 3| 3| 345|
+---+---+-----+
Thanks.
I don't think there is an update functionality. But this should work:
import pyspark.sql.functions as F
df1.join(df2, df1.id == df2.id, "left_outer") \
.select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))

Creating a column in a dataframe based on substring of another column, scala

I have a column in dataframe(d1): MODEL_SCORE, which has value like nulll7880.
I want to create another column MODEL_SCORE1 in datframe which is substring of MODEL_SCORE.
I am trying this. It's creating column, but not giving expected result:
val x=d1.withColumn("MODEL_SCORE1", substring(col("MODEL_SCORE"),0,4))
val y=d1.select(col("MODEL_SCORE"), substring(col("MODEL_SCORE"),0,4).as("MODEL_SCORE1"))
One way for this is you can define a UDF that will split your column string value as per your need. A sample code be as follow,
val df = sc.parallelize(List((1,"nulll7880"),(2,"null9000"))).toDF("id","col1")
df.show
//output
+---+---------+
| id| col1|
+---+---------+
| 1|nulll7880|
| 2| null9000|
+---+---------+
def splitString:(String => String) = {str => str.slice(0,4)}
val splitStringUDF = org.apache.spark.sql.functions.udf(splitString)
df.withColumn("col2",splitStringUDF(df("col1"))).show
//output
+---+---------+----+
| id| col1|col2|
+---+---------+----+
| 1|nulll7880|null|
| 2| null9000|null|
+---+---------+----+