I have a function that takes a LocalDate (it could take any other type) and returns a DataFrame, e.g.:
def genDataFrame(refDate: LocalDate): DataFrame = {
Seq(
(refDate,refDate.minusDays(7)),
(refDate.plusDays(3),refDate.plusDays(7))
).toDF("col_A","col_B")
}
genDataFrame(LocalDate.parse("2021-07-02")) output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
+----------+----------+
I wanna apply this function to each element in a dataframe column (which contains, obviously, LocalDate values), such as:
val myDate = LocalDate.parse("2021-07-02")
val df = Seq(
(myDate),
(myDate.plusDays(1)),
(myDate.plusDays(3))
).toDF("date")
df:
+----------+
| date|
+----------+
|2021-07-02|
|2021-07-03|
|2021-07-05|
+----------+
Required output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
|2021-07-03|2021-06-26|
|2021-07-06|2021-07-10|
|2021-07-05|2021-06-28|
|2021-07-08|2021-07-12|
+----------+----------+
How could I achieve that (without using collect)?
You can always convert your data frame to a lazily evaluated view and use Spark SQL:
val df_2 = df.map(x => x.getDate(0).toLocalDate()).withColumnRenamed("value", "col_A")
.withColumn("col_B", col("col_A"))
df_2.createOrReplaceTempView("test")
With that you can create a view like this one:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-07-02|
|2021-07-03|2021-07-03|
|2021-07-05|2021-07-05|
+----------+----------+
And then you can use SQL wich I find more intuitive:
spark.sql(s"""SELECT col_A, date_add(col_B, -7) as col_B FROM test
UNION
SELECT date_add(col_A, 3), date_add(col_B, 7) as col_B FROM test""")
.show()
This gives your expected output as a DataFrame:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-03|2021-06-26|
|2021-07-05|2021-06-28|
|2021-07-05|2021-07-09|
|2021-07-06|2021-07-10|
|2021-07-08|2021-07-12|
+----------+----------+
Related
I have two pyspark dataframe as follow :
df1 = spark.createDataFrame(
["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
df2 = spark.createDataFrame(
["yes","no"],
"string"
).toDF("location")
i want to check if values in location col from df1, startsWith, values in location col of df2 and vice versa.
Something like :
df1.select("location").startsWith(df2.location)
Following is the output i am expecting here:
+-------------+
| location|
+-------------+
| yes|
| no|
| yes23|
| no3|
+-------------+
Using spark SQL looks the easiest to me:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
joined = spark.sql("""
select df1.*
from df1
join df2
on df1.location rlike '^' || df2.location
""")
Im trying to find an exact string match in a dataframe column from employee dataframe
Employee days_present
Alex 1,2,11,23,
John 21,23,25,28
Need to find which employees are present on 2nd based on days_present column
expected output:
Alex
below is what i have tried
df = spark.sql("select * from employee where days_present RLIKE '2')
df.show()
This returns both Alex & John
Also i would like to find out who are present on 2 & 11, in this case expected ouput is only ALex
We can use array_intersect function starting from Spark-2.4+ and then check the array size if size >=2
Example:
df.show()
+--------+------------+
|Employee|days_present|
+--------+------------+
| Alex| 1,2,11,23|
| John| 21,23,25,28|
+--------+------------+
#DataFrame[Employee: string, days_present: string]
df.withColumn("tmp",split(col("days_present"),",")).\
withColumn("intersect",array_intersect(col("tmp"),array(lit("2"),lit("11")))).\
filter(size("intersect") >= 2).\
drop("tmp","intersect").\
show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+
In spark-sql:
df.createOrReplaceTempView("tmp")
spark.sql("""select Employee,days_present from (select *,size(array_intersect(split(days_present,","),array("2","11")))size from tmp)e where size >=2""").show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+
I have a dataframe a:
id,value
1,11
2,22
3,33
And another dataframe b:
id,value
1,123
3,345
I want to update dataframe a with all matching values from b (based on column 'id').
Final dataframe 'c' would be:
id,value
1,123
2,22
3,345
How to achieve that using datafame joins (or other approach)?
Tried:
a.join(b, a.id == b.id, "inner").drop(a.value)
Gives (not desired output):
+---+---+-----+
| id| id|value|
+---+---+-----+
| 1| 1| 123|
| 3| 3| 345|
+---+---+-----+
Thanks.
I don't think there is an update functionality. But this should work:
import pyspark.sql.functions as F
df1.join(df2, df1.id == df2.id, "left_outer") \
.select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))
How can I append an item to an array in dataframe (spark 2.3)?
Here is an example with integers, but the real case is with struct.
Input:
+------+-------------+
| key| my_arr |
+------+-------------+
|5 |[3,14] |
|3 |[9,5.99] |
+------+-------------+
output:
+-------------+
| my_arr |
+-------------+
|[3,14,5] |
|[9,5.99,3] |
+-------------+
you must create udf to add elements , with integer is easy but with struct is more
complicate.
With integers de code is :
`
val udfConcat = udf((key:Int,my_arr:WrappedArray[Int])=> my_arr:+key)
df.withColumn("my_arr",udfConcat(col("key"), col("my_arr"))).drop("key").show()
`
With struct de code is :
`
val schemaTyped = new StructType()
.add("name", StringType)
.add("age", IntegerType)
val schema = ArrayType(schemaTyped)
val udfConcatStruct = udf((key: Row, my_arr: Seq[Row]) => my_arr :+ key, schema)
df2.withColumn("my_arr", udfConcatStruct(col("key"), col("my_arr"))).drop("key").show(false)
`
When you create the udf , you must pass de schema of Array , in this example is array of element with names and ages.
Here is another way using Struct:
Input:
df.show()
+---+--------+
|Key|My_Array|
+---+--------+
| 5| [3,14]|
| 3| [9,45]|
+---+--------+
df.withColumn("My_Array", struct($"My_Array.*", $"Key")).show(false)
Output:
+---+--------+
|Key|My_Array|
+---+--------+
|5 |[3,14,5]|
|3 |[9,45,3]|
+---+--------+
Solution without UDF - PYSPARK
I was facing similar kind of problem & definitely did't wanted to use UDF because of performance degradation
spark_df.show(3,False)
+---+-----------+
|key|myarr |
+---+-----------+
|5 |[3.0, 14.0]|
|3 |[9.0, 5.99]|
+---+-----------+
Output:
spark_df=spark_df.\
withColumn("myarr",F.split(F.concat(F.concat_ws(",",F.col("myarr")),F.lit(",") ,F.col("key")),",\s*" ) )
spark_df.select("myarr").show(3,False)
+------------+
|myarr |
+------------+
|[3.0,14.0,5]|
|[9.0,5.99,3]|
+------------+
Method Steps:
First convert Array Column into String using concat_ws method
Use concat function to merge required column ("key") with original column ("myarr")
Use split function to convert string column from above step back to Array
Hope this helps.
I have a column in dataframe(d1): MODEL_SCORE, which has value like nulll7880.
I want to create another column MODEL_SCORE1 in datframe which is substring of MODEL_SCORE.
I am trying this. It's creating column, but not giving expected result:
val x=d1.withColumn("MODEL_SCORE1", substring(col("MODEL_SCORE"),0,4))
val y=d1.select(col("MODEL_SCORE"), substring(col("MODEL_SCORE"),0,4).as("MODEL_SCORE1"))
One way for this is you can define a UDF that will split your column string value as per your need. A sample code be as follow,
val df = sc.parallelize(List((1,"nulll7880"),(2,"null9000"))).toDF("id","col1")
df.show
//output
+---+---------+
| id| col1|
+---+---------+
| 1|nulll7880|
| 2| null9000|
+---+---------+
def splitString:(String => String) = {str => str.slice(0,4)}
val splitStringUDF = org.apache.spark.sql.functions.udf(splitString)
df.withColumn("col2",splitStringUDF(df("col1"))).show
//output
+---+---------+----+
| id| col1|col2|
+---+---------+----+
| 1|nulll7880|null|
| 2| null9000|null|
+---+---------+----+