Creating a column in a dataframe based on substring of another column, scala - sql

I have a column in dataframe(d1): MODEL_SCORE, which has value like nulll7880.
I want to create another column MODEL_SCORE1 in datframe which is substring of MODEL_SCORE.
I am trying this. It's creating column, but not giving expected result:
val x=d1.withColumn("MODEL_SCORE1", substring(col("MODEL_SCORE"),0,4))
val y=d1.select(col("MODEL_SCORE"), substring(col("MODEL_SCORE"),0,4).as("MODEL_SCORE1"))

One way for this is you can define a UDF that will split your column string value as per your need. A sample code be as follow,
val df = sc.parallelize(List((1,"nulll7880"),(2,"null9000"))).toDF("id","col1")
df.show
//output
+---+---------+
| id| col1|
+---+---------+
| 1|nulll7880|
| 2| null9000|
+---+---------+
def splitString:(String => String) = {str => str.slice(0,4)}
val splitStringUDF = org.apache.spark.sql.functions.udf(splitString)
df.withColumn("col2",splitStringUDF(df("col1"))).show
//output
+---+---------+----+
| id| col1|col2|
+---+---------+----+
| 1|nulll7880|null|
| 2| null9000|null|
+---+---------+----+

Related

Scala - Apply a function to each value in a dataframe column

I have a function that takes a LocalDate (it could take any other type) and returns a DataFrame, e.g.:
def genDataFrame(refDate: LocalDate): DataFrame = {
Seq(
(refDate,refDate.minusDays(7)),
(refDate.plusDays(3),refDate.plusDays(7))
).toDF("col_A","col_B")
}
genDataFrame(LocalDate.parse("2021-07-02")) output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
+----------+----------+
I wanna apply this function to each element in a dataframe column (which contains, obviously, LocalDate values), such as:
val myDate = LocalDate.parse("2021-07-02")
val df = Seq(
(myDate),
(myDate.plusDays(1)),
(myDate.plusDays(3))
).toDF("date")
df:
+----------+
| date|
+----------+
|2021-07-02|
|2021-07-03|
|2021-07-05|
+----------+
Required output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
|2021-07-03|2021-06-26|
|2021-07-06|2021-07-10|
|2021-07-05|2021-06-28|
|2021-07-08|2021-07-12|
+----------+----------+
How could I achieve that (without using collect)?
You can always convert your data frame to a lazily evaluated view and use Spark SQL:
val df_2 = df.map(x => x.getDate(0).toLocalDate()).withColumnRenamed("value", "col_A")
.withColumn("col_B", col("col_A"))
df_2.createOrReplaceTempView("test")
With that you can create a view like this one:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-07-02|
|2021-07-03|2021-07-03|
|2021-07-05|2021-07-05|
+----------+----------+
And then you can use SQL wich I find more intuitive:
spark.sql(s"""SELECT col_A, date_add(col_B, -7) as col_B FROM test
UNION
SELECT date_add(col_A, 3), date_add(col_B, 7) as col_B FROM test""")
.show()
This gives your expected output as a DataFrame:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-03|2021-06-26|
|2021-07-05|2021-06-28|
|2021-07-05|2021-07-09|
|2021-07-06|2021-07-10|
|2021-07-08|2021-07-12|
+----------+----------+

How to validate particular column in a Dataframe without troubling other columns using spark-sql?

set.createOrReplaceTempView("input1");
String look = "select case when length(date)>0 then 'Y' else 'N' end as date from input1";
Dataset<Row> Dataset_op = spark.sql(look);
Dataset_op.show();
In the above code the dataframe 'set' has 10 columns and i've done the validation for one column among them (i.e) 'date'. It return date column alone.
My question is how to return all the columns with the validated date column in a single dataframe?
Is there any way to get all the columns in the dataframe without manually selecting all the columns in the select statement. Please share your suggestions.TIA
Data
df= spark.createDataFrame([
(1,'2022-03-01'),
(2,'2022-04-17'),
(3,None)
],('id','date'))
df.show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
| 3| null|
+---+----------+
You have two options
Option 1 select without projecting a new column with N and Y
df.createOrReplaceTempView("input1");
String_look = "select id, date from input1 where length(date)>0";
Dataset_op = spark.sql(String_look).show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
+---+----------+
Or project Y and N into a new column. Remember the where clause is applied before column projection. So you cant use the newly created column in the where clause
String_look = "select id, date, case when length(date)>0 then 'Y' else 'N' end as status from input1 where length(date)>0";
+---+----------+------+
| id| date|status|
+---+----------+------+
| 1|2022-03-01| Y|
| 2|2022-04-17| Y|
+---+----------+------+

find specific string in spark sql--pyspark

Im trying to find an exact string match in a dataframe column from employee dataframe
Employee days_present
Alex 1,2,11,23,
John 21,23,25,28
Need to find which employees are present on 2nd based on days_present column
expected output:
Alex
below is what i have tried
df = spark.sql("select * from employee where days_present RLIKE '2')
df.show()
This returns both Alex & John
Also i would like to find out who are present on 2 & 11, in this case expected ouput is only ALex
We can use array_intersect function starting from Spark-2.4+ and then check the array size if size >=2
Example:
df.show()
+--------+------------+
|Employee|days_present|
+--------+------------+
| Alex| 1,2,11,23|
| John| 21,23,25,28|
+--------+------------+
#DataFrame[Employee: string, days_present: string]
df.withColumn("tmp",split(col("days_present"),",")).\
withColumn("intersect",array_intersect(col("tmp"),array(lit("2"),lit("11")))).\
filter(size("intersect") >= 2).\
drop("tmp","intersect").\
show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+
In spark-sql:
df.createOrReplaceTempView("tmp")
spark.sql("""select Employee,days_present from (select *,size(array_intersect(split(days_present,","),array("2","11")))size from tmp)e where size >=2""").show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+

How to append item to array in Spark 2.3

How can I append an item to an array in dataframe (spark 2.3)?
Here is an example with integers, but the real case is with struct.
Input:
+------+-------------+
| key| my_arr |
+------+-------------+
|5 |[3,14] |
|3 |[9,5.99] |
+------+-------------+
output:
+-------------+
| my_arr |
+-------------+
|[3,14,5] |
|[9,5.99,3] |
+-------------+
you must create udf to add elements , with integer is easy but with struct is more
complicate.
With integers de code is :
`
val udfConcat = udf((key:Int,my_arr:WrappedArray[Int])=> my_arr:+key)
df.withColumn("my_arr",udfConcat(col("key"), col("my_arr"))).drop("key").show()
`
With struct de code is :
`
val schemaTyped = new StructType()
.add("name", StringType)
.add("age", IntegerType)
val schema = ArrayType(schemaTyped)
val udfConcatStruct = udf((key: Row, my_arr: Seq[Row]) => my_arr :+ key, schema)
df2.withColumn("my_arr", udfConcatStruct(col("key"), col("my_arr"))).drop("key").show(false)
`
When you create the udf , you must pass de schema of Array , in this example is array of element with names and ages.
Here is another way using Struct:
Input:
df.show()
+---+--------+
|Key|My_Array|
+---+--------+
| 5| [3,14]|
| 3| [9,45]|
+---+--------+
df.withColumn("My_Array", struct($"My_Array.*", $"Key")).show(false)
Output:
+---+--------+
|Key|My_Array|
+---+--------+
|5 |[3,14,5]|
|3 |[9,45,3]|
+---+--------+
Solution without UDF - PYSPARK
I was facing similar kind of problem & definitely did't wanted to use UDF because of performance degradation
spark_df.show(3,False)
+---+-----------+
|key|myarr |
+---+-----------+
|5 |[3.0, 14.0]|
|3 |[9.0, 5.99]|
+---+-----------+
Output:
spark_df=spark_df.\
withColumn("myarr",F.split(F.concat(F.concat_ws(",",F.col("myarr")),F.lit(",") ,F.col("key")),",\s*" ) )
spark_df.select("myarr").show(3,False)
+------------+
|myarr |
+------------+
|[3.0,14.0,5]|
|[9.0,5.99,3]|
+------------+
Method Steps:
First convert Array Column into String using concat_ws method
Use concat function to merge required column ("key") with original column ("myarr")
Use split function to convert string column from above step back to Array
Hope this helps.

Spark: how to perform loop fuction to dataframes

I have two dataframes as below, I'm trying to search the second df using the foreign key, and then generate a new data frame. I was thinking of doing a spark.sql("""select history.value as previous_year 1 from df1, history where df1.key=history.key and history.date=add_months($currentdate,-1*12)""" but then I need to do it multiple times for say 10 previous_years. and join them back together. How can I create a function for this? Many thanks. Quite new here.
dataframe one:
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|100| 2018-04-16|
| 2|200| 2018-04-16|
+---+---+-----------+
dataframe two : historical data
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|10 | 2017-04-16|
| 1|20 | 2016-04-16|
+---+---+-----------+
The result I want to generate is
+---+----------+-----------------+-----------------+
|key|date | previous_year_1 | previous_year_2 |
+---+----------+-----------------+-----------------+
| 1|2018-04-16| 10 | 20 |
| 2|null | null | null |
+---+----------+-----------------+-----------------+
To solve this, the following approach can be applied:
1) Join the two dataframes by key.
2) Filter out all the rows where previous dates are not exactly years before reference dates.
3) Calculate the years difference for the row and put the value in a dedicated column.
4) Pivot the DataFrame around the column calculated in the previous step and aggregate on the value of the respective year.
private def generateWhereForPreviousYears(nbYears: Int): Column =
(-1 to -nbYears by -1) // loop on each backwards year value
.map(yearsBack =>
/*
* Each year back count number is transformed in an expression
* to be included into the WHERE clause.
* This is equivalent to "history.date=add_months($currentdate,-1*12)"
* in your comment in the question.
*/
add_months($"df1.date", 12 * yearsBack) === $"df2.date"
)
/*
The previous .map call produces a sequence of Column expressions,
we need to concatenate them with "or" in order to obtain
a single Spark Column reference. .reduce() function is most
appropriate here.
*/
.reduce(_ or _) or $"df2.date".isNull // the last "or" is added to include empty lines in the result.
val nbYearsBack = 3
val result = sourceDf1.as("df1")
.join(sourceDf2.as("df2"), $"df1.key" === $"df2.key", "left")
.where(generateWhereForPreviousYears(nbYearsBack))
.withColumn("diff_years", concat(lit("previous_year_"), year($"df1.date") - year($"df2.date")))
.groupBy($"df1.key", $"df1.date")
.pivot("diff_years")
.agg(first($"df2.value"))
.drop("null") // drop the unwanted extra column with null values
The output is:
+---+----------+---------------+---------------+
|key|date |previous_year_1|previous_year_2|
+---+----------+---------------+---------------+
|1 |2018-04-16|10 |20 |
|2 |2018-04-16|null |null |
+---+----------+---------------+---------------+
Let me "read through the lines" and give you a "similar" solution to what you are asking:
val df1Pivot = df1.groupBy("key").pivot("date").agg(max("val"))
val df2Pivot = df2.groupBy("key").pivot("date").agg(max("val"))
val result = df1Pivot.join(df2Pivot, Seq("key"), "left")
result.show
+---+----------+----------+----------+
|key|2018-04-16|2016-04-16|2017-04-16|
+---+----------+----------+----------+
| 1| 100| 20| 10|
| 2| 200| null| null|
+---+----------+----------+----------+
Feel free to manipulate the data a bit if you really need to change the column names.
Or even better:
df1.union(df2).groupBy("key").pivot("date").agg(max("val")).show
+---+----------+----------+----------+
|key|2016-04-16|2017-04-16|2018-04-16|
+---+----------+----------+----------+
| 1| 20| 10| 100|
| 2| null| null| 200|
+---+----------+----------+----------+