Replication of DataFrame rows based on Date

Replication of DataFrame rows based on Date - dataframe

I have a Dataframe which has the following structure and data
Source:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 05/01/2021
M, N, 10/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a the subsequent date, as following:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
A, B, 08/01/2021
A, B, 09/01/2021
M, N, 10/01/2021
Any idea on how this can be achieved in scala spark?

Here is the working solution:
val dfp1 = List(("1001", 11, "01/10/2021"), ("1002", 21, "05/10/2021"), ("1001", 12, "10/10/2021"), ("1002", 22, "15/10/2021")).toDF("SerialNumber","SomeVal", "Date")
val dfProducts = dfp1.withColumn("Date", to_date($"Date","dd/MM/yyyy"))
dfProducts.show
+------------+-------+----------+
|SerialNumber|SomeVal| Date|
+------------+-------+----------+
| 1001| 11|2021-10-01|
| 1002| 21|2021-10-05|
| 1001| 12|2021-10-10|
| 1002| 22|2021-10-15|
+------------+-------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val overColumns = Window.partitionBy("SerialNumber").orderBy( "Date").rowsBetween(1, Window.unboundedFollowing)
val dfProduct1 = dfProducts.withColumn("NextSerialDate",first("Date", true).over(overColumns)).orderBy("Date")
dfProduct1.show
+------------+-------+----------+--------------+
|SerialNumber|SomeVal| Date|NextSerialDate|
+------------+-------+----------+--------------+
| 1001| 11|2021-10-01| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-15|
| 1001| 12|2021-10-10| null|
| 1002| 22|2021-10-15| null|
+------------+-------+----------+--------------+
val dfProduct2= dfProduct1.withColumn("NextSerialDate", when(col("NextSerialDate").isNull, col("Date")).otherwise(date_sub(col("NextSerialDate"), 1))).orderBy("SerialNumber")
dfProduct2.show
+------------+-------+----------+--------------+
|SerialNumber|SomeVal| Date|NextSerialDate|
+------------+-------+----------+--------------+
| 1001| 11|2021-10-01| 2021-10-09|
| 1001| 12|2021-10-10| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14|
| 1002| 22|2021-10-15| 2021-10-15|
+------------+-------+----------+--------------+
val dfProduct3= dfProduct2.withColumn("ExpandedDate", explode_outer(sequence($"Date", $"NextSerialDate")))
dfProduct3.show
+------------+-------+----------+--------------+------------+
|SerialNumber|SomeVal| Date|NextSerialDate|ExpandedDate|
+------------+-------+----------+--------------+------------+
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-01|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-02|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-03|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-04|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-05|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-06|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-07|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-08|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-09|
| 1001| 12|2021-10-10| 2021-10-10| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-05|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-06|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-07|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-08|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-09|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-11|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-12|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-13|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-14|
+------------+-------+----------+--------------+------------+
only showing top 20 rows
val dfProduct4 = dfProduct3.drop("Date", "NextSerialDate").withColumn("Date", col("ExpandedDate")).drop("ExpandedDate")
dfProduct4.show(50, false)
+------------+-------+----------+
|SerialNumber|SomeVal|Date |
+------------+-------+----------+
|1001 |11 |2021-10-01|
|1001 |11 |2021-10-02|
|1001 |11 |2021-10-03|
|1001 |11 |2021-10-04|
|1001 |11 |2021-10-05|
|1001 |11 |2021-10-06|
|1001 |11 |2021-10-07|
|1001 |11 |2021-10-08|
|1001 |11 |2021-10-09|
|1001 |12 |2021-10-10|
|1002 |21 |2021-10-05|
|1002 |21 |2021-10-06|
|1002 |21 |2021-10-07|
|1002 |21 |2021-10-08|
|1002 |21 |2021-10-09|
|1002 |21 |2021-10-10|
|1002 |21 |2021-10-11|
|1002 |21 |2021-10-12|
|1002 |21 |2021-10-13|
|1002 |21 |2021-10-14|
|1002 |22 |2021-10-15|
+------------+-------+----------+

Related

Append new rows to a Spark dataframe based on a condition

I need some help on resolving this tricky transformation-
My spark dataframe look like this:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+
Where I need to add new records(rows) for a combination of A and B (having edit_flag as false) if the item_nbr matches with another A and B having edit_flag as true.
The new row will have everything columns copied from its parent row except rcv_qty and rcvr_nbr. So, final output will look like:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 1| 502| 10| 5| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 2| 510| 40| 10| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import spark.implicits._
case class Source(
A: Int,
B: Int,
rcvr_nbr: Int,
order_qty: Int,
rcv_qty: Int,
item_nbr: Int,
edit_flag: Boolean
)
val sourceDF = Seq(
Source(123, 1, 500, 10, 2, 1001, false),
Source(123, 1, 501, 10, 2, 1001, false),
Source(123, 4, 502, 60, 5, 1001, true),
Source(123, 2, 504, 40, 30, 1003, false),
Source(123, 5, 510, 10, 10, 1003, true)
).toDF()
sourceDF.printSchema()
// root
// |-- A: integer (nullable = false)
// |-- B: integer (nullable = false)
// |-- rcvr_nbr: integer (nullable = false)
// |-- order_qty: integer (nullable = false)
// |-- rcv_qty: integer (nullable = false)
// |-- item_nbr: integer (nullable = false)
// |-- edit_flag: boolean (nullable = false)
sourceDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |501 |10 |2 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+
val sourceDFTrueF = sourceDF.filter(col("edit_flag").equalTo(true))
val sourceDFTrue = sourceDFTrueF.columns.foldLeft(sourceDFTrueF) {
(tmpDF, col) =>
tmpDF.withColumnRenamed(col, s"${col}_true")
}
val sourceDFFalse = sourceDF
.filter(col("edit_flag").equalTo(false))
.dropDuplicates("item_nbr")
val resDF =
sourceDFFalse
.join(
sourceDFTrue,
sourceDFFalse.col("item_nbr") === sourceDFTrue.col("item_nbr_true"),
"inner"
)
.select(
sourceDFFalse.col("A"),
sourceDFFalse.col("B"),
sourceDFTrue.col("rcvr_nbr_true").alias("rcvr_nbr"),
sourceDFFalse.col("order_qty"),
sourceDFTrue.col("rcv_qty_true").alias("rcv_qty"),
sourceDFFalse.col("item_nbr"),
sourceDFFalse.col("edit_flag")
)
.union(sourceDF)
.orderBy(col("A"), col("item_nbr"), col("edit_flag"))
resDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |501 |10 |2 |1001 |false |
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |502 |10 |5 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|2 |510 |40 |10 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+

Spark: how to do aggregation operations on string array in dataframe

I want to do group cols some aggregations operations like count, count_distinct or nunique.
For examples,
# the samples values in `date` column are all unique
df.show(7)
+--------------------+---------------------------------+-------------------+---------+
| category| tags| datetime| date|
+--------------------+---------------------------------+-------------------+---------+
| null| ,industry,display,Merchants|2018-01-08 14:30:32| 20200704|
| social,smart| smart,swallow,game,Experience|2019-06-17 04:34:51| 20200705|
| ,beauty,social| social,picture,social|2017-08-19 09:01:37| 20200706|
| default| default,game,us,adventure|2019-10-02 14:18:56| 20200707|
|financial management|financial management,loan,product|2018-07-17 02:07:39| 20200708|
| system| system,font,application,setting|2015-07-18 00:45:57| 20200709|
| null| ,system,profile,optimization|2018-09-07 19:59:03| 20200710|
df.printSchema()
root
|-- category: string (nullable = true)
|-- tags: string (nullable = true)
|-- datetime: string (nullable = true)
|-- date: string (nullable = true)
# I want to do some group aggregations by PySpark like follows in pandas
group_date_tags_cnt_df = df.groupby('date')['tags'].count()
group_date_tags_nunique_df = df.groupby('date')['tags'].nunique()
group_date_category_cnt_df = df.groupby('date')['category'].count()
group_date_category_nunique_df = df.groupby('date')['category'].nunique()
# expected output here
# AND all results should ignore ',' in the splitted result and `null` value in aggregations operations
group_date_tags_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
group_date_tags_nunique_df.show(4)
+---------+---------------------------------+
| date| count(DISTINCT tag)|
+---------+---------------------------------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
# It should ignore `null` here
group_date_category_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 0|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
group_date_category_nunique_df.show(4)
+---------+----------------------------+
| date| count(DISTINCT category)|
+---------+----------------------------+
| 20200704| 1|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
But the tags and category columns are string type here.
So I think I should do split way first and do group aggregations operations based on.
But I am a little awkward to implement it.
So could anyone help me?

case class d(
category: Option[String],
tags: String,
datetime: String,
date: String
)
val sourceDF = Seq(
d(None, ",industry,display,Merchants", "2018-01-08 14:30:32", "20200704"),
d(Some("social,smart"), "smart,swallow,game,Experience", "2019-06-17 04:34:51", "20200704"),
d(Some(",beauty,social"), "social,picture,social", "2017-08-19 09:01:37", "20200704")
).toDF("category", "tags", "datetime", "date")
val df1 = sourceDF.withColumn("category", split('category, ","))
.withColumn("tags", split('tags, ","))
val df2 = df1.select('datetime, 'date, 'tags,
explode(
when(col("category").isNotNull, col("category"))
.otherwise(array(lit(null).cast("string")))).alias("category")
)
val df3 = df2.select('category, 'datetime, 'date,
explode(
when(col("tags").isNotNull, col("tags"))
.otherwise(array(lit(null).cast("string")))).alias("tags")
)
val resDF = df3.select('category, 'tags, 'datetime, 'date)
resDF.show
// +--------+----------+-------------------+--------+
// |category| tags| datetime| date|
// +--------+----------+-------------------+--------+
// | null| |2018-01-08 14:30:32|20200704|
// | null| industry|2018-01-08 14:30:32|20200704|
// | null| display|2018-01-08 14:30:32|20200704|
// | null| Merchants|2018-01-08 14:30:32|20200704|
// | social| smart|2019-06-17 04:34:51|20200704|
// | social| swallow|2019-06-17 04:34:51|20200704|
// | social| game|2019-06-17 04:34:51|20200704|
// | social|Experience|2019-06-17 04:34:51|20200704|
// | smart| smart|2019-06-17 04:34:51|20200704|
// | smart| swallow|2019-06-17 04:34:51|20200704|
// | smart| game|2019-06-17 04:34:51|20200704|
// | smart|Experience|2019-06-17 04:34:51|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | | picture|2017-08-19 09:01:37|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | beauty| picture|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | social| social|2017-08-19 09:01:37|20200704|
// | social| picture|2017-08-19 09:01:37|20200704|
// +--------+----------+-------------------+--------+
val group1DF = resDF.groupBy('date, 'category).count()
group1DF.show
// +--------+--------+-----+
// | date|category|count|
// +--------+--------+-----+
// |20200704| social| 7|
// |20200704| | 3|
// |20200704| smart| 4|
// |20200704| beauty| 3|
// |20200704| null| 4|
// +--------+--------+-----+
val group2DF = resDF.groupBy('datetime, 'category).count()
group2DF.show
// +-------------------+--------+-----+
// | datetime|category|count|
// +-------------------+--------+-----+
// |2017-08-19 09:01:37| social| 3|
// |2017-08-19 09:01:37| beauty| 3|
// |2019-06-17 04:34:51| smart| 4|
// |2019-06-17 04:34:51| social| 4|
// |2018-01-08 14:30:32| null| 4|
// |2017-08-19 09:01:37| | 3|
// +-------------------+--------+-----+

Pyspark code for which solves your problem, I have taken the 3 dates data 20200702, 20200704, 20200705
from pyspark.sql import Row
from pyspark.sql.functions import *
drow = Row("category","tags","datetime","date")
data = [drow("", ",industry,display,Merchants","2018-01-08 14:30:32","20200704"),drow("social,smart","smart,swallow,game,Experience","2019-06-17 04:34:51","20200702"),drow(",beauty,social", "social,picture,social", "2017-08-19 09:01:37", "20200705")]
df = spark.createDataFrame(data)
final_df=df.withColumn("category", split(df['category'], ",")).withColumn("tags", split(df['tags'], ",")).select('datetime', 'date', 'tags', explode(when(col("category").isNotNull(), col("category")).otherwise(array(lit("").cast("string")))).alias("category")).select('datetime', 'date', 'category', explode(when(col("tags").isNotNull(), col("tags")).otherwise(array(lit("").cast("string")))).alias("tags")).alias("tags")
final_df.show()
'''
+-------------------+--------+--------+----------+
| datetime| date|category| tags|
+-------------------+--------+--------+----------+
|2018-01-08 14:30:32|20200704| | |
|2018-01-08 14:30:32|20200704| | industry|
|2018-01-08 14:30:32|20200704| | display|
|2018-01-08 14:30:32|20200704| | Merchants|
|2019-06-17 04:34:51|20200702| social| smart|
|2019-06-17 04:34:51|20200702| social| swallow|
|2019-06-17 04:34:51|20200702| social| game|
|2019-06-17 04:34:51|20200702| social|Experience|
|2019-06-17 04:34:51|20200702| smart| smart|
|2019-06-17 04:34:51|20200702| smart| swallow|
|2019-06-17 04:34:51|20200702| smart| game|
|2019-06-17 04:34:51|20200702| smart|Experience|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| | picture|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| beauty| picture|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| social| social|
|2017-08-19 09:01:37|20200705| social| picture|
+-------------------+--------+--------+----------+
only showing top 20 rows'''
final_df.groupBy('date','tags').count().show()
'''
+--------+----------+-----+
| date| tags|count|
+--------+----------+-----+
|20200702| smart| 2|
|20200705| picture| 3|
|20200702| swallow| 2|
|20200704| industry| 1|
|20200704| display| 1|
|20200702| game| 2|
|20200704| | 1|
|20200704| Merchants| 1|
|20200702|Experience| 2|
|20200705| social| 6|
+--------+----------+-----+
'''
final_df.groupBy('date','category').count().show()
'''
+--------+--------+-----+
| date|category|count|
+--------+--------+-----+
|20200702| smart| 4|
|20200702| social| 4|
|20200705| | 3|
|20200705| beauty| 3|
|20200704| | 4|
|20200705| social| 3|
+--------+--------+-----+
'''

Replace empty array with null in Spark DataFrame

Consider a dataframe like the following:
+---+----+--------+----+
| c1| c2| c3| c4|
+---+----+--------+----+
| x| n1| [m1]| []|
| y| n3|[m2, m3]|[z3]|
| x| n2| []| []|
+---+----+--------+----+
I want to replace empty array with null.
+---+----+--------+----+
| c1| c2| c3| c4|
+---+----+--------+----+
| x| n1| [m1]|null|
| y| n3|[m2, m3]|[z3]|
| x| n2| null|null|
+---+----+--------+----+
What is the efficient way to achieve the above goal?

You could check array length and return null usign when...otherwise function:
val df = Seq(
("x", "n1", Seq("m1"), Seq()),
("y", "n3", Seq("m2", "m3"), Seq("z3")),
("x", "n2", Seq(), Seq())
).toDF("c1", "c2", "c3", "c4")
df.show
df.select($"c1", $"c2",
when(size($"c3") > 0, $"c3").otherwise(lit(null)) as "c3",
when(size($"c4") > 0, $"c4").otherwise(lit(null)) as "c4"
).show
It returns:
df: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
+---+---+--------+----+
| c1| c2| c3| c4|
+---+---+--------+----+
| x| n1| [m1]| []|
| y| n3|[m2, m3]|[z3]|
| x| n2| []| []|
+---+---+--------+----+
+---+---+--------+----+
| c1| c2| c3| c4|
+---+---+--------+----+
| x| n1| [m1]|null|
| y| n3|[m2, m3]|[z3]|
| x| n2| null|null|
+---+---+--------+----+

How do I pass parameters to selectExpr? SparkSQL-Scala

:)
When you have a data frame, you can add columns and fill their rows with the method selectExprt
Something like this:
scala> table.show
+------+--------+---------+--------+--------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|
+------+--------+---------+--------+--------+
| OlcM| h|999999999| J| 0|
| zOcQ| r|777777777| J| 1|
| kyGp| t|333333333| J| 2|
| BEuX| A|999999999| F| 3|
scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo")
tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string]
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hola|
| zOcQ| r|777777777| J| 1| hola|
| kyGp| t|333333333| J| 2| hola|
| BEuX| A|999999999| F| 3| hola|
My point is:
I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this:
scala> var english = "hello"
scala> def generar_informe(df: DataFrame, tabla: String) {
var selectExpr_df = df.selectExpr(
"TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD",
"CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD",
"'tabla' as PUNTO DEL FLUJO" )
}
scala> generar_informe(df,english)
.....
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hello|
| zOcQ| r|777777777| J| 1| hello|
| kyGp| t|333333333| J| 2| hello|
| BEuX| A|999999999| F| 3| hello|
I tried:
scala> var result = tabl.selectExpr("A", "B", "$tabla as C")
scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C)
<console>:31: error: not found: value $
var abc = tabl.selectExpr("A", "B", ${tabla} as C)
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
scala> sqlContext.sql("set tabla='hello'")
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
SAME ERROR:
java.lang.RuntimeException: [1.1] failure: identifier expected
${tabla} as C
^
at scala.sys.package$.error(package.scala:27)
Thanks in advance!

Can you try this.
val english = "hello"
generar_informe(data,english).show()
}
def generar_informe(df: DataFrame , english : String)={
df.selectExpr(
"transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """)
}
This is the output I got.
17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms
+-------------+----------+------+----------+------+
|transactionId|customerId|itemId|amountPaid|saludo|
+-------------+----------+------+----------+------+
| 111| 1| 1| 100.0| hello|
| 112| 2| 2| 505.0| hello|
| 113| 3| 3| 510.0| hello|
| 114| 4| 4| 600.0| hello|
| 115| 1| 2| 500.0| hello|
| 116| 1| 2| 500.0| hello|
| 117| 1| 2| 500.0| hello|
| 118| 1| 2| 500.0| hello|
| 119| 2| 3| 500.0| hello|
| 120| 1| 2| 500.0| hello|
| 121| 1| 4| 500.0| hello|
| 122| 1| 2| 500.0| hello|
| 123| 1| 4| 500.0| hello|
| 124| 1| 2| 500.0| hello|
+-------------+----------+------+----------+------+
17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook

Find date of each week in from week in Spark Dataframe

I want to add a column with date of each corresponding week in Dataframe (appending friday in each date)
My Dataframe looks like this
+----+------+---------+
|Week| City|sum(Sale)|
+----+------+---------+
| 29|City 2| 72|
| 28|City 3| 48|
| 28|City 2| 19|
| 27|City 2| 16|
| 28|City 1| 84|
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 3| 42|
| 26|City 3| 68|
| 27|City 1| 89|
| 27|City 4| 104|
| 26|City 2| 19|
| 29|City 3| 27|
+----+------+---------+
I need to convert it as below dataframe
----+------+---------+--------------- |
|Week| City|sum(Sale)|perticular day(dd/mm/yyyy) |
+----+------+---------+---------------|
| 29|City 2| 72|Friday(07/21/2017)|
| 28|City 3| 48|Friday(07/14/2017)|
| 28|City 2| 19|Friday(07/14/2017)|
| 27|City 2| 16|Friday(07/07/2017)|
| 28|City 1| 84|Friday(07/14/2017)|
| 28|City 4| 72|Friday(07/14/2017)|
| 29|City 4| 39|Friday(07/21/2017)|
| 27|City 3| 42|Friday(07/07/2017)|
| 26|City 3| 68|Friday(06/30/2017)|
| 27|City 1| 89|Friday(07/07/2017)|
| 27|City 4| 104|Friday(07/07/2017)|
| 26|City 2| 19|Friday(06/30/2017)|
| 29|City 3| 27|Friday(07/21/2017)|
+----+------+---------+
please help me

You can write a simple UDF and get the date from adding week in it.
Here is the simple example
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
val getDateFromWeek = udf((week : Int) => {
//create a default date for week 1
val week1 = LocalDate.of(2016, 12, 30)
val day = "Friday"
//add week from the week column
val result = week1.plusWeeks(week).format(DateTimeFormatter.ofPattern("MM/dd/yyyy"))
//return result as Friday (date)
s"${day} (${result})"
})
//use the udf and create a new column named day
data.withColumn("day", getDateFromWeek($"week")).show
can anyone convert this to Pyspark?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Replication of DataFrame rows based on Date - dataframe

Related

Append new rows to a Spark dataframe based on a condition

Spark: how to do aggregation operations on string array in dataframe

Replace empty array with null in Spark DataFrame

How do I pass parameters to selectExpr? SparkSQL-Scala

Find date of each week in from week in Spark Dataframe

Categories

Resources