Spark: how to do aggregation operations on string array in dataframe - dataframe

I want to do group cols some aggregations operations like count, count_distinct or nunique.
For examples,
# the samples values in `date` column are all unique
df.show(7)
+--------------------+---------------------------------+-------------------+---------+
| category| tags| datetime| date|
+--------------------+---------------------------------+-------------------+---------+
| null| ,industry,display,Merchants|2018-01-08 14:30:32| 20200704|
| social,smart| smart,swallow,game,Experience|2019-06-17 04:34:51| 20200705|
| ,beauty,social| social,picture,social|2017-08-19 09:01:37| 20200706|
| default| default,game,us,adventure|2019-10-02 14:18:56| 20200707|
|financial management|financial management,loan,product|2018-07-17 02:07:39| 20200708|
| system| system,font,application,setting|2015-07-18 00:45:57| 20200709|
| null| ,system,profile,optimization|2018-09-07 19:59:03| 20200710|
df.printSchema()
root
|-- category: string (nullable = true)
|-- tags: string (nullable = true)
|-- datetime: string (nullable = true)
|-- date: string (nullable = true)
# I want to do some group aggregations by PySpark like follows in pandas
group_date_tags_cnt_df = df.groupby('date')['tags'].count()
group_date_tags_nunique_df = df.groupby('date')['tags'].nunique()
group_date_category_cnt_df = df.groupby('date')['category'].count()
group_date_category_nunique_df = df.groupby('date')['category'].nunique()
# expected output here
# AND all results should ignore ',' in the splitted result and `null` value in aggregations operations
group_date_tags_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
group_date_tags_nunique_df.show(4)
+---------+---------------------------------+
| date| count(DISTINCT tag)|
+---------+---------------------------------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
# It should ignore `null` here
group_date_category_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 0|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
group_date_category_nunique_df.show(4)
+---------+----------------------------+
| date| count(DISTINCT category)|
+---------+----------------------------+
| 20200704| 1|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
But the tags and category columns are string type here.
So I think I should do split way first and do group aggregations operations based on.
But I am a little awkward to implement it.
So could anyone help me?

case class d(
category: Option[String],
tags: String,
datetime: String,
date: String
)
val sourceDF = Seq(
d(None, ",industry,display,Merchants", "2018-01-08 14:30:32", "20200704"),
d(Some("social,smart"), "smart,swallow,game,Experience", "2019-06-17 04:34:51", "20200704"),
d(Some(",beauty,social"), "social,picture,social", "2017-08-19 09:01:37", "20200704")
).toDF("category", "tags", "datetime", "date")
val df1 = sourceDF.withColumn("category", split('category, ","))
.withColumn("tags", split('tags, ","))
val df2 = df1.select('datetime, 'date, 'tags,
explode(
when(col("category").isNotNull, col("category"))
.otherwise(array(lit(null).cast("string")))).alias("category")
)
val df3 = df2.select('category, 'datetime, 'date,
explode(
when(col("tags").isNotNull, col("tags"))
.otherwise(array(lit(null).cast("string")))).alias("tags")
)
val resDF = df3.select('category, 'tags, 'datetime, 'date)
resDF.show
// +--------+----------+-------------------+--------+
// |category| tags| datetime| date|
// +--------+----------+-------------------+--------+
// | null| |2018-01-08 14:30:32|20200704|
// | null| industry|2018-01-08 14:30:32|20200704|
// | null| display|2018-01-08 14:30:32|20200704|
// | null| Merchants|2018-01-08 14:30:32|20200704|
// | social| smart|2019-06-17 04:34:51|20200704|
// | social| swallow|2019-06-17 04:34:51|20200704|
// | social| game|2019-06-17 04:34:51|20200704|
// | social|Experience|2019-06-17 04:34:51|20200704|
// | smart| smart|2019-06-17 04:34:51|20200704|
// | smart| swallow|2019-06-17 04:34:51|20200704|
// | smart| game|2019-06-17 04:34:51|20200704|
// | smart|Experience|2019-06-17 04:34:51|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | | picture|2017-08-19 09:01:37|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | beauty| picture|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | social| social|2017-08-19 09:01:37|20200704|
// | social| picture|2017-08-19 09:01:37|20200704|
// +--------+----------+-------------------+--------+
val group1DF = resDF.groupBy('date, 'category).count()
group1DF.show
// +--------+--------+-----+
// | date|category|count|
// +--------+--------+-----+
// |20200704| social| 7|
// |20200704| | 3|
// |20200704| smart| 4|
// |20200704| beauty| 3|
// |20200704| null| 4|
// +--------+--------+-----+
val group2DF = resDF.groupBy('datetime, 'category).count()
group2DF.show
// +-------------------+--------+-----+
// | datetime|category|count|
// +-------------------+--------+-----+
// |2017-08-19 09:01:37| social| 3|
// |2017-08-19 09:01:37| beauty| 3|
// |2019-06-17 04:34:51| smart| 4|
// |2019-06-17 04:34:51| social| 4|
// |2018-01-08 14:30:32| null| 4|
// |2017-08-19 09:01:37| | 3|
// +-------------------+--------+-----+

Pyspark code for which solves your problem, I have taken the 3 dates data 20200702, 20200704, 20200705
from pyspark.sql import Row
from pyspark.sql.functions import *
drow = Row("category","tags","datetime","date")
data = [drow("", ",industry,display,Merchants","2018-01-08 14:30:32","20200704"),drow("social,smart","smart,swallow,game,Experience","2019-06-17 04:34:51","20200702"),drow(",beauty,social", "social,picture,social", "2017-08-19 09:01:37", "20200705")]
df = spark.createDataFrame(data)
final_df=df.withColumn("category", split(df['category'], ",")).withColumn("tags", split(df['tags'], ",")).select('datetime', 'date', 'tags', explode(when(col("category").isNotNull(), col("category")).otherwise(array(lit("").cast("string")))).alias("category")).select('datetime', 'date', 'category', explode(when(col("tags").isNotNull(), col("tags")).otherwise(array(lit("").cast("string")))).alias("tags")).alias("tags")
final_df.show()
'''
+-------------------+--------+--------+----------+
| datetime| date|category| tags|
+-------------------+--------+--------+----------+
|2018-01-08 14:30:32|20200704| | |
|2018-01-08 14:30:32|20200704| | industry|
|2018-01-08 14:30:32|20200704| | display|
|2018-01-08 14:30:32|20200704| | Merchants|
|2019-06-17 04:34:51|20200702| social| smart|
|2019-06-17 04:34:51|20200702| social| swallow|
|2019-06-17 04:34:51|20200702| social| game|
|2019-06-17 04:34:51|20200702| social|Experience|
|2019-06-17 04:34:51|20200702| smart| smart|
|2019-06-17 04:34:51|20200702| smart| swallow|
|2019-06-17 04:34:51|20200702| smart| game|
|2019-06-17 04:34:51|20200702| smart|Experience|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| | picture|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| beauty| picture|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| social| social|
|2017-08-19 09:01:37|20200705| social| picture|
+-------------------+--------+--------+----------+
only showing top 20 rows'''
final_df.groupBy('date','tags').count().show()
'''
+--------+----------+-----+
| date| tags|count|
+--------+----------+-----+
|20200702| smart| 2|
|20200705| picture| 3|
|20200702| swallow| 2|
|20200704| industry| 1|
|20200704| display| 1|
|20200702| game| 2|
|20200704| | 1|
|20200704| Merchants| 1|
|20200702|Experience| 2|
|20200705| social| 6|
+--------+----------+-----+
'''
final_df.groupBy('date','category').count().show()
'''
+--------+--------+-----+
| date|category|count|
+--------+--------+-----+
|20200702| smart| 4|
|20200702| social| 4|
|20200705| | 3|
|20200705| beauty| 3|
|20200704| | 4|
|20200705| social| 3|
+--------+--------+-----+
'''

Related

Append new rows to a Spark dataframe based on a condition

I need some help on resolving this tricky transformation-
My spark dataframe look like this:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+
Where I need to add new records(rows) for a combination of A and B (having edit_flag as false) if the item_nbr matches with another A and B having edit_flag as true.
The new row will have everything columns copied from its parent row except rcv_qty and rcvr_nbr. So, final output will look like:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 1| 502| 10| 5| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 2| 510| 40| 10| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import spark.implicits._
case class Source(
A: Int,
B: Int,
rcvr_nbr: Int,
order_qty: Int,
rcv_qty: Int,
item_nbr: Int,
edit_flag: Boolean
)
val sourceDF = Seq(
Source(123, 1, 500, 10, 2, 1001, false),
Source(123, 1, 501, 10, 2, 1001, false),
Source(123, 4, 502, 60, 5, 1001, true),
Source(123, 2, 504, 40, 30, 1003, false),
Source(123, 5, 510, 10, 10, 1003, true)
).toDF()
sourceDF.printSchema()
// root
// |-- A: integer (nullable = false)
// |-- B: integer (nullable = false)
// |-- rcvr_nbr: integer (nullable = false)
// |-- order_qty: integer (nullable = false)
// |-- rcv_qty: integer (nullable = false)
// |-- item_nbr: integer (nullable = false)
// |-- edit_flag: boolean (nullable = false)
sourceDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |501 |10 |2 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+
val sourceDFTrueF = sourceDF.filter(col("edit_flag").equalTo(true))
val sourceDFTrue = sourceDFTrueF.columns.foldLeft(sourceDFTrueF) {
(tmpDF, col) =>
tmpDF.withColumnRenamed(col, s"${col}_true")
}
val sourceDFFalse = sourceDF
.filter(col("edit_flag").equalTo(false))
.dropDuplicates("item_nbr")
val resDF =
sourceDFFalse
.join(
sourceDFTrue,
sourceDFFalse.col("item_nbr") === sourceDFTrue.col("item_nbr_true"),
"inner"
)
.select(
sourceDFFalse.col("A"),
sourceDFFalse.col("B"),
sourceDFTrue.col("rcvr_nbr_true").alias("rcvr_nbr"),
sourceDFFalse.col("order_qty"),
sourceDFTrue.col("rcv_qty_true").alias("rcv_qty"),
sourceDFFalse.col("item_nbr"),
sourceDFFalse.col("edit_flag")
)
.union(sourceDF)
.orderBy(col("A"), col("item_nbr"), col("edit_flag"))
resDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |501 |10 |2 |1001 |false |
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |502 |10 |5 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|2 |510 |40 |10 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+

Replace empty array with null in Spark DataFrame

Consider a dataframe like the following:
+---+----+--------+----+
| c1| c2| c3| c4|
+---+----+--------+----+
| x| n1| [m1]| []|
| y| n3|[m2, m3]|[z3]|
| x| n2| []| []|
+---+----+--------+----+
I want to replace empty array with null.
+---+----+--------+----+
| c1| c2| c3| c4|
+---+----+--------+----+
| x| n1| [m1]|null|
| y| n3|[m2, m3]|[z3]|
| x| n2| null|null|
+---+----+--------+----+
What is the efficient way to achieve the above goal?
You could check array length and return null usign when...otherwise function:
val df = Seq(
("x", "n1", Seq("m1"), Seq()),
("y", "n3", Seq("m2", "m3"), Seq("z3")),
("x", "n2", Seq(), Seq())
).toDF("c1", "c2", "c3", "c4")
df.show
df.select($"c1", $"c2",
when(size($"c3") > 0, $"c3").otherwise(lit(null)) as "c3",
when(size($"c4") > 0, $"c4").otherwise(lit(null)) as "c4"
).show
It returns:
df: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
+---+---+--------+----+
| c1| c2| c3| c4|
+---+---+--------+----+
| x| n1| [m1]| []|
| y| n3|[m2, m3]|[z3]|
| x| n2| []| []|
+---+---+--------+----+
+---+---+--------+----+
| c1| c2| c3| c4|
+---+---+--------+----+
| x| n1| [m1]|null|
| y| n3|[m2, m3]|[z3]|
| x| n2| null|null|
+---+---+--------+----+

Window Function Tie breaker on other field to get the Latest Record

I have following data, where the data is partitioned by the stores and month id and ordered by amount in order to get the primary vendor for the store.
I need a tie breaker if the amount is equal between two vendors,
then if one of the tied vendor was the previous months most sales vendor, make that vendor as the most sales vendor for the month.
The look back will increase if there is a tie again. Lag of 1 month will not work if there is tie again. Worst case scenario we will have more duplicates in previous month also.
sample data
val data = Seq((201801, 10941, 115, 80890.44900, 135799.66400),
(201801, 10941, 3, 80890.44900, 135799.66400) ,
(201712, 10941, 3, 517440.74500, 975893.79000),
(201712, 10941, 115, 517440.74500, 975893.79000),
(201711, 10941, 3 , 371501.92100, 574223.52300),
(201710, 10941, 115, 552435.57800, 746912.06700),
(201709, 10941, 115,1523492.60700,1871480.06800),
(201708, 10941, 115,1027698.93600,1236544.50900),
(201707, 10941, 33 ,1469219.86900,1622949.53000)
).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")
Code:
val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)
Output:
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 1|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 115| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
My rank is 1 for both 1 and 2 records, but my rank should be 1 for second record based on previous month max dollars
I am expecting the following output.
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 2|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 3| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
Should I write a UDAF? Any suggestions would help.
You can do this with 2 windows. First, you will need to use the lag() function to carry over the previous month's sales values so that you can use that in your rank window. here's that part in pyspark:
lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))
Then edit your window to include that new column:
window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711| 10941| 99| 371501.921| 574223.523| null| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1027698.936| 1|
|201707| 10941| 33|1469219.869| 1622949.53| null| 1|
|201708| 10941| 115|1027698.936|1236544.509| null| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1523492.607| 1|
|201712| 10941| 3| 517440.745| 975893.79| null| 1|
|201801| 10941| 3| 80890.449| 135799.664| 517440.745| 1|
|201801| 10941| 115| 80890.449| 135799.664| 552435.578| 2|
+------+--------+-----+-----------+-----------+----------------+----+
For each row, collect an array of that brands previous sales, in a (Month, Sales) struct.
val storeAndBrandWindow = Window.partitionBy("store_id", "brand").orderBy($"MTH_ID")
val df1 = data.withColumn("brndSales_list", collect_list(struct($"MTH_ID", $"brndSales")).over(storeAndBrandWindow))
Reverse that array with a UDF.
val returnType = ArrayType(StructType(Array(StructField("month", IntegerType), StructField("sales", DoubleType))))
val reverseUdf = udf((list: Seq[Row]) => list.reverse, returnType)
val df2 = df1.withColumn("brndSales_list", reverseUdf($"brndSales_list"))
And then sort by the array.
val window = Window.partitionBy("store_id", "MTH_ID").orderBy($"brndSales_list".desc)
val df3 = df2.withColumn("rank", rank over window).orderBy("MTH_ID", "brand")
df3.show(false)
Result
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|MTH_ID|store_id|brand|brndSales |TotalSales |brndSales_list |rank|
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|201707|10941 |33 |1469219.869|1622949.53 |[[201707, 1469219.869]] |1 |
|201708|10941 |115 |1027698.936|1236544.509|[[201708, 1027698.936]] |1 |
|201709|10941 |115 |1523492.607|1871480.068|[[201709, 1523492.607], [201708, 1027698.936]] |1 |
|201710|10941 |115 |552435.578 |746912.067 |[[201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]] |1 |
|201711|10941 |99 |371501.921 |574223.523 |[[201711, 371501.921]] |1 |
|201712|10941 |3 |517440.745 |975893.79 |[[201712, 517440.745]] |1 |
|201801|10941 |3 |80890.449 |135799.664 |[[201801, 80890.449], [201712, 517440.745]] |1 |
|201801|10941 |115 |80890.449 |135799.664 |[[201801, 80890.449], [201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]]|2 |
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+

How do I pass parameters to selectExpr? SparkSQL-Scala

:)
When you have a data frame, you can add columns and fill their rows with the method selectExprt
Something like this:
scala> table.show
+------+--------+---------+--------+--------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|
+------+--------+---------+--------+--------+
| OlcM| h|999999999| J| 0|
| zOcQ| r|777777777| J| 1|
| kyGp| t|333333333| J| 2|
| BEuX| A|999999999| F| 3|
scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo")
tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string]
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hola|
| zOcQ| r|777777777| J| 1| hola|
| kyGp| t|333333333| J| 2| hola|
| BEuX| A|999999999| F| 3| hola|
My point is:
I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this:
scala> var english = "hello"
scala> def generar_informe(df: DataFrame, tabla: String) {
var selectExpr_df = df.selectExpr(
"TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD",
"CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD",
"'tabla' as PUNTO DEL FLUJO" )
}
scala> generar_informe(df,english)
.....
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hello|
| zOcQ| r|777777777| J| 1| hello|
| kyGp| t|333333333| J| 2| hello|
| BEuX| A|999999999| F| 3| hello|
I tried:
scala> var result = tabl.selectExpr("A", "B", "$tabla as C")
scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C)
<console>:31: error: not found: value $
var abc = tabl.selectExpr("A", "B", ${tabla} as C)
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
scala> sqlContext.sql("set tabla='hello'")
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
SAME ERROR:
java.lang.RuntimeException: [1.1] failure: identifier expected
${tabla} as C
^
at scala.sys.package$.error(package.scala:27)
Thanks in advance!
Can you try this.
val english = "hello"
generar_informe(data,english).show()
}
def generar_informe(df: DataFrame , english : String)={
df.selectExpr(
"transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """)
}
This is the output I got.
17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms
+-------------+----------+------+----------+------+
|transactionId|customerId|itemId|amountPaid|saludo|
+-------------+----------+------+----------+------+
| 111| 1| 1| 100.0| hello|
| 112| 2| 2| 505.0| hello|
| 113| 3| 3| 510.0| hello|
| 114| 4| 4| 600.0| hello|
| 115| 1| 2| 500.0| hello|
| 116| 1| 2| 500.0| hello|
| 117| 1| 2| 500.0| hello|
| 118| 1| 2| 500.0| hello|
| 119| 2| 3| 500.0| hello|
| 120| 1| 2| 500.0| hello|
| 121| 1| 4| 500.0| hello|
| 122| 1| 2| 500.0| hello|
| 123| 1| 4| 500.0| hello|
| 124| 1| 2| 500.0| hello|
+-------------+----------+------+----------+------+
17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook

Find date of each week in from week in Spark Dataframe

I want to add a column with date of each corresponding week in Dataframe (appending friday in each date)
My Dataframe looks like this
+----+------+---------+
|Week| City|sum(Sale)|
+----+------+---------+
| 29|City 2| 72|
| 28|City 3| 48|
| 28|City 2| 19|
| 27|City 2| 16|
| 28|City 1| 84|
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 3| 42|
| 26|City 3| 68|
| 27|City 1| 89|
| 27|City 4| 104|
| 26|City 2| 19|
| 29|City 3| 27|
+----+------+---------+
I need to convert it as below dataframe
----+------+---------+--------------- |
|Week| City|sum(Sale)|perticular day(dd/mm/yyyy) |
+----+------+---------+---------------|
| 29|City 2| 72|Friday(07/21/2017)|
| 28|City 3| 48|Friday(07/14/2017)|
| 28|City 2| 19|Friday(07/14/2017)|
| 27|City 2| 16|Friday(07/07/2017)|
| 28|City 1| 84|Friday(07/14/2017)|
| 28|City 4| 72|Friday(07/14/2017)|
| 29|City 4| 39|Friday(07/21/2017)|
| 27|City 3| 42|Friday(07/07/2017)|
| 26|City 3| 68|Friday(06/30/2017)|
| 27|City 1| 89|Friday(07/07/2017)|
| 27|City 4| 104|Friday(07/07/2017)|
| 26|City 2| 19|Friday(06/30/2017)|
| 29|City 3| 27|Friday(07/21/2017)|
+----+------+---------+
please help me
You can write a simple UDF and get the date from adding week in it.
Here is the simple example
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
val getDateFromWeek = udf((week : Int) => {
//create a default date for week 1
val week1 = LocalDate.of(2016, 12, 30)
val day = "Friday"
//add week from the week column
val result = week1.plusWeeks(week).format(DateTimeFormatter.ofPattern("MM/dd/yyyy"))
//return result as Friday (date)
s"${day} (${result})"
})
//use the udf and create a new column named day
data.withColumn("day", getDateFromWeek($"week")).show
can anyone convert this to Pyspark?