I need some help on resolving this tricky transformation-
My spark dataframe look like this:
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
Where I need to add new records(rows) for a combination of A and B (having edit_flag as false) if the item_nbr matches with another A and B having edit_flag as true.
The new row will have everything columns copied from its parent row except rcv_qty and rcvr_nbr. So, final output will look like:
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 1| 502| 10| 5| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 2| 510| 40| 10| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import spark.implicits._
case class Source(
A: Int,
B: Int,
rcvr_nbr: Int,
order_qty: Int,
rcv_qty: Int,
item_nbr: Int,
edit_flag: Boolean
val sourceDF = Seq(
Source(123, 1, 500, 10, 2, 1001, false),
Source(123, 1, 501, 10, 2, 1001, false),
Source(123, 4, 502, 60, 5, 1001, true),
Source(123, 2, 504, 40, 30, 1003, false),
Source(123, 5, 510, 10, 10, 1003, true)
// root
// |-- A: integer (nullable = false)
// |-- B: integer (nullable = false)
// |-- rcvr_nbr: integer (nullable = false)
// |-- order_qty: integer (nullable = false)
// |-- rcv_qty: integer (nullable = false)
// |-- item_nbr: integer (nullable = false)
// |-- edit_flag: boolean (nullable = false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |501 |10 |2 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+
val sourceDFTrueF = sourceDF.filter(col("edit_flag").equalTo(true))
val sourceDFTrue = sourceDFTrueF.columns.foldLeft(sourceDFTrueF) {
(tmpDF, col) =>
tmpDF.withColumnRenamed(col, s"${col}_true")
val sourceDFFalse = sourceDF
val resDF =
sourceDFFalse.col("item_nbr") === sourceDFTrue.col("item_nbr_true"),
.orderBy(col("A"), col("item_nbr"), col("edit_flag"))
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |501 |10 |2 |1001 |false |
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |502 |10 |5 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|2 |510 |40 |10 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+
I have a Dataframe which has the following structure and data
Column1(String), Column2(String), Date
1, 2, 01/01/2021
A, B, 05/01/2021
M, N, 10/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a the subsequent date, as following:
Column1(String), Column2(String), Date
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
A, B, 08/01/2021
A, B, 09/01/2021
M, N, 10/01/2021
Any idea on how this can be achieved in scala spark?
Here is the working solution:
val dfp1 = List(("1001", 11, "01/10/2021"), ("1002", 21, "05/10/2021"), ("1001", 12, "10/10/2021"), ("1002", 22, "15/10/2021")).toDF("SerialNumber","SomeVal", "Date")
val dfProducts = dfp1.withColumn("Date", to_date($"Date","dd/MM/yyyy"))
|SerialNumber|SomeVal| Date|
| 1001| 11|2021-10-01|
| 1002| 21|2021-10-05|
| 1001| 12|2021-10-10|
| 1002| 22|2021-10-15|
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val overColumns = Window.partitionBy("SerialNumber").orderBy( "Date").rowsBetween(1, Window.unboundedFollowing)
val dfProduct1 = dfProducts.withColumn("NextSerialDate",first("Date", true).over(overColumns)).orderBy("Date")
|SerialNumber|SomeVal| Date|NextSerialDate|
| 1001| 11|2021-10-01| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-15|
| 1001| 12|2021-10-10| null|
| 1002| 22|2021-10-15| null|
val dfProduct2= dfProduct1.withColumn("NextSerialDate", when(col("NextSerialDate").isNull, col("Date")).otherwise(date_sub(col("NextSerialDate"), 1))).orderBy("SerialNumber")
|SerialNumber|SomeVal| Date|NextSerialDate|
| 1001| 11|2021-10-01| 2021-10-09|
| 1001| 12|2021-10-10| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14|
| 1002| 22|2021-10-15| 2021-10-15|
val dfProduct3= dfProduct2.withColumn("ExpandedDate", explode_outer(sequence($"Date", $"NextSerialDate")))
|SerialNumber|SomeVal| Date|NextSerialDate|ExpandedDate|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-01|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-02|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-03|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-04|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-05|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-06|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-07|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-08|
| 1001| 11|2021-10-01| 2021-10-09| 2021-10-09|
| 1001| 12|2021-10-10| 2021-10-10| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-05|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-06|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-07|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-08|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-09|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-10|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-11|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-12|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-13|
| 1002| 21|2021-10-05| 2021-10-14| 2021-10-14|
only showing top 20 rows
val dfProduct4 = dfProduct3.drop("Date", "NextSerialDate").withColumn("Date", col("ExpandedDate")).drop("ExpandedDate"), false)
|SerialNumber|SomeVal|Date |
|1001 |11 |2021-10-01|
|1001 |11 |2021-10-02|
|1001 |11 |2021-10-03|
|1001 |11 |2021-10-04|
|1001 |11 |2021-10-05|
|1001 |11 |2021-10-06|
|1001 |11 |2021-10-07|
|1001 |11 |2021-10-08|
|1001 |11 |2021-10-09|
|1001 |12 |2021-10-10|
|1002 |21 |2021-10-05|
|1002 |21 |2021-10-06|
|1002 |21 |2021-10-07|
|1002 |21 |2021-10-08|
|1002 |21 |2021-10-09|
|1002 |21 |2021-10-10|
|1002 |21 |2021-10-11|
|1002 |21 |2021-10-12|
|1002 |21 |2021-10-13|
|1002 |21 |2021-10-14|
|1002 |22 |2021-10-15|
Suppose I have this dataframe on PySpark:
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])
I want to create a function that takes column v2 divided by column v1, with the condition:
import numpy as np
from pyspark.sql.functions import pandas_udf
#pandas_udf('long', PandasUDFType.SCALAR)
def pandas_div(a,b):
if b == 0:
return np.nan
return (a/b)
However the result turn out to be like this
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The output that I want should be like this:
| red| banana|10 |
| blue| banana|20 |
| red| carrot|30 |
| blue| grape|40 |
| red| carrot|50 |
| black| carrot|60 |
| red| banana|70 |
| red| grape|80 |
All you needed was a WHEN and OTHERWISE. See example below
# Create data frame
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80], ['orange', 'grapefruit', 0, 100]], schema=['color', 'fruit', 'v1', 'v2'])
# display result
| color| fruit| v1| v2|
| red| banana| 1| 10|
| blue| banana| 2| 20|
| red| carrot| 3| 30|
| blue| grape| 4| 40|
| red| carrot| 5| 50|
| black| carrot| 6| 60|
| red| banana| 7| 70|
| red| grape| 8| 80|
|orange|grapefruit| 0|100|
# Import functions
import pyspark.sql.functions as f
# apply case when
df1 = df.withColumn("divide", f.when(f.col("v1") == 0, None).otherwise(f.lit(f.col("v2")/f.col("v1"))))
# display result
| color| fruit| v1| v2|divide|
| red| banana| 1| 10| 10.0|
| blue| banana| 2| 20| 10.0|
| red| carrot| 3| 30| 10.0|
| blue| grape| 4| 40| 10.0|
| red| carrot| 5| 50| 10.0|
| black| carrot| 6| 60| 10.0|
| red| banana| 7| 70| 10.0|
| red| grape| 8| 80| 10.0|
|orange|grapefruit| 0|100| null|
I want to do group cols some aggregations operations like count, count_distinct or nunique.
For examples,
# the samples values in `date` column are all unique
| category| tags| datetime| date|
| null| ,industry,display,Merchants|2018-01-08 14:30:32| 20200704|
| social,smart| smart,swallow,game,Experience|2019-06-17 04:34:51| 20200705|
| ,beauty,social| social,picture,social|2017-08-19 09:01:37| 20200706|
| default| default,game,us,adventure|2019-10-02 14:18:56| 20200707|
|financial management|financial management,loan,product|2018-07-17 02:07:39| 20200708|
| system| system,font,application,setting|2015-07-18 00:45:57| 20200709|
| null| ,system,profile,optimization|2018-09-07 19:59:03| 20200710|
|-- category: string (nullable = true)
|-- tags: string (nullable = true)
|-- datetime: string (nullable = true)
|-- date: string (nullable = true)
# I want to do some group aggregations by PySpark like follows in pandas
group_date_tags_cnt_df = df.groupby('date')['tags'].count()
group_date_tags_nunique_df = df.groupby('date')['tags'].nunique()
group_date_category_cnt_df = df.groupby('date')['category'].count()
group_date_category_nunique_df = df.groupby('date')['category'].nunique()
# expected output here
# AND all results should ignore ',' in the splitted result and `null` value in aggregations operations
| date| count|
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
| date| count(DISTINCT tag)|
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
# It should ignore `null` here
| date| count|
| 20200704| 0|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
| date| count(DISTINCT category)|
| 20200704| 1|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
But the tags and category columns are string type here.
So I think I should do split way first and do group aggregations operations based on.
But I am a little awkward to implement it.
So could anyone help me?
case class d(
category: Option[String],
tags: String,
datetime: String,
date: String
val sourceDF = Seq(
d(None, ",industry,display,Merchants", "2018-01-08 14:30:32", "20200704"),
d(Some("social,smart"), "smart,swallow,game,Experience", "2019-06-17 04:34:51", "20200704"),
d(Some(",beauty,social"), "social,picture,social", "2017-08-19 09:01:37", "20200704")
).toDF("category", "tags", "datetime", "date")
val df1 = sourceDF.withColumn("category", split('category, ","))
.withColumn("tags", split('tags, ","))
val df2 ='datetime, 'date, 'tags,
when(col("category").isNotNull, col("category"))
val df3 ='category, 'datetime, 'date,
when(col("tags").isNotNull, col("tags"))
val resDF ='category, 'tags, 'datetime, 'date)
// +--------+----------+-------------------+--------+
// |category| tags| datetime| date|
// +--------+----------+-------------------+--------+
// | null| |2018-01-08 14:30:32|20200704|
// | null| industry|2018-01-08 14:30:32|20200704|
// | null| display|2018-01-08 14:30:32|20200704|
// | null| Merchants|2018-01-08 14:30:32|20200704|
// | social| smart|2019-06-17 04:34:51|20200704|
// | social| swallow|2019-06-17 04:34:51|20200704|
// | social| game|2019-06-17 04:34:51|20200704|
// | social|Experience|2019-06-17 04:34:51|20200704|
// | smart| smart|2019-06-17 04:34:51|20200704|
// | smart| swallow|2019-06-17 04:34:51|20200704|
// | smart| game|2019-06-17 04:34:51|20200704|
// | smart|Experience|2019-06-17 04:34:51|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | | picture|2017-08-19 09:01:37|20200704|
// | | social|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | beauty| picture|2017-08-19 09:01:37|20200704|
// | beauty| social|2017-08-19 09:01:37|20200704|
// | social| social|2017-08-19 09:01:37|20200704|
// | social| picture|2017-08-19 09:01:37|20200704|
// +--------+----------+-------------------+--------+
val group1DF = resDF.groupBy('date, 'category).count()
// +--------+--------+-----+
// | date|category|count|
// +--------+--------+-----+
// |20200704| social| 7|
// |20200704| | 3|
// |20200704| smart| 4|
// |20200704| beauty| 3|
// |20200704| null| 4|
// +--------+--------+-----+
val group2DF = resDF.groupBy('datetime, 'category).count()
// +-------------------+--------+-----+
// | datetime|category|count|
// +-------------------+--------+-----+
// |2017-08-19 09:01:37| social| 3|
// |2017-08-19 09:01:37| beauty| 3|
// |2019-06-17 04:34:51| smart| 4|
// |2019-06-17 04:34:51| social| 4|
// |2018-01-08 14:30:32| null| 4|
// |2017-08-19 09:01:37| | 3|
// +-------------------+--------+-----+
Pyspark code for which solves your problem, I have taken the 3 dates data 20200702, 20200704, 20200705
from pyspark.sql import Row
from pyspark.sql.functions import *
drow = Row("category","tags","datetime","date")
data = [drow("", ",industry,display,Merchants","2018-01-08 14:30:32","20200704"),drow("social,smart","smart,swallow,game,Experience","2019-06-17 04:34:51","20200702"),drow(",beauty,social", "social,picture,social", "2017-08-19 09:01:37", "20200705")]
df = spark.createDataFrame(data)
final_df=df.withColumn("category", split(df['category'], ",")).withColumn("tags", split(df['tags'], ",")).select('datetime', 'date', 'tags', explode(when(col("category").isNotNull(), col("category")).otherwise(array(lit("").cast("string")))).alias("category")).select('datetime', 'date', 'category', explode(when(col("tags").isNotNull(), col("tags")).otherwise(array(lit("").cast("string")))).alias("tags")).alias("tags")
| datetime| date|category| tags|
|2018-01-08 14:30:32|20200704| | |
|2018-01-08 14:30:32|20200704| | industry|
|2018-01-08 14:30:32|20200704| | display|
|2018-01-08 14:30:32|20200704| | Merchants|
|2019-06-17 04:34:51|20200702| social| smart|
|2019-06-17 04:34:51|20200702| social| swallow|
|2019-06-17 04:34:51|20200702| social| game|
|2019-06-17 04:34:51|20200702| social|Experience|
|2019-06-17 04:34:51|20200702| smart| smart|
|2019-06-17 04:34:51|20200702| smart| swallow|
|2019-06-17 04:34:51|20200702| smart| game|
|2019-06-17 04:34:51|20200702| smart|Experience|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| | picture|
|2017-08-19 09:01:37|20200705| | social|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| beauty| picture|
|2017-08-19 09:01:37|20200705| beauty| social|
|2017-08-19 09:01:37|20200705| social| social|
|2017-08-19 09:01:37|20200705| social| picture|
only showing top 20 rows'''
| date| tags|count|
|20200702| smart| 2|
|20200705| picture| 3|
|20200702| swallow| 2|
|20200704| industry| 1|
|20200704| display| 1|
|20200702| game| 2|
|20200704| | 1|
|20200704| Merchants| 1|
|20200702|Experience| 2|
|20200705| social| 6|
| date|category|count|
|20200702| smart| 4|
|20200702| social| 4|
|20200705| | 3|
|20200705| beauty| 3|
|20200704| | 4|
|20200705| social| 3|
I have a dataframe like this:
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 2|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 2|
|2020-05-11| 30| 120| -60| 2|
For each Date I have a Total_Space that can be filled. So for 2020-05-10, I have 60 seconds, and for 2020-05-11 I have 120 seconds.
Each Date also already have assigned slots with a certain Slot_Length.
For each Date I have already calculated the amount of space that Date is over in the Amount_Over column and have ranked them appropriately based on a priority column not shown here.
What I would like to do is to drop the rows with lowest Rank for a Date until the Slot_Lengths add up to the Total_Space for a Date.
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
In this example, it is as easy as dropping all Rank equal to 2, but there will be examples where there is a tie between ranks, so first take the highest ranks, and then take a random one if there is a tie.
What is the best way to do this? I already understand it will need a Window function over the Date to do each calculation over the Slot_Length, Total_Space, and Amount_Over columns correctly.
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11",
"2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
w = Window.partitionBy("Date").orderBy("Rank").rowsBetween(Window.unboundedPreceding, Window.currentRow)
"Cumulative_Sum", F.sum("Slot_Length").over(w)
F.col("Cumulative_Sum") <= F.col("Total_Space")
which results
| Date|Slot_Length|Total_Space|Amount_Over|Rank|Cumulative_Sum|
|2020-05-10| 30| 60| -30| 1| 30|
|2020-05-10| 30| 60| -30| 1| 60|
|2020-05-11| 30| 120| -60| 1| 30|
|2020-05-11| 30| 120| -60| 1| 60|
|2020-05-11| 30| 120| -60| 1| 90|
|2020-05-11| 30| 120| -60| 1| 120|