PySpark: Column Is Not Iterable - dataframe

I have Spark dataframe as follows:
from pyspark.sql import SparkSession, functions as F
df = spark.sql("SELECT transaction_id, transaction_label, module_name, length(transaction_label) as length FROM all_trans")
df.show()
+---------------+-----------------+-----------+------+
| transaction_id|transaction_label|module_name|length|
+---------------+-----------------+-----------+------+
|0P2117292543428| EDU| mcc| 3|
| 0P211729824944| EDU| mcc| 3|
| 0P31172950208| EDU| mcc| 3|
|0P2117294027213| FUN0402007| regex| 10|
|0P2117294027213| FUN04| mcc| 5|
|0P2117293581427| FUN0402007| regex| 10|
|0P2117293581427| FUN04| mcc| 5|
|0P2117292967336| FUN0402007| regex| 10|
|0P2117292967336| FUN04| mcc| 5|
|0P2117292659416| FUN0402007| regex| 10|
|0P2117292659416| FUN04| mcc| 5|
|0P2117293159304| FUN0402007| regex| 10|
|0P2117293159304| FUN04| mcc| 5|
|0P2117293237687| FUN0402007| regex| 10|
|0P2117293237687| FUN04| mcc| 5|
|0P2117293548610| FUN0402007| regex| 10|
|0P2117293548610| FUN04| mcc| 5|
|0P2117293678239| FUN0402007| regex| 10|
|0P2117293678239| FUN04| mcc| 5|
|0P2117293840924| FUN0402007| regex| 10|
+---------------+-----------------+-----------+------+
I want to compare transaction_label of the same transaction_id for different module_name.
I tried:
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.substring('r.transaction_label', 1, F.col('m.length')))
)
df.show()
which has yielded:
TypeError: Column is not iterable

The 3rd argument in substring expects a number, but you provided a column instead.
Switch to SQL when using substring. SQL can deal with this situation.
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.expr("substring(r.transaction_label, 1, m.length)"))
)

Related

Compare different rows in dataframe containing the same id

I have a Spark dataframe as follows:
spark.sql("""
SELECT * from all_trans
""").show()
+---------------+-------------+-----------+------------------+-----------------+-----------+----------+
| transaction_id|card_event_id|card_pos_id|card_point_country|transaction_label|module_name| post_date|
+---------------+-------------+-----------+------------------+-----------------+-----------+----------+
|0P2117292543428| 2502723025| null| CZ| EDU| mcc|2022-02-10|
| 0P211729824944| 2502723477| null| CZ| EDU| mcc|2022-02-10|
| 0P31172950208| 2502723587| null| CZ| EDU| mcc|2022-02-10|
|0P2117294027213| 2502726454| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117294027213| 2502726454| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293581427| 2502729360| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293581427| 2502729360| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117292967336| 2502729724| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117292967336| 2502729724| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117292659416| 2502730642| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117292659416| 2502730642| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293159304| 2502731764| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293159304| 2502731764| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293237687| 2502732381| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293237687| 2502732381| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293548610| 2502733071| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293548610| 2502733071| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293678239| 2502736684| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293678239| 2502736684| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293840924| 2502737447| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
+---------------+-------------+-----------+------------------+-----------------+-----------+----------+
One transaction_id can have more than 1 transaction_label.
I want to be able to go automatically through all transaction_label in the dataframe for each transaction_id and compare whether they match on some level.
I had in mind logic like:
df.foreach(lambda x:
(transaction_id.transaction_label where module_name=='mcc') == (left(transaction_id.transaction_label, 5) where module_name=='regex')
But I don't know how to compare same transaction_id in Spark.
Conversion to Pandas fails due to limited driver memory.
You could do a self join. For this, best practice is to provide alias to dataframes. Then you could create a column for the check:
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('0P2117292543428', 'EDU', 'mcc'),
( '0P211729824944', 'EDU', 'mcc'),
( '0P31172950208', 'EDU', 'mcc'),
('0P2117294027213', 'FUN0402007', 'regex'),
('0P2117294027213', 'FUN04', 'mcc'),
('0P2117293581427', 'FUN0402007', 'regex'),
('0P2117293581427', 'FUN04', 'mcc'),
('0P2117292967336', 'FUN0402007', 'regex'),
('0P2117292967336', 'FUN04', 'mcc'),
('0P2117292659416', 'FUN0402007', 'regex'),
('0P2117292659416', 'FUN04', 'mcc'),
('0P2117293159304', 'FUN0402007', 'regex'),
('0P2117293159304', 'FUN04', 'mcc'),
('0P2117293237687', 'FUN0402007', 'regex'),
('0P2117293237687', 'FUN04', 'mcc'),
('0P2117293548610', 'FUN0402007', 'regex'),
('0P2117293548610', 'FUN04', 'mcc'),
('0P2117293678239', 'FUN0402007', 'regex'),
('0P2117293678239', 'FUN04', 'mcc'),
('0P2117293840924', 'FUN0402007', 'regex')],
['transaction_id', 'transaction_label', 'module_name'])
Script:
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.substring('r.transaction_label', 1, 5))
)
df.show()
# +---------------+-----------------+-----------+-----------------+-----------+-----+
# | transaction_id|transaction_label|module_name|transaction_label|module_name|check|
# +---------------+-----------------+-----------+-----------------+-----------+-----+
# |0P2117292659416| FUN04| mcc| FUN0402007| regex| true|
# |0P2117292967336| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293159304| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293237687| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293548610| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293581427| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293678239| FUN04| mcc| FUN0402007| regex| true|
# |0P2117294027213| FUN04| mcc| FUN0402007| regex| true|
# +---------------+-----------------+-----------+-----------------+-----------+-----+

Pyspark: how to solve complicated dataframe logic plus join

I have two data frames to work on, the first one looks like this the following df1
df1_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("total_time", IntegerType(), True) ])
df_data = [('2020-08-01','110','1','11010',3),('2020-08-02','110','1','11010',2),\
('2020-08-03','110','1','11010',3),('2020-08-04','110','1','11010',3),\
('2020-08-05','111','1','11010',1),('2020-08-06','111','1','11010',-1)]
rdd = sc.parallelize(df_data)
df1 = sqlContext.createDataFrame(df_data, df1_schema)
df1 = df1.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df1.show()
+----------+--------+------------+--------+----------+
| Date|store_id|warehouse_id|class_id|total_time|
+----------+--------+------------+--------+----------+
|2020-08-01| 110| 1| 11010| 3|
|2020-08-02| 110| 1| 11010| 2|
|2020-08-03| 110| 1| 11010| 3|
|2020-08-04| 110| 1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1|
|2020-08-06| 111| 1| 11010| -1|
+----------+--------+------------+--------+----------+
I calculated something called arrival_date
#To calculate the arrival_date
#logic : add the Date + total_time so in first row, 2020-08-01 +3 would give me 2020-08-04
#if total_time is -1 then return blank
df1= df1.withColumn('arrival_date', F.when(col('total_time') != -1, expr("date_add(date, total_time)"))
.otherwise(''))
+----------+--------+------------+--------+----------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|
+----------+--------+------------+--------+----------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06|
|2020-08-06| 111| 1| 11010| -1| |
+----------+--------+------------+--------+----------+------------+
and what I want to calculate is this..
#to calculate the transit_date
#if arrival_date is same, ex) 2020-08-04 is repeated 2 or more times, then take min("Date")
#which will be 2020-08-01 otherwise just return the Date ex) 2020-08-07 would just return 2020-08-04
#we need to care about cloth_id too, we have arrival_date = 2020-08-06 repeated 2 times as well but since
#if one of store_id or warehouse_id is different we treat them separately. so at arrival_date = 2020-08-06 at date = 2020-08-03,
##we must return 2020-08-03
#so we treat them separately when one of (store_id, warehouse_id ) is different.
#*Note* we dont care about class_id, its not effective.
#if arrival_date = blank then leave it as blank..
#so our df would look something like this.
+----------+--------+------------+--------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
+----------+--------+------------+--------+----------+------------+------------+
Next, I have df2 looks like the following..
#we have another dataframe call it df2
df2_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("cloth_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("type", StringType(), True),\
StructField("quantity", IntegerType(), True)])
df_data = [('2020-08-01','110','1','M_1','11010','R',5),('2020-08-01','110','1','M_1','11010','R',2),\
('2020-08-02','110','1','M_1','11010','C',3),('2020-08-03','110','1','M_1','11010','R',1),\
('2020-08-04','110','1','M_1','11010','R',3),('2020-08-05','111','1','M_2','11010','R',5)]
rdd = sc.parallelize(df_data)
df2 = sqlContext.createDataFrame(df_data, df2_schema)
df2 = df2.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df2.show()
+----------+--------+------------+--------+--------+----+--------+
| Date|store_id|warehouse_id|cloth_id|class_id|type|quantity|
+----------+--------+------------+--------+--------+----+--------+
|2020-08-01| 110| 1| M_1| 11010| R| 5|
|2020-08-01| 110| 1| M_1| 11010| R| 2|
|2020-08-02| 110| 1| M_1| 11010| C| 3|
|2020-08-03| 110| 1| M_1| 11010| R| 1|
|2020-08-04| 110| 1| M_1| 11010| R| 3|
|2020-08-05| 111| 1| M_2| 11010| R| 5|
+----------+--------+------------+--------+--------+----+--------+
and I calculated quantity2, this is just sum of quantity where type=R
df2 =df2.groupBy('Date','store_id','warehouse_id','cloth_id','class_id')\
.agg( F.sum(F.when(col('type')=='R', col('quantity'))\
.otherwise(col('quantity'))).alias('quantity2')).orderBy('Date')
+----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
+----------+--------+------------+--------+--------+---------+
Now I have df1, and df2. I want to join such that It will look something like this...
I tried something like this
df4 = df1.select('store_id','warehouse_id','class_id','arrival_date','transit_date')
df4= df4.filter(" transit_date != '' ")
df4=df4.withColumnRenamed('arrival_date', 'date')
df3 = df2.join(df1, on=['Date','store_id','warehouse_id','class_id'],how='inner').orderBy('Date')
df5 = df3.join(df4, on=['Date','store_id','warehouse_id','class_id'], how='left').orderBy('Date')
but I dont think this is the correct approach.... the result df should look like below..
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
note that the transit_date went to where Date = arrival_date of course the null is replaced by blank.
LASTLY, if today is 2020-08-04, then look at where arrival_date == 2020-08-04 and sum up the quantity and place it at today. so.... It will look like this... where the store_id = 111, it will have separate date. not shown here.. so logic needs to make sense when store_id = 111 as well.. i've just shown the example where store_id = 110
From my understanding about your question and where you already have with the following df1 and df2:
df1.orderBy('Date').show() df2.orderBy('Date').show()
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date| | Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| |2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| |2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| |2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| |2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| |2020-08-05| 111| 1| M_2| 11010| 5|
|2020-08-06| 111| 1| 11010| -1| | +----------+--------+------------+--------+--------+---------+
+----------+--------+------------+--------+----------+------------+
you can try the following 5 steps:
Step-1: Set up the list of column names grp_cols for join:
from pyspark.sql import functions as F
grp_cols = ["Date", "store_id", "warehouse_id", "class_id"]
Step-2: create df3 containing transit_date which is the min Date on each combination of arrival_date, store_id, warehouse_id and class_id:
df3 = df1.filter('total_time != -1') \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id") \
.agg(F.min('Date').alias('transit_date')) \
.withColumnRenamed("arrival_date", "Date")
df3.orderBy('Date').show()
+----------+--------+------------+--------+------------+
| Date|store_id|warehouse_id|class_id|transit_date|
+----------+--------+------------+--------+------------+
|2020-08-04| 110| 1| 11010| 2020-08-01|
|2020-08-06| 111| 1| 11010| 2020-08-05|
|2020-08-06| 110| 1| 11010| 2020-08-03|
|2020-08-07| 110| 1| 11010| 2020-08-04|
+----------+--------+------------+--------+------------+
Step-3: set up df4 by join df2 with df1 and left join df3 using grp_cols, persist df4
df4 = df2.join(df1, grp_cols).join(df3, grp_cols, "left") \
.withColumn('transit_date', F.when(F.col('total_time') != -1, F.col("transit_date")).otherwise('')) \
.persist()
_ = df4.count()
df4.orderBy('Date').show()
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
Step-4: calculate sum(quantity2) as want from df4 for each arrival_date + store_id + warehouse_id + class_id + cloth_id
df5 = df4 \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id", "cloth_id") \
.agg(F.sum("quantity2").alias("want")) \
.withColumnRenamed("arrival_date", "Date")
df5.orderBy('Date').show()
+----------+--------+------------+--------+--------+----+
| Date|store_id|warehouse_id|class_id|cloth_id|want|
+----------+--------+------------+--------+--------+----+
|2020-08-04| 110| 1| 11010| M_1| 10|
|2020-08-06| 111| 1| 11010| M_2| 5|
|2020-08-06| 110| 1| 11010| M_1| 1|
|2020-08-07| 110| 1| 11010| M_1| 3|
+----------+--------+------------+--------+--------+----+
Step-5: create the final dataframe by left join df4 with df5
df_new = df4.join(df5, grp_cols+["cloth_id"], "left").fillna(0, subset=['want'])
df_new.orderBy("Date").show()
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|want|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null| 0|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null| 0|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null| 0|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01| 10|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null| 0|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
df4.unpersist()
Here is for the df1,
from pyspark.sql import Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
import builtins as p
df1_schema = StructType(
[
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('total_time', IntegerType(), True)
]
)
df1_data = [
('2020-08-01','110','1','11010',3),
('2020-08-02','110','1','11010',2),
('2020-08-03','110','1','11010',3),
('2020-08-04','110','1','11010',3),
('2020-08-05','111','1','11010',1),
('2020-08-06','111','1','11010',-1)
]
df1 = spark.createDataFrame(df1_data, df1_schema)
df1 = df1.withColumn('Date', to_date('Date'))
df1 = df1.withColumn('arrival_date', when(col('total_time') != -1, expr("date_add(date, total_time)")).otherwise(''))
w = Window.partitionBy('arrival_date', 'store_id', 'warehouse_id').orderBy('Date')
df1 = df1.withColumn('transit_date', when(col('total_time') != -1, first('Date').over(w)).otherwise('')).orderBy('Date')
df1.show()
+----------+--------+------------+--------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
+----------+--------+------------+--------+----------+------------+------------+
and df2 as you did,
df2_schema = StructType(
[
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('cloth_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('type', StringType(), True),
StructField('quantity', IntegerType(), True)
]
)
df2_data = [
('2020-08-01','110','1','M_1','11010','R',5),
('2020-08-01','110','1','M_1','11010','R',2),
('2020-08-02','110','1','M_1','11010','C',3),
('2020-08-03','110','1','M_1','11010','R',1),
('2020-08-04','110','1','M_1','11010','R',3),
('2020-08-05','111','1','M_2','11010','R',5)
]
df2 = spark.createDataFrame(df2_data, df2_schema)
df2 = df2.withColumn('Date', to_date('Date'))
df2 = df2.groupBy('Date', 'store_id', 'warehouse_id', 'cloth_id', 'class_id') \
.agg(
sum(
when(col('type') == 'R', col('quantity')).otherwise(0)
).alias('quantity2')
).orderBy('Date')
df2.show()
+----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 0|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
+----------+--------+------------+--------+--------+---------+
and finally the join result.
df3 = df1.filter('total_time != -1') \
.join(df2, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.drop('Date', 'total_time', 'cloth_id') \
.withColumnRenamed('arrival_date', 'Date')
df4 = df1.drop('transit_date') \
.join(df3, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.groupBy('Date', 'store_id', 'warehouse_id', 'class_id', 'arrival_date', 'transit_date') \
.agg(sum('quantity2').alias('want')) \
.orderBy('Date')
df4.show()
+----------+--------+------------+--------+------------+------------+----+
| Date|store_id|warehouse_id|class_id|arrival_date|transit_date|want|
+----------+--------+------------+--------+------------+------------+----+
|2020-08-01| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-02| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-03| 110| 1| 11010| 2020-08-06| null|null|
|2020-08-04| 110| 1| 11010| 2020-08-07| 2020-08-01| 7|
|2020-08-05| 111| 1| 11010| 2020-08-06| null|null|
|2020-08-06| 111| 1| 11010| | 2020-08-05| 5|
+----------+--------+------------+--------+------------+------------+----+

Window Function Tie breaker on other field to get the Latest Record

I have following data, where the data is partitioned by the stores and month id and ordered by amount in order to get the primary vendor for the store.
I need a tie breaker if the amount is equal between two vendors,
then if one of the tied vendor was the previous months most sales vendor, make that vendor as the most sales vendor for the month.
The look back will increase if there is a tie again. Lag of 1 month will not work if there is tie again. Worst case scenario we will have more duplicates in previous month also.
sample data
val data = Seq((201801, 10941, 115, 80890.44900, 135799.66400),
(201801, 10941, 3, 80890.44900, 135799.66400) ,
(201712, 10941, 3, 517440.74500, 975893.79000),
(201712, 10941, 115, 517440.74500, 975893.79000),
(201711, 10941, 3 , 371501.92100, 574223.52300),
(201710, 10941, 115, 552435.57800, 746912.06700),
(201709, 10941, 115,1523492.60700,1871480.06800),
(201708, 10941, 115,1027698.93600,1236544.50900),
(201707, 10941, 33 ,1469219.86900,1622949.53000)
).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")
Code:
val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)
Output:
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 1|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 115| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
My rank is 1 for both 1 and 2 records, but my rank should be 1 for second record based on previous month max dollars
I am expecting the following output.
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 2|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 3| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
Should I write a UDAF? Any suggestions would help.
You can do this with 2 windows. First, you will need to use the lag() function to carry over the previous month's sales values so that you can use that in your rank window. here's that part in pyspark:
lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))
Then edit your window to include that new column:
window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711| 10941| 99| 371501.921| 574223.523| null| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1027698.936| 1|
|201707| 10941| 33|1469219.869| 1622949.53| null| 1|
|201708| 10941| 115|1027698.936|1236544.509| null| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1523492.607| 1|
|201712| 10941| 3| 517440.745| 975893.79| null| 1|
|201801| 10941| 3| 80890.449| 135799.664| 517440.745| 1|
|201801| 10941| 115| 80890.449| 135799.664| 552435.578| 2|
+------+--------+-----+-----------+-----------+----------------+----+
For each row, collect an array of that brands previous sales, in a (Month, Sales) struct.
val storeAndBrandWindow = Window.partitionBy("store_id", "brand").orderBy($"MTH_ID")
val df1 = data.withColumn("brndSales_list", collect_list(struct($"MTH_ID", $"brndSales")).over(storeAndBrandWindow))
Reverse that array with a UDF.
val returnType = ArrayType(StructType(Array(StructField("month", IntegerType), StructField("sales", DoubleType))))
val reverseUdf = udf((list: Seq[Row]) => list.reverse, returnType)
val df2 = df1.withColumn("brndSales_list", reverseUdf($"brndSales_list"))
And then sort by the array.
val window = Window.partitionBy("store_id", "MTH_ID").orderBy($"brndSales_list".desc)
val df3 = df2.withColumn("rank", rank over window).orderBy("MTH_ID", "brand")
df3.show(false)
Result
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|MTH_ID|store_id|brand|brndSales |TotalSales |brndSales_list |rank|
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|201707|10941 |33 |1469219.869|1622949.53 |[[201707, 1469219.869]] |1 |
|201708|10941 |115 |1027698.936|1236544.509|[[201708, 1027698.936]] |1 |
|201709|10941 |115 |1523492.607|1871480.068|[[201709, 1523492.607], [201708, 1027698.936]] |1 |
|201710|10941 |115 |552435.578 |746912.067 |[[201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]] |1 |
|201711|10941 |99 |371501.921 |574223.523 |[[201711, 371501.921]] |1 |
|201712|10941 |3 |517440.745 |975893.79 |[[201712, 517440.745]] |1 |
|201801|10941 |3 |80890.449 |135799.664 |[[201801, 80890.449], [201712, 517440.745]] |1 |
|201801|10941 |115 |80890.449 |135799.664 |[[201801, 80890.449], [201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]]|2 |
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+

Grab last different data on Spark Dataframe?

I have this data on Spark Dataframe
+------+-------+-----+------------+----------+---------+
|sernum|product|state|testDateTime|testResult| msg|
+------+-------+-----+------------+----------+---------+
| 8| PA1| 1.0| 1.18| pass|testlog18|
| 7| PA1| 1.0| 1.17| fail|testlog17|
| 6| PA1| 1.0| 1.16| pass|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|
| 4| PA1| 2.0| 1.14| fail|testlog14|
| 3| PA1| 1.0| 1.13| pass|testlog13|
| 2| PA1| 2.0| 1.12| pass|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11|
+------+-------+-----+------------+----------+---------+
What I care about is the testResult == "fail", and the hard part is that I need the to get the last "pass" message as an extra column GROUP BY product+state:
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+
How can I achieve this using DataFrame or SQL?
The trick is to define groups where each group starts with a passed test. Then, use again window-functions with group as an additional partition-column:
val df = Seq(
(8, "PA1", 1.0, 1.18, "pass", "testlog18"),
(7, "PA1", 1.0, 1.17, "fail", "testlog17"),
(6, "PA1", 1.0, 1.16, "pass", "testlog16"),
(5, "PA1", 1.0, 1.15, "fail", "testlog15"),
(4, "PA1", 2.0, 1.14, "fail", "testlog14"),
(3, "PA1", 1.0, 1.13, "pass", "testlog13"),
(2, "PA1", 2.0, 1.12, "pass", "testlog12"),
(1, "PA1", 1.0, 1.11, "fail", "testlog11")
).toDF("sernum", "product", "state", "testDateTime", "testResult", "msg")
df
.withColumn("group", sum(when($"testResult" === "pass", 1)).over(Window.partitionBy($"product", $"state").orderBy($"testDateTime")))
.withColumn("passMsg", when($"group".isNotNull,first($"msg").over(Window.partitionBy($"product", $"state", $"group").orderBy($"testDateTime"))))
.drop($"group")
.where($"testResult"==="fail")
.orderBy($"product", $"state", $"testDateTime")
.show()
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+
This is an alternate approach, by joining the passed logs with failed ones for previous times, and taking the latest "pass" message log.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
Window.partitionBy($"msg").orderBy($"p_testDateTime".desc)
val fDf = df.filter($"testResult" === "fail")
var pDf = df.filter($"testResult" === "pass")
pDf.columns.foreach(x => pDf = pDf.withColumnRenamed(x, "p_"+x))
val jDf = fDf.join(
pDf,
pDf("p_product") === fDf("product") &&
pDf("p_state") === fDf("state") &&
fDf("testDateTime") > pDf("p_testDateTime") ,
"left").
select(fDf("*"),
pDf("p_testResult"),
pDf("p_testDateTime"),
pDf("p_msg")
)
jDf.withColumn(
"rnk",
row_number().
over(window)
).
filter($"rnk" === 1).
drop("rnk","p_testResult","p_testDateTime").
show()
+---------+-------+------+-----+------------+----------+---------+
| msg|product|sernum|state|testDateTime|testResult| p_msg|
+---------+-------+------+-----+------------+----------+---------+
|testlog14| PA1| 4| 2| 1.14| fail|testlog12|
|testlog11| PA1| 1| 1| 1.11| fail| null|
|testlog15| PA1| 5| 1| 1.15| fail|testlog13|
|testlog17| PA1| 7| 1| 1.17| fail|testlog16|
+---------+-------+------+-----+------------+----------+---------+

How do I pass parameters to selectExpr? SparkSQL-Scala

:)
When you have a data frame, you can add columns and fill their rows with the method selectExprt
Something like this:
scala> table.show
+------+--------+---------+--------+--------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|
+------+--------+---------+--------+--------+
| OlcM| h|999999999| J| 0|
| zOcQ| r|777777777| J| 1|
| kyGp| t|333333333| J| 2|
| BEuX| A|999999999| F| 3|
scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo")
tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string]
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hola|
| zOcQ| r|777777777| J| 1| hola|
| kyGp| t|333333333| J| 2| hola|
| BEuX| A|999999999| F| 3| hola|
My point is:
I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this:
scala> var english = "hello"
scala> def generar_informe(df: DataFrame, tabla: String) {
var selectExpr_df = df.selectExpr(
"TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD",
"CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD",
"'tabla' as PUNTO DEL FLUJO" )
}
scala> generar_informe(df,english)
.....
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hello|
| zOcQ| r|777777777| J| 1| hello|
| kyGp| t|333333333| J| 2| hello|
| BEuX| A|999999999| F| 3| hello|
I tried:
scala> var result = tabl.selectExpr("A", "B", "$tabla as C")
scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C)
<console>:31: error: not found: value $
var abc = tabl.selectExpr("A", "B", ${tabla} as C)
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
scala> sqlContext.sql("set tabla='hello'")
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
SAME ERROR:
java.lang.RuntimeException: [1.1] failure: identifier expected
${tabla} as C
^
at scala.sys.package$.error(package.scala:27)
Thanks in advance!
Can you try this.
val english = "hello"
generar_informe(data,english).show()
}
def generar_informe(df: DataFrame , english : String)={
df.selectExpr(
"transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """)
}
This is the output I got.
17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms
+-------------+----------+------+----------+------+
|transactionId|customerId|itemId|amountPaid|saludo|
+-------------+----------+------+----------+------+
| 111| 1| 1| 100.0| hello|
| 112| 2| 2| 505.0| hello|
| 113| 3| 3| 510.0| hello|
| 114| 4| 4| 600.0| hello|
| 115| 1| 2| 500.0| hello|
| 116| 1| 2| 500.0| hello|
| 117| 1| 2| 500.0| hello|
| 118| 1| 2| 500.0| hello|
| 119| 2| 3| 500.0| hello|
| 120| 1| 2| 500.0| hello|
| 121| 1| 4| 500.0| hello|
| 122| 1| 2| 500.0| hello|
| 123| 1| 4| 500.0| hello|
| 124| 1| 2| 500.0| hello|
+-------------+----------+------+----------+------+
17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook