Compare different rows in dataframe containing the same id - dataframe

I have a Spark dataframe as follows:
spark.sql("""
SELECT * from all_trans
""").show()
+---------------+-------------+-----------+------------------+-----------------+-----------+----------+
| transaction_id|card_event_id|card_pos_id|card_point_country|transaction_label|module_name| post_date|
+---------------+-------------+-----------+------------------+-----------------+-----------+----------+
|0P2117292543428| 2502723025| null| CZ| EDU| mcc|2022-02-10|
| 0P211729824944| 2502723477| null| CZ| EDU| mcc|2022-02-10|
| 0P31172950208| 2502723587| null| CZ| EDU| mcc|2022-02-10|
|0P2117294027213| 2502726454| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117294027213| 2502726454| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293581427| 2502729360| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293581427| 2502729360| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117292967336| 2502729724| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117292967336| 2502729724| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117292659416| 2502730642| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117292659416| 2502730642| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293159304| 2502731764| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293159304| 2502731764| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293237687| 2502732381| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293237687| 2502732381| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293548610| 2502733071| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293548610| 2502733071| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293678239| 2502736684| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
|0P2117293678239| 2502736684| E3KB2938| CZ| FUN04| mcc|2022-02-10|
|0P2117293840924| 2502737447| E3KB2938| CZ| FUN0402007| regex|2022-02-10|
+---------------+-------------+-----------+------------------+-----------------+-----------+----------+
One transaction_id can have more than 1 transaction_label.
I want to be able to go automatically through all transaction_label in the dataframe for each transaction_id and compare whether they match on some level.
I had in mind logic like:
df.foreach(lambda x:
(transaction_id.transaction_label where module_name=='mcc') == (left(transaction_id.transaction_label, 5) where module_name=='regex')
But I don't know how to compare same transaction_id in Spark.
Conversion to Pandas fails due to limited driver memory.

You could do a self join. For this, best practice is to provide alias to dataframes. Then you could create a column for the check:
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('0P2117292543428', 'EDU', 'mcc'),
( '0P211729824944', 'EDU', 'mcc'),
( '0P31172950208', 'EDU', 'mcc'),
('0P2117294027213', 'FUN0402007', 'regex'),
('0P2117294027213', 'FUN04', 'mcc'),
('0P2117293581427', 'FUN0402007', 'regex'),
('0P2117293581427', 'FUN04', 'mcc'),
('0P2117292967336', 'FUN0402007', 'regex'),
('0P2117292967336', 'FUN04', 'mcc'),
('0P2117292659416', 'FUN0402007', 'regex'),
('0P2117292659416', 'FUN04', 'mcc'),
('0P2117293159304', 'FUN0402007', 'regex'),
('0P2117293159304', 'FUN04', 'mcc'),
('0P2117293237687', 'FUN0402007', 'regex'),
('0P2117293237687', 'FUN04', 'mcc'),
('0P2117293548610', 'FUN0402007', 'regex'),
('0P2117293548610', 'FUN04', 'mcc'),
('0P2117293678239', 'FUN0402007', 'regex'),
('0P2117293678239', 'FUN04', 'mcc'),
('0P2117293840924', 'FUN0402007', 'regex')],
['transaction_id', 'transaction_label', 'module_name'])
Script:
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.substring('r.transaction_label', 1, 5))
)
df.show()
# +---------------+-----------------+-----------+-----------------+-----------+-----+
# | transaction_id|transaction_label|module_name|transaction_label|module_name|check|
# +---------------+-----------------+-----------+-----------------+-----------+-----+
# |0P2117292659416| FUN04| mcc| FUN0402007| regex| true|
# |0P2117292967336| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293159304| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293237687| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293548610| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293581427| FUN04| mcc| FUN0402007| regex| true|
# |0P2117293678239| FUN04| mcc| FUN0402007| regex| true|
# |0P2117294027213| FUN04| mcc| FUN0402007| regex| true|
# +---------------+-----------------+-----------+-----------------+-----------+-----+

Related

PySpark: Column Is Not Iterable

I have Spark dataframe as follows:
from pyspark.sql import SparkSession, functions as F
df = spark.sql("SELECT transaction_id, transaction_label, module_name, length(transaction_label) as length FROM all_trans")
df.show()
+---------------+-----------------+-----------+------+
| transaction_id|transaction_label|module_name|length|
+---------------+-----------------+-----------+------+
|0P2117292543428| EDU| mcc| 3|
| 0P211729824944| EDU| mcc| 3|
| 0P31172950208| EDU| mcc| 3|
|0P2117294027213| FUN0402007| regex| 10|
|0P2117294027213| FUN04| mcc| 5|
|0P2117293581427| FUN0402007| regex| 10|
|0P2117293581427| FUN04| mcc| 5|
|0P2117292967336| FUN0402007| regex| 10|
|0P2117292967336| FUN04| mcc| 5|
|0P2117292659416| FUN0402007| regex| 10|
|0P2117292659416| FUN04| mcc| 5|
|0P2117293159304| FUN0402007| regex| 10|
|0P2117293159304| FUN04| mcc| 5|
|0P2117293237687| FUN0402007| regex| 10|
|0P2117293237687| FUN04| mcc| 5|
|0P2117293548610| FUN0402007| regex| 10|
|0P2117293548610| FUN04| mcc| 5|
|0P2117293678239| FUN0402007| regex| 10|
|0P2117293678239| FUN04| mcc| 5|
|0P2117293840924| FUN0402007| regex| 10|
+---------------+-----------------+-----------+------+
I want to compare transaction_label of the same transaction_id for different module_name.
I tried:
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.substring('r.transaction_label', 1, F.col('m.length')))
)
df.show()
which has yielded:
TypeError: Column is not iterable
The 3rd argument in substring expects a number, but you provided a column instead.
Switch to SQL when using substring. SQL can deal with this situation.
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.expr("substring(r.transaction_label, 1, m.length)"))
)

select Not null values from mutiple columns in pyspark

Below is my dataframe
+---------+-----+--------+---------+-------+
| NAME|Actor| Doctor|Professor|Singer |
+---------+-----+--------+---------+-------+
| Samantha| null|Samantha| null| null|
|Christeen| null| null|Christeen| null|
| Meera| null| null| null| Meera|
| Julia|Julia| null| null| null|
| Priya| null| null| null| Priya|
| Ashley| null| null| Ashley| null|
| Jenny| null| Jenny| null| null|
| Maria|Maria| null| null| null|
| Jane| Jane| null| null| null|
| Ketty| null| null| Ketty| null|
+---------+-----+--------+---------+-------+
I want to select all not null values from ACTOR,DOCTOR,PROFESSOR AND SINGER
This can be achieved via isNotNull and creating a condn of your desired rules and finally filter -
You can modify the condn depending on your requirement further -
Preparing Data
input_str = """
Samantha|None|Samantha| None| None|
Christeen| None| None|Christeen| None|
Meera| None| None| None| Meera|
Julia|Julia| None| None| None|
Priya| None| None| None| Priya|
Ashley| None| None| Ashley| None|
Jenny| None| Jenny| None| None|
Maria|Maria| None| None| None|
Jane| Jane| None| None| None|
Ketty| None| None| Ketty| None|
Aditya| None| None| None| None|
""".split("|")
input_values = list(map(lambda x:x.strip(),input_str))[:-1]
i = 0
n = len(input_values)
input_list = []
while i < n:
input_list += [ tuple(input_values[i:i+5]) ]
i += 5
input_list
[('Samantha', 'None', 'Samantha', 'None', 'None'),
('Christeen', 'None', 'None', 'Christeen', 'None'),
('Meera', 'None', 'None', 'None', 'Meera'),
('Julia', 'Julia', 'None', 'None', 'None'),
('Priya', 'None', 'None', 'None', 'Priya'),
('Ashley', 'None', 'None', 'Ashley', 'None'),
('Jenny', 'None', 'Jenny', 'None', 'None'),
('Maria', 'Maria', 'None', 'None', 'None'),
('Jane', 'Jane', 'None', 'None', 'None'),
('Ketty', 'None', 'None', 'Ketty', 'None'),
('Aditya', 'Aditya', 'Aditya', 'Aditya', 'Aditya')]
Converting None to Null
def blank_as_null(x):
return F.when(F.col(x) == "None", None).otherwise(F.col(x))
sparkDF = sql.createDataFrame(input_list,['Name','Actor','Doctor','Professor','Singer'])
to_convert = set(['Actor','Doctor','Professor','Singer'])
sparkDF = reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, sparkDF)
sparkDF.show()
+---------+------+--------+---------+------+
| Name| Actor| Doctor|Professor|Singer|
+---------+------+--------+---------+------+
| Samantha| null|Samantha| null| null|
|Christeen| null| null|Christeen| null|
| Meera| null| null| null| Meera|
| Julia| Julia| null| null| null|
| Priya| null| null| null| Priya|
| Ashley| null| null| Ashley| null|
| Jenny| null| Jenny| null| null|
| Maria| Maria| null| null| null|
| Jane| Jane| null| null| null|
| Ketty| null| null| Ketty| null|
| Aditya|Aditya| Aditya| Aditya|Aditya|
+---------+------+--------+---------+------+
Filtering Not Null Values
condn = (
(F.col('Actor').isNotNull())
& (F.col('Doctor').isNotNull())
& (F.col('Professor').isNotNull())
& (F.col('Singer').isNotNull())
)
sparkDF.filter(condn).show()
+------+------+------+---------+------+
| Name| Actor|Doctor|Professor|Singer|
+------+------+------+---------+------+
|Aditya|Aditya|Aditya| Aditya|Aditya|
+------+------+------+---------+------+

Pyspark: how to solve complicated dataframe logic plus join

I have two data frames to work on, the first one looks like this the following df1
df1_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("total_time", IntegerType(), True) ])
df_data = [('2020-08-01','110','1','11010',3),('2020-08-02','110','1','11010',2),\
('2020-08-03','110','1','11010',3),('2020-08-04','110','1','11010',3),\
('2020-08-05','111','1','11010',1),('2020-08-06','111','1','11010',-1)]
rdd = sc.parallelize(df_data)
df1 = sqlContext.createDataFrame(df_data, df1_schema)
df1 = df1.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df1.show()
+----------+--------+------------+--------+----------+
| Date|store_id|warehouse_id|class_id|total_time|
+----------+--------+------------+--------+----------+
|2020-08-01| 110| 1| 11010| 3|
|2020-08-02| 110| 1| 11010| 2|
|2020-08-03| 110| 1| 11010| 3|
|2020-08-04| 110| 1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1|
|2020-08-06| 111| 1| 11010| -1|
+----------+--------+------------+--------+----------+
I calculated something called arrival_date
#To calculate the arrival_date
#logic : add the Date + total_time so in first row, 2020-08-01 +3 would give me 2020-08-04
#if total_time is -1 then return blank
df1= df1.withColumn('arrival_date', F.when(col('total_time') != -1, expr("date_add(date, total_time)"))
.otherwise(''))
+----------+--------+------------+--------+----------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|
+----------+--------+------------+--------+----------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06|
|2020-08-06| 111| 1| 11010| -1| |
+----------+--------+------------+--------+----------+------------+
and what I want to calculate is this..
#to calculate the transit_date
#if arrival_date is same, ex) 2020-08-04 is repeated 2 or more times, then take min("Date")
#which will be 2020-08-01 otherwise just return the Date ex) 2020-08-07 would just return 2020-08-04
#we need to care about cloth_id too, we have arrival_date = 2020-08-06 repeated 2 times as well but since
#if one of store_id or warehouse_id is different we treat them separately. so at arrival_date = 2020-08-06 at date = 2020-08-03,
##we must return 2020-08-03
#so we treat them separately when one of (store_id, warehouse_id ) is different.
#*Note* we dont care about class_id, its not effective.
#if arrival_date = blank then leave it as blank..
#so our df would look something like this.
+----------+--------+------------+--------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
+----------+--------+------------+--------+----------+------------+------------+
Next, I have df2 looks like the following..
#we have another dataframe call it df2
df2_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("cloth_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("type", StringType(), True),\
StructField("quantity", IntegerType(), True)])
df_data = [('2020-08-01','110','1','M_1','11010','R',5),('2020-08-01','110','1','M_1','11010','R',2),\
('2020-08-02','110','1','M_1','11010','C',3),('2020-08-03','110','1','M_1','11010','R',1),\
('2020-08-04','110','1','M_1','11010','R',3),('2020-08-05','111','1','M_2','11010','R',5)]
rdd = sc.parallelize(df_data)
df2 = sqlContext.createDataFrame(df_data, df2_schema)
df2 = df2.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df2.show()
+----------+--------+------------+--------+--------+----+--------+
| Date|store_id|warehouse_id|cloth_id|class_id|type|quantity|
+----------+--------+------------+--------+--------+----+--------+
|2020-08-01| 110| 1| M_1| 11010| R| 5|
|2020-08-01| 110| 1| M_1| 11010| R| 2|
|2020-08-02| 110| 1| M_1| 11010| C| 3|
|2020-08-03| 110| 1| M_1| 11010| R| 1|
|2020-08-04| 110| 1| M_1| 11010| R| 3|
|2020-08-05| 111| 1| M_2| 11010| R| 5|
+----------+--------+------------+--------+--------+----+--------+
and I calculated quantity2, this is just sum of quantity where type=R
df2 =df2.groupBy('Date','store_id','warehouse_id','cloth_id','class_id')\
.agg( F.sum(F.when(col('type')=='R', col('quantity'))\
.otherwise(col('quantity'))).alias('quantity2')).orderBy('Date')
+----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
+----------+--------+------------+--------+--------+---------+
Now I have df1, and df2. I want to join such that It will look something like this...
I tried something like this
df4 = df1.select('store_id','warehouse_id','class_id','arrival_date','transit_date')
df4= df4.filter(" transit_date != '' ")
df4=df4.withColumnRenamed('arrival_date', 'date')
df3 = df2.join(df1, on=['Date','store_id','warehouse_id','class_id'],how='inner').orderBy('Date')
df5 = df3.join(df4, on=['Date','store_id','warehouse_id','class_id'], how='left').orderBy('Date')
but I dont think this is the correct approach.... the result df should look like below..
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
note that the transit_date went to where Date = arrival_date of course the null is replaced by blank.
LASTLY, if today is 2020-08-04, then look at where arrival_date == 2020-08-04 and sum up the quantity and place it at today. so.... It will look like this... where the store_id = 111, it will have separate date. not shown here.. so logic needs to make sense when store_id = 111 as well.. i've just shown the example where store_id = 110
From my understanding about your question and where you already have with the following df1 and df2:
df1.orderBy('Date').show() df2.orderBy('Date').show()
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date| | Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| |2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| |2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| |2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| |2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| |2020-08-05| 111| 1| M_2| 11010| 5|
|2020-08-06| 111| 1| 11010| -1| | +----------+--------+------------+--------+--------+---------+
+----------+--------+------------+--------+----------+------------+
you can try the following 5 steps:
Step-1: Set up the list of column names grp_cols for join:
from pyspark.sql import functions as F
grp_cols = ["Date", "store_id", "warehouse_id", "class_id"]
Step-2: create df3 containing transit_date which is the min Date on each combination of arrival_date, store_id, warehouse_id and class_id:
df3 = df1.filter('total_time != -1') \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id") \
.agg(F.min('Date').alias('transit_date')) \
.withColumnRenamed("arrival_date", "Date")
df3.orderBy('Date').show()
+----------+--------+------------+--------+------------+
| Date|store_id|warehouse_id|class_id|transit_date|
+----------+--------+------------+--------+------------+
|2020-08-04| 110| 1| 11010| 2020-08-01|
|2020-08-06| 111| 1| 11010| 2020-08-05|
|2020-08-06| 110| 1| 11010| 2020-08-03|
|2020-08-07| 110| 1| 11010| 2020-08-04|
+----------+--------+------------+--------+------------+
Step-3: set up df4 by join df2 with df1 and left join df3 using grp_cols, persist df4
df4 = df2.join(df1, grp_cols).join(df3, grp_cols, "left") \
.withColumn('transit_date', F.when(F.col('total_time') != -1, F.col("transit_date")).otherwise('')) \
.persist()
_ = df4.count()
df4.orderBy('Date').show()
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
Step-4: calculate sum(quantity2) as want from df4 for each arrival_date + store_id + warehouse_id + class_id + cloth_id
df5 = df4 \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id", "cloth_id") \
.agg(F.sum("quantity2").alias("want")) \
.withColumnRenamed("arrival_date", "Date")
df5.orderBy('Date').show()
+----------+--------+------------+--------+--------+----+
| Date|store_id|warehouse_id|class_id|cloth_id|want|
+----------+--------+------------+--------+--------+----+
|2020-08-04| 110| 1| 11010| M_1| 10|
|2020-08-06| 111| 1| 11010| M_2| 5|
|2020-08-06| 110| 1| 11010| M_1| 1|
|2020-08-07| 110| 1| 11010| M_1| 3|
+----------+--------+------------+--------+--------+----+
Step-5: create the final dataframe by left join df4 with df5
df_new = df4.join(df5, grp_cols+["cloth_id"], "left").fillna(0, subset=['want'])
df_new.orderBy("Date").show()
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|want|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null| 0|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null| 0|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null| 0|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01| 10|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null| 0|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
df4.unpersist()
Here is for the df1,
from pyspark.sql import Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
import builtins as p
df1_schema = StructType(
[
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('total_time', IntegerType(), True)
]
)
df1_data = [
('2020-08-01','110','1','11010',3),
('2020-08-02','110','1','11010',2),
('2020-08-03','110','1','11010',3),
('2020-08-04','110','1','11010',3),
('2020-08-05','111','1','11010',1),
('2020-08-06','111','1','11010',-1)
]
df1 = spark.createDataFrame(df1_data, df1_schema)
df1 = df1.withColumn('Date', to_date('Date'))
df1 = df1.withColumn('arrival_date', when(col('total_time') != -1, expr("date_add(date, total_time)")).otherwise(''))
w = Window.partitionBy('arrival_date', 'store_id', 'warehouse_id').orderBy('Date')
df1 = df1.withColumn('transit_date', when(col('total_time') != -1, first('Date').over(w)).otherwise('')).orderBy('Date')
df1.show()
+----------+--------+------------+--------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
+----------+--------+------------+--------+----------+------------+------------+
and df2 as you did,
df2_schema = StructType(
[
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('cloth_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('type', StringType(), True),
StructField('quantity', IntegerType(), True)
]
)
df2_data = [
('2020-08-01','110','1','M_1','11010','R',5),
('2020-08-01','110','1','M_1','11010','R',2),
('2020-08-02','110','1','M_1','11010','C',3),
('2020-08-03','110','1','M_1','11010','R',1),
('2020-08-04','110','1','M_1','11010','R',3),
('2020-08-05','111','1','M_2','11010','R',5)
]
df2 = spark.createDataFrame(df2_data, df2_schema)
df2 = df2.withColumn('Date', to_date('Date'))
df2 = df2.groupBy('Date', 'store_id', 'warehouse_id', 'cloth_id', 'class_id') \
.agg(
sum(
when(col('type') == 'R', col('quantity')).otherwise(0)
).alias('quantity2')
).orderBy('Date')
df2.show()
+----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 0|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
+----------+--------+------------+--------+--------+---------+
and finally the join result.
df3 = df1.filter('total_time != -1') \
.join(df2, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.drop('Date', 'total_time', 'cloth_id') \
.withColumnRenamed('arrival_date', 'Date')
df4 = df1.drop('transit_date') \
.join(df3, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.groupBy('Date', 'store_id', 'warehouse_id', 'class_id', 'arrival_date', 'transit_date') \
.agg(sum('quantity2').alias('want')) \
.orderBy('Date')
df4.show()
+----------+--------+------------+--------+------------+------------+----+
| Date|store_id|warehouse_id|class_id|arrival_date|transit_date|want|
+----------+--------+------------+--------+------------+------------+----+
|2020-08-01| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-02| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-03| 110| 1| 11010| 2020-08-06| null|null|
|2020-08-04| 110| 1| 11010| 2020-08-07| 2020-08-01| 7|
|2020-08-05| 111| 1| 11010| 2020-08-06| null|null|
|2020-08-06| 111| 1| 11010| | 2020-08-05| 5|
+----------+--------+------------+--------+------------+------------+----+

Grab last different data on Spark Dataframe?

I have this data on Spark Dataframe
+------+-------+-----+------------+----------+---------+
|sernum|product|state|testDateTime|testResult| msg|
+------+-------+-----+------------+----------+---------+
| 8| PA1| 1.0| 1.18| pass|testlog18|
| 7| PA1| 1.0| 1.17| fail|testlog17|
| 6| PA1| 1.0| 1.16| pass|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|
| 4| PA1| 2.0| 1.14| fail|testlog14|
| 3| PA1| 1.0| 1.13| pass|testlog13|
| 2| PA1| 2.0| 1.12| pass|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11|
+------+-------+-----+------------+----------+---------+
What I care about is the testResult == "fail", and the hard part is that I need the to get the last "pass" message as an extra column GROUP BY product+state:
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+
How can I achieve this using DataFrame or SQL?
The trick is to define groups where each group starts with a passed test. Then, use again window-functions with group as an additional partition-column:
val df = Seq(
(8, "PA1", 1.0, 1.18, "pass", "testlog18"),
(7, "PA1", 1.0, 1.17, "fail", "testlog17"),
(6, "PA1", 1.0, 1.16, "pass", "testlog16"),
(5, "PA1", 1.0, 1.15, "fail", "testlog15"),
(4, "PA1", 2.0, 1.14, "fail", "testlog14"),
(3, "PA1", 1.0, 1.13, "pass", "testlog13"),
(2, "PA1", 2.0, 1.12, "pass", "testlog12"),
(1, "PA1", 1.0, 1.11, "fail", "testlog11")
).toDF("sernum", "product", "state", "testDateTime", "testResult", "msg")
df
.withColumn("group", sum(when($"testResult" === "pass", 1)).over(Window.partitionBy($"product", $"state").orderBy($"testDateTime")))
.withColumn("passMsg", when($"group".isNotNull,first($"msg").over(Window.partitionBy($"product", $"state", $"group").orderBy($"testDateTime"))))
.drop($"group")
.where($"testResult"==="fail")
.orderBy($"product", $"state", $"testDateTime")
.show()
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+
This is an alternate approach, by joining the passed logs with failed ones for previous times, and taking the latest "pass" message log.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
Window.partitionBy($"msg").orderBy($"p_testDateTime".desc)
val fDf = df.filter($"testResult" === "fail")
var pDf = df.filter($"testResult" === "pass")
pDf.columns.foreach(x => pDf = pDf.withColumnRenamed(x, "p_"+x))
val jDf = fDf.join(
pDf,
pDf("p_product") === fDf("product") &&
pDf("p_state") === fDf("state") &&
fDf("testDateTime") > pDf("p_testDateTime") ,
"left").
select(fDf("*"),
pDf("p_testResult"),
pDf("p_testDateTime"),
pDf("p_msg")
)
jDf.withColumn(
"rnk",
row_number().
over(window)
).
filter($"rnk" === 1).
drop("rnk","p_testResult","p_testDateTime").
show()
+---------+-------+------+-----+------------+----------+---------+
| msg|product|sernum|state|testDateTime|testResult| p_msg|
+---------+-------+------+-----+------------+----------+---------+
|testlog14| PA1| 4| 2| 1.14| fail|testlog12|
|testlog11| PA1| 1| 1| 1.11| fail| null|
|testlog15| PA1| 5| 1| 1.15| fail|testlog13|
|testlog17| PA1| 7| 1| 1.17| fail|testlog16|
+---------+-------+------+-----+------------+----------+---------+

How do I pass parameters to selectExpr? SparkSQL-Scala

:)
When you have a data frame, you can add columns and fill their rows with the method selectExprt
Something like this:
scala> table.show
+------+--------+---------+--------+--------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|
+------+--------+---------+--------+--------+
| OlcM| h|999999999| J| 0|
| zOcQ| r|777777777| J| 1|
| kyGp| t|333333333| J| 2|
| BEuX| A|999999999| F| 3|
scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo")
tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string]
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hola|
| zOcQ| r|777777777| J| 1| hola|
| kyGp| t|333333333| J| 2| hola|
| BEuX| A|999999999| F| 3| hola|
My point is:
I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this:
scala> var english = "hello"
scala> def generar_informe(df: DataFrame, tabla: String) {
var selectExpr_df = df.selectExpr(
"TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD",
"CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD",
"'tabla' as PUNTO DEL FLUJO" )
}
scala> generar_informe(df,english)
.....
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hello|
| zOcQ| r|777777777| J| 1| hello|
| kyGp| t|333333333| J| 2| hello|
| BEuX| A|999999999| F| 3| hello|
I tried:
scala> var result = tabl.selectExpr("A", "B", "$tabla as C")
scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C)
<console>:31: error: not found: value $
var abc = tabl.selectExpr("A", "B", ${tabla} as C)
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
scala> sqlContext.sql("set tabla='hello'")
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
SAME ERROR:
java.lang.RuntimeException: [1.1] failure: identifier expected
${tabla} as C
^
at scala.sys.package$.error(package.scala:27)
Thanks in advance!
Can you try this.
val english = "hello"
generar_informe(data,english).show()
}
def generar_informe(df: DataFrame , english : String)={
df.selectExpr(
"transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """)
}
This is the output I got.
17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms
+-------------+----------+------+----------+------+
|transactionId|customerId|itemId|amountPaid|saludo|
+-------------+----------+------+----------+------+
| 111| 1| 1| 100.0| hello|
| 112| 2| 2| 505.0| hello|
| 113| 3| 3| 510.0| hello|
| 114| 4| 4| 600.0| hello|
| 115| 1| 2| 500.0| hello|
| 116| 1| 2| 500.0| hello|
| 117| 1| 2| 500.0| hello|
| 118| 1| 2| 500.0| hello|
| 119| 2| 3| 500.0| hello|
| 120| 1| 2| 500.0| hello|
| 121| 1| 4| 500.0| hello|
| 122| 1| 2| 500.0| hello|
| 123| 1| 4| 500.0| hello|
| 124| 1| 2| 500.0| hello|
+-------------+----------+------+----------+------+
17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook