Pyspark Dataframe Compare - dataframe

I have 2 Spark dataframes with same number of columns.
DF1:
ID KEY
1 A
1 A
2 B
3 C
3 C
DF2:
ID KEY
1 A
1 A
1 A
2 B
3 C
3 C
4 D
5 E
5 E
I want to compare these 2 dataframes and write those records that are there in DF2 but not in DF1.
Expected output:
ID KEY
1 A
4 D
5 E
5 E

use .exceptAll function.
`Example:
df1.show()
#+---+---+
#| ID|KEY|
#+---+---+
#| 1| A|
#| 1| A|
#| 2| B|
#| 3| c|
#| 3| c|
#+---+---+
df2.show()
#+---+---+
#| ID|KEY|
#+---+---+
#| 1| A|
#| 1| A|
#| 1| A|
#| 2| B|
#| 3| c|
#| 3| c|
#| 4| D|
#| 5| E|
#| 5| E|
#+---+---+
df2.exceptAll(df1).orderBy("ID").show()
#+---+---+
#| ID|KEY|
#+---+---+
#| 1| A|
#| 4| D|
#| 5| E|
#| 5| E|
#+---+---+

Related

Pyspark sum of columns after union of dataframe

How can I sum all columns after unioning two dataframe ?
I have this first df with one row per user:
df = sqlContext.createDataFrame([("2022-01-10", 3, 2,"a"),("2022-01-10",3,4,"b"),("2022-01-10", 1,3,"c")], ["date", "value1", "value2", "userid"])
df.show()
+----------+------+------+------+
| date|value1|value2|userid|
+----------+------+------+------+
|2022-01-10| 3| 2| a|
|2022-01-10| 3| 4| b|
|2022-01-10| 1| 3| c|
+----------+------+------+------+
date value will always be the today's date.
and I have another df, with multiple row per userid this time, so one value for each day:
df2 = sqlContext.createDataFrame([("2022-01-01", 13, 12,"a"),("2022-01-02",13,14,"b"),("2022-01-03", 11,13,"c"),
("2022-01-04", 3, 2,"a"),("2022-01-05",3,4,"b"),("2022-01-06", 1,3,"c"),
("2022-01-10", 31, 21,"a"),("2022-01-07",31,41,"b"),("2022-01-09", 11,31,"c")], ["date", "value3", "value4", "userid"])
df2.show()
+----------+------+------+------+
| date|value3|value4|userid|
+----------+------+------+------+
|2022-01-01| 13| 12| a|
|2022-01-02| 13| 14| b|
|2022-01-03| 11| 13| c|
|2022-01-04| 3| 2| a|
|2022-01-05| 3| 4| b|
|2022-01-06| 1| 3| c|
|2022-01-10| 31| 21| a|
|2022-01-07| 31| 41| b|
|2022-01-09| 11| 31| c|
+----------+------+------+------+
After unioning the two of them with this function, here what I have:
def union_different_tables(df1, df2):
columns_df1 = df1.columns
columns_df2 = df2.columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
for col, _type in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2.withColumn(col, f.lit(None).cast(_type))
for col, _type in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1.withColumn(col, f.lit(None).cast(_type))
union = df1.unionByName(df2)
return union
+----------+------+------+------+------+------+
| date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10| 3| 2| a| null| null|
|2022-01-10| 3| 4| b| null| null|
|2022-01-10| 1| 3| c| null| null|
|2022-01-01| null| null| a| 13| 12|
|2022-01-02| null| null| b| 13| 14|
|2022-01-03| null| null| c| 11| 13|
|2022-01-04| null| null| a| 3| 2|
|2022-01-05| null| null| b| 3| 4|
|2022-01-06| null| null| c| 1| 3|
|2022-01-10| null| null| a| 31| 21|
|2022-01-07| null| null| b| 31| 41|
|2022-01-09| null| null| c| 11| 31|
+----------+------+------+------+------+------+
What I want to get is the sum of all columns in df2 (I have 10 of them in the real case) till the date of the day for each userid, so one row per user:
+----------+------+------+------+------+------+
| date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10| 3| 2| a| 47 | 35 |
|2022-01-10| 3| 4| b| 47 | 59 |
|2022-01-10| 1| 3| c| 23 | 47 |
+----------+------+------+------+------+------+
Since I have to do this operation for multiple tables, here what I tried:
user_window = Window.partitionBy(['userid']).orderBy('date')
list_tables = [df2]
list_col_original = df.columns
for table in list_tables:
df = union_different_tables(df, table)
list_column = list(set(table.columns) - set(list_col_original))
list_col_original.extend(list_column)
df = df.select('userid',
*[f.sum(f.col(col_name)).over(user_window).alias(col_name) for col_name in list_column])
df.show()
+------+------+------+
|userid|value4|value3|
+------+------+------+
| c| 13| 11|
| c| 16| 12|
| c| 47| 23|
| c| 47| 23|
| b| 14| 13|
| b| 18| 16|
| b| 59| 47|
| b| 59| 47|
| a| 12| 13|
| a| 14| 16|
| a| 35| 47|
| a| 35| 47|
+------+------+------+
But that give me a sort of cumulative sum, plus I didn't find a way to add all the columns in the resulting df.
The only thing is that I can't do any join ! My df are very very large and any join is taking too long to compute.
Do you know how I can fix my code to have the result I want ?
After union of df1 and df2, you can group by userid and sum all columns except date for which you get the max.
Note that for the union part, you can actually use DataFrame.unionByName if you have the same data types but only number of columns can differ:
df = df1.unionByName(df2, allowMissingColumns=True)
Then group by and agg:
import pyspark.sql.functions as F
result = df.groupBy("userid").agg(
F.max("date").alias("date"),
*[F.sum(c).alias(c) for c in df.columns if c not in ("date", "userid")]
)
result.show()
#+------+----------+------+------+------+------+
#|userid| date|value1|value2|value3|value4|
#+------+----------+------+------+------+------+
#| a|2022-01-10| 3| 2| 47| 35|
#| b|2022-01-10| 3| 4| 47| 59|
#| c|2022-01-10| 1| 3| 23| 47|
#+------+----------+------+------+------+------+
This supposes the second dataframe contains only dates prior to the today date in the first one. Otherwise, you'll need to filter df2 before union.

How to join two dataframes together

I have two dataframes.
One is coming from groupBy and the other is the total summary:
a = data.groupBy("bucket").agg(sum(a.total))
b = data.agg(sum(a.total))
I want to put the total from b to a dataframe so that I can calculate the % on each bucket.
Do you know what kind of join I shall use?
Use .crossJoin then you will get the total from b added to all rows of df a, then you can calculate the percentage.
Example:
a.crossJoin(b).show()
#+------+----------+----------+
#|bucket|sum(total)|sum(total)|
#+------+----------+----------+
#| c| 4| 10|
#| b| 3| 10|
#| a| 3| 10|
#+------+----------+----------+
Instead of CrossJoin you can try using window functions as mentioned below.
df.show()
#+-----+------+
#|total|bucket|
#+-----+------+
#| 1| a|
#| 2| a|
#| 3| b|
#| 4| c|
#+-----+------+
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.window import *
import sys
w=Window.partitionBy(col("bucket"))
w1=Window.orderBy(lit("1")).rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("sum_b",sum(col("total")).over(w)).withColumn("sum_c",sum(col("total")).over(w1)).show()
#+-----+------+-----+-----+
#|total|bucket|sum_b|sum_c|
#+-----+------+-----+-----+
#| 4| c| 4| 10|
#| 3| b| 3| 10|
#| 1| a| 3| 10|
#| 2| a| 3| 10|
#+-----+------+-----+-----+
You can use also collect() as you will return to the driver just a simple result
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
df = spark.sql("select 'A' as bucket, 5 as value union all select 'B' as bucket, 8 as value")
df_total = spark.sql("select 9 as total")
df=df.withColumn('total',lit(df_total.collect()[0]['total']))
+------+-----+-----+
|bucket|value|total|
+------+-----+-----+
| A| 5| 9|
| B| 8| 9|
+------+-----+-----+
df= df.withColumn('pourcentage', col('total') / col('value'))
+------+-----+-----+-----------+
|bucket|value|total|pourcentage|
+------+-----+-----+-----------+
| A| 5| 9| 1.8|
| B| 8| 9| 1.125|
+------+-----+-----+-----------+

How to get last row value when flag is 0 and get the current row value to new column when flag 1 in pyspark dataframe

Scenario 1 when Flag 1 :
For the row where Flag is 1 Copy trx_date to Destination
Scenario 2 When Flag 0 :
For the row where Flag is 0 Copy the previous Destination Value
Input :
+-----------+----+----------+
|customer_id|Flag| trx_date|
+-----------+----+----------+
| 1| 1| 12/3/2020|
| 1| 0| 12/4/2020|
| 1| 1| 12/5/2020|
| 1| 1| 12/6/2020|
| 1| 0| 12/7/2020|
| 1| 1| 12/8/2020|
| 1| 0| 12/9/2020|
| 1| 0|12/10/2020|
| 1| 0|12/11/2020|
| 1| 1|12/12/2020|
| 2| 1| 12/1/2020|
| 2| 0| 12/2/2020|
| 2| 0| 12/3/2020|
| 2| 1| 12/4/2020|
+-----------+----+----------+
Output :
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Code to generate spark Dataframe :
df = spark.createDataFrame([(1,1,'12/3/2020'),(1,0,'12/4/2020'),(1,1,'12/5/2020'),
(1,1,'12/6/2020'),(1,0,'12/7/2020'),(1,1,'12/8/2020'),(1,0,'12/9/2020'),(1,0,'12/10/2020'),
(1,0,'12/11/2020'),(1,1,'12/12/2020'),(2,1,'12/1/2020'),(2,0,'12/2/2020'),(2,0,'12/3/2020'),
(2,1,'12/4/2020')], ["customer_id","Flag","trx_date"])
Pyspark way to do this. After getting trx_date in datetype, First get incremental sum of Flag to create the groupings we need in order to use the first function on a window partitioned by those groupings. We can use date_format to get both columns back to desired date format. I assumed your format was MM/dd/yyyy, if it was different please change it to dd/MM/yyyy in the code.
df.show() #sample data
#+-----------+----+----------+
#|customer_id|Flag| trx_date|
#+-----------+----+----------+
#| 1| 1| 12/3/2020|
#| 1| 0| 12/4/2020|
#| 1| 1| 12/5/2020|
#| 1| 1| 12/6/2020|
#| 1| 0| 12/7/2020|
#| 1| 1| 12/8/2020|
#| 1| 0| 12/9/2020|
#| 1| 0|12/10/2020|
#| 1| 0|12/11/2020|
#| 1| 1|12/12/2020|
#| 2| 1| 12/1/2020|
#| 2| 0| 12/2/2020|
#| 2| 0| 12/3/2020|
#| 2| 1| 12/4/2020|
#+-----------+----+----------+
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().orderBy("customer_id","trx_date")
w1=Window().partitionBy("Flag2").orderBy("trx_date").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("trx_date", F.to_date("trx_date", "MM/dd/yyyy"))\
.withColumn("Flag2", F.sum("Flag").over(w))\
.withColumn("destination", F.when(F.col("Flag")==0, F.first("trx_date").over(w1)).otherwise(F.col("trx_date")))\
.withColumn("trx_date", F.date_format("trx_date","MM/dd/yyyy"))\
.withColumn("destination", F.date_format("destination", "MM/dd/yyyy"))\
.orderBy("customer_id","trx_date").drop("Flag2").show()
#+-----------+----+----------+-----------+
#|customer_id|Flag| trx_date|destination|
#+-----------+----+----------+-----------+
#| 1| 1|12/03/2020| 12/03/2020|
#| 1| 0|12/04/2020| 12/03/2020|
#| 1| 1|12/05/2020| 12/05/2020|
#| 1| 1|12/06/2020| 12/06/2020|
#| 1| 0|12/07/2020| 12/06/2020|
#| 1| 1|12/08/2020| 12/08/2020|
#| 1| 0|12/09/2020| 12/08/2020|
#| 1| 0|12/10/2020| 12/08/2020|
#| 1| 0|12/11/2020| 12/08/2020|
#| 1| 1|12/12/2020| 12/12/2020|
#| 2| 1|12/01/2020| 12/01/2020|
#| 2| 0|12/02/2020| 12/01/2020|
#| 2| 0|12/03/2020| 12/01/2020|
#| 2| 1|12/04/2020| 12/04/2020|
#+-----------+----+----------+-----------+
You can use window functions. I am unsure whether spark sql supports the standard ignore nulls option to lag().
If it does, you can just do:
select
t.*,
case when flag = 1
then trx_date
else lag(case when flag = 1 then trx_date end ignore nulls)
over(partition by customer_id order by trx_date)
end destination
from mytable t
Else, you can build groups with a window sum first:
select
customer_id,
flag,
trx_date,
case when flag = 1
then trx_date
else min(trx_date) over(partition by customer_id, grp order by trx_date)
end destination
from (
select t.*, sum(flag) over(partition by customer_id order by trx_date) grp
from mytable t
) t
You can achieve this in the following way if you are considering dataframe API
#Convert date format while creating window itself
window = Window().orderBy("customer_id",f.to_date('trx_date','MM/dd/yyyy'))
df1 = df.withColumn('destination', f.when(f.col('Flag')==1,f.col('trx_date'))).\
withColumn('destination',f.last(f.col('destination'),ignorenulls=True).over(window))
df1.show()
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Hope it helps.

Reset counter on window functions

I have a dataset like the below and I want to create a new column C that acts like a counter/row number which should get reset every time column B has 0 partitioned by column value of A
Using SparkSQL / SQL only (I can do it using Pyspark)
>>> rdd = sc.parallelize([
... [1, 0], [1, 1],[1, 1], [1, 0], [1, 1],
... [1, 1], [2, 1], [2, 1], [3, 0], [3, 1], [3, 1], [3, 1]])
>>> df = rdd.toDF(['A', 'B'])
>>>
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 0|
| 1| 1|
| 1| 1|
| 1| 0|
| 1| 1|
| 1| 1|
| 2| 1|
| 2| 1|
| 3| 0|
| 3| 1|
| 3| 1|
| 3| 1|
+---+---+
What I would like to achieve
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 0| 1|
| 1| 1| 2|
| 1| 1| 3|
| 1| 0| 1|
| 1| 1| 2|
| 1| 1| 3|
| 2| 1| 1|
| 2| 1| 2|
| 3| 0| 1|
| 3| 1| 2|
| 3| 1| 3|
| 3| 1| 4|
+---+---+---+
What I have so far
>>> spark.sql('''
... select *, row_number() over(partition by A order by A) as C from df
... ''').show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 0| 1|
| 1| 1| 2|
| 1| 1| 3|
| 1| 0| 4|
| 1| 1| 5|
| 1| 1| 6|
| 3| 0| 1|
| 3| 1| 2|
| 3| 1| 3|
| 3| 1| 4|
| 2| 1| 1|
| 2| 1| 2|
+---+---+---+
SQL tables represent unordered sets. You need a column that specifies the ordering of the data.
With such a column you can accumulate the 0 values because they appear to be breaks. So:
select df.*, row_number() over (partition by A, grp order by A) as C
from (select df.*,
sum(case when b = 0 then 1 else 0 end) over (partition by A order by <ordering column>) as grp
from df
) df

Getting Memory Error in PySpark during Filter & GroupBy computation

This is the Error :
Job aborted due to stage failure: Task 12 in stage 37.0 failed 4 times, most recent failure: Lost task 12.3 in stage 37.0 (TID 325, 10.139.64.5, executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages*
So is there any alternative, more efficient way to apply those function without causing out-of-memory error? I have the data in Billions to be computed.
Input Dataframe on which filtering is to be done:
+------+-------+-------+------+-------+-------+
|Pos_id|level_p|skill_p|Emp_id|level_e|skill_e|
+------+-------+-------+------+-------+-------+
| 1| 2| a| 100| 2| a|
| 1| 2| a| 100| 3| f|
| 1| 2| a| 100| 2| d|
| 1| 2| a| 101| 4| a|
| 1| 2| a| 101| 5| b|
| 1| 2| a| 101| 1| e|
| 1| 2| a| 102| 5| b|
| 1| 2| a| 102| 3| d|
| 1| 2| a| 102| 2| c|
| 2| 2| d| 100| 2| a|
| 2| 2| d| 100| 3| f|
| 2| 2| d| 100| 2| d|
| 2| 2| d| 101| 4| a|
| 2| 2| d| 101| 5| b|
| 2| 2| d| 101| 1| e|
| 2| 2| d| 102| 5| b|
| 2| 2| d| 102| 3| d|
| 2| 2| d| 102| 2| c|
| 2| 4| b| 100| 2| a|
| 2| 4| b| 100| 3| f|
+------+-------+-------+------+-------+-------+
Filtering Code:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql import functions as sf
function = udf(lambda item, items: 1 if item in items else 0, IntegerType())
df_result = new_df.withColumn('result', function(sf.col('skill_p'), sf.col('skill_e')))
df_filter = df_result.filter(sf.col("result") == 1)
df_filter.show()
res = df_filter.groupBy("Pos_id", "Emp_id").agg(
sf.collect_set("skill_p").alias("SkillsMatch"),
sf.sum("result").alias("SkillsMatchedCount"))
res.show()
This needs to be done on Billions of rows.