Compare 2 Spark dataframes, get the records which are not in both dataframes based on multiple columns - dataframe

I have two PySpark DataFrames. I am looking for records which are not in both datasets based on specific columns.
Sample datasets:
# Prepare Data
data_1 = [("A", 1, "data_1"), \
("A", 1, "data_1"), \
("A", 1, "data_1"), \
("A", 2, "data_1")
]
# Create DataFrame
columns= ["col_1", "col_2", "source"]
df_1 = spark.createDataFrame(data = data_1, schema = columns)
df_1.show(truncate=False)
# Prepare Data
data_2 = [("A", 1, "data_2"), \
("A", 1, "data_2"), \
("A", 1, "data_2"), \
("A", 3, "data_2")
]
# Create DataFrame
columns= ["col_1", "col_2", "source"]
df_2 = spark.createDataFrame(data = data_2, schema = columns)
df_2.show(truncate=False)
I want to compare above DataFrames based on columns col_1 & col_2 and get the records which are only in one of the DataFrames. The expected results are:
col_1
col_2
source
"A"
2
"data_1"
"A"
3
"data_2"
Any idea how to solve it?

You can do LEFT_ANTI based on two columns which will give you the records present in one dataframe but missing from another. You can then union both the outputs.
// Use comma separated string instead of colList in case of python
Dataset<Row> missingInRight = leftDF.join(rightDF, colList, "left_anti");
Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti");
missingInRight.union(missingInLeft).show();
Output:
+-----+-----+------+
|col_1|col_2|source|
+-----+-----+------+
| A| 2|data_1|
| A| 3|data_2|
+-----+-----+------+
You can also add a column to tell you which dataframe didn't had the record.
Dataset<Row> missingInRight = leftDF.join(rightDF, colList, "left_anti")
.withColumn("Comment", functions.lit("NOT_IN_RIGHT"));
Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti")
.withColumn("Comment", functions.lit("NOT_IN_LEFT"));
missingInRight.union(missingInLeft).show();
Output:
+-----+-----+------+------------+
|col_1|col_2|source| Comment|
+-----+-----+------+------------+
| A| 2|data_1|NOT_IN_RIGHT|
| A| 3|data_2| NOT_IN_LEFT|
+-----+-----+------+------------+
In case of comparing all the columns, you can use "except"
leftDF.except(rightDF)

Related

How to remove duplicate records from PySpark DataFrame based on a condition?

Assume that I have a PySpark DataFrame like below:
# Prepare Data
data = [('Italy', 'ITA'), \
('China', 'CHN'), \
('China', None), \
('France', 'FRA'), \
('Spain', None), \
('Taiwan', 'TWN'), \
('Taiwan', None)
]
# Create DataFrame
columns = ['Name', 'Code']
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)
As you can see, a few countries are repeated twice (China & Taiwan in the above example). I want to delete records that satisfy the following conditions:
The column 'Name' is repeated more than once
AND
The column 'Code' is Null.
Note that column 'Code' can be Null for countries which are not repeated, like Spain. I want to keep those records.
The expected output will be like:
Name
Code
'Italy'
'ITA'
'China'
'CHN'
'France'
'FRA'
'Spain'
Null
'Taiwan'
'TWN'
In fact, I want to have one record for every country. Any idea how to do that?
You can use window.PartitionBy to achieve your desired results:
from pyspark.sql import Window
import pyspark.sql.functions as f
df1 = df.select('Name', f.max('Code').over(Window.partitionBy('Name')).alias('Code')).distinct()
df1.show()
Output:
+------+----+
| Name|Code|
+------+----+
| China| CHN|
| Spain|null|
|France| FRA|
|Taiwan| TWN|
| Italy| ITA|
+------+----+
In order to obtain non-null rows first, use the row_number window function to group by Name column and sort the Code column. Since null is considered the smallest in Spark order by, desc mode is used. Then take the first row of each group.
df = df.withColumn('rn', F.expr('row_number() over (partition by Name order by Code desc)')).filter('rn = 1').drop('rn')
Here is one approach :
from pyspark.sql.functions import col
df = df.dropDuplicates(subset=["Name"],keep='first')
There will almost certainly be a cleverer way to do this, but for the sake of a lesson, what if you:
made a new dataframe with just 'Name'
dropped duplicates on that
deleted records where Code = 'null' from initial table
do a left join between new table and old table for 'Code'
I've added Australia with no country code just so you can see it works for that case as well
import pandas as pd
data = [('Italy', 'ITA'), \
('China', 'CHN'), \
('China', None), \
('France', 'FRA'), \
('Spain', None), \
('Taiwan', 'TWN'), \
('Taiwan', None), \
('Australia', None)
]
# Create DataFrame
columns = ['Name', 'Code']
df = pd.DataFrame(data = data, columns = columns)
print(df)
# get unique country names
uq_countries = df['Name'].drop_duplicates().to_frame()
print(uq_countries)
# remove None
non_na_codes = df.dropna()
print(non_na_codes)
# combine
final = pd.merge(left=uq_countries, right=non_na_codes, on='Name', how='left')
print(final)

How to do joins and update together in a same query in PySpark?

I have a SQL query which I am trying to convert into PySpark. In SQL query, we are joining two tables and updating a column where condition is matching. The SQL query looks like this:
UPDATE [STUDENT_TABLE] INNER JOIN [COLLEGE_DATA]
ON ([STUDENT_TABLE].UNIQUEID = COLLEGE_DATA.PROFESSIONALID)
AND ([STUDENT_TABLE].[ADDRESS] = COLLEGE_DATA.STATE_ADDRESS)
SET STUDENT_TABLE.PROGRESS = "REGULAR"
WHERE (((STUDENT_TABLE.BLOCKERS) Is Null));
Example inputs:
from pyspark.sql import functions as F
df_stud = spark.createDataFrame(
[(1, 'x', None, 'REG'),
(2, 'y', 'qwe', 'REG')],
['UNIQUEID', 'ADDRESS', 'BLOCKERS', 'STUDENTINSTATE'])
df_college = spark.createDataFrame([(1, 'x'), (2, 'x')], ['PROFESSIONALID', 'STATE_ADDRESS'])
Your query would update just the first row of df_stud - the value in the column "STUDENTINSTATE" would become "REGULAR".
In the following script, we do the join, then select all the columns from df_stud, except the column "STUDENTINSTATE" which must be updated. This column gets value "REGULAR" if the column "PROFESSIONALID" (from df_college) is not null (i.e. join condition was satisfied). If the join condition is not satisfied, the value should not be updated, so it is taken from column "STUDENTINSTATE" as is.
join_on = (df_stud.UNIQUEID == df_college.PROFESSIONALID) & \
(df_stud.ADDRESS == df_college.STATE_ADDRESS) & \
df_stud.BLOCKERS.isNull()
df = (df_stud.alias('a')
.join(df_college.alias('b'), join_on, 'left')
.select(
*[c for c in df_stud.columns if c != 'STUDENTINSTATE'],
F.expr("nvl2(b.PROFESSIONALID, 'REGULAR', a.STUDENTINSTATE) STUDENTINSTATE")
)
)
df.show()
# +--------+-------+--------+--------------+
# |UNIQUEID|ADDRESS|BLOCKERS|STUDENTINSTATE|
# +--------+-------+--------+--------------+
# | 1| x| null| REGULAR|
# | 2| y| qwe| REG|
# +--------+-------+--------+--------------+

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Is there a way in pyspark to count unique values

I have a spark dataframe (12m x 132) and I am trying to calculate the number of unique values by column, and remove columns that have only 1 unique value.
So far, I have used the pandas nunique function as such:
import pandas as pd
df = sql_dw.read_table(<table>)
df_p = df.toPandas()
nun = df_p.nunique(axis=0)
nundf = pd.DataFrame({'atr':nun.index, 'countU':nun.values})
dropped = []
for i, j in nundf.values:
if j == 1:
dropped.append(i)
df = df.drop(i)
print(dropped)
Is there a way to do this that is more native to spark - i.e. not using pandas?
Please have a look at the commented example below. The solution requires more python as pyspark specific knowledge.
import pyspark.sql.functions as F
#creating a dataframe
columns = ['asin' ,'ctx' ,'fo' ]
l = [('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO2')
,('ASIN1','CTX2','FO1')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO3')
,('ASIN1','CTX3','FO1')
,('ASIN1','CTX3','FO3')]
df=spark.createDataFrame(l, columns)
df.show()
#we create a list of functions we want to apply
#in this case countDistinct for each column
expr = [F.countDistinct(c).alias(c) for c in df.columns]
#we apply those functions
countdf = df.select(*expr)
#this df has just one row
countdf.show()
#we extract the columns which have just one value
cols2drop = [k for k,v in countdf.collect()[0].asDict().items() if v == 1]
df.drop(*cols2drop).show()
Output:
+-----+----+---+
| asin| ctx| fo|
+-----+----+---+
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO2|
|ASIN1|CTX2|FO1|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO3|
|ASIN1|CTX3|FO1|
|ASIN1|CTX3|FO3|
+-----+----+---+
+----+---+---+
|asin|ctx| fo|
+----+---+---+
| 1| 3| 3|
+----+---+---+
+----+---+
| ctx| fo|
+----+---+
|CTX1|FO1|
|CTX1|FO1|
|CTX1|FO2|
|CTX2|FO1|
|CTX2|FO2|
|CTX2|FO2|
|CTX2|FO3|
|CTX3|FO1|
|CTX3|FO3|
+----+---+
My apologies as I don't have the solution in pyspark but in pure spark, which may be transferable or used in case you can't find a pyspark way.
You can create a blank list and then using a foreach, check which columns have a distinct count of 1, then append them to the blank list.
From there you can use the list as a filter and drop those columns from your dataframe.
var list_of_columns: List[String] = ()
df_p.columns.foreach{c =>
if (df_p.select(c).distinct.count == 1)
list_of_columns ++= List(c)
df_p_new = df_p.drop(list_of_columns:_*)
you can group your df by that column and count distinct value of this column:
df = df.groupBy("column_name").agg(countDistinct("column_name").alias("distinct_count"))
And then filter your df by row which has more than 1 distinct_count:
df = df.filter(df.distinct_count > 1)

How to concatenate all rows into one row of a multi-column DataFrame?

In Python,
How best to combine all rows of each column in a multi-column DataFrame 
into one column,
separated by ‘ | ’ separator
including null values
import pandas as pd
html = 'https://en.wikipedia.org/wiki/Visa_requirements_for_Norwegian_citizens'
df = pd.read_html(html, header=0)
df= df[1]
df.to_csv('norway.csv)
From This:
To This:
df = pandas.DataFrame([
{'A' : 'x', 'B' : 2, 'C' : None},
{'A' : None, 'B' : 2, 'C' : 1},
{'A' : 'y', 'B' : None, 'C' : None},
])
pandas.DataFrame(df.fillna('').apply(lambda x: '|'.join(x.astype(str)), axis = 0)).transpose()
I believe you need replace missing values if necessary by fillna, convert values to strings with astype and apply with join. Get Series, so for one column DataFrame add to_frame with transposing:
df = df.fillna(' ').astype(str).apply('|'.join).to_frame().T
print (df)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free
Or use list comprehension with DataFrame constructor:
L = ['|'.join(df[x].fillna(' ').astype(str)) for x in df]
df1 = pd.DataFrame([L], columns=df.columns)
print (df1)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free