How to remove duplicate records from PySpark DataFrame based on a condition? - dataframe

Assume that I have a PySpark DataFrame like below:
# Prepare Data
data = [('Italy', 'ITA'), \
('China', 'CHN'), \
('China', None), \
('France', 'FRA'), \
('Spain', None), \
('Taiwan', 'TWN'), \
('Taiwan', None)
]
# Create DataFrame
columns = ['Name', 'Code']
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)
As you can see, a few countries are repeated twice (China & Taiwan in the above example). I want to delete records that satisfy the following conditions:
The column 'Name' is repeated more than once
AND
The column 'Code' is Null.
Note that column 'Code' can be Null for countries which are not repeated, like Spain. I want to keep those records.
The expected output will be like:
Name
Code
'Italy'
'ITA'
'China'
'CHN'
'France'
'FRA'
'Spain'
Null
'Taiwan'
'TWN'
In fact, I want to have one record for every country. Any idea how to do that?

You can use window.PartitionBy to achieve your desired results:
from pyspark.sql import Window
import pyspark.sql.functions as f
df1 = df.select('Name', f.max('Code').over(Window.partitionBy('Name')).alias('Code')).distinct()
df1.show()
Output:
+------+----+
| Name|Code|
+------+----+
| China| CHN|
| Spain|null|
|France| FRA|
|Taiwan| TWN|
| Italy| ITA|
+------+----+

In order to obtain non-null rows first, use the row_number window function to group by Name column and sort the Code column. Since null is considered the smallest in Spark order by, desc mode is used. Then take the first row of each group.
df = df.withColumn('rn', F.expr('row_number() over (partition by Name order by Code desc)')).filter('rn = 1').drop('rn')

Here is one approach :
from pyspark.sql.functions import col
df = df.dropDuplicates(subset=["Name"],keep='first')

There will almost certainly be a cleverer way to do this, but for the sake of a lesson, what if you:
made a new dataframe with just 'Name'
dropped duplicates on that
deleted records where Code = 'null' from initial table
do a left join between new table and old table for 'Code'
I've added Australia with no country code just so you can see it works for that case as well
import pandas as pd
data = [('Italy', 'ITA'), \
('China', 'CHN'), \
('China', None), \
('France', 'FRA'), \
('Spain', None), \
('Taiwan', 'TWN'), \
('Taiwan', None), \
('Australia', None)
]
# Create DataFrame
columns = ['Name', 'Code']
df = pd.DataFrame(data = data, columns = columns)
print(df)
# get unique country names
uq_countries = df['Name'].drop_duplicates().to_frame()
print(uq_countries)
# remove None
non_na_codes = df.dropna()
print(non_na_codes)
# combine
final = pd.merge(left=uq_countries, right=non_na_codes, on='Name', how='left')
print(final)

Related

Compare 2 Spark dataframes, get the records which are not in both dataframes based on multiple columns

I have two PySpark DataFrames. I am looking for records which are not in both datasets based on specific columns.
Sample datasets:
# Prepare Data
data_1 = [("A", 1, "data_1"), \
("A", 1, "data_1"), \
("A", 1, "data_1"), \
("A", 2, "data_1")
]
# Create DataFrame
columns= ["col_1", "col_2", "source"]
df_1 = spark.createDataFrame(data = data_1, schema = columns)
df_1.show(truncate=False)
# Prepare Data
data_2 = [("A", 1, "data_2"), \
("A", 1, "data_2"), \
("A", 1, "data_2"), \
("A", 3, "data_2")
]
# Create DataFrame
columns= ["col_1", "col_2", "source"]
df_2 = spark.createDataFrame(data = data_2, schema = columns)
df_2.show(truncate=False)
I want to compare above DataFrames based on columns col_1 & col_2 and get the records which are only in one of the DataFrames. The expected results are:
col_1
col_2
source
"A"
2
"data_1"
"A"
3
"data_2"
Any idea how to solve it?
You can do LEFT_ANTI based on two columns which will give you the records present in one dataframe but missing from another. You can then union both the outputs.
// Use comma separated string instead of colList in case of python
Dataset<Row> missingInRight = leftDF.join(rightDF, colList, "left_anti");
Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti");
missingInRight.union(missingInLeft).show();
Output:
+-----+-----+------+
|col_1|col_2|source|
+-----+-----+------+
| A| 2|data_1|
| A| 3|data_2|
+-----+-----+------+
You can also add a column to tell you which dataframe didn't had the record.
Dataset<Row> missingInRight = leftDF.join(rightDF, colList, "left_anti")
.withColumn("Comment", functions.lit("NOT_IN_RIGHT"));
Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti")
.withColumn("Comment", functions.lit("NOT_IN_LEFT"));
missingInRight.union(missingInLeft).show();
Output:
+-----+-----+------+------------+
|col_1|col_2|source| Comment|
+-----+-----+------+------------+
| A| 2|data_1|NOT_IN_RIGHT|
| A| 3|data_2| NOT_IN_LEFT|
+-----+-----+------+------------+
In case of comparing all the columns, you can use "except"
leftDF.except(rightDF)

(pyspark) how to make dataframes which have no same user_id mutually

I was trying to collect 2 user_id dataframes which have no same user_id mutually in pyspark.
So, I typed some codes below you can see
import pyspark.sql.functions as f
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_a = df_original.sort(f.rand()).limit(10000)
df_a.count()
# df_a: 10000
df_b = df_original.join(df_a,on="user_id",how="left_anti").sort(f.rand()).limit(10000)
df_b.count()
# df_b: 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# df_a - df_b = 9998
# What?????
As a result, df_a and df_b have the same 2 user_ids... sometimes 1, or 0.
It looks like no problem with codes. However, this occurs due to lazy action of spark mechanism maybe...
I need to solve this problem for collecting 2 user_id dataframes which have no same user_id mutually.
Since you want to generate two different set of users from a given pool of users with no overlap you may use this simple trick : =
from pyspark.sql.functions import monotonically_increasing_id
import pyspark.sql.functions as f
#"Creation of Original DF"
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_original =df.withColumn("UNIQUE_ID", monotonically_increasing_id())
number_groups_needed=2 ## you can adjust the number of group you need for your use case
dfa=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==0)
dfb=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==1)
##dfa and dfb will not have any overlap for user_id
Ps- if your user_id is itself a integer you don't need to create a new UNIQUE_ID column you can use it directly .
I choose randomSplit function pyspark supports.
df_a,df_b = df_original.randomSplit([0.6,0.4])
df_a = df_a.limit(10000)
df_a.count()
# 10000
df_b = df_b.limit(10000)
df_b.count()
# 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# 10000
never conflict between df_a and df_b anymore!

How do I specify a default value when the value is "null" in a spark dataframe?

I have a data frame like the picture below.
In the case of "null" among the values of the "item_param" column, I want to replace the string'test'.
How can I do it?
df = sv_df.withColumn("srv_name", col('col.srv_name'))\
.withColumn("srv_serial", col('col.srv_serial'))\
.withColumn("col2",explode('col.groups'))\
.withColumn("groups_id", col('col2.group_id'))\
.withColumn("col3", explode('col2.items'))\
.withColumn("item_id", col('col3.item_id'))\
.withColumn("item_param", from_json(col("col3.item_param"), MapType(StringType(), StringType())) ) \
.withColumn("item_param", map_values(col("item_param"))[0])\
.withColumn("item_time", col('col3.item_time'))\
.withColumn("item_time", from_unixtime( col('col3.item_time')/10000000 - 11644473600))\
.withColumn("item_value",col('col3.item_value'))\
.drop("servers","col","col2", "col3")
df.show(truncate=False)
df.printSchema()
Use coalesce:
.withColumn("item_param", coalesce(col("item_param"), lit("someDefaultValue"))
It will apply the first column/expression which is not null
You can use fillna, which allows you to replace the null values in all columns, a subset of columns, or each column individually. [Docs]
# All values
df = df.fillna(0)
# Subset of columns
df = df.fillna(0, subset=['a', 'b'])
# Per selected column
df = df.fillna( { 'a':0, 'b':-1 } )
In you case it would be:
df = df.fillna( {'item_param': 'test'} )

Is there a way in pyspark to count unique values

I have a spark dataframe (12m x 132) and I am trying to calculate the number of unique values by column, and remove columns that have only 1 unique value.
So far, I have used the pandas nunique function as such:
import pandas as pd
df = sql_dw.read_table(<table>)
df_p = df.toPandas()
nun = df_p.nunique(axis=0)
nundf = pd.DataFrame({'atr':nun.index, 'countU':nun.values})
dropped = []
for i, j in nundf.values:
if j == 1:
dropped.append(i)
df = df.drop(i)
print(dropped)
Is there a way to do this that is more native to spark - i.e. not using pandas?
Please have a look at the commented example below. The solution requires more python as pyspark specific knowledge.
import pyspark.sql.functions as F
#creating a dataframe
columns = ['asin' ,'ctx' ,'fo' ]
l = [('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO2')
,('ASIN1','CTX2','FO1')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO3')
,('ASIN1','CTX3','FO1')
,('ASIN1','CTX3','FO3')]
df=spark.createDataFrame(l, columns)
df.show()
#we create a list of functions we want to apply
#in this case countDistinct for each column
expr = [F.countDistinct(c).alias(c) for c in df.columns]
#we apply those functions
countdf = df.select(*expr)
#this df has just one row
countdf.show()
#we extract the columns which have just one value
cols2drop = [k for k,v in countdf.collect()[0].asDict().items() if v == 1]
df.drop(*cols2drop).show()
Output:
+-----+----+---+
| asin| ctx| fo|
+-----+----+---+
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO2|
|ASIN1|CTX2|FO1|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO3|
|ASIN1|CTX3|FO1|
|ASIN1|CTX3|FO3|
+-----+----+---+
+----+---+---+
|asin|ctx| fo|
+----+---+---+
| 1| 3| 3|
+----+---+---+
+----+---+
| ctx| fo|
+----+---+
|CTX1|FO1|
|CTX1|FO1|
|CTX1|FO2|
|CTX2|FO1|
|CTX2|FO2|
|CTX2|FO2|
|CTX2|FO3|
|CTX3|FO1|
|CTX3|FO3|
+----+---+
My apologies as I don't have the solution in pyspark but in pure spark, which may be transferable or used in case you can't find a pyspark way.
You can create a blank list and then using a foreach, check which columns have a distinct count of 1, then append them to the blank list.
From there you can use the list as a filter and drop those columns from your dataframe.
var list_of_columns: List[String] = ()
df_p.columns.foreach{c =>
if (df_p.select(c).distinct.count == 1)
list_of_columns ++= List(c)
df_p_new = df_p.drop(list_of_columns:_*)
you can group your df by that column and count distinct value of this column:
df = df.groupBy("column_name").agg(countDistinct("column_name").alias("distinct_count"))
And then filter your df by row which has more than 1 distinct_count:
df = df.filter(df.distinct_count > 1)

Duplicate row in PySpark Dataframe based off value in another column

I have a dataframe that looks like the following:
ID NumRecords
123 2
456 1
789 3
I want to create a new data frame that concatenates the two columns and duplicates the rows based on the value in NumRecords
So the output should be
ID_New 123-1
ID_New 123-2
ID_New 456-1
ID_New 789-1
ID_New 789-2
ID_New 789-3
I was looking into the "explode" function but it seemed to take only a constant based on the example I saw.
I had a similar issue, this code will duplicate the rows based on the value in the NumRecords column:
from pyspark.sql import Row
def duplicate_function(row):
data = [] # list of rows to return
to_duplicate = float(row["NumRecords"])
i = 0
while i < to_duplicate:
row_dict = row.asDict() # convert a Spark Row object to a Python dictionary
row_dict["SERIAL_NO"] = str(i)
new_row = Row(**row_dict) # create a Spark Row object based on a Python dictionary
to_return.append(new_row) # adds this Row to the list
i += 1
return data # returns the final list
# create final dataset based on value in NumRecords column
df_flatmap = df_input.rdd.flatMap(duplicate_function).toDF(df_input.schema)
You can use udf
from pyspark.sql.functions import udf, explode, concat_ws
from pyspark.sql.types import *
range_ = udf(lambda x: [str(y) for y in range(1, x + 1)], ArrayType(StringType()))
df.withColumn("records", range_("NumRecords") \
.withColumn("record", explode("records")) \
.withColumn("ID_New", concat_ws("-", "id", "record"))