How to fill in Null values in a column of a PySpark DataFrame using value from other records? - dataframe

I have a PySpark DataFrame with columns 'Country' and 'Continent'. In several records, value of 'Continent' is missing. However, there are records containing 'Continent' for the same country (Examples of "Italy" and "China" in the sample dataset below).
# Prepare Data
data = [
(1, "Italy", "Europe"),
(2, "Italy", "Europe"),
(3, "Italy", None),
(4, "Italy", None),
(5, "China", "Asia"),
(6, "China", None),
(7, "Japan", "Asia"),
(8, "France", None),
]
# Create DataFrame
columns = ["Code", "Country", "Continent"]
df = spark.createDataFrame(data=data, schema=columns)
df.show(truncate=False)
For records whose 'Continent' is Null, I want to:
Look for any record of the same country whose 'Continent' is not Null
Get the value of 'Continent' and fill in the Null value
The expected output will be like:
Code
Country
Continent
1
'Italy'
'Europe'
2
'Italy'
'Europe'
3
'Italy'
'Europe'
4
'Italy'
'Europe'
5
'China'
'Asia'
6
'China'
'Asia'
7
'Japan'
'Asia'
8
'France'
Null
Any idea how to do that?

Here is one of the solution. Assuming there is only 1 distinct Continent value in each country, you can use the max() within each country partition:
df2 = df.withColumn(
"Continent",
func.when(
func.col("Continent").isNull(),
func.max("Continent").over(Window.partitionBy(func.col("Country"))),
).otherwise(func.col("Continent")),
).orderBy("code")
df2.show(truncate=False)
+----+-------+---------+
|Code|Country|Continent|
+----+-------+---------+
|1 |Italy |Europe |
|2 |Italy |Europe |
|3 |Italy |Europe |
|4 |Italy |Europe |
|5 |China |Asia |
|6 |China |Asia |
|7 |Japan |Asia |
|8 |France |null |
+----+-------+---------+
Or you can create a smaller "reference" dataframe and do the broadcasting join back to the main dataframe.

Another solution, same result :
from pyspark.sql import functions as F, Window
df.withColumn(
"continent",
F.coalesce(
F.col("continent"),
F.first("continent").over(Window.partitionBy("country")),
),
).show()
+----+-------+---------+
|Code|Country|continent|
+----+-------+---------+
| 5| China| Asia|
| 6| China| Asia|
| 8| France| null|
| 1| Italy| Europe|
| 2| Italy| Europe|
| 3| Italy| Europe|
| 4| Italy| Europe|
| 7| Japan| Asia|
+----+-------+---------+

Related

Finding the value of a column based on other 2 columns

I have specific problem, where I want to retrieve the value of bu_id field from id and matched_ id.
When there is some value in matched_id column, bu_id should be same as the id for that particular id and ids of corresponding matched_id.
When matched_id is blank, bu_id should be same as id.
input
+---+------------+
|id |matched_id |
+---+------------+
|0 |7,8 |
|1 | |
|2 |4 |
|3 |5,9 |
|4 |2 |
|5 |3,9 |
|6 | |
|7 |0,8 |
|8 |0,7 |
|9 |3,5 |
output
+---+------------+-----+
|id |matched_id |bu_id|
+---+------------+-----+
|0 |7,8 |0 |
|1 | |1 |
|2 |4 |2 |
|3 |5,9 |3 |
|4 |2 |2 |
|5 |3,9 |3 |
|6 | |6 |
|7 |0,8 |0 |
|8 |0,7 |0 |
|9 |3,5 |3 |
Can anyone help me how to approach this problem. Thanks in advance.
We should try to use functions exclusively from the pyspark.sql.functions module because these are optimized for pyspark dataframes (see here), whereas udfs are not and should be avoided when possible.
To achieve the desired output pyspark dataframe, we can concatenate both "id" and "matched_id" columns together, convert the string that into a list of strings using split, cast the result as an array of integers, and take the minimum of the array – and we can get away with not having to worry about the blank strings because they get converted into null, and F.array_min drops nulls from consideration. This can be done with the following line of code (and while it is a little hard to read, it gets the job done):
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
],
["id", "matched_id"]
)
df.withColumn(
"bu_id",
F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),",").cast("array<int>"))
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
+---+----------+-----+
Update: in the case of non-numeric strings in columns "id" and "matched_id", we can no longer cast to an array of integers, so we can instead use pyspark functions F.when and .otherwise (see here) to set our new column to the "id" column when "matched_id" is an empty string "", and apply our other longer nested function when "matched_id" is non-empty.
df2 = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
("x", ""),
("x", "y,z")
],
["id", "matched_id"]
)
df2.withColumn(
"bu_id",
F.when(F.col("matched_id") != "", F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),","))).otherwise(
F.col("id")
)
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
| x| | x|
| x| y,z| x|
+---+----------+-----+
To answer this question I assumed that the logic you are looking to implement is,
If the matched_id column is null, then bu_id should be the same as id.
If the matched_id column is not null, we should consider the values listed in both the id and matched_id columns and bu_id should be the minimum of those values.
The Set-Up
# imports to include
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
# making your dataframe
df = spark.createDataFrame(
[
('0','7,8'),
('1',''),
('2','4'),
('3','5,9'),
('4','2'),
('5','3,9'),
('6',''),
('7','0,8'),
('8','0,7'),
('9','3,5'),
],
['id', 'matched_id'])
print(df.schema.fields)
df.show(truncate=False)
In this df, both the id and matched_id columns are StringType data types. The code that follows builds-off this assumption. You can check the column types in your df by running print(df.schema.fields)
id
matched_id
0
7,8
1
2
4
3
5,9
4
2
5
3,9
6
7
0,8
8
0,7
9
3,5
The Logic
To implement the logic for bu_id, we created a function called bu_calculation that defines the logic. Then we wrap the function in pyspark sql UDF. The bu_id column is then created by inputing the columns we need to evaluate (the id and matched_id columns) into the UDF
# create custom function with the logic for bu_id
def bu_calculation(id_col, matched_id_col):
id_int = int(id_col)
# turn the string in the matched_id column into a list and remove empty values from the list
matched_id_list = list(filter(None, matched_id_col.split(",")))
if len(matched_id_list) > 0:
# if matched_id column has values, convert strings to ints
all_ids = [int(x) for x in matched_id_list]
# join id column values with matched_id column values
all_ids.append(id_int)
# return minimum value
return min(all_ids)
else:
# if matched_id column is empty return the id column value
return id_int
# apply custom bu_calculation function to pyspark sql udf
# the use of IntegerType() here enforces that the bu_calculation function has to return an int
bu_udf = F.udf(bu_calculation, IntegerType())
# make a new column called bu_id using the pyspark sql udf we created called bu_udf
df = df.withColumn('bu_id', bu_udf('id', 'matched_id'))
df.show(truncate=False)
id
matched_id
bu_id
0
7,8
0
1
1
2
4
2
3
5,9
3
4
2
2
5
3,9
3
6
6
7
0,8
0
8
0,7
0
9
3,5
3
More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html

Pyspark how to compare row by row based on hash from two data frame and group the result

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks
What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.

Determine if pyspark DataFrame row value is present in other columns

I'm working with a dataframe in pyspark, and need to evaluate row by row if a value is present in other columns of the dataframe. As an example, given this dataframe:
df:
+---------+--------------+-------+-------+-------+
|Subject |SubjectTotal |TypeA |TypeB |TypeC |
+---------+--------------+-------+-------+-------+
|Subject1 |10 |5 |3 |2 |
+---------+--------------+-------+-------+-------+
|Subject2 |15 |0 |15 |0 |
+---------+--------------+-------+-------+-------+
|Subject3 |5 |0 |0 |5 |
+---------+--------------+-------+-------+-------+
As an output, I need to determine which Type has 100% of the SubjectTotal. So my output would look like this:
df_output:
+---------+--------------+
|Subject |Type |
+---------+--------------+
|Subject2 |TypeB |
+---------+--------------+
|Subject3 |TypeC |
+---------+--------------+
Is it even possible?
Thanks!
Yo can try with when().otherwise() PySpark SQL function or case statement in SQL
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("Subject1", 10, 5, 3, 2),
("Subject2", 15, 0, 15, 0),
("Subject3", 5, 0, 0, 5)
],
("subject", "subjectTotal", "TypeA", "TypeB", "TypeC"))
df.show()
+--------+------------+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC|
+--------+------------+-----+-----+-----+
|Subject1| 10| 5| 3| 2|
|Subject2| 15| 0| 15| 0|
|Subject3| 5| 0| 0| 5|
+--------+------------+-----+-----+-----+
df.withColumn("Type", F.
when(F.col("subjectTotal") == F.col("TypeA"), "TypeA").
when(F.col("subjectTotal") == F.col("TypeB"), "TypeB").
when(F.col("subjectTotal") == F.col("TypeC"), "TypeC").
otherwise(None)).show()
+--------+------------+-----+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC| Type|
+--------+------------+-----+-----+-----+-----+
|Subject1| 10| 5| 3| 2| null|
|Subject2| 15| 0| 15| 0|TypeB|
|Subject3| 5| 0| 0| 5|TypeC|
+--------+------------+-----+-----+-----+-----+
You can use when expression within a list comprehension over all columns TypeX, then coalesce the list of expressions:
from pyspark.sql import functions as F
df1 = df.select(
F.col("Subject"),
F.coalesce(*[F.when(F.col(c) == F.col("SubjectTotal"), F.lit(c)) for c in df.columns[2:]]).alias("Type")
).filter("Type is not null")
df1.show()
#+--------+-----+
#| Subject| Type|
#+--------+-----+
#|Subject2|TypeB|
#|Subject3|TypeC|
#+--------+-----+
You can unpivot the dataframe using stack and filter the rows where SubjectTotal is equal to the value in the type columns:
df2 = df.selectExpr(
'Subject',
'SubjectTotal',
"stack(3, 'TypeA', TypeA, 'TypeB', TypeB, 'TypeC', TypeC) as (type, val)"
).filter('SubjectTotal = val').select('Subject', 'type')
df2.show()
+--------+-----+
| Subject| type|
+--------+-----+
|Subject2|TypeB|
|Subject3|TypeC|
+--------+-----+

Selecting 'Exclusive Rows' from a PySpark Dataframe

I have a PySpark dataframe like this:
+----------+-----+
|account_no|types|
+----------+-----+
| 1| K|
| 1| A|
| 1| S|
| 2| M|
| 2| D|
| 2| S|
| 3| S|
| 3| S|
| 4| M|
| 5| K|
| 1| S|
| 6| S|
+----------+-----+
and I am trying to pick the account numbers for which Exclusively 'S' exists.
For example: Even though '1' has type ='S', I will not pick it because it has also got other types. But I will pick 3 and 6, because they have just one type 'S'.
What I am doing right now is:
- First get all accounts for which 'K' exists and remove them; which in this example removes '1' and '5'
- Second find all accounts for which 'D' exists and remove them, which removes '2'
- Third find all accounts for which 'M' exists, and remove '4' ('2' has also got 'M' but it was removed at step 2)
- Fourth find all accounts for which 'A' exists, and remove them
So, now '1', '2', '4' and '5' are removed and I get '3' and '6' which have exclusive 'S'.
But this is a long process, how do I optimize it?
Thank you
Another alternative is counting distinct over a window and then filter where Distinct count == 1 and types == S , for ordering you can assign a monotonically increasing id and then orderBy the same.
from pyspark.sql import functions as F
W = Window.partitionBy('account_no')
out = (df.withColumn("idx",F.monotonically_increasing_id())
.withColumn("Distinct",F.approx_count_distinct(F.col("types")).over(W)).orderBy("idx")
.filter("Distinct==1 AND types =='S'")).drop('idx','Distinct')
out.show()
+----------+-----+
|account_no|types|
+----------+-----+
| 3| S|
| 3| S|
| 6| S|
+----------+-----+
One way to do this is to use Window functions. First we get a sum of the number of S in each account_no grouping. Then we compare that to the total number of entries for that group, in the filter, if they match we keep that number.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("account_no")
w1=Window().partitionBy("account_no").orderBy("types")
df.withColumn("sum_S", F.sum(F.when(F.col("types")=='S', F.lit(1)).otherwise(F.lit(0))).over(w))\
.withColumn("total", F.max(F.row_number().over(w1)).over(w))\
.filter('total=sum_S').drop("total","Sum_S").show()
#+----------+-----+
#|account_no|types|
#+----------+-----+
#| 6| S|
#| 3| S|
#| 3| S|
#+----------+-----+
You can simply detect the amount of distinct types an account has and then filter the 'S' accounts which only have 1 distinct type.
Here is my code for it:
from pyspark.sql.functions import countDistinct
data = [(1, 'k'),
(1, 'a'),
(1, 's'),
(2, 'm'),
(2, 'd'),
(2, 's'),
(3, 's'),
(3, 's'),
(4, 'm'),
(5, 'k'),
(1, 's'),
(6, 's')]
df = spark.createDataFrame(data, ['account_no', 'types']).distinct()
exclusive_s_accounts = (df.groupBy('account_no').agg(countDistinct('types').alias('distinct_count'))
.join(df, 'account_no')
.where((col('types') == 's') & (col('distinct_count') == 1))
.drop('distinct_count'))
Another alternate approach could be to get all the types under one column and then apply filter operations to exclude which has non "S" values.
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import collectivist
from pyspark.sql.functions import col
df = spark.read.csv("/Users/Downloads/account.csv", header=True, inferSchema=True, sep=",")
type_df = df.groupBy("account_no").agg(concat_ws(",", collect_list("types")).alias("all_types")).select(col("account_no"), col("all_types"))
+----------+---------+
|account_no|all_types|
+----------+---------+
| 1| K,A,S,S|
| 6| S|
| 3| S,S|
| 5| K|
| 4| M|
| 2| M,D,S|
+----------+---------+
further filtering using regular expression
only_s_df = type_df.withColumn("S_status",F.col("all_types").rlike("K|A|M|D"))
only_s_df.show()
+----------+---------+----------+
|account_no|all_types|S_status |
+----------+---------+----------+
| 1| K,A,S,S| true|
| 6| S| false|
| 3| S,S| false|
| 5| K| true|
| 4| M| true|
| 2| M,D,S| true|
+----------+---------+----------+
hope this way you can get the answer and further processing.

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

I have recently started working with pySpark so don't know about many details regarding this.
I am trying to create a BinaryType column in a data frame? But struggling to do it...
for example, let's take a simple df
df.show(2)
+---+----------+
| col1|col2|
+---+----------+
| "1"| null|
| "2"| "20"|
+---+----------+
Now I want to have a third column "col3" with BinaryType like
| col1|col2| col3|
+---+----------+
| "1"| null|[1 null]
| "2"| "20"|[ 2 20]
+---+----------+
How should i do it?
Try this:
a = [('1', None), ('2', '20')]
df = spark.createDataFrame(a, ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2| 20|
+----+----+
df = df.withColumn('col3', F.array(['col1', 'col2']))
df.show()
+----+----+-------+
|col1|col2| col3|
+----+----+-------+
| 1|null| [1,]|
| 2| 20|[2, 20]|
+----+----+-------+