Pyspark how to compare row by row based on hash from two data frame and group the result - sql

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks

What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.

Related

PySpark join dataframes with LIKE

I try to join dataframes using a LIKE expression in which the conditions (content of LIKE) is stores in a column. Is it possible in PySpark 2.3?
Source dataframe:
+---------+----------+
|firstname|middlename|
+---------+----------+
| James| |
| Michael| Rose|
| Robert| Williams|
| Maria| Anne|
+---------+----------+
Second dataframe
+---------+----+
|condition|dest|
+---------+----+
| %a%|Box1|
| %b%|Box2|
+---------+----+
Expected result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+
Let me reproduce the issue on the sample below.
Let's create a sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data = [("James",""),
("Michael","Rose"),
("Robert","Williams"),
("Maria","Anne")
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
and the second one:
mapping = [("%a%","Box1"),("%b%","Box2")]
schema = StructType([ \
StructField("condition",StringType(),True), \
StructField("dest",StringType(),True)
])
map = spark.createDataFrame(data=mapping,schema=schema)
map.show()
If I am rights, it is not possible to use LIKE during join dataframes, so I have created a crossJoin and tried to use a filter with like, but is it possible to take the content from a column, not a fixed string? This is invalid syntax of cource, but I am looking for another solution:
df.crossJoin(map).filter(df.firstname.like(map.condition)).show()
Any expression can be used as a join condition. True, with DataFrame API like function's parameter can only be str, not Column, so you can't have col("firstname").like(col("condition")). However SQL version does not have this limitation so you can leverage expr:
df.join(map, expr("firstname like condition")).show()
Or just plain SQL:
df.createOrReplaceTempView("df")
map.createOrReplaceTempView("map")
spark.sql("SELECT * FROM df JOIN map ON firstname like condition").show()
Both return the same result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+

How to update a dataframe in PySpark with random values from another dataframe?

I have two dataframes in PySpark as below:
Dataframe A: total 1000 records
+-----+
|Name |
+-----+
| a|
| b|
| c|
+-----+
Dataframe B: Total 3 records
+-----+
|Zip |
+-----+
|06905|
|06901|
|06902|
+-----+
I need to add a new column named Zip in Dataframe A and populate the values with a randomly selected value from Dataframe B. So the Dataframe A will look something like this:
+-----+-----+
|Name |Zip |
+-----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06902|
+-----+-----+
I am running this on Azure Databricks and apparently, quinn isn't a module in there. So can't use quinn unfortunately.
If b is small (3 rows), you can just collect it into a Python list and add it as an array column to a. Then you can get a random element using shuffle.
import pyspark.sql.functions as F
df = a.withColumn(
'Zip',
F.shuffle(
F.array(*[F.lit(r[0]) for r in b.collect()])
)[0]
)
df.show()
+----+-----+
|Name| Zip|
+----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06901|
+----+-----+
You can agg the dataframe with zips and collect the values into one array column, then do a cross join and select a random element from the array of zips using for example shuffle on the array before picking the first element:
from pyspark.sql import functions as F
df_result = df_a.crossJoin(
df_b.agg(F.collect_list("Zip").alias("Zip"))
).withColumn(
"Zip",
F.expr("shuffle(Zip)[0]")
)
#+----+-----+
#|Name| Zip|
#+----+-----+
#| a|06901|
#| b|06902|
#| c|06901|
#| d|06901|
#+----+-----+

Storing values of multiples columns in pyspark dataframe under a new column

I am importing data from a csv file where I have columns Reading1 and Reading2 and storing it into a pyspark dataframe.
My objective is to have a new column name Reading and its value as a array containing values of Reading1 and Reading2. How can I achieve the same in pyspark.
+---+-----------+-----------+
| id| Reading A| Reading B|
+---+-----------------------+
|01 | 0.123 | 0.145 |
|02 | 0.546 | 0.756 |
+---+-----------+-----------+
Desired Output:
+---+------------------+
| id| Reading |
+---+------------------+
|01 | [0.123, 0.145] |
|02 | [0.546, 0.756 |
+---+------------------+-
try this
import pyspark.sql.functions as f
df.withColumn('reading',f.array([f.col("reading a"), f.col("reading b")]))

Merge two columns but with different structure in hive

I have loaded a parquet file and created a Data frame as shown below
----------------------------------------------------------------------
time | data1 | data2
-----------------------------------------------------------------------
1-40 | [ lion-> 34, bear -> 2 ] | [ monkey -> [9,23], goose -> [4,5] ]
So, the data type of data1 column is string->integer map, where data type of data2 column is string->array map.
I want to explode the above data frame into below structure
------------------------
time | key | val
------------------------
1-40 | lion | 34
1-40 | bear | 2
1-40 | monkey_0 | 9
1-40 | monkey_1 | 23
1-40 | goose_0 | 4
1-40 | goose_1 | 5
I tried to convert both data1 and data2 into same datatype as string->array by using udfs in pyspark and then exploded the column as show below
def to_map(col1, col2):
for i in col1.keys():
col2[i] = [col1[i]]
return col2
caster= udf(to_map,MapType(StringType(),ArrayType(IntegerType())))
pm_df = pm_df.withColumn("animals", caster('data1', 'data2'))
pm_df.select('time',explode(col('animals')))
I also tried using hive sql by assuming hive sql has more performance than using pyspark UDFs.
rdd = spark.sparkContext.parallelize([[datetime.datetime.now(), {'lion': 34, 'bear': 2}, {'monkey': [9, 23], 'goose':[4,5]} ]])
df = rdd.toDF(fields)
df.createOrReplaceTempView("df")
df = spark.sql("select time, explode(data1), data2 from df")
df.createOrReplaceTempView("df")
df = spark.sql("select time,key as animal,value,posexplode(data2) from df").show(truncate=False)
But I am stuck with below result and don't know how to merge the splitted columns as per my requirement.Output of above hive sql is:
+--------------------------+------+-----+---+------+-------+
|time |animal|value|pos|key |value |
+--------------------------+------+-----+---+------+-------+
|2019-06-12 19:23:00.169739|bear |2 |0 |goose |[4, 5] |
|2019-06-12 19:23:00.169739|bear |2 |1 |monkey|[9, 23]|
|2019-06-12 19:23:00.169739|lion |34 |0 |goose |[4, 5] |
|2019-06-12 19:23:00.169739|lion |34 |1 |monkey|[9, 23]|
+--------------------------+------+-----+---+------+-------+
I know that while using python udfs there is lot of overhead that goes for communication between a python processor and JVMs. Is there any way to achieve the above expected result using inbuilt functions or hive sql.
I would process data1 and data2 separately and then union the resultset:
from pyspark.sql import functions as F
df1 = df.select('time', F.explode('data1').alias('key', 'value'))
>>> df1.show()
#+--------------------+----+-----+
#| time| key|value|
#+--------------------+----+-----+
#|2019-06-12 20:19:...|bear| 2|
#|2019-06-12 20:19:...|lion| 34|
#+--------------------+----+-----+
df2 = df.select('time', F.explode('data2').alias('key', 'values')) \
.select('time', 'key', F.posexplode('values').alias('pos','value')) \
.select('time', F.concat('key', F.lit('_'), 'pos').alias('key'), 'value')
>>> df2.show()
#+--------------------+--------+-----+
#| time| key|value|
#+--------------------+--------+-----+
#|2019-06-12 20:19:...| goose_0| 4|
#|2019-06-12 20:19:...| goose_1| 5|
#|2019-06-12 20:19:...|monkey_0| 9|
#|2019-06-12 20:19:...|monkey_1| 23|
#+--------------------+--------+-----+
df_new = df1.union(df2)

Adding a column to a PySpark dataframe contans standard deviations of a column based on the grouping on two another columns

Suppose that we have a csv file which has been imported as a dataframe in PysPark as follows
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("file path and name.csv", inferSchema = True, header = True)
df.show()
output
+-----+----+----+
|lable|year|val |
+-----+----+----+
| A|2003| 5.0|
| A|2003| 6.0|
| A|2003| 3.0|
| A|2004|null|
| B|2000| 2.0|
| B|2000|null|
| B|2009| 1.0|
| B|2000| 6.0|
| B|2009| 6.0|
+-----+----+----+
Now, we want to add another column to df which contains the standard deviation of val based on the grouping on two columns lable and year. So, the output must be as follows:
+-----+----+----+-----+
|lable|year|val | std |
+-----+----+----+-----+
| A|2003| 5.0| 1.53|
| A|2003| 6.0| 1.53|
| A|2003| 3.0| 1.53|
| A|2004|null| null|
| B|2000| 2.0| 2.83|
| B|2000|null| 2.83|
| B|2009| 1.0| 3.54|
| B|2000| 6.0| 2.83|
| B|2009| 6.0| 3.54|
+-----+----+----+-----+
I have the following codes which works for a small dataframe but it does not work for a very large dataframe (with about 40 million rows) which I am working with now.
import pyspark.sql.functions as f
a = df.groupby('lable','year').agg(f.round(f.stddev("val"),2).alias('std'))
df = df.join(a, on = ['lable', 'year'], how = 'inner')
I get Py4JJavaError Traceback (most recent call last) error after running on my large dataframe.
Does anyone knows any alternative way? I hope your way works on my dataset.
I am using python3.7.1, pyspark2.4, and jupyter4.4.0
The join on dataframe causes a lot of data shuffle between executors. In your case, you can do without the join.
Use a window specification to partition data by 'lable' and 'year' and aggregate on the window.
from pyspark.sql.window import *
windowSpec = Window.partitionBy('lable','year')\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("std", f.round(f.stddev("val").over(windowSpec), 2))