Py spark join on pipeline separated column - apache-spark-sql

I have two data frames which i want to join. The catch is the one of the tables have pipeline separated string on which one of the value is what I want to join with. How do I it in Pyspark. Below is an example
TABLE A has
+-------+--------------------+
|id | name |
+-------+--------------------+
| 613760|123|test|test2 |
| 613740|456|ABC |
| 598946|OMG|567 |
TABLE B has
+-------+--------------------+
|join_id| prod_type|
+-------+--------------------+
| 123 |Direct De |
| 456 |Direct |
| 567 |In |
Expected Result - Join table A and Table B when there is a match with Table A's pipeline separated ID against Table B's value. For instance TableA.id - 613760 the name has 123|test and I want to join with table B's join ID 123 likewise 456 and 567.
Resultant Table
+--------------------+-------+
| name |join_Id|
+-------+------------+-------+
|123|test|test2 |123 |
|456|ABC |456 |
|OMG|567 |567 |
Can someone help me solve this. I am relatively new to pyspark and I am learning

To solve your problem you need to:
split those "pipeline separated strings"
then exploding those values into separated rows.
posexplode would do that for you http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.posexplode
from there an "inner join" and
finally a "select" would do the rest of the trick.
See the code below:
import pyspark.sql.functions as f
#First create the dataframes to test solution
table_A = spark.createDataFrame([(613760, '123|test|test2' ), (613740, '456|ABC'), (598946, 'OMG|567' )], ["id", "name"])
# +-------+--------------------+
# |id | name |
# +-------+--------------------+
# | 613760|123|test|test2 |
# | 613740|456|ABC |
# | 598946|OMG|567 |
table_B = spark.createDataFrame([('123', 'Direct De' ), ('456', 'Direct'), ('567', 'In' )], ["join_id", "prod_type"])
# +-------+--------------------+
# |join_id| prod_type|
# +-------+--------------------+
# | 123 |Direct De |
# | 456 |Direct |
# | 567 |In |
result = table_A \
.select(
'name',
f.posexplode(f.split(f.col('name'),'\|')).alias('pos', 'join_id')) \
.join(table_B, on='join_id', how='inner') \
.select('name', 'join_id')
result.show(10, False)
# +--------------+-------+
# |name |join_id|
# +--------------+-------+
# |123|test|test2|123 |
# |456|ABC |456 |
# |OMG|567 |567 |
# +--------------+-------+
Hope that works. As you continue getting better in Pyspark. I would recommend you to go through the functions in pyspark.sql.functions and that would take your skills to the next level.

Related

PySpark lead based on condition

I have a dataset such as:
Condition | Date
0 | 2019/01/10
1 | 2019/01/11
0 | 2019/01/15
1 | 2019/01/16
1 | 2019/01/19
0 | 2019/01/23
0 | 2019/01/25
1 | 2019/01/29
1 | 2019/01/30
I would like to get the next value of the date column when condition == 1 was met.
The desired output would be something like:
Condition | Date | Lead
0 | 2019/01/10 | 2019/01/15
1 | 2019/01/11 | 2019/01/16
0 | 2019/01/15 | 2019/01/23
1 | 2019/01/16 | 2019/01/19
1 | 2019/01/19 | 2019/01/29
0 | 2019/01/23 | 2019/01/25
0 | 2019/01/25 | NaN
1 | 2019/01/29 | 2019/01/30
1 | 2019/01/30 | NaN
How can I perform that?
Please keep in mind it's a very large dataset - which I will have to partition and group by an UUID so the solution has to be somewhat performant.
To get the next value of the date column when condition == 1 was met, we can use the first window function with a when().otherwise() which emulates the lead.
data_sdf. \
withColumn('dt_w_cond1_lead',
func.first(func.when(func.col('cond') == 1, func.col('dt')), ignorenulls=True).
over(wd.partitionBy().orderBy('dt').rowsBetween(1, sys.maxsize))
). \
show()
# +----+----------+---------------+
# |cond| dt|dt_w_cond1_lead|
# +----+----------+---------------+
# | 0|2019-01-10| 2019-01-11|
# | 1|2019-01-11| 2019-01-16|
# | 0|2019-01-15| 2019-01-16|
# | 1|2019-01-16| 2019-01-19|
# | 1|2019-01-19| 2019-01-29|
# | 0|2019-01-23| 2019-01-29|
# | 0|2019-01-25| 2019-01-29|
# | 1|2019-01-29| 2019-01-30|
# | 1|2019-01-30| null|
# +----+----------+---------------+
You can use window function lead. As you said in the question, for more performance, you will need to have more partitions.
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(0, '2019/01/10'),
(1, '2019/01/11'),
(0, '2019/01/15'),
(1, '2019/01/16'),
(1, '2019/01/19'),
(0, '2019/01/23'),
(0, '2019/01/25'),
(1, '2019/01/29'),
(1, '2019/01/30')],
['Condition', 'Date'])
Script:
w = W.partitionBy('Condition').orderBy('Date')
df = df.withColumn('Lead', F.lead('Date').over(w))
df.show()
# +---------+----------+----------+
# |Condition| Date| Lead|
# +---------+----------+----------+
# | 0|2019/01/10|2019/01/15|
# | 0|2019/01/15|2019/01/23|
# | 0|2019/01/23|2019/01/25|
# | 0|2019/01/25| null|
# | 1|2019/01/11|2019/01/16|
# | 1|2019/01/16|2019/01/19|
# | 1|2019/01/19|2019/01/29|
# | 1|2019/01/29|2019/01/30|
# | 1|2019/01/30| null|
# +---------+----------+----------+

Pyspark get rows with max value for a column over a window

I have a dataframe as follows:
| created | id | date |value|
| 1650983874871 | x | 2020-05-08 | 5 |
| 1650367659030 | x | 2020-05-08 | 3 |
| 1639429213087 | x | 2020-05-08 | 2 |
| 1650983874871 | x | 2020-06-08 | 5 |
| 1650367659030 | x | 2020-06-08 | 3 |
| 1639429213087 | x | 2020-06-08 | 2 |
I want to get max of created for every date.
The table should look like :
| created | id | date |value|
| 1650983874871 | x | 2020-05-08 | 5 |
| 1650983874871 | x | 2020-06-08 | 5 |
I tried:
df2 = (
df
.groupby(['id', 'date'])
.agg(
F.max(F.col('created')).alias('created_max')
)
df3 = df.join(df2, on=['id', 'date'], how='left')
But this is not working as expected.
Can anyone help me.
You need to make two changes.
The join condition needs to include created as well. Here I have changed alias to alias("created") to make the join easier. This will ensure a unique join condition (if there are no duplicate created values).
The join type must be inner.
df2 = (
df
.groupby(['id', 'date'])
.agg(
F.max(F.col('created')).alias('created')
)
)
df3 = df.join(df2, on=['id', 'date','created'], how='inner')
df3.show()
+---+----------+-------------+-----+
| id| date| created|value|
+---+----------+-------------+-----+
| x|2020-05-08|1650983874871| 5|
| x|2020-06-08|1650983874871| 5|
+---+----------+-------------+-----+
Instead of using the group by and joining, you can also use the Window in pyspark.sql:
from pyspark.sql import functions as func
from pyspark.sql.window import Window
df = df\
.withColumn('max_created', func.max('created').over(Window.partitionBy('date', 'id')))\
.filter(func.col('created')==func.col('max_created'))\
.drop('max_created')
Step:
Get the max value based on the Window
Filter the row by using the matched timestamp

Check the elements in two columns of different dataframes

I have two dataframes.
Df1
Id | Name | Remarks
---------------------
1 | A | Not bad
1 | B | Good
2 | C | Very bad
Df2
Id | Name | Place |Job
-----------------------
1 | A | Can | IT
2 |C | Cbe | CS
4 |L | anc | ME
5 | A | cne | IE
Output
Id | Name | Remarks |Results
------------------------------
1 | A | Not bad |True
1 | B | Good |False
2 | C | VeryGood |True
That is the result should be true if same id and name are present in both dataframes. I tried
df1['Results']=np.where(Df1['id','Name'].isin(Df2['Id','Name']),'true','false')
But it was not successful.
Use DataFrame.merge with indicator parameter and compare both values:
df = Df1[['id','Name']].merge(Df2[['Id','Name']], indicator='Results', how='left')
df['Results'] = df['Results'].eq('both')
Your solution is possible by compare index values by DataFrame.set_index with Index.isin:
df1['Results']= Df1.set_index(['id','Name']).index.isin(Df2.set_index(['id','Name']).index)
Or compare tuples from both columns:
df1['Results']= Df1[['id','Name']].agg(tuple, 1).isin(Df2[['id','Name']].agg(tuple, 1))
You can easily achieve by merge like #jezrael 's answer.
You can also achieve it with np.where,list comprehension and zip like below:
df1['Results']=np.where([str(i)+'_'+str(j)==str(k)+'_'+str(l) for i,j,k,l in zip(Df1['ID'],Df1['Name'],Df2['ID'],Df2['Name'])],True,False)

Using PySpark window functions with conditions to add rows

I have a need to be able to add new rows to a PySpark df will values based upon the contents of other rows with a common id. There will eventually millions of ids with lots rows for each id. I have tried the below method which works but seems overly complicated.
I start with a df in the format below (but in reality have more columns):
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
+-------+----------+-------+
Currently I am pivoting this df to get it in the following format:
+-----+------+------+------+
| id | varA | varB | varC |
+-----+------+------+------+
| 1 | 30 | 1 | -9 |
+-----+------+------+------+
On this df I can then use the standard withColumn and when functionality to add new columns based on the values in other columns. For example:
df = df.withColumn("varD", when((col("varA") > 16) & (col("varC") != -9)), 2).otherwise(1)
Which leads to:
+-----+------+------+------+------+
| id | varA | varB | varC | varD |
+-----+------+------+------+------+
| 1 | 30 | 1 | -9 | 1 |
+-----+------+------+------+------+
I can then pivot this df back to the original format leading to this:
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
| 1 | varD | 1 |
+-------+----------+-------+
This works but seems like it could, with millions of rows, lead to expensive and unnecessary operations. It feels like it should be doable without the need to pivot and unpivot the data. Do I need to do this?
I have read about Window functions and it sounds as if they may be another way to achieve the same result but to be honest I am struggling to get started with them. I can see how they can be used to generate a value, say a sum, for each id, or to find a maximum value but have not found a way to even get started on applying complex conditions that lead to a new row.
Any help to get started with this problem would be gratefully received.
You can use pandas_udf for adding/deleting rows/col on grouped data, and implement your processing logic in pandas udf.
import pyspark.sql.functions as F
row_schema = StructType(
[StructField("id", IntegerType(), True),
StructField("variable", StringType(), True),
StructField("value", IntegerType(), True)]
)
#F.pandas_udf(row_schema, F.PandasUDFType.GROUPED_MAP)
def addRow(pdf):
val = 1
if (len(pdf.loc[(pdf['variable'] == 'varA') & (pdf['value'] > 16)]) > 0 ) & \
(len(pdf.loc[(pdf['variable'] == 'varC') & (pdf['value'] != -9)]) > 0):
val = 2
return pdf.append(pd.Series([1, 'varD', val], index=['id', 'variable', 'value']), ignore_index=True)
df = spark.createDataFrame([[1, 'varA', 30],
[1, 'varB', 1],
[1, 'varC', -9]
], schema=['id', 'variable', 'value'])
df.groupBy("id").apply(addRow).show()
which resuts
+---+--------+-----+
| id|variable|value|
+---+--------+-----+
| 1| varA| 30|
| 1| varB| 1|
| 1| varC| -9|
| 1| varD| 1|
+---+--------+-----+

Transpose rows into columns

I have a requirement to transpose rows into columns. There are 2 tables (shown below). Each record in the product table matches with 0 or 1 or 2 records in the product_segment table. There can be 2 types of products - HOS & AMB. The requirement is to populate the "segment" values into their corresponding 2 columns (1 for HOS & 1 for AM) in the target, based on this product type.
Populate value for that HOS_segment or AMB_segment in the target based on whichever corresponding product-type record exists in the source. If both record-types are present then populate both fields in output Or else populate the one that exists.
Assume the tables as :
Product:
product_id | eff_date
12345 | 10/01/2018
75852 | 22/05/2018
33995 | 15/02/2019
product_segment:
product_id | segment | type
12345 | KA | HOS
12345 | HM | AMB
75852 | GB | HOS
33995 | HD | AMB
Expected output:
product_id | eff_date | HOS_segment | AMB_segment
12345 | 10/01/2018 | KA | HM
75852 | 22/05/2018 | GB | Null
33995 | 15/02/2019 | Null | HD
For product 12345 both HOS and AMB records exists hence, in the output both the columns get populated with their corresponding segments.
For product 75852 only the HOS record exists, hence, HOS_segment gets populated but AMB_segment gets Null
And finally just the opposite happens for product 33995. AMB_segment gets populated but HOS_segment gets Null
Can anyone please help me solve this
instead of using joins and where I would suggest single join with pivot. here is the code snippet have a look.
>>> import pyspark.sql.functions as F
>>> df1= spark.createDataFrame([[12345,"10/01/2018"],[75852,"10/01/2018"],[33995,"10/01/2018"]],["product_id","eff_date"])
>>> df1.show()
+----------+----------+
|product_id| eff_date|
+----------+----------+
| 12345|10/01/2018|
| 75852|10/01/2018|
| 33995|10/01/2018|
+----------+----------+
>>> df2 = spark.createDataFrame([[12345,"KA","HOS"],[12345,"HM","AMB"],[75852,"GB","HOS"],[33995,"HD","AMB"]],["product_id","Segment","type"])
>>> df2.show()
+----------+-------+----+
|product_id|Segment|type|
+----------+-------+----+
| 12345| KA| HOS|
| 12345| HM| AMB|
| 75852| GB| HOS|
| 33995| HD| AMB|
+----------+-------+----+
>>> df1.join(df2,df1.product_id ==df2.product_id,"inner").groupBy(df2.product_id,df1.eff_date).pivot("type").agg(F.first(df2.Segment)).show()
+----------+----------+----+----+
|product_id| eff_date| AMB| HOS|
+----------+----------+----+----+
| 12345|10/01/2018| HM| KA|
| 33995|10/01/2018| HD|null|
| 75852|10/01/2018|null| GB|
+----------+----------+----+----+
Spark-sql 2.4+
>>> df1.registerTempTable("df1_temp")
>>> df2.registerTempTable("df2_temp")
>>> spark.sql("select * from(select a.*,b.segment,b.type from df1_temp a inner join df2_temp b on a.product_id =b.product_id) PIVOT( first(segment) for type in ('HOS' HOS_segment,'AMB' AMB_Segment )) " ).show()
+----------+----------+-----------+-----------+
|product_id| eff_date|HOS_segment|AMB_Segment|
+----------+----------+-----------+-----------+
| 12345|10/01/2018| KA| HM|
| 33995|10/01/2018| null| HD|
| 75852|10/01/2018| GB| null|
+----------+----------+-----------+-----------+
I hope it will help you. let me know if you have any questions related to same.
You can use a join with a filtered segment table.
import pyspark.sql.functions as F
product \
.join(product_segment.where("type = 'HOS'").select("product_id", F.col("segment").alias("HOS_segment")), "product_id", "left_outer") \
.join(product_segment.where("type = 'AMB'").select("product_id", F.col("segment").alias("AMB_segment")), "product_id", "left_outer")