PySpark join dataframes with LIKE - dataframe

I try to join dataframes using a LIKE expression in which the conditions (content of LIKE) is stores in a column. Is it possible in PySpark 2.3?
Source dataframe:
+---------+----------+
|firstname|middlename|
+---------+----------+
| James| |
| Michael| Rose|
| Robert| Williams|
| Maria| Anne|
+---------+----------+
Second dataframe
+---------+----+
|condition|dest|
+---------+----+
| %a%|Box1|
| %b%|Box2|
+---------+----+
Expected result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+
Let me reproduce the issue on the sample below.
Let's create a sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data = [("James",""),
("Michael","Rose"),
("Robert","Williams"),
("Maria","Anne")
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
and the second one:
mapping = [("%a%","Box1"),("%b%","Box2")]
schema = StructType([ \
StructField("condition",StringType(),True), \
StructField("dest",StringType(),True)
])
map = spark.createDataFrame(data=mapping,schema=schema)
map.show()
If I am rights, it is not possible to use LIKE during join dataframes, so I have created a crossJoin and tried to use a filter with like, but is it possible to take the content from a column, not a fixed string? This is invalid syntax of cource, but I am looking for another solution:
df.crossJoin(map).filter(df.firstname.like(map.condition)).show()

Any expression can be used as a join condition. True, with DataFrame API like function's parameter can only be str, not Column, so you can't have col("firstname").like(col("condition")). However SQL version does not have this limitation so you can leverage expr:
df.join(map, expr("firstname like condition")).show()
Or just plain SQL:
df.createOrReplaceTempView("df")
map.createOrReplaceTempView("map")
spark.sql("SELECT * FROM df JOIN map ON firstname like condition").show()
Both return the same result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+

Related

Pyspark how to compare row by row based on hash from two data frame and group the result

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks
What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.

How to do reshape operation on a single column in pyspark dataframe?

I have a long pyspark dataframe as shown below:
+------+
|number|
+------+
|12.4 |
|13.4 |
|42.3 |
|33.4 |
|42.3 |
|32.4 |
|44.2 |
|12.3 |
|45.4 |
+------+
Ideally, I want this to be reshaped to an nxn matrix where n is sqrt(length of pyspark dataframe).
While there is a solution by converting it into a numpy array and then reshaping it to nxn matrix but I want that to be done in pyspark. Because my data is super long (about 100 million rows).
So the expected output I am looking for is something along these lines:
+------+------+------+
|12.4 | 13.4 | 42.3 |
|33.4 | 42.3 | 32.4 |
|44.2 | 12.3 | 45.4 |
+------+------+------+
While I was able to do it properly by converting it to pandas then to numpy and then doing the reshape operation. But I want to do this transformation in Pyspark itself. Because the below code works fine for only a few thousand rows.
covarianceMatrix_pd = covarianceMatrix_df.toPandas()
nrows = np.sqrt(len(covarianceMatrix_pd))
covarianceMatrix_pd = covarianceMatrix_pd.to_numpy().reshape((int(nrows),int(nrows)))
covarianceMatrix_pd
One way to do this is using row_number with pivot after we have a count of the dataframe:
from pyspark.sql import functions as F, Window
from math import sqrt
c = int(sqrt(df.count())) #this gives 3
rnum = F.row_number().over(Window.orderBy(F.lit(1)))
out = (df.withColumn("Rnum",((rnum-1)/c).cast("Integer"))
.withColumn("idx",F.row_number().over(Window.partitionBy("Rnum").orderBy("Rnum")))
.groupby("Rnum").pivot("idx").agg(F.first("number")))
out.show()
+----+----+----+----+
|Rnum| 1| 2| 3|
+----+----+----+----+
| 0|12.4|13.4|42.3|
| 1|33.4|42.3|32.4|
| 2|44.2|12.3|45.4|
+----+----+----+----+

How to update a dataframe in PySpark with random values from another dataframe?

I have two dataframes in PySpark as below:
Dataframe A: total 1000 records
+-----+
|Name |
+-----+
| a|
| b|
| c|
+-----+
Dataframe B: Total 3 records
+-----+
|Zip |
+-----+
|06905|
|06901|
|06902|
+-----+
I need to add a new column named Zip in Dataframe A and populate the values with a randomly selected value from Dataframe B. So the Dataframe A will look something like this:
+-----+-----+
|Name |Zip |
+-----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06902|
+-----+-----+
I am running this on Azure Databricks and apparently, quinn isn't a module in there. So can't use quinn unfortunately.
If b is small (3 rows), you can just collect it into a Python list and add it as an array column to a. Then you can get a random element using shuffle.
import pyspark.sql.functions as F
df = a.withColumn(
'Zip',
F.shuffle(
F.array(*[F.lit(r[0]) for r in b.collect()])
)[0]
)
df.show()
+----+-----+
|Name| Zip|
+----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06901|
+----+-----+
You can agg the dataframe with zips and collect the values into one array column, then do a cross join and select a random element from the array of zips using for example shuffle on the array before picking the first element:
from pyspark.sql import functions as F
df_result = df_a.crossJoin(
df_b.agg(F.collect_list("Zip").alias("Zip"))
).withColumn(
"Zip",
F.expr("shuffle(Zip)[0]")
)
#+----+-----+
#|Name| Zip|
#+----+-----+
#| a|06901|
#| b|06902|
#| c|06901|
#| d|06901|
#+----+-----+

Adding a column to a PySpark dataframe contans standard deviations of a column based on the grouping on two another columns

Suppose that we have a csv file which has been imported as a dataframe in PysPark as follows
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("file path and name.csv", inferSchema = True, header = True)
df.show()
output
+-----+----+----+
|lable|year|val |
+-----+----+----+
| A|2003| 5.0|
| A|2003| 6.0|
| A|2003| 3.0|
| A|2004|null|
| B|2000| 2.0|
| B|2000|null|
| B|2009| 1.0|
| B|2000| 6.0|
| B|2009| 6.0|
+-----+----+----+
Now, we want to add another column to df which contains the standard deviation of val based on the grouping on two columns lable and year. So, the output must be as follows:
+-----+----+----+-----+
|lable|year|val | std |
+-----+----+----+-----+
| A|2003| 5.0| 1.53|
| A|2003| 6.0| 1.53|
| A|2003| 3.0| 1.53|
| A|2004|null| null|
| B|2000| 2.0| 2.83|
| B|2000|null| 2.83|
| B|2009| 1.0| 3.54|
| B|2000| 6.0| 2.83|
| B|2009| 6.0| 3.54|
+-----+----+----+-----+
I have the following codes which works for a small dataframe but it does not work for a very large dataframe (with about 40 million rows) which I am working with now.
import pyspark.sql.functions as f
a = df.groupby('lable','year').agg(f.round(f.stddev("val"),2).alias('std'))
df = df.join(a, on = ['lable', 'year'], how = 'inner')
I get Py4JJavaError Traceback (most recent call last) error after running on my large dataframe.
Does anyone knows any alternative way? I hope your way works on my dataset.
I am using python3.7.1, pyspark2.4, and jupyter4.4.0
The join on dataframe causes a lot of data shuffle between executors. In your case, you can do without the join.
Use a window specification to partition data by 'lable' and 'year' and aggregate on the window.
from pyspark.sql.window import *
windowSpec = Window.partitionBy('lable','year')\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("std", f.round(f.stddev("val").over(windowSpec), 2))

How to extract hour part of timestamp to a new column in DataFrame Spark

I want to extract the hour from current timestamp column and store the hour value in a new column in the data frame. Please do help
This should work:
val DF2 = DF1.withColumn("col_1", trim(DF1("col_1")))
Hope this will help
val df = Seq((" Virat ",18,"RCB"),("Rohit ",45,"MI "),(" DK",67,"KKR ")).toDF("captains","jersey_number","teams")
scala> df.show
+--------+-------------+-----+
|captains|jersey_number|teams|
+--------+-------------+-----+
| Virat | 18| RCB|
| Rohit | 45| MI |
| DK| 67| KKR |
+--------+-------------+-----+
scala>val trimmedDF = df.withColumn("captains",trim(df("captains"))).withColumn("teams",trim(df("teams")))
scala> trimmedDF.show
+--------+-------------+-----+
|captains|jersey_number|teams|
+--------+-------------+-----+
| Virat| 18| RCB|
| Rohit| 45| MI|
| DK| 67| KKR|
+--------+-------------+-----+
You can use one of the functions available for column operations:
For Scala:
import org.apache.spark.sql.functions._
val df2 = df.withColumn("hour", hour(col("timestamp_column")))
For Python:
from pyspark.sql.functions import *
df2 = df.withColumn('hour', hour(col('timestamp_column')))
Reference:
org.apache.spark.functions
pyspark.sql.functions