Pyspark -- Filter dataframe based on row values of another dataframe - apache-spark-sql

I have a master dataframe and a secondary dataframe which I want to go through row by row, filter the master dataframe based on the values in each row, run a function on the filtered master dataframe, and save the output.
The output could either be saved in a separate dataframe, or in a new column of the secondary dataframe.
# Master DF
df = pd.DataFrame({"Name": ["Mike", "Bob", "Steve", "Jim", "Dan"], "Age": [22, 44, 66, 22, 66], "Job": ["Doc", "Cashier", "Fireman", "Doc", "Fireman"]})
#Secondary DF
df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"]})
df = spark.createDataFrame(df)
+-----+---+-------+
| Name|Age| Job|
+-----+---+-------+
| Mike| 22| Doc|
| Bob| 44|Cashier|
|Steve| 66|Fireman|
| Jim| 22| Doc|
| Dan| 66|Fireman|
+-----+---+-------+
df1 = spark.createDataFrame(df1)
+---+-------+
|Age| Job|
+---+-------+
| 22| Doc|
| 66|Fireman|
+---+-------+
​
# Filter by values in first row of secondary DF
df_filt = df.filter(
(F.col("Age") == 22) &
(F.col('Job') == 'Doc')
)
# Run the filtered DF through my function
def my_func(df_filt):
my_list = df_filt.select('Name').rdd.flatMap(lambda x: x).collect()
return '-'.join(my_list)
# Output of function
my_func(df_filt)
'Mike-Jim'
# Filter by values in second row of secondary DF
df_filt = df.filter(
(F.col("Age") == 66) &
(F.col('Job') == 'Fireman')
)
# Output of function
my_func(df_filt)
'Steve-Dan'
# Desired output at the end of the iterations
new_df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"], "Returned_value": ['Mike-Jim', 'Steve-Dan']})
Basically, I want to take my Master DF and filter it in certain ways, and run an algorithm on the filtered dataset and get the output for that filtering, then go on to the next set of filtering and do the same.
What is the best way to go about this?

Try this with join, groupBy, concat_ws/array_join and collect_list.
from pyspark.sql import functions as F
df.join(df1,['Age','Job'])\
.groupBy("Age","Job").agg(F.concat_ws('-',F.collect_list("Name")).alias("Returned_value")).show()
#+---+-------+--------------+
#|Age| Job|Returned_value|
#+---+-------+--------------+
#| 22| Doc| Mike-Jim|
#| 66|Fireman| Steve-Dan|
#+---+-------+--------------+

Related

Can I apply MERGE INTO on PySpark DataFrame?

I have two PySpark DataFrames and I want to merge these DataFrames. When I try to use MERGE INTO statement, I get an error that there is no table. I am running the code in Databricks.
Sample code:
import pandas as pd
target_data = {'id': [1100, 1200, 1300, 1400, 1500],
'name': ["Person1", "Person2", "Person3", "Person4", "Person5"],
'location': ["Location1", "Location2", "Location3", None, "Location5"],
'contact': [None, "Contact2", None, "Contact4", None],
}
pdf = pd.DataFrame(target_data)
target = spark.createDataFrame(pdf)
source_data = {'id': [1400, 1500, 1600],
'name': ["Person4", "Person5", "Person6"],
'location': ["Location4", "Location5", "Location6"],
'contact': ["Contact4", "Contact5", "Contact6"],
}
pdf = pd.DataFrame(source_data)
source = spark.createDataFrame(pdf)
And using SQL statement in the next cell:
%sql
MERGE INTO target as t
USING source as s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
I get the error:
Is there any way that I can merge two DataFrames? Should I convert them into Delta table first?
Extending on the other answers here and if you are looking to drop duplicates as well you can leverage dropDuplicates function.
>>> output_df=source.union(target).dropDuplicates(["id"])
>>> output_df.orderBy(["id"]).show()
+----+-------+---------+--------+
| id| name| location| contact|
+----+-------+---------+--------+
|1100|Person1|Location1| null|
|1200|Person2|Location2|Contact2|
|1300|Person3|Location3| null|
|1400|Person4|Location4|Contact4|
|1500|Person5|Location5|Contact5|
|1600|Person6|Location6|Contact6|
+----+-------+---------+--------+
Output:

Read text file using information in separate dataframe

I have fixed width file as below
00120181120xyz12341
00220180203abc56792
00320181203pqr25483
And a corresponding dataframe that specifies the schema (says column name (_Name) and the column width (_Length):
How can I use PySpark to get the text file dataframe as follows:
#+---+----+--+
#|C1| C2 |C3|
#+--+-----+--+
#| 0|02018|11|
#| 0|02018|02|
#| 0|02018|12|
#+--+-----+--+
You could:
collect your column names and lengths;
use it to create a list of substrings indexes to be used in extracting string parts that you need;
use the list of substring indexes to extract string parts for every row.
Input:
rdd_data = spark.sparkContext.textFile(r'C:\Temp\sample.txt')
df_lengths = spark.createDataFrame([("1", "C1"), ("5", "C2"), ("2", "C3")], ["_Length", "_Name"])
Script:
from pyspark.sql import Row
lengths = df_lengths.collect()
ranges = [[0, 0]]
for x in lengths:
ranges.append([ranges[-1][-1], ranges[-1][-1] + int(x["_Length"])])
Cols = Row(*[r["_Name"] for r in lengths])
df = rdd_data.map(lambda x: Cols(*[x[r[0]:r[1]] for r in ranges[1:]])).toDF()
df.show()
# +---+-----+---+
# | C1| C2| C3|
# +---+-----+---+
# | 0|01201| 81|
# | 0|02201| 80|
# | 0|03201| 81|
# +---+-----+---+
Something like this is possible using only DataFrame API, if you have a column which you could use inside orderBy for the window function.
from pyspark.sql import functions as F, Window as W
df_data = spark.read.csv(r"C:\Temp\sample.txt")
df_lengths = spark.createDataFrame([("1", "C1"), ("5", "C2"), ("2", "C3")], ["_Length", "_Name"])
sum_col = F.sum("_Length").over(W.orderBy("_Name")) + 1
df_lengths = (df_lengths
.withColumn("_Len", F.array((sum_col - F.col("_Length")).cast('int'), "_Length"))
.groupBy().pivot("_Name").agg(F.first("_Len"))
)
df_data = df_data.select(
[F.substring("_c0", int(c[0]), int(c[1])) for c in df_lengths.head()]
).toDF(*df_lengths.columns)
df_data.show()
# +---+-----+---+
# | C1| C2| C3|
# +---+-----+---+
# | 0|01201| 81|
# | 0|02201| 80|
# | 0|03201| 81|
# +---+-----+---+

How to count hypothenuses with pandas udf, pyspark

I want to write a panda udf which will take two arguments cathetus1, and cathetus2 from other dataframe and return hypot.
# this data is list where cathetuses are.
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
#and this is creating dataframe where only cathetuses are showing.
this is function i have written so far.
def pandaUdf(cat1, cat2):
leg1 = []
leg2 = []
for i in data:
x = 0
leg1.append(i[x])
leg2.append(i[x+1])
hypoData.append(np.hypot(leg1,leg2))
return np.hypot(leg1,leg2)
#example_series = pd.Series(data)
and im trying to create a new column in df, which values will be hypos.
df.withColumn(col('Hypo'), pandaUdf(example_df.cathetus1,example_df.cathetus2)).show()
but this gives me an error --> col should be Column.
I dont understand how I can fix this error or why its even there.
You can apply np.hypot on the 2 cathetus directly without extracting individual values.
from pyspark.sql import functions as F
from pyspark.sql.types import *
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
"""
+---------+---------+
|cathetus1|cathetus2|
+---------+---------+
| 3.0| 4.0|
| 6.0| 8.0|
| 3.3| 5.6|
+---------+---------+
"""
def hypot(cat1: pd.Series, cat2: pd.Series) -> pd.Series:
return np.hypot(cat1,cat2)
hypot_pandas_df = F.pandas_udf(hypot, returnType=FloatType())
df.withColumn("Hypo", hypot_pandas_df("cathetus1", "cathetus2")).show()
"""
+---------+---------+----+
|cathetus1|cathetus2|Hypo|
+---------+---------+----+
| 3.0| 4.0| 5.0|
| 6.0| 8.0|10.0|
| 3.3| 5.6| 6.5|
+---------+---------+----+
"""

PySpark: how to use `StringIndexer` to do label encoding with the string array column

As we know, we can do LabelEncoder() by StringIndexer in the string column, but if want to do LabelEncoder() on string array column, it is not easy to implement.
# input
df.show()
+--------------------------------------+
| tags|
+--------------------------------------+
| [industry, display, Merchants]|
| [smart, swallow, game, Experience]|
| [social, picture, social]|
| [default, game, us, adventure]|
| [financial management, loan, product]|
| [system, profile, optimization]|
...
# After do LabelEncoder() on `tags` column
...
+--------------------------------------+
| tags|
+--------------------------------------+
| [0, 1, 2]|
| [3, 4, 4, 5]|
| [6, 7, 6]|
| [8, 4, 9, 10]|
| [11, 12, 13]|
| [14, 15, 16]|
Python version will be very similar:
// add unique id to each row
val df2 = df.withColumn("id", monotonically_increasing_id).select('id, explode('tags).as("tag"))
val indexer = new StringIndexer()
.setInputCol("tag")
.setOutputCol("tagIndex")
val indexed = indexer.fit(df2).transform(df2)
// in the final step you should convert tags back to array of tags
val dfFinal = indexed.groupBy('id).agg(collect_list('tagIndex))
You can create a class, which will explode the array column, apply the StringIndexer, and will collect the indexes back to the list. The benefit of using it as a class instead of step by step transformations, is that it can be used in a pipeline or saved as fitted.
A class doing all the transformations and applying a StringIndexer:
class ArrayStringIndexerModel(Model
,DefaultParamsReadable, DefaultParamsWritable):
def __init__(self, indexer, inputCol: str, outputCol: str):
super(ArrayStringIndexerModel, self).__init__()
self.indexer = indexer
self.inputCol = inputCol
self.outputCol = outputCol
def _transform(self, df: DataFrame=[]) -> DataFrame:
# Creating always increasing id (as in fit)
df = df.withColumn("id_temp_added", monotonically_increasing_id())\
# Exploding "inputCol" and saving to the new dataframe (as in fit)
df2 = df.withColumn('inputExpl', F.explode(self.inputCol)).select('id_temp_added', 'inputExpl')
# Transforming with fitted "indexed"
indexed_df = self.indexer.transform(df2)
# Converting indexed back to array
indexed_df = indexed_df.groupby('id_temp_added').agg(F.collect_list(F.col(self.outputCol)).alias(self.outputCol))
# Joining to the main dataframe
df = df.join(indexed_df, on='id_temp_added', how='left')
# dropping created id column
df = df.drop('id_temp_added')
return df
class ArrayStringIndexer(Estimator
,DefaultParamsReadable, DefaultParamsWritable):
"""
A custom Transformer which applies string indexer to the array of strings
(explodes, applies StirngIndexer, aggregates back)
"""
def __init__(self, inputCol: str, outputCol: str):
super(ArrayStringIndexer, self).__init__()
# self.indexer = None
self.inputCol = inputCol
self.outputCol = outputCol
def _fit(self, df: DataFrame = []) -> ArrayStringIndexerModel:
# Creating always increasing id
df = df.withColumn("id_temp_added", monotonically_increasing_id())
# Exploding "inputCol" and saving to the new dataframe
df2 = df.withColumn('inputExpl', F.explode(self.inputCol)).select('id_temp_added', 'inputExpl')
# Indexing self.indexer and self.indexed dataframe with exploded input column
indexer = StringIndexer(inputCol='inputExpl', outputCol=self.outputCol)
indexer = indexer.fit(df2)
# Returns ArrayStringIndexerModel class with fitted StringIndexer, input and output columns
return ArrayStringIndexerModel(indexer=indexer, inputCol=self.inputCol, outputCol=self.outputCol)
How to use the class in a code:
tags_indexer = ArrayStringIndexer(inputCol="tags", outputCol="tagsIndex")
tags_indexer.fit(df).transform(df).show()

How to create a column with all the values in a range given by another column in PySpark

I have a problem with the following scenario using PySpark version 2.0, I have a DataFrame with a column contains an array with start and end value, e.g.
[1000, 1010]
I would like to know how to create and compute another column which contains an array that holds all the values for the given range? the result of the generated range values column will be:
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
Try this.
define the udf
def range_value(a):
start = a[0]
end = a[1] +1
return list(range(start,end))
from pyspark.sql import functions as F
from pyspark.sql import types as pt
df = spark.createDataFrame([("Range 1", list([101,105])), ("Range 2", list([200, 203]))],("Description", "Accounts"))
range_value= F.udf(range_value, pt.ArrayType(pt.IntegerType()))
df = df.withColumn('Range', range_value(F.col('Accounts')))
Output
you should use UDF (UDF sample)
Consider your pyspark data frame name is df, your data frame could be like this:
df = spark.createDataFrame(
[("Range 1", list([101,105])),
("Range 2", list([200, 203]))],
("Description", "Accounts"))
And your solution is like this:
import pyspark.sql.functions as F
import numpy as np
def make_range_number(arr):
number_range = np.arange(arr[0], arr[1]+1, 1).tolist()
return number_range
range_udf = F.udf(make_range_number)
df = df.withColumn("Range", range_udf(F.col("Accounts")))
Have a fun time!:)