Can I apply MERGE INTO on PySpark DataFrame? - dataframe

I have two PySpark DataFrames and I want to merge these DataFrames. When I try to use MERGE INTO statement, I get an error that there is no table. I am running the code in Databricks.
Sample code:
import pandas as pd
target_data = {'id': [1100, 1200, 1300, 1400, 1500],
'name': ["Person1", "Person2", "Person3", "Person4", "Person5"],
'location': ["Location1", "Location2", "Location3", None, "Location5"],
'contact': [None, "Contact2", None, "Contact4", None],
}
pdf = pd.DataFrame(target_data)
target = spark.createDataFrame(pdf)
source_data = {'id': [1400, 1500, 1600],
'name': ["Person4", "Person5", "Person6"],
'location': ["Location4", "Location5", "Location6"],
'contact': ["Contact4", "Contact5", "Contact6"],
}
pdf = pd.DataFrame(source_data)
source = spark.createDataFrame(pdf)
And using SQL statement in the next cell:
%sql
MERGE INTO target as t
USING source as s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
I get the error:
Is there any way that I can merge two DataFrames? Should I convert them into Delta table first?

Extending on the other answers here and if you are looking to drop duplicates as well you can leverage dropDuplicates function.
>>> output_df=source.union(target).dropDuplicates(["id"])
>>> output_df.orderBy(["id"]).show()
+----+-------+---------+--------+
| id| name| location| contact|
+----+-------+---------+--------+
|1100|Person1|Location1| null|
|1200|Person2|Location2|Contact2|
|1300|Person3|Location3| null|
|1400|Person4|Location4|Contact4|
|1500|Person5|Location5|Contact5|
|1600|Person6|Location6|Contact6|
+----+-------+---------+--------+
Output:

Related

How to count hypothenuses with pandas udf, pyspark

I want to write a panda udf which will take two arguments cathetus1, and cathetus2 from other dataframe and return hypot.
# this data is list where cathetuses are.
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
#and this is creating dataframe where only cathetuses are showing.
this is function i have written so far.
def pandaUdf(cat1, cat2):
leg1 = []
leg2 = []
for i in data:
x = 0
leg1.append(i[x])
leg2.append(i[x+1])
hypoData.append(np.hypot(leg1,leg2))
return np.hypot(leg1,leg2)
#example_series = pd.Series(data)
and im trying to create a new column in df, which values will be hypos.
df.withColumn(col('Hypo'), pandaUdf(example_df.cathetus1,example_df.cathetus2)).show()
but this gives me an error --> col should be Column.
I dont understand how I can fix this error or why its even there.
You can apply np.hypot on the 2 cathetus directly without extracting individual values.
from pyspark.sql import functions as F
from pyspark.sql.types import *
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
"""
+---------+---------+
|cathetus1|cathetus2|
+---------+---------+
| 3.0| 4.0|
| 6.0| 8.0|
| 3.3| 5.6|
+---------+---------+
"""
def hypot(cat1: pd.Series, cat2: pd.Series) -> pd.Series:
return np.hypot(cat1,cat2)
hypot_pandas_df = F.pandas_udf(hypot, returnType=FloatType())
df.withColumn("Hypo", hypot_pandas_df("cathetus1", "cathetus2")).show()
"""
+---------+---------+----+
|cathetus1|cathetus2|Hypo|
+---------+---------+----+
| 3.0| 4.0| 5.0|
| 6.0| 8.0|10.0|
| 3.3| 5.6| 6.5|
+---------+---------+----+
"""

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Pyspark -- Filter dataframe based on row values of another dataframe

I have a master dataframe and a secondary dataframe which I want to go through row by row, filter the master dataframe based on the values in each row, run a function on the filtered master dataframe, and save the output.
The output could either be saved in a separate dataframe, or in a new column of the secondary dataframe.
# Master DF
df = pd.DataFrame({"Name": ["Mike", "Bob", "Steve", "Jim", "Dan"], "Age": [22, 44, 66, 22, 66], "Job": ["Doc", "Cashier", "Fireman", "Doc", "Fireman"]})
#Secondary DF
df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"]})
df = spark.createDataFrame(df)
+-----+---+-------+
| Name|Age| Job|
+-----+---+-------+
| Mike| 22| Doc|
| Bob| 44|Cashier|
|Steve| 66|Fireman|
| Jim| 22| Doc|
| Dan| 66|Fireman|
+-----+---+-------+
df1 = spark.createDataFrame(df1)
+---+-------+
|Age| Job|
+---+-------+
| 22| Doc|
| 66|Fireman|
+---+-------+
​
# Filter by values in first row of secondary DF
df_filt = df.filter(
(F.col("Age") == 22) &
(F.col('Job') == 'Doc')
)
# Run the filtered DF through my function
def my_func(df_filt):
my_list = df_filt.select('Name').rdd.flatMap(lambda x: x).collect()
return '-'.join(my_list)
# Output of function
my_func(df_filt)
'Mike-Jim'
# Filter by values in second row of secondary DF
df_filt = df.filter(
(F.col("Age") == 66) &
(F.col('Job') == 'Fireman')
)
# Output of function
my_func(df_filt)
'Steve-Dan'
# Desired output at the end of the iterations
new_df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"], "Returned_value": ['Mike-Jim', 'Steve-Dan']})
Basically, I want to take my Master DF and filter it in certain ways, and run an algorithm on the filtered dataset and get the output for that filtering, then go on to the next set of filtering and do the same.
What is the best way to go about this?
Try this with join, groupBy, concat_ws/array_join and collect_list.
from pyspark.sql import functions as F
df.join(df1,['Age','Job'])\
.groupBy("Age","Job").agg(F.concat_ws('-',F.collect_list("Name")).alias("Returned_value")).show()
#+---+-------+--------------+
#|Age| Job|Returned_value|
#+---+-------+--------------+
#| 22| Doc| Mike-Jim|
#| 66|Fireman| Steve-Dan|
#+---+-------+--------------+

transpose and merge data in pandas

I would like to transpose some data in order to see the format shown by "table", but having the possibility to analyze the 10,20,50 columns, with sum, value_counts (), etc., instead it gives an error
raw_data = {'product': [10,20,50],
'key1': [1,2,3],
'key2': [51,52,53],
'tick': [1,1,1]}
df = pd.DataFrame(raw_data, columns = ['product','key1','key2','tick'])
table = pd.pivot_table(df, index=('key1','key2'),values=('tick'), columns=('product'))
table.reset_index(inplace=True)
table['10'].sum()
'''

How to create a column with all the values in a range given by another column in PySpark

I have a problem with the following scenario using PySpark version 2.0, I have a DataFrame with a column contains an array with start and end value, e.g.
[1000, 1010]
I would like to know how to create and compute another column which contains an array that holds all the values for the given range? the result of the generated range values column will be:
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
Try this.
define the udf
def range_value(a):
start = a[0]
end = a[1] +1
return list(range(start,end))
from pyspark.sql import functions as F
from pyspark.sql import types as pt
df = spark.createDataFrame([("Range 1", list([101,105])), ("Range 2", list([200, 203]))],("Description", "Accounts"))
range_value= F.udf(range_value, pt.ArrayType(pt.IntegerType()))
df = df.withColumn('Range', range_value(F.col('Accounts')))
Output
you should use UDF (UDF sample)
Consider your pyspark data frame name is df, your data frame could be like this:
df = spark.createDataFrame(
[("Range 1", list([101,105])),
("Range 2", list([200, 203]))],
("Description", "Accounts"))
And your solution is like this:
import pyspark.sql.functions as F
import numpy as np
def make_range_number(arr):
number_range = np.arange(arr[0], arr[1]+1, 1).tolist()
return number_range
range_udf = F.udf(make_range_number)
df = df.withColumn("Range", range_udf(F.col("Accounts")))
Have a fun time!:)