How to implement a User Defined Aggregate Function (UDAF) in PySpark SQL?
pyspark version = 3.0.2
python version = 3.7.10
As a minimal example, I'd like to replace the AVG aggregate function with a UDAF:
sc = SparkContext()
sql = SQLContext(sc)
df = sql.createDataFrame(
pd.DataFrame({'id': [1, 1, 2, 2], 'value': [1, 2, 3, 4]}))
df.createTempView('df')
rv = sql.sql('SELECT id, AVG(value) FROM df GROUP BY id').toPandas()
where rv will be:
In [2]: rv
Out[2]:
id avg(value)
0 1 1.5
1 2 3.5
How can a UDAF replace AVG in the query?
For example this does not work
import numpy as np
def udf_avg(x):
return np.mean(x)
sql.udf.register('udf_avg', udf_avg)
rv = sql.sql('SELECT id, udf_avg(value) FROM df GROUP BY id').toPandas()
The idea is to implement a UDAF in pure Python for processing not supported by SQL aggregate functions (e.g. a low-pass filter).
A Pandas UDF can be used, where the definition is compatible from Spark 3.0 and Python 3.6+. See the issue and documentation for details.
Full implementation in Spark SQL:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
pd.DataFrame({'id': [1, 1, 2, 2], 'value': [1, 2, 3, 4]}))
df.createTempView('df')
#pandas_udf(DoubleType())
def avg_udf(s: pd.Series) -> float:
return s.mean()
spark.udf.register('avg_udf', avg_udf)
rv = spark.sql('SELECT id, avg_udf(value) FROM df GROUP BY id').toPandas()
with return value
In [2]: rv
Out[2]:
id avg_udf(value)
0 1 1.5
1 2 3.5
You can use a Pandas UDF with GROUPED_AGG type. It receives columns from Spark as Pandas Series, so that you can call Series.mean on the column.
import pyspark.sql.functions as F
#F.pandas_udf('float', F.PandasUDFType.GROUPED_AGG)
def avg_udf(s):
return s.mean()
df2 = df.groupBy('id').agg(avg_udf('value'))
df2.show()
+---+--------------+
| id|avg_udf(value)|
+---+--------------+
| 1| 1.5|
| 2| 3.5|
+---+--------------+
To register it for use in SQL is also possible:
df.createTempView('df')
spark.udf.register('avg_udf', avg_udf)
df2 = spark.sql("select id, avg_udf(value) from df group by id")
df2.show()
+---+--------------+
| id|avg_udf(value)|
+---+--------------+
| 1| 1.5|
| 2| 3.5|
+---+--------------+
Related
I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.
However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:
A very simple example like below
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pandas as pd
#F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
return arr.add(addition)
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()
df.withColumn("added", addition(F.col("array"),F.col("addition")))
throws the following exception on the udf definition line
NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.
Am i tackling this problem in a wrong way? I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.
Loading the example, adding import array
from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)
The response is,
+---------+--------+
| array|addition|
+---------+--------+
|[1, 2, 3]| 10|
|[4, 5, 6]| 20|
+---------+--------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]
If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,
The Spark DF arr column is ArrayType, convert it into a Pandas Series
Apply the Pandas function
Then, convert the Pandas Series back to an array
#F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
pd_arr = pd.Series(arr)
added = pd_arr.add(addition)
return array("l", added)
df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)
Returns
+---------+--------+------------+
|array |addition|added |
+---------+--------+------------+
|[1, 2, 3]|10 |[11, 12, 13]|
|[4, 5, 6]|20 |[24, 25, 26]|
+---------+--------+------------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]
However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here)
Strangely enough I cant find any where on the internet if its possible to be done.
I have a datafrme of array column.
arr_col
[1,3,4]
[4,3,5]
I want result
Result
3
4
I want the median for each row.
I managed to do it with a pandas udf but it iterates the column and applies np.median to each row. .
I dont want it as it's slow and tow at a time. I want it to act at all rows the same time.
Either in pandas or pyspark
Use numpy
import numpy as np
df['Result'] = np.median(np.vstack(df['arr_col']), axis=1)
Or explode and groupby.median:
df['Result'] = (df['arr_col'].explode()
.groupby(level=0).median()
)
Output:
arr_col Result
0 [1, 3, 4] 3.0
1 [4, 3, 5] 4.0
Used input:
df = pd.DataFrame({'arr_col': [[1,3,4], [4,3,5]]})
Can use a udf in pyspark.
m =udf(lambda x: int(np.median(x)),IntegerType())
df.withColumn('Result', m(col('arr_col'))).show()
+---+---------+------+
| Id| arr_col|Result|
+---+---------+------+
| 1|[1, 3, 4]| 3.0|
| 1|[4, 3, 6]| 4.0|
+---+---------+------+
Currently I'm gathering the top 5 most frequent values with a UDF.
The goal is to achieve the same result without using UDF and have the most efficient solution (avoid groupBy in loops).
Here's the code I'm using to have the result :
from pyspark.sql import functions as F
df = df.select('A', 'B', ...)
#F.udf
def get_top_5_udf(x)
from collections import Counter
return [elem[0] for elem in Counter(x).most_common(5)]
agg_expr = [get_top_5_udf(F.collect_list(col)).alias(col) for col in df.columns]
df_top5 = df.agg(*agg_expr)
The result looks like the following :
# result
#+-----------------+--------------+---------------+
#| A | B | ... |
#+-----------------+--------------+---------------+
#| [1, 2, 3, 4, 5] | [...] | ... |
#+-----------------+--------------+---------------+
You can try using count over window partitioned by each column before aggregating:
from pyspark.sql import functions as F, Window
result = df.select(*[
F.struct(
F.count(c).over(Window.partitionBy(c)).alias("cnt"),
F.col(c).alias("val")
).alias(c) for c in df.columns
]).agg(*[
F.slice(
F.expr(f"transform(sort_array(collect_set({c}), false), x -> x.val)"),
1, 5
).alias(c) for c in df.columns
])
result.show()
I would like to replace a column of pyspark dataframe.
the dataframe:
price
90.16|USD
I need:
dollar_price currency
9016 USD
Pyspark code:
new_col = F.when(F.col("price").isNull() == False, F.substring(F.col('price'), 1, F.instr(F.col('retail_value'), '|')-1)).otherwise(null)
new_df = df.withColumn('dollar_price', new_col)
new_col = F.when(F.col("price").isNull() == False, F.substring(F.col('price'), F.instr(F.col('retail_value'), '|')+1, 3)).otherwise(null)
new_df_1 = new_df.withColumn('currency', new_col)
I got error:
TypeError: Column is not iterable
Could you please tell me what I missed ?
I have tried
Split a dataframe column's list into two dataframe columns
but it does not work.
thanks
Try with expr as you are computing value from instr function.
Example:
df.show()
#+---------+
#| price|
#+---------+
#|90.16|USD|
#+---------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("dollar_price",when(col("price").isNull()==False,expr("substring(price,1,instr(price,'|')-1)")).otherwise(None)).\
withColumn("currency",when(col("price").isNull()==False,expr("substring(price,instr(price,'|')+1,3)")).otherwise(None)).\
show()
#+---------+------------+--------+
#| price|dollar_price|currency|
#+---------+------------+--------+
#|90.16|USD| 90.16| USD|
#+---------+------------+--------+
I have a dataframe like below
from pyspark import SparkContext, SparkConf,SQLContext
import numpy as np
config = SparkConf("local")
sc = SparkContext(conf=config)
sqlContext=SQLContext(sc)
df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")
Now I need to keep the first column and the rest columns as a numpy array (two columns : "doc" and a numpy array column)
I know that
sdf=np.array(df.select([c for c in df.columns if c not in {'doc'}]).collect())
print sdf
Translate all the columns into a numpy array but how do I append the numpy array with the first column ? Any help is appreciated.
Unfortunately you cannot make numpy.array column in pyspark dataframe, but you can use regular python list instead, and convert it while reading:
>>> df = sqlContext.createDataFrame([("doc_3",[1,3,9]), ("doc_1",[9,6,0]), ("doc_2",[9,9,3]) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","words")
>>> df.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+
>>> df
DataFrame[doc: string, words: array<bigint>]
And to get this from 4 columns you had, you can:
>>> from pyspark.sql.functions import *
>>> df2=df.select("doc", array("word1", "word2", "word3").alias("words"))
>>> df2
DataFrame[doc: string, words: array<bigint>]
>>> df2.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+