can the spark window function first_value be used? - apache-spark-sql

The window function first_value (equivalent first) seems to exist as the code snippet below demonstrates. The function is not listed in
https://spark.apache.org/docs/3.1.2/sql-ref-functions-builtin.html#window-functions
however it is listed in
https://spark.apache.org/docs/latest/api/sql/#first_value
in any case it seems to work as a window function
# first_value window function
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *
import pandas as pd
from time import perf_counter
# get a spark session
spark = SparkSession.builder.appName('learn').getOrCreate()
# create dataset
schema = StructType([
StructField('c1', StringType(), nullable=True),
StructField('c2', StringType(), nullable=True),
StructField('value', DoubleType(), nullable=True),
])
import random
data = [(random.choice(list('ABC')), random.choice(list('abc')), random.random()) for _ in range(100)]
df = spark.createDataFrame(data, schema=schema).drop_duplicates()
df.createOrReplaceTempView('tmp_view')
# execute window function (using first() instead of first_value() gives the same result)
query ="""SELECT c1, first_value(value) OVER (PARTITION BY c1) as f FROM tmp_view"""
res = spark.sql(query)
res.drop_duplicates().show()
so the question is this a matter of omitting listing the first_value/first window function in the window function table in the documentation?
Looking at the dataframe API, it seems first_value does not exist, whilst first is not a window function but an aggregate function
import pyspark.sql.functions as f
f.first?
Signature: f.first(col, ignorenulls=False)
Docstring:
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null
value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
.. versionadded:: 1.3.0
However, it can be executed as a window function using the dataframe API:
from pyspark.sql.window import Window
w = Window.partitionBy("c1")
res = df.withColumn('f', f.first('value').over(w))
res.select(['c1', 'f']).drop_duplicates()
res.show()
It can also be used as an aggregation function
data = [('a', 3),
('a', 30),
('b', 7),
('b', 70)
]
df = spark.createDataFrame(data, ['nam', 'value'])
res = df.groupby('nam').agg(f.first(f.col('value')))
res.show()
what is going on? Is it is so that the documentation is confusing or my understanding is somehow wrong?

check this official api and it's examples, you can define a windowing column from a dataframe column by xxx.over(w)

Related

Filter dataframe based on condition before groupby

Suppose I have a dataframe like this
Create sample dataframe:
import pandas as pd
import numpy as np
data = {
'gender': np.random.choice(['m', 'f'], size=100),
'vaccinated': np.random.choice([0, 1], size=100),
'got sick': np.random.choice([0, 1], size=100)
}
df = pd.DataFrame(data)
and I want to see, by gender, what proportion of vaccinated people got sick.
I've tries something like this:
df.groupby('gender').agg(lambda group: sum(group['vaccinated']==1 & group['sick']==1)
/sum(group['sick']==1))
but this doesn't work because agg works on the series level. Same applies for transform. apply doesn't work either, but I'm not as clear why or how apply functions on groupby objects.
Any ideas how to accomplish this with a single line of code?
You could first filter for the vaccinated people and then group by gender and calculate the proportion of people that got sick..
df[df.vaccinated == 1].groupby("gender").agg({"got sick":"mean"})
Output:
got sick
gender
f 0.548387
m 0.535714
In this case the proportion is calculated based on a sample data that I've created
The docs for GroupBy.apply state that the function is applied "group-wise". This means that the function is called on each group separately as a data frame.
That is, df.groupby(c).apply(f) is conceptually equivalent to:
results = {}
for val in df[c]:
group = df.loc[df[c] == val]
result = f(group)
results[val] = result
pd.concat(results)
We can use this understanding to apply your custom aggregation function, using a top-level def just to make the code easier to read:
def calc_vax_sick_frac(group):
vaccinated = group['vaccinated'] == 1
sick = group['sick'] == 1
return (vaccinated & sick).sum() / sick.sum()
(
df
.groupby('gender')
.apply(calc_vax_sick_frac)
)

pandas_udf with pd.Series and other object as arguments

I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.
However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:
A very simple example like below
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pandas as pd
#F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
return arr.add(addition)
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()
df.withColumn("added", addition(F.col("array"),F.col("addition")))
throws the following exception on the udf definition line
NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.
Am i tackling this problem in a wrong way? I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.
Loading the example, adding import array
from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)
The response is,
+---------+--------+
| array|addition|
+---------+--------+
|[1, 2, 3]| 10|
|[4, 5, 6]| 20|
+---------+--------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]
If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,
The Spark DF arr column is ArrayType, convert it into a Pandas Series
Apply the Pandas function
Then, convert the Pandas Series back to an array
#F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
pd_arr = pd.Series(arr)
added = pd_arr.add(addition)
return array("l", added)
df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)
Returns
+---------+--------+------------+
|array |addition|added |
+---------+--------+------------+
|[1, 2, 3]|10 |[11, 12, 13]|
|[4, 5, 6]|20 |[24, 25, 26]|
+---------+--------+------------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]
However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here)

Using pandas udf to return a full column containing average

This is very weird I tried using pandas udf on a spark df and it works only if i do select and return one value which is the average of the column
but if i try to fill the whole column with this value then it doesnt work
the following works:
#pandas_udf(DoubleType())
def avg(col ) :
cl = np.average(col)
return cl
df.select(avg('col' ))
this works and returns a df of one row containing the value average of column.
but the following doesnt work
df.withColumn('avg', F.lit( avg(col))
why? if avg(col) is a value then why cant i use that to fill the column with a lit()?
like the following example which does work. This does work when i return a constant number
#pandas_udf(DoubleType())
def avg(col ) :
return 5
df.withColumn('avg', avg(col)
I also tried returning a series and didnt work either
#pandas_udf(DoubleType())
def avg(col ) :
cl = np.average(col)
return pd.Series([cl]* col.size())
df.withColumn('avg', avg(col))
doesnt work. But does work if i use a constant instead of cl
So basically how could i return a full column containing the same value of the average to fill up the whole column with that value?
lit is evaluated on driver and not executed on the data on the executor.
The best way to achieve this would be to simply define a window spec for the entire dataset and call the aggregate function over the window. This would eliminate the need for an extra UDF.
windowSpec = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('avg', avg(col).over(windowSpec))
Type cast it to float().
I am not sure what you are trying to achieve here. The UDF is called for each row. So, inside the UDF, the "col" represents each individual cell value - it does not represent the entire column.
If your column is of type array/list:
df = spark.createDataFrame(
[
[[1.0, 2.0, 3.0, 4.0]],
[[5.0, 6.0, 7.0, 8.0]],
],
["num"]
)
#F.udf(returnType=DoubleType())
def avg(col):
import numpy as np
return float(np.average(col))
#
df = df.withColumn("avg", avg("num"))
+--------------------+---+
| num|avg|
+--------------------+---+
|[1.0, 2.0, 3.0, 4.0]|2.5|
|[5.0, 6.0, 7.0, 8.0]|6.5|
+--------------------+---+
But if your column is a scalar type like double/float, then the average of it via UDF will always return the same column value:
df = spark.createDataFrame(
[[1.0],[2.0],[3.0],[4.0]],
["num"]
)
#F.udf(returnType=DoubleType())
def avg(col):
import numpy as np
return float(np.average(col))
#
df = df.withColumn("avg", avg("num"))
+---+---+
|num|avg|
+---+---+
|1.0|1.0|
|2.0|2.0|
|3.0|3.0|
|4.0|4.0|
+---+---+

Creating PySpark UDFs from python method with numpy array input, to calculate and return a single float value

As input I have a csv file with int values in it.
spark_df = spark.read.option("header", "false").csv("../int_values.csv")
df = spark_df.selectExpr("_c0 as something")
_df = df.withColumn("values", df.something.cast(FloatType())).select("values")
I also have some python functions designed for numpy array inputs, that I need to apply on the Spark DataFrame.
The example one:
def calc_sum(float_array):
return np.sum(float_array)
Real function:
def calc_rms(float_array):
return np.sqrt(np.mean(np.diff(float_array)**2))
For the 1. example you can use SQL sum like:
_df.groupBy().sum().collect()
But, what I need is a standard solution to transform these functions into Spark UDFs
I tried many ways, like:
udf_sum = udf(lambda x : calc_sum(x), FloatType())
_df.rdd.flatMap(udf_sum).collect()
but it always failed with:
TypeError: Invalid argument, not a string or column:
Row(values=1114.0) of type <class 'pyspark.sql.types.Row'>. For column
literals, use 'lit', 'array', 'struct' or 'create_map' function.
Is it possible to transform the data in a way that works with these functions?
DataFrame sample:
In [6]: spark_df.show()
+----+
| _c0|
+----+
|1114|
|1113|
|1066|
|1119|
|1062|
|1089|
|1093|
| 975|
|1099|
|1062|
|1062|
|1162|
|1057|
|1123|
|1141|
|1089|
|1172|
|1096|
|1164|
|1146|
+----+
only showing top 20 rows
Expected output:
A Float value returned from the UDF.
For the Sum function it should be clear.
What you want is groupby and use collect_list to get all integer values into an array column then apply your UDF on that column. Also, you need to explicitly return float from calc_rms:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
def calc_rms(float_array):
return float(np.sqrt(np.mean(np.diff(float_array) ** 2)))
calc_rms_udf = F.udf(calc_rms, FloatType())
df.groupby().agg(F.collect_list("_c0").alias("_c0")) \
.select(calc_rms_udf(F.col("_c0")).alias("rms")) \
.show()
#+--------+
#| rms|
#+--------+
#|67.16202|
#+--------+

Is there an equivalent of 'REGEXP_SUBSTR' of SnowFlake in PySpark?

Is there an equivalent of Snowflake's REGEXP_SUBSTR in PySpark/spark-sql?
REGEXP_EXTRACT exists, but that doesn't support as many parameters as are supported by REGEXP_SUBSTR.
Here is a link to REGEXP_SUBSTR.
Here is a link to REGEXP_EXTRACT.
More specifically, I'm looking for alternatives for position, occurrence and regex parameters which are supported by Snowflake's REGEXP_SUBSTR.
position: Number of characters from the beginning of the string where the function starts searching for matches.
occurrence: Specifies which occurrence of the pattern to match. The function skips the first occurrence - 1 matches.
regex_parameters: I'm looking specifically for the parameter 'e', which does the following:
extract sub-matches.
So the query is something like:
REGEXP_SUBSTR(string, pattern, 1, 2, 'e', 2).
Sample Input: It was the best of times, it was the worst in times.
Expected output: worst
Assuming string1 = It was the best of times, it was the worst in times.
Equivalent SF query:
SELECT regexp_substr(string1, 'the(\\W+)(\\w+)', 1, 2, 'e', 2)
One of the best things about Spark is that you don't have to rely on a vendor to create a library of functions for you. You can create a User Defined Function in python and use it in a Spark SQL Statement. EG staring with
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.functions import broadcast,col, lit, concat, udf
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
import re
def regexp_substr(subject:str, pattern:str, position:int,occurance:int) -> str:
s = subject[position:]
searchResult = re.search(pattern,s)
if searchResult:
return searchResult.group(occurance)
return None
#bench testing the python function
string1 = 'It was the best of times, it was the worst in times.'
pattern = 'the(\W+)(\w+)'
# print(pattern)
rv = regexp_substr(string1, pattern, 1,2)
print(rv)
# register for use in python
regexp_substr_udf = udf(regexp_substr , StringType())
# register for use in Spark SQL
spark.udf.register("REGEXP_SUBSTR", regexp_substr, StringType())
#craeate a spark DataFrame
df = spark.range(100).withColumn("s",lit(string1))
df.createOrReplaceTempView("df")
then you can run Spark SQL queries like
%%sql
select *, REGEXP_SUBSTR(s,'the(\\W+)(\\w+)',1,2) ex from df