PySpark-numpy interoperability - numpy

I am working on a dataset where most of the columns are 1D sequences of doubles, like the following one:
source_data = [
Row(ID="F0", P1=[-1.0, -2.0, -3.0], P2=[-4.0, -5.0, -6.0]),
Row(ID="F1", P1=[1.0, 2.0, 3.0], P2=[4.0, 5.0, 6.0]),
]
df = spark.createDataFrame(source_data)
which looks like:
+---+------------------+------------------+
| ID| P1| P2|
+---+------------------+------------------+
| F0|[-1.0, -2.0, -3.0]|[-4.0, -5.0, -6.0]|
| F1| [1.0, 2.0, 3.0]| [4.0, 5.0, 6.0]|
+---+------------------+------------------+
The columns P1 and P2 have spark type ArrayType(DoubleType()).
In my real dataset, I have hundreds of columns of different length.
My goal is to execute mathematical operations such as mean, median, quantiles, fft, etc on these sequences in a distributed manner on a cluster.
My approach is to wrap numpy functions as follows. First, I "hide"
the numpy types which do not seem to be supported by Spark:
# wrap functions to "hide" numpy data types
def py_median(param_val):
param_val = np.array(param_val)
param_median = np.median(param_val)
return float(param_median)
def py_abs(param_val):
param_val = np.array(param_val)
param_abs = np.abs(param_val)
return param_abs.tolist()
Then, I transform the functions to pySpark UDFs:
# wrap functions to operate as udfs
median = F.udf(py_median, DoubleType())
abs = F.udf(py_abs, ArrayType(DoubleType()))
Finally, I can use them with pySpark to do the work:
df_processed = (
df.withColumn("P1_mean", median(col("P1")))
.withColumn("P2_abs", abs(col("P2")))
)
Note that I have to convert the result type to simple python types such as:
python float (functions returning a scalar)
python list of float (functions returning a 1D numpy array)
otherwise I get Spark errors
I need different wrapping strategies to handle functions returning a complex array (e.g. the fft) or multiple return value. Is there a cleaner way to get my job done? Perhaps using Pandas UDFs?
Thanks in advance,
Marco

Related

Using pandas udf to return a full column containing average

This is very weird I tried using pandas udf on a spark df and it works only if i do select and return one value which is the average of the column
but if i try to fill the whole column with this value then it doesnt work
the following works:
#pandas_udf(DoubleType())
def avg(col ) :
cl = np.average(col)
return cl
df.select(avg('col' ))
this works and returns a df of one row containing the value average of column.
but the following doesnt work
df.withColumn('avg', F.lit( avg(col))
why? if avg(col) is a value then why cant i use that to fill the column with a lit()?
like the following example which does work. This does work when i return a constant number
#pandas_udf(DoubleType())
def avg(col ) :
return 5
df.withColumn('avg', avg(col)
I also tried returning a series and didnt work either
#pandas_udf(DoubleType())
def avg(col ) :
cl = np.average(col)
return pd.Series([cl]* col.size())
df.withColumn('avg', avg(col))
doesnt work. But does work if i use a constant instead of cl
So basically how could i return a full column containing the same value of the average to fill up the whole column with that value?
lit is evaluated on driver and not executed on the data on the executor.
The best way to achieve this would be to simply define a window spec for the entire dataset and call the aggregate function over the window. This would eliminate the need for an extra UDF.
windowSpec = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('avg', avg(col).over(windowSpec))
Type cast it to float().
I am not sure what you are trying to achieve here. The UDF is called for each row. So, inside the UDF, the "col" represents each individual cell value - it does not represent the entire column.
If your column is of type array/list:
df = spark.createDataFrame(
[
[[1.0, 2.0, 3.0, 4.0]],
[[5.0, 6.0, 7.0, 8.0]],
],
["num"]
)
#F.udf(returnType=DoubleType())
def avg(col):
import numpy as np
return float(np.average(col))
#
df = df.withColumn("avg", avg("num"))
+--------------------+---+
| num|avg|
+--------------------+---+
|[1.0, 2.0, 3.0, 4.0]|2.5|
|[5.0, 6.0, 7.0, 8.0]|6.5|
+--------------------+---+
But if your column is a scalar type like double/float, then the average of it via UDF will always return the same column value:
df = spark.createDataFrame(
[[1.0],[2.0],[3.0],[4.0]],
["num"]
)
#F.udf(returnType=DoubleType())
def avg(col):
import numpy as np
return float(np.average(col))
#
df = df.withColumn("avg", avg("num"))
+---+---+
|num|avg|
+---+---+
|1.0|1.0|
|2.0|2.0|
|3.0|3.0|
|4.0|4.0|
+---+---+

Add a column from a function of 2 other columns in PySpark

I have two columns in a data frame df in PySpark:
| features | center |
+----------+----------+
| [0,1,0] | [1.5,2,1]|
| [5,7,6] | [10,7,7] |
I want to create a function which calculates the Euclidean distance between df['features'] and df['center'] and map it to a new column in df, distance.
Let's say our function looks like the following:
#udf
def dist(feat, cent):
return np.linalg.norm(feat-cent)
How would I actually apply this to do what I want it to do? I was trying things like
df.withColumn("distance", dist(col("features"), col("center"))).show()
but that gives me the following error:
rg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 869.0 failed 4 times, most recent failure: Lost task 0.3 in stage 869.0 (TID 26423) (10.50.91.134 executor 35): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
I am really struggling with understanding how to do basic Python mappings in a Spark context, so I really appreciate any help.
You have truly chosen a difficult topic. In Spark, 95%+ of things can be done without python UDFs. You should always try to find a way not to create a UDF.
I've attempted your UDF, I got the same error, and I cannot really tell, why. I think there's something with data types, as you pass Spark array into a function which expects numpy data types. I really can't tell much more...
For Euclidian distance, it's possible to calculate it in Spark. Not an easy one, though.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([0, 1, 0], [1.5, 2., 1.]),
([5, 7, 6], [10., 7., 7.])],
['features', 'center'])
distance = F.aggregate(
F.transform(
F.arrays_zip('features', 'center'),
lambda x: (x['features'] - x['center'])**2
),
F.lit(0.0),
lambda acc, x: acc + x,
lambda x: x**.5
)
df = df.withColumn('distance', distance)
df.show()
# +---------+----------------+------------------+
# | features| center| distance|
# +---------+----------------+------------------+
# |[0, 1, 0]| [1.5, 2.0, 1.0]|2.0615528128088303|
# |[5, 7, 6]|[10.0, 7.0, 7.0]|5.0990195135927845|
# +---------+----------------+------------------+
from sklearn.metrics.pairwise import paired_distances
Alter dfs schema to accommodate the dist column
sch= df.withColumn('dist', lit(90.087654623)).schema
Create pandas udf that claculates distance
def euclidean_dist(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(dist=paired_distances(pdf['features'].to_list(),pdf['center'].to_list()))
df.mapInPandas(euclidean_dist, schema=sch).show()
Solution
+---------+----------------+------------------+
| features| center| dist|
+---------+----------------+------------------+
|[0, 1, 0]| [1.5, 2.0, 1.0]|2.0615528128088303|
|[5, 7, 6]|[10.0, 7.0, 7.0]|5.0990195135927845|
+---------+----------------+------------------+
You can calculate the distance using only PySpark and spark sql APIs:
import pyspark.sql.functions as f
df = (
df
.withColumn('distance', f.sqrt(f.expr('aggregate(transform(features, (element, idx) -> pow(element - element_at(center, idx + 1), 2)), cast(0 as double), (acc, val) -> acc + val)')))
)

Creating PySpark UDFs from python method with numpy array input, to calculate and return a single float value

As input I have a csv file with int values in it.
spark_df = spark.read.option("header", "false").csv("../int_values.csv")
df = spark_df.selectExpr("_c0 as something")
_df = df.withColumn("values", df.something.cast(FloatType())).select("values")
I also have some python functions designed for numpy array inputs, that I need to apply on the Spark DataFrame.
The example one:
def calc_sum(float_array):
return np.sum(float_array)
Real function:
def calc_rms(float_array):
return np.sqrt(np.mean(np.diff(float_array)**2))
For the 1. example you can use SQL sum like:
_df.groupBy().sum().collect()
But, what I need is a standard solution to transform these functions into Spark UDFs
I tried many ways, like:
udf_sum = udf(lambda x : calc_sum(x), FloatType())
_df.rdd.flatMap(udf_sum).collect()
but it always failed with:
TypeError: Invalid argument, not a string or column:
Row(values=1114.0) of type <class 'pyspark.sql.types.Row'>. For column
literals, use 'lit', 'array', 'struct' or 'create_map' function.
Is it possible to transform the data in a way that works with these functions?
DataFrame sample:
In [6]: spark_df.show()
+----+
| _c0|
+----+
|1114|
|1113|
|1066|
|1119|
|1062|
|1089|
|1093|
| 975|
|1099|
|1062|
|1062|
|1162|
|1057|
|1123|
|1141|
|1089|
|1172|
|1096|
|1164|
|1146|
+----+
only showing top 20 rows
Expected output:
A Float value returned from the UDF.
For the Sum function it should be clear.
What you want is groupby and use collect_list to get all integer values into an array column then apply your UDF on that column. Also, you need to explicitly return float from calc_rms:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
def calc_rms(float_array):
return float(np.sqrt(np.mean(np.diff(float_array) ** 2)))
calc_rms_udf = F.udf(calc_rms, FloatType())
df.groupby().agg(F.collect_list("_c0").alias("_c0")) \
.select(calc_rms_udf(F.col("_c0")).alias("rms")) \
.show()
#+--------+
#| rms|
#+--------+
#|67.16202|
#+--------+

Performing different computations conditioned on a column value in a spark dataframe

I have a pyspark dataframe with 2 columns, A and B. I need rows of B to be processed differently, based on values of the A column. In plain pandas I might do this:
import pandas as pd
funcDict = {}
funcDict['f1'] = (lambda x:x+1000)
funcDict['f2'] = (lambda x:x*x)
df = pd.DataFrame([['a',1],['b',2],['b',3],['a',4]], columns=['A','B'])
df['newCol'] = df.apply(lambda x: funcDict['f1'](x['B']) if x['A']=='a' else funcDict['f2']
(x['B']), axis=1)
The easy way I can think of to do in (py)spark are
Use files
read in the data into a dataframe
partition by column A and write to separate files (write.partitionBy)
read in each file and then process them separately
or else
use expr
read in the data into a dataframe
write a unwieldy expr (from a readability/maintenance perspective) to conditionally do something differently based on the value of the column
this will not look anywhere as "clean" as the pandas code above looks
Is there anything else that is the appropriate way to handle this requirement? From the efficiency perspective, I expect the first approach to be cleaner, but have more run time due to the partition-write-read, and the second approach is not as good from the code perspective, and harder to extend and maintain.
More primarily, would you choose to use something completely different (e.g. message queues) instead (relative latency difference notwithstanding)?
EDIT 1
Based on my limited knowledge of pyspark, the solution proposed by user pissall (https://stackoverflow.com/users/8805315/pissall) works as long as the processing isn't very complex. If that happens, I don't know how to do it without resorting to UDFs, which come with their own disadvantages. Consider the simple example below
# create a 2-column data frame
# where I wish to extract the city
# in column B differently based on
# the type given in column A
# This requires taking a different
# substring (prefix or suffix) from column B
df = sparkSession.createDataFrame([
(1, "NewYork_NY"),
(2, "FL_Miami"),
(1, "LA_CA"),
(1, "Chicago_IL"),
(2,"PA_Kutztown")
], ["A", "B"])
# create UDFs to get left and right substrings
# I do not know how to avoid creating UDFs
# for this type of processing
getCityLeft = udf(lambda x:x[0:-3],StringType())
getCityRight = udf(lambda x:x[3:],StringType())
#apply UDFs
df = df.withColumn("city", F.when(F.col("A") == 1, getCityLeft(F.col("B"))) \
.otherwise(getCityRight(F.col("B"))))
Is there a way to do this in a simpler manner without resorting to UDFs? If I use expr, I can do this, but as I mentioned earlier, it doesn't seem elegant.
What about using when?
import pyspark.sql.functions as F
df = df.withColumn("transformed_B", F.when(F.col("A") == "a", F.col("B") + 1000).otherwise(F.col("B") * F.col("B")))
EDIT after more clarity on the question:
You can use split on _ and take the first or the second part of it based on your condition.
Is this the expected output?
df.withColumn("city", F.when(F.col("A") == 1, F.split("B", "_")[0]).otherwise(F.split("B", "_")[1])).show()
+---+-----------+--------+
| A| B| city|
+---+-----------+--------+
| 1| NewYork_NY| NewYork|
| 2| FL_Miami| Miami|
| 1| LA_CA| LA|
| 1| Chicago_IL| Chicago|
| 2|PA_Kutztown|Kutztown|
+---+-----------+--------+
UDF approach:
def sub_string(ref_col, city_col):
# ref_col is the reference column (A) and city_col is the string we want to sub (B)
if ref_col == 1:
return city_col[0:-3]
return city_col[3:]
sub_str_udf = F.udf(sub_string, StringType())
df = df.withColumn("city", sub_str_udf(F.col("A"), F.col("B")))
Also, please look into: remove last few characters in PySpark dataframe column

Linear 1D interpolation on multiple datasets using loops

I'm interested in performing Linear interpolation using the scipy.interpolate library. The dataset looks somewhat like this:
DATAFRAME for interpolation between X, Y for different RUNs
I'd like to use this interpolated function to find the missing Y from this dataset:
DATAFRAME to use the interpolation function
The number of runs given here is just 3, but I'm running on a dataset that will run into 1000s of runs. Hence appreciate if you could advise how to use the iterative functions for the interpolation ?
from scipy.interpolate import interp1d
for RUNNumber in range(TotalRuns)
InterpolatedFunction[RUNNumber]=interp1d(X, Y)
As I understand it, you want a separate interpolation function defined for each run. Then you want to apply these functions to a second dataframe. I defined a dataframe df with columns ['X', 'Y', 'RUN'], and a second dataframe, new_df with columns ['X', 'Y_interpolation', 'RUN'].
interpolating_functions = dict()
for run_number in range(1, max_runs):
run_data = df[df['RUN']==run_number][['X', 'Y']]
interpolating_functions[run_number] = interp1d(run_data['X'], run_data['Y'])
Now that we have interpolating functions for each run, we can use them to fill in the 'Y_interpolation' column in a new dataframe. This can be done using the apply function, which takes a function and applies it to each row in a dataframe. So let's define an interpolate function that will take a row of this new df and use the X value and the run number to calculate an interpolated Y value.
def interpolate(row):
int_func = interpolating_functions[row['RUN']]
interp_y = int_func._call_linear([row['X'])[0] #the _call_linear method
#expects and returns an array
return interp_y[0]
Now we just use apply and our defined interpolate function.
new_df['Y_interpolation'] = new_df.apply(interpolate,axis=1)
I'm using pandas version 0.20.3, and this gives me a new_df that looks like this: