Add a column from a function of 2 other columns in PySpark - dataframe

I have two columns in a data frame df in PySpark:
| features | center |
+----------+----------+
| [0,1,0] | [1.5,2,1]|
| [5,7,6] | [10,7,7] |
I want to create a function which calculates the Euclidean distance between df['features'] and df['center'] and map it to a new column in df, distance.
Let's say our function looks like the following:
#udf
def dist(feat, cent):
return np.linalg.norm(feat-cent)
How would I actually apply this to do what I want it to do? I was trying things like
df.withColumn("distance", dist(col("features"), col("center"))).show()
but that gives me the following error:
rg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 869.0 failed 4 times, most recent failure: Lost task 0.3 in stage 869.0 (TID 26423) (10.50.91.134 executor 35): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
I am really struggling with understanding how to do basic Python mappings in a Spark context, so I really appreciate any help.

You have truly chosen a difficult topic. In Spark, 95%+ of things can be done without python UDFs. You should always try to find a way not to create a UDF.
I've attempted your UDF, I got the same error, and I cannot really tell, why. I think there's something with data types, as you pass Spark array into a function which expects numpy data types. I really can't tell much more...
For Euclidian distance, it's possible to calculate it in Spark. Not an easy one, though.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([0, 1, 0], [1.5, 2., 1.]),
([5, 7, 6], [10., 7., 7.])],
['features', 'center'])
distance = F.aggregate(
F.transform(
F.arrays_zip('features', 'center'),
lambda x: (x['features'] - x['center'])**2
),
F.lit(0.0),
lambda acc, x: acc + x,
lambda x: x**.5
)
df = df.withColumn('distance', distance)
df.show()
# +---------+----------------+------------------+
# | features| center| distance|
# +---------+----------------+------------------+
# |[0, 1, 0]| [1.5, 2.0, 1.0]|2.0615528128088303|
# |[5, 7, 6]|[10.0, 7.0, 7.0]|5.0990195135927845|
# +---------+----------------+------------------+

from sklearn.metrics.pairwise import paired_distances
Alter dfs schema to accommodate the dist column
sch= df.withColumn('dist', lit(90.087654623)).schema
Create pandas udf that claculates distance
def euclidean_dist(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(dist=paired_distances(pdf['features'].to_list(),pdf['center'].to_list()))
df.mapInPandas(euclidean_dist, schema=sch).show()
Solution
+---------+----------------+------------------+
| features| center| dist|
+---------+----------------+------------------+
|[0, 1, 0]| [1.5, 2.0, 1.0]|2.0615528128088303|
|[5, 7, 6]|[10.0, 7.0, 7.0]|5.0990195135927845|
+---------+----------------+------------------+

You can calculate the distance using only PySpark and spark sql APIs:
import pyspark.sql.functions as f
df = (
df
.withColumn('distance', f.sqrt(f.expr('aggregate(transform(features, (element, idx) -> pow(element - element_at(center, idx + 1), 2)), cast(0 as double), (acc, val) -> acc + val)')))
)

Related

How do I use `pd.NamedAgg` with a lambda function inside a `pandas` aggregation?

I want to be able to feed a list as parameters to generate different aggregate functions in pandas. To make this more concrete, let's say I have this as data:
import numpy as np
import pandas as pd
np.random.seed(0)
df_data = pd.DataFrame({
'group': np.repeat(['x', 'y'], 10),
'val': np.random.randint(0, 10, 20)
})
So the first few rows of the data looks like this:
group
val
x
5
x
0
x
3
I have a list of per-group percentiles that I want to compute.
percentile_list = [10, 90]
And I tried to use dictionary comprehension with pd.NamedAgg that calls a lambda function to do this.
df_agg = df_data.groupby('group').agg(
**{f'p{y}_by_dict': pd.NamedAgg('val', lambda x: np.quantile(x, y / 100)) for y in percentile_list},
)
But it doesn't work. Here I calculate both by hand and by dictionary comprehension.
df_agg = df_data.groupby('group').agg(
p10_by_hand=pd.NamedAgg('val', lambda x: np.quantile(x, 0.1)),
p90_by_hand=pd.NamedAgg('val', lambda x: np.quantile(x, 0.9)),
**{f'p{y}_by_dict': pd.NamedAgg('val', lambda x: np.quantile(x, y / 100)) for y in percentile_list},
)
The result looks like this. The manually specified aggregations work but the dictionary comprehension ones have the same values across different aggregations. I guess they just took the last lambda function in the generated dictionary.
p10_by_hand
p90_by_hand
p10_by_dict
p90_by_dict
x
1.8
7.2
7.2
7.2
y
1.0
8.0
8.0
8.0
How do I fix this? I don't have to use dictionary comprehension, as long as each aggregation can be specified programmatically.
In [23]: def agg_gen(y):
...: return lambda x: np.quantile(x, y / 100)
...:
In [24]: df_data.groupby('group').agg(
...: **{f'p{y}_by_dict': pd.NamedAgg('val', agg_gen(y)) for y in percentile_list},
...: )
Out[24]:
p10_by_dict p90_by_dict
group
x 1.8 7.2
y 1.0 8.0
the reason your initial assign fails is because of this - What do lambda function closures capture?

Median of an array column in spark or pandas all rows simultaneously

Strangely enough I cant find any where on the internet if its possible to be done.
I have a datafrme of array column.
arr_col
[1,3,4]
[4,3,5]
I want result
Result
3
4
I want the median for each row.
I managed to do it with a pandas udf but it iterates the column and applies np.median to each row. .
I dont want it as it's slow and tow at a time. I want it to act at all rows the same time.
Either in pandas or pyspark
Use numpy
import numpy as np
df['Result'] = np.median(np.vstack(df['arr_col']), axis=1)
Or explode and groupby.median:
df['Result'] = (df['arr_col'].explode()
.groupby(level=0).median()
)
Output:
arr_col Result
0 [1, 3, 4] 3.0
1 [4, 3, 5] 4.0
Used input:
df = pd.DataFrame({'arr_col': [[1,3,4], [4,3,5]]})
Can use a udf in pyspark.
m =udf(lambda x: int(np.median(x)),IntegerType())
df.withColumn('Result', m(col('arr_col'))).show()
+---+---------+------+
| Id| arr_col|Result|
+---+---------+------+
| 1|[1, 3, 4]| 3.0|
| 1|[4, 3, 6]| 4.0|
+---+---------+------+

How to transform columns with method chaining?

What's the most fluent (or easy to read) method chaining solution for transforming columns in Pandas?
(“method chaining” or “fluent” is the coding style made popular by Tom Augspurger among others.)
For the sake of the example, let's set up some example data:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("iris").astype(str) # Just for this example
df.loc[1, :] = "NA"
df.head()
#
# sepal_length sepal_width petal_length petal_width species
# 0 5.1 3.5 1.4 0.2 setosa
# 1 NA NA NA NA NA
# 2 4.7 3.2 1.3 0.2 setosa
# 3 4.6 3.1 1.5 0.2 setosa
# 4 5.0 3.6 1.4 0.2 setosa
Just for this example: I want to map certain columns through a function - sepal_length using pd.to_numeric - while keeping the other columns as they were. What's the easiest way to do that in a method chaining style?
I can already use assign, but I'm repeating the column name here, which I don't want.
new_result = (
df.assign(sepal_length = lambda df_: pd.to_numeric(df_.sepal_length, errors="coerce"))
.head() # Further chaining methods, what it may be
)
I can use transform, but transform drops(!) the unmentioned columns. Transform with passthrough for the other columns would be ideal:
# Columns not mentioned in transform are lost
new_result = (
df.transform({'sepal_length': lambda series: pd.to_numeric(series, errors="coerce")})
.head() # Further chaining methods...
)
Is there a “best” way to apply transformations to certain columns, in a fluent style, and pass the other columns along?
Edit: Below this line, a suggestion after reading Laurent's ideas.
Add a helper function that allows applying a mapping to just one column:
import functools
coerce_numeric = functools.partial(pd.to_numeric, errors='coerce')
def on_column(column, mapping):
"""
Adaptor that takes a column transformation and returns a "whole dataframe" function suitable for .pipe()
Notice that columns take the name of the returned series, if applicable
Columns mapped to None are removed from the result.
"""
def on_column_(df):
df = df.copy(deep=False)
res = mapping(df[column])
# drop column if mapped to None
if res is None:
df.pop(column)
return df
df[column] = res
# update column name if mapper changes its name
if hasattr(res, 'name') and res.name != col:
df = df.rename(columns={column: res.name})
return df
return on_column_
This now allows the following neat chaining in the previous example:
new_result = (
df.pipe(on_column('sepal_length', coerce_numeric))
.head() # Further chaining methods...
)
However, I'm still open to ways how to do this just in native pandas without the glue code.
Edit 2 to further adapt Laurent's ideas, as an alternative. Self-contained example:
import pandas as pd
df = pd.DataFrame(
{"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)
def map_columns(mapping=None, /, **kwargs):
"""
Transform the specified columns and let the rest pass through.
Examples:
df.pipe(map_columns(a=lambda x: x + 1, b=str.upper))
# dict for non-string column names
df.pipe({(0, 0): np.sqrt, (0, 1): np.log10})
"""
if mapping is not None and kwargs:
raise ValueError("Only one of a dict and kwargs can be used at the same time")
mapping = mapping or kwargs
def map_columns_(df: pd.DataFrame) -> pd.DataFrame:
mapping_funcs = {**{k: lambda x: x for k in df.columns}, **mapping}
# preserve original order of columns
return df.transform({key: mapping_funcs[key] for key in df.columns})
return map_columns_
df2 = (
df
.pipe(map_columns(col2=pd.to_numeric))
.sort_values(by="col1")
.pipe(map_columns(col1=lambda x: x.astype(str) + "0"))
.pipe(map_columns({'col2': lambda x: -x, 'col3': str.upper}))
.reset_index(drop=True)
)
df2
# col1 col2 col3
# 0 10 -7 Z
# 1 20 -5 Y
# 2 30 -6 X
# 3 40 -9 W
Here is my take on your interesting question.
I don't know of a more idiomatic way in Pandas to do method chaining than combining pipe, assign, or transform. But I understand that "transform with passthrough for the other columns would be ideal".
So, I suggest using it with a higher-order function to deal with other columns, doing even more functional-like coding by taking advantage of Python standard library functools module.
For example, with the following toy dataframe:
df = pd.DataFrame(
{"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)
You can define the following partial object:
from functools import partial
from typing import Any, Callable
import pandas as pd
def helper(df: pd.DataFrame, col: str, method: Callable[..., Any]) -> pd.DataFrame:
funcs = {col: method} | {k: lambda x: x for k in df.columns if k != col}
# preserve original order of columns
return {key: funcs[key] for key in df.columns}
on = partial(helper, df)
And then do all sorts of chain assignments using transform, for instance:
df = (
df
.transform(on("col1", pd.to_numeric))
.sort_values(by="col1")
.transform(on("col2", lambda x: x.astype(str) + "0"))
.transform(on("col3", str.upper))
.reset_index(drop=True)
)
print(df)
# Ouput
col1 col2 col3
0 1 70 Z
1 2 50 Y
2 3 60 X
3 4 90 W
If I understand the question correctly, perhaps using ** within assign will be helpful. For example, if you just wanted to convert the numeric data types using pd.to_numeric the following should work.
df.assign(**df.select_dtypes(include=np.number).apply(pd.to_numeric,errors='coerce'))
By unpacking the df, you are essentially giving assign what it needs to assign each column. This would be equivalent to writing sepal_length = pd.to_numeric(df['sepal_length'],errors='coerce'), sepal_width = ... for each column.

i need to return a value from a dataframe cell as a variable not a series

i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]
.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7

PySpark-numpy interoperability

I am working on a dataset where most of the columns are 1D sequences of doubles, like the following one:
source_data = [
Row(ID="F0", P1=[-1.0, -2.0, -3.0], P2=[-4.0, -5.0, -6.0]),
Row(ID="F1", P1=[1.0, 2.0, 3.0], P2=[4.0, 5.0, 6.0]),
]
df = spark.createDataFrame(source_data)
which looks like:
+---+------------------+------------------+
| ID| P1| P2|
+---+------------------+------------------+
| F0|[-1.0, -2.0, -3.0]|[-4.0, -5.0, -6.0]|
| F1| [1.0, 2.0, 3.0]| [4.0, 5.0, 6.0]|
+---+------------------+------------------+
The columns P1 and P2 have spark type ArrayType(DoubleType()).
In my real dataset, I have hundreds of columns of different length.
My goal is to execute mathematical operations such as mean, median, quantiles, fft, etc on these sequences in a distributed manner on a cluster.
My approach is to wrap numpy functions as follows. First, I "hide"
the numpy types which do not seem to be supported by Spark:
# wrap functions to "hide" numpy data types
def py_median(param_val):
param_val = np.array(param_val)
param_median = np.median(param_val)
return float(param_median)
def py_abs(param_val):
param_val = np.array(param_val)
param_abs = np.abs(param_val)
return param_abs.tolist()
Then, I transform the functions to pySpark UDFs:
# wrap functions to operate as udfs
median = F.udf(py_median, DoubleType())
abs = F.udf(py_abs, ArrayType(DoubleType()))
Finally, I can use them with pySpark to do the work:
df_processed = (
df.withColumn("P1_mean", median(col("P1")))
.withColumn("P2_abs", abs(col("P2")))
)
Note that I have to convert the result type to simple python types such as:
python float (functions returning a scalar)
python list of float (functions returning a 1D numpy array)
otherwise I get Spark errors
I need different wrapping strategies to handle functions returning a complex array (e.g. the fft) or multiple return value. Is there a cleaner way to get my job done? Perhaps using Pandas UDFs?
Thanks in advance,
Marco