How to use Pandas UDF in Class - pandas

I am trying to figure out how to use self in PandasUDF.GroupBy.Apply in a Class method in Python and also pass arguments in it. I have tried a lot of different ways but couldn't make it work. I also searched the internet extensively looking for an example of PandasUDF which is used inside a class with self and arguments but could not find anything like that. I know how to do all of the before mentioned things with Pandas.GroupBy.Apply.
The only way through which i could make it work was by declaring it static-method
class Train:
return_type = StructType([
StructField("div_nbr", FloatType()),
StructField("store_nbr", FloatType()),
StructField("model_str", BinaryType())
])
function_type = PandasUDFType.GROUPED_MAP
def __init__(self):
............
def run_train(self):
output = sp_df.groupby(['A', 'B']).apply(self.model_train)
output.show(10)
#staticmethod
#pandas_udf(return_type, function_type)
def model_train(pd_df):
features_name = ['days_into_year', 'months_into_year', 'minutes_into_day', 'hour_of_day', 'recency']
X = pd_df[features_name].copy()
Y = pd.DataFrame(pd_df['trans_type_value']).copy()
estimator_1 = XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=300, verbosity=1,
objective='reg:squarederror', booster='gbtree', n_jobs=-1, gamma=0,
min_child_weight=5, max_delta_step=0, subsample=0.6, colsample_bytree=0.8,
colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, base_score=0.5, random_state=1234, missing=None,
importance_type='gain')
estimator_1.fit(X, Y)
df_to_return = pd_df[['div_nbr', 'store_nbr']].drop_duplicates().copy()
df_to_return['model_str'] = pickle.dumps(estimator_1)
return df_to_return
What i would like to achieve in reality is that declare return_type and function_type, features_name in __init__(), then use it in PandasUDF, also pass parameters to be used inside the function when doing PandasUDF.GroupBy.Apply
If anyone could help me out, I would highly appreciate that. I am a bit newbie to PySpark.

Background
Pandas UDF Lifecycle:
A spark datatype (Column, DataFrame) is serialized into Arrow's Table format via PyArrow.
That data is sent to python virtual environments (VM), which are created JIT, within each executor.
Before reaching the python VM, the data is deserialized to a pandas Column/DataFrame and your pandas_udf code is run on that Column/DataFrame.
The Pandas output is serialized back to Arrow's Table format.
The python VM sends data back to the calling process.
Before reaching the Spark environment, the Arrow Table is decoded back to a spark datatype.
The Problem:
When working with extra data, such as a class's self, the pandas udf still needs to serialize and send that data. Serializing complex python objects like classes is not in PyArrow's capabilities, so you must either create a wrapper function and reference only specific serializable python types within the pandas_udf or 2) use a #staticmethod to negate the need for self.
The Solutions
1 - Pandas UDF with a Parameter in a Class: wrap the method with a function and create a local variable within that wrapper - src. Note that all variables that are referenced within the pandas_udf must be supported by PyArrow. Most python types are supported, but classes are not.
class my_class:
def __init__(self, s):
self.s = s
def wrapper_add_s(self, column):
local_s = self.s # create a local copy of s to be referenced by udf
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + f'_{local_s}'
return add_s(column)
def add_col(df):
return df.withColumn("Name", self.wrapper_add_s("Name"))
c = my_class(s='hi')
c.add_col(df)
2 - Pandas UDF without a Parameter in a Class: use the #staticmethod decorator
class my_class:
def __init__(self, s):
pass
#staticmethod
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + ' static string'
def add_col(df):
return df.withColumn("Name", self.wrapper_add_s("Name"))
c = my_class()
c.add_col(df)
Not in a Class
If you're looking a simple structure to pass a parameter to a pandas_udf outside of a class, use this... - src
def wrapper_add_s(column, s):
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + f'_{s}'
return add_s(column)
df = df.withColumn("Name", wrapper_add_s("Name", s='hi'))

Related

Understand pandas' applymap argument

I'm trying to highlight specific columns in my dataframe using guideline from this post, https://stackoverflow.com/a/41655055/5158984.
My question is on the use of the subset argument. My guess is that it's part of the **kwargs argument. However, the official documentation from Pandas, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html, vaguely explains it.
So in general, how can I know which key words I can use whenever I see **kwargs?
Thanks!
It seems that you are confusing pandas.DataFrame.applymap and df.style.applymap (where df is an instance of pd.DataFrame), for which subset stands on its own and is not part of the kwargs arguments.
Here is one way to find out (in your terminal or a Jupyter notebook cell) what are the named parameters of this method (or any other Pandas method for that matter):
import pandas as pd
df = pd.DataFrame()
help(df.style.applymap)
# Output
Help on method applymap in module pandas.io.formats.style:
applymap(func: 'Callable', subset: 'Subset | None' = None, **kwargs)
-> 'Styler' method of pandas.io.formats.style.Styler instance
Apply a CSS-styling function elementwise.
Updates the HTML representation with the result.
Parameters
----------
func : function
``func`` should take a scalar and return a string.
subset : label, array-like, IndexSlice, optional
A valid 2d input to `DataFrame.loc[<subset>]`, or, in the case of a 1d input
or single key, to `DataFrame.loc[:, <subset>]` where the columns are
prioritised, to limit ``data`` to *before* applying the function.
**kwargs : dict
Pass along to ``func``.
...

Pandas: what are "string function names" called technically?

When using pandas you can in certain cases pass names of functions as strings instead of actual references to those functions. For example: df.transform('round').
In the pandas docs they call these "string function names" but is there another (perhaps more technical) name for these kinds of strings?
Well, Pandas doesn't really want to do this, it's just that in some cases i.e. when using some functions like mean it's required to put the quotes, otherwise errors would be called.
With cases like round quotes wouldn't be actually needed, since they're already builtin functions. The "function names" are really just sort of a way to represent these function names so that they don't get mixed up with other functions.
As mentioned in the documentation link you provided, they call it:
string function name
There is really no special turn IMO.
By passing an invalid string to the aggregate method (ex. df.agg('max2')) and following the Traceback I got to the following code (pandas version 1.1.4):
class SelectionMixin:
"""
mixin implementing the selection & aggregation interface on a group-like
object sub-classes need to define: obj, exclusions
"""
# < some lines deleted >
def _try_aggregate_string_function(self, arg: str, *args, **kwargs):
"""
if arg is a string, then try to operate on it:
- try to find a function (or attribute) on ourselves
- try to find a numpy function
- raise
"""
assert isinstance(arg, str)
f = getattr(self, arg, None)
if f is not None:
if callable(f):
return f(*args, **kwargs)
# people may try to aggregate on a non-callable attribute
# but don't let them think they can pass args to it
assert len(args) == 0
assert len([kwarg for kwarg in kwargs if kwarg not in ["axis"]]) == 0
return f
f = getattr(np, arg, None)
if f is not None:
if hasattr(self, "__array__"):
# in particular exclude Window
return f(self, *args, **kwargs)
raise AttributeError(
f"'{arg}' is not a valid function for '{type(self).__name__}' object"
)
It seems that we fall into this code whenever we pass a string function name to aggregate. If we were to look into the familiar pandas objects (Series, DataFrame, GroupBy) we would find that they inherit from SelectionMixin.
The string function names are looked up either in the pandas object itself (getattr(self, arg, None)) or in Numpy (getattr(np, arg, None)). So the string function names simply represent attributes of some object, either methods of a pandas object or functions defined in Numpy.

How to initialize an indexed Parameter with data from a dictionary

I'm working with two blocks, one to use pandas to import data from a .csv file and another one to use this information to construct variable values.
The first block is working fine, I'm able to construct a table with indexed values (in this case is Energy Tariff prices):
import pandas as pd
df_1 = pd.read_csv('time_tff_data.csv', sep=';', usecols=['time', 'tff'], index_col='time', header=0)
data_Tariff = {
'tff':{'time': df_1['tff'].to_dict()} #Set of tariff prices
}
data = {None: dict(tariff=data_Tariff)}
The problem is that on the other block, the one that I need to use the data, I'm not able to initialize a parameter with the data within the dictionary. Altough I'm using Pyomo (for optimization) my question isn't about Pyomo itself, but about how to initialize a Parameter with the data stored in a dictionary (self.tff):
from pyomo.environ import *
from data import data_Tariff
from pyomo.environ import SimpleBlock
class tariff(SimpleBlock):
def __init__(self, *args, **kwds):
super().__init__(*args, **kwds)
self.time = Set()
self.Tmax = Param(self.time, doc='Maximum tariff price', default=1.06, multable=True)
self.Tmin = Param(self.time, doc='Minimum tariff price', default=0.39, multable=True)
self.tff = Param(self.time, doc='Set of tariff prices', default=data_Tariff['tff'], mutable=True)
self.Tc = Var(self.time, doc='Tariff priority index', initialize=0)
def _Tc(m, t):
if tff is not None:
return (m.Tc[t] == (m.Tmax-m.tff[t])/(m.Tmax-m.Tmin) for t in m.time)
return Constraint.Skip
self.Tc = Constraint(self.time, rule=_Tc, doc='Tariff priority index')
My question is: how do I import tariff data "tff[t]" from the data block, since the set is idexed by time [t]?
Couple quick observations...
First, you should be using the keyword initialize not default to initialize from a collection. Also, I can't see why you would make this mutable, so you might remove that. Try:
self.tff = Param(self.time, doc='Set of tariff prices', initialize=data_Tariff['tff'])
This assumes that data_Tariff[tff] returns a properly constructed dictionary that is indexed by self.time
Backing up, I see that you also need to initialize the self.time:
self.time = Set(initialize=data_Tariff[tff].keys())
your constraint... Your constraint appears incorrect. The for t in m.time part is taken care of when you call the rule with a set. It will make a constraint for each value of t. And the check for tff... Probably unnecessary, right? If it is necessary, you need to reference it as self.tff. So:
def _Tc(m, t):
return m.Tc[t] == (m.Tmax-m.tff[t])/(m.Tmax-m.Tmin)
self.Tc = Constraint(self.time, rule=_Tc, doc='Tariff priority index')
Also, your Tmax and Tmin appear to be just constants (not indexed). If that is the case, you can simplify a little bit and just treat them as constants that are regular python variables and take them out of the model declaration, if desired.

Multiprocessing with class functions and class attributes

I have a pandas Dataframe, that has millions of rows and I have to do row-wise operations. Since I have a Multicore CPU, I would like to speed up that process using Multiprocessing. The way I would like to do this is to just split up the dataframe in equally sized dataframes and process each of them within a separate process. So far so good...
The problem is, that my code is written in OOP style and I get Pickle errors using a Multiprocess Pool. What I do is, I pass a reference to a class function self.X to the pool. I further use class attributes within X (only read access). I really don't want to switch back to functional programming style... Hence, is it possible to do Multiprocessing in an OOP envirnoment?
It should be possible as long as all elements in your class (that you pass to the sub-processes) is picklable. That is the only thing you have to make sure. If there are any elements in your class that are not, then you cannot pass it to a Pool. Even if you only pass self.x, everything else like self.y has to be picklable.
I do my pandas Dataframe processing like that:
import pandas as pd
import multiprocessing as mp
import numpy as np
import time
def worker(in_queue, out_queue):
for row in iter(in_queue.get, 'STOP'):
value = (row[1] * row[2] / row[3]) + row[4]
time.sleep(0.1)
out_queue.put((row[0], value))
if __name__ == "__main__":
# fill a DataFrame
df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))
in_queue = mp.Queue()
out_queue = mp.Queue()
# setup workers
numProc = 2
process = [mp.Process(target=worker,
args=(in_queue, out_queue)) for x in range(numProc)]
# run processes
for p in process:
p.start()
# iterator over rows
it = df.itertuples()
# fill queue and get data
# code fills the queue until a new element is available in the output
# fill blocks if no slot is available in the in_queue
for i in range(len(df)):
while out_queue.empty():
# fill the queue
try:
row = next(it)
in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True) # row = (index, A, B, C, D) tuple
except StopIteration:
break
row_data = out_queue.get()
df.loc[row_data[0], "Result"] = row_data[1]
# signals for processes stop
for p in process:
in_queue.put('STOP')
# wait for processes to finish
for p in process:
p.join()
This way I do not have to pass big chunks of DataFrames and I do not have to think about picklable elements in my class.

subclass ndarray in python numpy: change size and value of array

Someone asks the question here: Subclassing numpy ndarray problem but it is basically unanswered.
Here is my version of the question. Suppose you subclass the numpy.ndarray to something that automatically expands when you try to set an element beyond the current shape. You would need to override the setitem and use some numpy.concatenate calls to construct a new array and then assign it to "self" somehow. How to assign the array to "self"?
class myArray(numpy.ndarray):
def __new__(cls, input_array):
obj = numpy.asarray(input_array).view(cls)
return(obj)
def __array_finalize__(self, obj):
if obj is None: return
try:
super(myArray, self).__setitem__(coords, value)
except IndexError as e:
logging.error("Adjusting array")
...
self = new_array # THIS IS WRONG
Why subclass? Why not just give your wrapper object it's own data member that is an ndarray and use __getitem__ and __setitem__ to operate on the wrapped data member? This is basically what ndarray already does, wrapping Python's built-in containers. Also have a look at Python Pandas which already does a lot of what you're talking about wrapped on top of ndarray.