When using pandas you can in certain cases pass names of functions as strings instead of actual references to those functions. For example: df.transform('round').
In the pandas docs they call these "string function names" but is there another (perhaps more technical) name for these kinds of strings?
Well, Pandas doesn't really want to do this, it's just that in some cases i.e. when using some functions like mean it's required to put the quotes, otherwise errors would be called.
With cases like round quotes wouldn't be actually needed, since they're already builtin functions. The "function names" are really just sort of a way to represent these function names so that they don't get mixed up with other functions.
As mentioned in the documentation link you provided, they call it:
string function name
There is really no special turn IMO.
By passing an invalid string to the aggregate method (ex. df.agg('max2')) and following the Traceback I got to the following code (pandas version 1.1.4):
class SelectionMixin:
"""
mixin implementing the selection & aggregation interface on a group-like
object sub-classes need to define: obj, exclusions
"""
# < some lines deleted >
def _try_aggregate_string_function(self, arg: str, *args, **kwargs):
"""
if arg is a string, then try to operate on it:
- try to find a function (or attribute) on ourselves
- try to find a numpy function
- raise
"""
assert isinstance(arg, str)
f = getattr(self, arg, None)
if f is not None:
if callable(f):
return f(*args, **kwargs)
# people may try to aggregate on a non-callable attribute
# but don't let them think they can pass args to it
assert len(args) == 0
assert len([kwarg for kwarg in kwargs if kwarg not in ["axis"]]) == 0
return f
f = getattr(np, arg, None)
if f is not None:
if hasattr(self, "__array__"):
# in particular exclude Window
return f(self, *args, **kwargs)
raise AttributeError(
f"'{arg}' is not a valid function for '{type(self).__name__}' object"
)
It seems that we fall into this code whenever we pass a string function name to aggregate. If we were to look into the familiar pandas objects (Series, DataFrame, GroupBy) we would find that they inherit from SelectionMixin.
The string function names are looked up either in the pandas object itself (getattr(self, arg, None)) or in Numpy (getattr(np, arg, None)). So the string function names simply represent attributes of some object, either methods of a pandas object or functions defined in Numpy.
Related
I'm trying to highlight specific columns in my dataframe using guideline from this post, https://stackoverflow.com/a/41655055/5158984.
My question is on the use of the subset argument. My guess is that it's part of the **kwargs argument. However, the official documentation from Pandas, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html, vaguely explains it.
So in general, how can I know which key words I can use whenever I see **kwargs?
Thanks!
It seems that you are confusing pandas.DataFrame.applymap and df.style.applymap (where df is an instance of pd.DataFrame), for which subset stands on its own and is not part of the kwargs arguments.
Here is one way to find out (in your terminal or a Jupyter notebook cell) what are the named parameters of this method (or any other Pandas method for that matter):
import pandas as pd
df = pd.DataFrame()
help(df.style.applymap)
# Output
Help on method applymap in module pandas.io.formats.style:
applymap(func: 'Callable', subset: 'Subset | None' = None, **kwargs)
-> 'Styler' method of pandas.io.formats.style.Styler instance
Apply a CSS-styling function elementwise.
Updates the HTML representation with the result.
Parameters
----------
func : function
``func`` should take a scalar and return a string.
subset : label, array-like, IndexSlice, optional
A valid 2d input to `DataFrame.loc[<subset>]`, or, in the case of a 1d input
or single key, to `DataFrame.loc[:, <subset>]` where the columns are
prioritised, to limit ``data`` to *before* applying the function.
**kwargs : dict
Pass along to ``func``.
...
I'm working with two blocks, one to use pandas to import data from a .csv file and another one to use this information to construct variable values.
The first block is working fine, I'm able to construct a table with indexed values (in this case is Energy Tariff prices):
import pandas as pd
df_1 = pd.read_csv('time_tff_data.csv', sep=';', usecols=['time', 'tff'], index_col='time', header=0)
data_Tariff = {
'tff':{'time': df_1['tff'].to_dict()} #Set of tariff prices
}
data = {None: dict(tariff=data_Tariff)}
The problem is that on the other block, the one that I need to use the data, I'm not able to initialize a parameter with the data within the dictionary. Altough I'm using Pyomo (for optimization) my question isn't about Pyomo itself, but about how to initialize a Parameter with the data stored in a dictionary (self.tff):
from pyomo.environ import *
from data import data_Tariff
from pyomo.environ import SimpleBlock
class tariff(SimpleBlock):
def __init__(self, *args, **kwds):
super().__init__(*args, **kwds)
self.time = Set()
self.Tmax = Param(self.time, doc='Maximum tariff price', default=1.06, multable=True)
self.Tmin = Param(self.time, doc='Minimum tariff price', default=0.39, multable=True)
self.tff = Param(self.time, doc='Set of tariff prices', default=data_Tariff['tff'], mutable=True)
self.Tc = Var(self.time, doc='Tariff priority index', initialize=0)
def _Tc(m, t):
if tff is not None:
return (m.Tc[t] == (m.Tmax-m.tff[t])/(m.Tmax-m.Tmin) for t in m.time)
return Constraint.Skip
self.Tc = Constraint(self.time, rule=_Tc, doc='Tariff priority index')
My question is: how do I import tariff data "tff[t]" from the data block, since the set is idexed by time [t]?
Couple quick observations...
First, you should be using the keyword initialize not default to initialize from a collection. Also, I can't see why you would make this mutable, so you might remove that. Try:
self.tff = Param(self.time, doc='Set of tariff prices', initialize=data_Tariff['tff'])
This assumes that data_Tariff[tff] returns a properly constructed dictionary that is indexed by self.time
Backing up, I see that you also need to initialize the self.time:
self.time = Set(initialize=data_Tariff[tff].keys())
your constraint... Your constraint appears incorrect. The for t in m.time part is taken care of when you call the rule with a set. It will make a constraint for each value of t. And the check for tff... Probably unnecessary, right? If it is necessary, you need to reference it as self.tff. So:
def _Tc(m, t):
return m.Tc[t] == (m.Tmax-m.tff[t])/(m.Tmax-m.Tmin)
self.Tc = Constraint(self.time, rule=_Tc, doc='Tariff priority index')
Also, your Tmax and Tmin appear to be just constants (not indexed). If that is the case, you can simplify a little bit and just treat them as constants that are regular python variables and take them out of the model declaration, if desired.
I am trying to figure out how to use self in PandasUDF.GroupBy.Apply in a Class method in Python and also pass arguments in it. I have tried a lot of different ways but couldn't make it work. I also searched the internet extensively looking for an example of PandasUDF which is used inside a class with self and arguments but could not find anything like that. I know how to do all of the before mentioned things with Pandas.GroupBy.Apply.
The only way through which i could make it work was by declaring it static-method
class Train:
return_type = StructType([
StructField("div_nbr", FloatType()),
StructField("store_nbr", FloatType()),
StructField("model_str", BinaryType())
])
function_type = PandasUDFType.GROUPED_MAP
def __init__(self):
............
def run_train(self):
output = sp_df.groupby(['A', 'B']).apply(self.model_train)
output.show(10)
#staticmethod
#pandas_udf(return_type, function_type)
def model_train(pd_df):
features_name = ['days_into_year', 'months_into_year', 'minutes_into_day', 'hour_of_day', 'recency']
X = pd_df[features_name].copy()
Y = pd.DataFrame(pd_df['trans_type_value']).copy()
estimator_1 = XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=300, verbosity=1,
objective='reg:squarederror', booster='gbtree', n_jobs=-1, gamma=0,
min_child_weight=5, max_delta_step=0, subsample=0.6, colsample_bytree=0.8,
colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, base_score=0.5, random_state=1234, missing=None,
importance_type='gain')
estimator_1.fit(X, Y)
df_to_return = pd_df[['div_nbr', 'store_nbr']].drop_duplicates().copy()
df_to_return['model_str'] = pickle.dumps(estimator_1)
return df_to_return
What i would like to achieve in reality is that declare return_type and function_type, features_name in __init__(), then use it in PandasUDF, also pass parameters to be used inside the function when doing PandasUDF.GroupBy.Apply
If anyone could help me out, I would highly appreciate that. I am a bit newbie to PySpark.
Background
Pandas UDF Lifecycle:
A spark datatype (Column, DataFrame) is serialized into Arrow's Table format via PyArrow.
That data is sent to python virtual environments (VM), which are created JIT, within each executor.
Before reaching the python VM, the data is deserialized to a pandas Column/DataFrame and your pandas_udf code is run on that Column/DataFrame.
The Pandas output is serialized back to Arrow's Table format.
The python VM sends data back to the calling process.
Before reaching the Spark environment, the Arrow Table is decoded back to a spark datatype.
The Problem:
When working with extra data, such as a class's self, the pandas udf still needs to serialize and send that data. Serializing complex python objects like classes is not in PyArrow's capabilities, so you must either create a wrapper function and reference only specific serializable python types within the pandas_udf or 2) use a #staticmethod to negate the need for self.
The Solutions
1 - Pandas UDF with a Parameter in a Class: wrap the method with a function and create a local variable within that wrapper - src. Note that all variables that are referenced within the pandas_udf must be supported by PyArrow. Most python types are supported, but classes are not.
class my_class:
def __init__(self, s):
self.s = s
def wrapper_add_s(self, column):
local_s = self.s # create a local copy of s to be referenced by udf
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + f'_{local_s}'
return add_s(column)
def add_col(df):
return df.withColumn("Name", self.wrapper_add_s("Name"))
c = my_class(s='hi')
c.add_col(df)
2 - Pandas UDF without a Parameter in a Class: use the #staticmethod decorator
class my_class:
def __init__(self, s):
pass
#staticmethod
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + ' static string'
def add_col(df):
return df.withColumn("Name", self.wrapper_add_s("Name"))
c = my_class()
c.add_col(df)
Not in a Class
If you're looking a simple structure to pass a parameter to a pandas_udf outside of a class, use this... - src
def wrapper_add_s(column, s):
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + f'_{s}'
return add_s(column)
df = df.withColumn("Name", wrapper_add_s("Name", s='hi'))
i want to reduce loading memory usage by filter some gid
reg_df = pd.read_parquet('/data/2010r.pq',
columns=['timestamp', 'gid', 'uid', 'flag'])
But in docs kwargs havn't been shown .
For example:
gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]
so,how can i only load gid that i want to calculate?
The introduction of the **kwargs to the pandas library is documented here. It looks like the original intent was to actually pass columns into the request to limit IO volumn. The contributors took the next step and added a general pass for **kwargs.
For pandas/io/parquet.py the following is for read_parquet:
def read_parquet(path, engine='auto', columns=None, **kwargs):
"""
Load a parquet object from the file path, returning a DataFrame.
.. versionadded 0.21.0
Parameters
----------
path : string
File path
columns: list, default=None
If not None, only these columns will be read from the file.
.. versionadded 0.21.1
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet library to use. If 'auto', then the option
``io.parquet.engine`` is used. The default ``io.parquet.engine``
behavior is to try 'pyarrow', falling back to 'fastparquet' if
'pyarrow' is unavailable.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
impl = get_engine(engine)
return impl.read(path, columns=columns, **kwargs)
For pandas/io/parquet.py the following is for read on the pyarrow engine:
def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
if self._pyarrow_lt_070:
result = self.api.parquet.read_pandas(path, columns=columns,
**kwargs).to_pandas()
else:
kwargs['use_pandas_metadata'] = True #<-- only param for kwargs...
result = self.api.parquet.read_table(path, columns=columns,
**kwargs).to_pandas()
if should_close:
try:
path.close()
except: # noqa: flake8
pass
return result
for pyarrow/parquet.py the following is for read_pandas:
def read_pandas(self, **kwargs):
"""
Read dataset including pandas metadata, if any. Other arguments passed
through to ParquetDataset.read, see docstring for further details
Returns
-------
pyarrow.Table
Content of the file as a table (of columns)
"""
return self.read(use_pandas_metadata=True, **kwargs) #<-- params being passed
For pyarrow/parquet.py the following is for read:
def read(self, columns=None, nthreads=1, use_pandas_metadata=False): #<-- kwargs param at pyarrow
"""
Read a Table from Parquet format
Parameters
----------
columns: list
If not None, only these columns will be read from the file. A
column name may be a prefix of a nested field, e.g. 'a' will select
'a.b', 'a.c', and 'a.d.e'
nthreads : int, default 1
Number of columns to read in parallel. If > 1, requires that the
underlying file source is threadsafe
use_pandas_metadata : boolean, default False
If True and file has custom pandas schema metadata, ensure that
index columns are also loaded
Returns
-------
pyarrow.table.Table
Content of the file as a table (of columns)
"""
column_indices = self._get_column_indices(
columns, use_pandas_metadata=use_pandas_metadata)
return self.reader.read_all(column_indices=column_indices,
nthreads=nthreads)
So, if I understand correctly maybe you can access nthreads and use_pandas_metadata - but then again, neither is explicitly assigned (??). I haven't tested it - but it maybe a start.
Someone asks the question here: Subclassing numpy ndarray problem but it is basically unanswered.
Here is my version of the question. Suppose you subclass the numpy.ndarray to something that automatically expands when you try to set an element beyond the current shape. You would need to override the setitem and use some numpy.concatenate calls to construct a new array and then assign it to "self" somehow. How to assign the array to "self"?
class myArray(numpy.ndarray):
def __new__(cls, input_array):
obj = numpy.asarray(input_array).view(cls)
return(obj)
def __array_finalize__(self, obj):
if obj is None: return
try:
super(myArray, self).__setitem__(coords, value)
except IndexError as e:
logging.error("Adjusting array")
...
self = new_array # THIS IS WRONG
Why subclass? Why not just give your wrapper object it's own data member that is an ndarray and use __getitem__ and __setitem__ to operate on the wrapped data member? This is basically what ndarray already does, wrapping Python's built-in containers. Also have a look at Python Pandas which already does a lot of what you're talking about wrapped on top of ndarray.