alternative way to define a function inside a class method [closed] - pandas

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I have a the following class:
class Analysis():
def __init__(self, file_dir):
self.path = file_dir #file path directory
def getData(self):
return pd.read_csv(self.path) # create a pandas dataframe
def getStd(self):
return self.getData().loc['1':'5'].apply(lambda x: x.std()) # cacluate the standard deviation of all columns
def getHighlight(self):
#a function to highlight df based on the given condition
def highlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#rows over which the highlighting function should apply
r = ['1', '2', '3', '4', '5']
#first boolean mask for selecting the df elements
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
#second boolean mask for selecting the df elements
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
#apply the highlight function on the df to get highlighted
return self.getData().style.apply(highlight, axis=None)
getData method returns the df like this:
my_analysis = Analysis(path_to_file)
my_analysis.getData()
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
The getHighligt method has an inner function which applies to the df in order to highlight the df elements based on the given mask and it would out put something like this:
my_analysis.getHighlight()
My question is what is the pythonic or elegant way of defining the inner function inside the class method?

Disclaimer: the following remarks represent my opinion about the topic of pythonic code.
Avoid Inner Functions
You should avoid inner functions at all cost. Sometimes they're necessary, but most of the time they're an indication that you might want to refactor your code.
Avoid re-reading multiple times
I would also avoid calling pd.read_csv every time I want to perform some operation in the data. Unless there's a good reason to read the file over and over again, It's more performant to read it once and store it in a class attribute, or property.
PEP-8 Naming Conventions
Another important thing to consider, if you're trying to make your code more pythonic, is to try to follow the PEP8 naming conventions, unless you're working on legacy code that does not follow PEP-8.
Class Overkill
Finally, I think that creating a class for what you're doing seems a little overkill. Most of your methods are simply transformations that could be easily converted to functions. Aside from making your code less complex, It would improve its reusability.
How I would write the Analysis class
from __future__ import absolute_import, annotations
from pathlib import Path
from typing import Any, Collection, Iterable, Type, Union
import numpy as np
import pandas as pd
from pandas.core.dtypes.dtypes import ExtensionDtype # type: ignore
# Custom types for type hinting
Axes = Collection[Any]
NpDtype = Union[
str, np.dtype, Type[Union[str, float, int, complex, bool, object]]
]
Dtype = Union["ExtensionDtype", NpDtype]
# Auxiliary functions
def is_iterable_not_string(iterable: Any) -> bool:
"""Return True, if `iterable` is an iterable object, and not a string.
Parameters
----------
iterable: Any
The object to check whether it's an iterable except for strings,
or not.
Returns
-------
bool
True, if object is iterable, but not a string.
Otherwise, if object isn't an iterable, or if it's a string, return
False.
Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> class FakeIterable(int):
... def __iter__(self): pass
>>> print(is_iterable_not_string('abcde'))
False
>>> print(is_iterable_not_string(bytes(12345)))
False
>>> print(is_iterable_not_string(12345))
False
>>> print(is_iterable_not_string(123.45))
False
>>> print(is_iterable_not_string(type))
False
>>> print(is_iterable_not_string(list)) # Type list isn't iterable
False
>>> print(is_iterable_not_string(object))
False
>>> print(is_iterable_not_string(None))
False
>>> print(is_iterable_not_string(list())) # Empty list is still iterable
True
>>> # `FakeIterable` has a method `__iter__`, therefore it's considered
>>> # iterable, even though it isn't.
>>> print(is_iterable_not_string(FakeIterable(10)))
True
>>> print(is_iterable_not_string(list('abcde')))
True
>>> print(is_iterable_not_string(tuple('abcde')))
True
>>> print(is_iterable_not_string(set('abcde')))
True
>>> print(is_iterable_not_string(np.array(list('abcdef'))))
True
>>> print(is_iterable_not_string({col: [1, 2, 3, 4] for col in 'abcde'}))
True
>>> print(is_iterable_not_string(
... pd.DataFrame({col: [1, 2, 3, 4] for col in 'abcde'}))
... )
True
>>> print(is_iterable_not_string(pd.DataFrame()))
True
Notes
-----
In python, any object that contains a method called `__iter__` considered
an “iterable”. This means that you can, in theory, fake an “iterable”
object, by creating a method called `__iter__` that doesn't contain any
real implementation. For a concrete case, see the examples section.
Python common iterable objects:
- strings
- bytes
- lists
- tuples
- sets
- dictionaries
Python common non-iterable objects:
- integers
- floats
- None
- types
- objects
"""
return (not isinstance(iterable, (bytes, str))
and isinstance(iterable, Iterable))
def prepare_dict(data: dict) -> dict:
"""Transform non-iterable dictionary values into lists.
Parameters
----------
data : dict
The dictionary to convert non-iterable values into lists.
Returns
-------
dict
Dictionary with non-iterable values converted to lists.
Examples
--------
>>> import pandas as pd
>>> d = {'a': '1', 'b': 2}
>>> prepare_dict(d)
{'a': ['1'], 'b': [2]}
>>> pd.DataFrame(d) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError: If using all scalar values, you must pass an index
>>> pd.DataFrame(prepare_dict(d))
a b
0 1 2
Notes
-----
Use this function to prepare dictionaries, before calling
`pandas.DataFrame`, to make sure all values have the correct format.
"""
return {
key: value if is_iterable_not_string(value) else [value]
for key, value in data.items()
}
def check_dict_value_lens(data: dict) -> bool:
"""Check whether all values from dictionary have the same lenght.
Parameters
----------
data : dict
The dictionary to check the values lenghts.
Returns
-------
bool
True, if all `data` values have the same lenght. False otherwise.
"""
min_len = min(map(lambda value: len(value), data.values()))
return all(len(value) == min_len for value in data.values())
def read_file(path: Path | str, **kwargs: Any) -> pd.DataFrame:
"""
Read a DataFrame from a file.
Supported file types are:
- `.csv`
- `.xlsx`, `.xls`, `.xlsm`, `.xlsb` (Excel files)
- `.json`
- `.parquet`
- `.feather`
- `.html`
Parameters
----------
path : Path | str
The path to the file.
kwargs : Any
Keyword arguments to pass to pandas io functions.
Returns
-------
pd.DataFrame
The DataFrame read from the file.
Raises
------
ValueError
If the file type not supported.
FileNotFoundError
If the file doesn't exist.
"""
_path = Path(path)
path = str(path)
if not _path.is_file():
raise FileNotFoundError(f"File {path} does not exist.")
if _path.suffix in [".csv", ".txt"]:
return pd.read_csv(path, **kwargs)
if ".xls" in _path.suffix:
return pd.read_excel(path, **kwargs)
if _path.suffix == ".json":
return pd.read_json(path, **kwargs)
if _path.suffix == ".pickle":
return pd.read_pickle(path, **kwargs)
if _path.suffix == ".html":
return pd.read_html(path, **kwargs)
if _path.suffix == ".feather":
return pd.read_feather(path, **kwargs)
if _path.suffix in [".parquet", ".pq"]:
return pd.read_parquet(path, **kwargs)
raise ValueError(f"File {path} has an unknown extension.")
def highlight(df: pd.DataFrame) -> pd.DataFrame:
"""Highlight a DataFrame.
Parameters
----------
df : pd.DataFrame
The DataFrame to highlight. Required indexes:
- ["USL", "LSL", "1", "2", "3", "4", "5"]
Returns
-------
pd.DataFrame
The DataFrame with highlighted rows.
"""
# The dataframe cells background colors.
c1: str = "background-color:red"
c2: str = "background-color:yellow"
c3: str = "background-color:green"
# Rows over which the highlighting function should apply
rows: list[str] = ["1", "2", "3", "4", "5"]
# First boolean mask for selecting the df elements
m1 = (df.loc[rows] > df.loc["USL"]) | (df.loc[rows] < df.loc["LSL"])
# Second boolean mask for selecting the df elements
m2 = (df.loc[rows] == df.loc["USL"]) | (df.loc[rows] == df.loc["LSL"])
# DataFrame with same index, and column names as the original,
# but with filled empty strings.
df_highlight = pd.DataFrame("", index=df.index, columns=df.columns)
# Change values of df1 columns by boolean mask
df_highlight.loc[rows, :] = np.select(
[m1, m2], [c1, c2], default=c3
)
return df_highlight
class Analysis:
"""
Read a dataframe, and help performing some analysis in the data.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Attributes
----------
_data : pd.DataFrame
The data read from the file.
_path : str | Path
The path to the file.
Examples
--------
>>> data = {
... 'A-A': [
... 0.37, 0.39, 0.35, 0.35, 0.36, 0.34, 0.35, 0.33, 0.36, 0.33,
... ],
... 'A-B': [
... 10.24, 10.26, 10.22, 10.23, 10.19, 10.25, 10.2, 10.17, 10.25,
... 10.17,
... ],
... 'A-C': [
... 5.02, 5.04, 5.0, 5.05, 5.07, 5.03, 5.08, 5.05, 5.08, 5.03,
... ],
... 'A-D': [
... 0.63, 0.65, 0.63, 0.65, 0.67, 0.66, 0.69, 0.62, 0.69, 0.62,
... ],
... 'A-E': [
... 20.3, 20.32, 20.28, 20.45, 20.25, 20.33, 20.22, 20.4,
... 20.45, 20.22,
... ],
... }
>>> index = ['Tg', 'USL', 'LSL', '1', '2', '3', '4', '5', 'Max', 'Min']
>>> analysis = Analysis.from_dict(data, index=index)
>>> analysis.get_std()
A-A 0.011402
A-B 0.031937
A-C 0.019494
A-D 0.025884
A-E 0.097211
dtype: float64
"""
_path: Path | str | None = None
_data: pd.DataFrame | None = None
#property
def path(self) -> str | Path:
"""Get the path to the file.
Returns
-------
str | Path
The path to the file.
Raises
------
ValueError
If `_path` is `None`.
"""
if self._path is None:
raise ValueError("Path not set.")
return str(self._path)
#path.setter
def path(self, path: str | Path):
"""Set the path of the file to analyze.
Parameters
----------
path : str | Path
The path of the file to analyze.
Path should point to a `.csv` file.
Raises
------
FileNotFoundError
If the path not found.
"""
_path = Path(path)
if _path.is_file():
self._path = str(path)
else:
raise FileNotFoundError(f"Path {path} does not exist.")
#property
def data(self) -> pd.DataFrame:
"""Dataframe read from `path`.
Returns
-------
pd.DataFrame
The dataframe read from `path`.
"""
if self._data is None:
self._data = self.get_data()
return self._data
#data.setter
def data(self, data: pd.DataFrame):
"""Set the data to analyze.
Parameters
----------
data : pd.DataFrame
The data to analyze.
"""
self._data = data
def __init__(self, path_or_data: str | Path | pd.DataFrame):
"""Initialize the Analyzer.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Raises
------
ValueError
If `path_or_data` not a `str`, `Path`, or `pd.DataFrame`.
"""
if isinstance(path_or_data, (str, Path)):
self.path = path_or_data
elif isinstance(path_or_data, pd.DataFrame):
self.data = path_or_data
else:
raise ValueError(f"Invalid type {type(path_or_data)}.")
def get_data(self) -> pd.DataFrame:
"""Read the data from the file.
Returns
-------
pd.DataFrame
The dataframe read from the `path` property.
"""
return read_file(self.path)
def get_std(self) -> pd.Series:
"""Calcuate the standard deviation of every column.
Returns
-------
pd.Series
The standard deviation of every column.
"""
return self.data.loc["1":"5"].apply(lambda x: x.std()) # type: ignore
def highlight_frame(
self, round_values: int | None = None
) -> pd.io.formats.style.Styler: # type: ignore
"""Highlight dataframe, based on some condition.
Parameters
----------
round_values: int | None
If defined, sets the precision of the Styler object with the
highlighted dataframe.
Returns
-------
pd.io.formats.style.Styler
The Styler object with the highlighted dataframe.
"""
highlight_df = self.data.style.apply(highlight, axis=None)
if isinstance(round_values, int) and round_values >= 0:
return highlight_df.format(precision=round_values)
return highlight_df
#classmethod
def from_dict(
cls,
data: dict,
index: Axes | None = None,
columns: Axes | None = None,
dtype: Dtype | None = None,
) -> Analysis:
"""Create an Analysis object from a dictionary.
Parameters
----------
data : dict
The dictionary to create the Analysis object from.
index : Index or array-like
Index to use for resulting frame. Defaults to RangeIndex, if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data doesn't have
them, defaulting to RangeIndex(0, 1, 2, ..., n).
If data contains column labels, will perform column selection
instead.
dtype : dtype, default None
Data type to force. Only a single dtype allowed. If None, infer.
Returns
-------
Analysis
An instance of the `Analysis` class.
Raises
------
ValueError
If dictionary values have different lenghts.
"""
data = prepare_dict(data)
if check_dict_value_lens(data):
return cls(
pd.DataFrame(data, index=index, columns=columns, dtype=dtype)
)
raise ValueError(
f"Dictionary values don't have the same lenghts.\nData: {data}"
)
if __name__ == "__main__":
import doctest
doctest.testmod()

Related

Pandas groupy "aggregate" does not see column

I am working on a huge database where I did a pandas apply to categorize the type of cliente based on the type of the product he consumed:
Sample DF:
import pandas as pd
import numpy as np
from datetime import datetime
num_variables = 1000
rng = np.random.default_rng()
data = pd.DataFrame({
'id' : np.random.randint(1,999999999,num_variables),
'date' : [np.random.choice(pd.date_range(datetime(2021,1,1),datetime(2022,12,31))) for i in range(num_variables)],
'product' : [np.random.choice(['giftcards', 'afiliates']) for i in range(num_variables)],
'brand' : [np.random.choice(['brand_1', 'brand_2', 'brand_4', 'brand_6']) for i in range(num_variables)],
'gmv' : rng.random(num_variables) * 100,
'revenue' : rng.random(num_variables) * 100,})
data = data.astype({'product':'category', 'brand':'category'})
base = data.groupby(['id', 'product']).aggregate({'product' : 'count'})
base = base.unstack()
Now I need to group clients by the "type" column and just count how much there are in each group.
first, apply the categorization function and its application :
def setup(row):
if row[('product', 'afiliates')] >= 1 and row[('product', 'giftcards')] == 0:
return 'afiliates'
if row[('product', 'afiliates')] == 0 and row[('product', 'giftcards')] >= 1:
return 'gift'
if row[('product', 'afiliates')] >= 1 and row[('product', 'giftcards')] >= 1:
return 'both'
base['type'] = base.apply(setup, axis=1)
base.reset_index(inplace=True)
So far, so good. If I run an groupby.agg, I get these results:
results = base[['type','id']].groupby(['type'], dropna=False).agg('count')
but if instead of agg I try an agregate, it does not work.
results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[10], line 2
1 #results = base[['type','id']].groupby(['type'], dropna=False).agg('count')
----> 2 results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})
File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\groupby\generic.py:894, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
891 func = maybe_mangle_lambdas(func)
893 op = GroupByApply(self, func, args, kwargs)
--> 894 result = op.agg()
895 if not is_dict_like(func) and result is not None:
896 return result
File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\apply.py:169, in Apply.agg(self)
166 return self.apply_str()
168 if is_dict_like(arg):
--> 169 return self.agg_dict_like()
170 elif is_list_like(arg):
171 # we require a list, but not a 'str'
172 return self.agg_list_like()
File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\apply.py:478, in Apply.agg_dict_like(self)
475 selected_obj = obj._selected_obj
476 selection = obj._selection
--> 478 arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
...
606 # eg. {'A' : ['mean']}, normalize all to
607 # be list-likes
608 # Cannot use func.values() because arg may be a Series
KeyError: "Column(s) ['id'] do not exist"
What am I missing?
I´ve made the same question on Pandas Github.
They helped me, I will reproduce the answer here.
you can see how to access your columns using:
print(base.columns.tolist())
[('id', ''), ('product', 'afiliates'), ('product', 'giftcards'), ('type', '')]
When you have a MultiIndex for columns, you need to specify each level as a tuple. So you can do:
base[['type','id']].groupby(['type']).aggregate({('id', ''): 'count'})
Regarding the title of this issue - agg and aggregate are aliases, they do not behave differently.
I suppose there is a bit of an oddity here - why can you do base[['id']] but not specify {'id': ...} in agg? The reason is because column selection can return multiple columns (e.g. in the example here, base[['product']] returns a DataFrame with two columns), whereas agg must have one column and one column only. Thus, it is necessary to specify all levels in agg.

IndexError: invalid index to scalar variable error

For each gene, I want to perform McNemar's test and then evaluate if the p-value > 0.05. I want to calculate the number of genes that pass the test.
My code raised IndexError: invalid index to scalar variable.
import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower
def generate_gene_df(gene, n):
df = pd.DataFrame.from_dict(
{"Gene" : gene,
"Cells": (f'Cell{x}' for x in range(1, n+1)),
"Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
"Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
orient='columns'
)
df = df.set_index(["Gene","Cells"])
return df
table = pd.crosstab([df["Gene"], df["Cells"]],
[df["Control"], df["Experimental"]]).to_numpy()
# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]
df = pd.concat(gene_df_list)
df = df.reset_index()
alpha=0.05
p_adjusted=[]
pass_test = []
# McNemar's test
result = mcnemar(table, exact=True)
# Bonferroni correction
p_adjusted = multipletests(result.pvalue, alpha=alpha, method="bonferroni")
for index, value in np.ndenumerate(table):
if result.pvalue > alpha:
np.append(pass_test, result.pvalue[index])
num_passed = len(pass_test)
print("Number of genes that failed to reject H0 is: %i" %num_passed)
Traceback:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/tmp/ipykernel_593/1521754442.py in <module>
11 for index, value in np.ndenumerate(table):
12 if result.pvalue > alpha:
---> 13 np.append(pass_test, result.pvalue[index])
14
15
IndexError: invalid index to scalar variable.
I haven't looked at mcnemar, so don't know what it produces. But the error tells us/youthat result.pvalue is a scalar value, and not indexable.
if result.pvalue > alpha:
np.append(pass_test, result.pvalue[index])
I also can deduce that from the fact that the if line works. It would return an ambiguity error if result.pvalue was an array.
Besides rereading the mcnemar docs, I'd suggest rereading the np.append docs (assuming you even did that!).
Generally we discourage use of np.append in a loop. np.append is not a list append clone

pandas apply map function throwing error when applied lower() method on the data series

For the below code:
# Reading the input
import ast,sys
input_str = sys.stdin.read()
input_list = ast.literal_eval(input_str)
# Storing the names in a variable 'name'
name = input_list[0]
# Storing the responses in a variable 'repsonse'
response = input_list[1]
import pandas as pd
df = pd.DataFrame({'Name': name,'Response': response})
This input provided is as:
['Reetesh', 'Shruti', 'Kaustubh', 'Vikas', 'Mahima', 'Akshay']
['No', 'Maybe', 'yes', 'Yes', 'maybe', 'Yes']
And the output needed is like:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/d3FIi.png
And when I tried something like below:
def res_map(x):
return x.map({'Yes': 1, 'yes': 1, 'No': 0, 'no': 0, 'Maybe': 0.5, 'maybe': 0.5})
df[['Response']] = df[['Response']].apply(res_map)
# Print the final DataFrame
print(df)
It worked well..
but when I tried something like this:
df['Response'] = df['Response'].apply(lambda x: x.str.lower().map({'yes':1.0, 'no':0.0, 'maybe':0.5}))
I got an error "str" object does not hap method "map"...
What I did wrong here?
When you apply on a DataFrame like df[['Response']].apply(...), the lambda receives the whole df.Response column at once so you can use Series methods like str and map.
If you instead apply on a Series like df['Response'].apply(...), the lambda receives individual strings of df.Response instead of the whole df.Response column, so the mapping needs to be changed accordingly:
mapping = {'yes':1.0, 'no':0.0, 'maybe':0.5}
df['Response'] = df['Response'].apply(lambda x: mapping[x.lower()])
# Name Response
# 0 Reetesh 0.0
# 1 Shruti 0.5
# 2 Kaustubh 1.0
# 3 Vikas 1.0
# 4 Mahima 0.5
# 5 Akshay 1.0

PySpark: how to use `StringIndexer` to do label encoding with the string array column

As we know, we can do LabelEncoder() by StringIndexer in the string column, but if want to do LabelEncoder() on string array column, it is not easy to implement.
# input
df.show()
+--------------------------------------+
| tags|
+--------------------------------------+
| [industry, display, Merchants]|
| [smart, swallow, game, Experience]|
| [social, picture, social]|
| [default, game, us, adventure]|
| [financial management, loan, product]|
| [system, profile, optimization]|
...
# After do LabelEncoder() on `tags` column
...
+--------------------------------------+
| tags|
+--------------------------------------+
| [0, 1, 2]|
| [3, 4, 4, 5]|
| [6, 7, 6]|
| [8, 4, 9, 10]|
| [11, 12, 13]|
| [14, 15, 16]|
Python version will be very similar:
// add unique id to each row
val df2 = df.withColumn("id", monotonically_increasing_id).select('id, explode('tags).as("tag"))
val indexer = new StringIndexer()
.setInputCol("tag")
.setOutputCol("tagIndex")
val indexed = indexer.fit(df2).transform(df2)
// in the final step you should convert tags back to array of tags
val dfFinal = indexed.groupBy('id).agg(collect_list('tagIndex))
You can create a class, which will explode the array column, apply the StringIndexer, and will collect the indexes back to the list. The benefit of using it as a class instead of step by step transformations, is that it can be used in a pipeline or saved as fitted.
A class doing all the transformations and applying a StringIndexer:
class ArrayStringIndexerModel(Model
,DefaultParamsReadable, DefaultParamsWritable):
def __init__(self, indexer, inputCol: str, outputCol: str):
super(ArrayStringIndexerModel, self).__init__()
self.indexer = indexer
self.inputCol = inputCol
self.outputCol = outputCol
def _transform(self, df: DataFrame=[]) -> DataFrame:
# Creating always increasing id (as in fit)
df = df.withColumn("id_temp_added", monotonically_increasing_id())\
# Exploding "inputCol" and saving to the new dataframe (as in fit)
df2 = df.withColumn('inputExpl', F.explode(self.inputCol)).select('id_temp_added', 'inputExpl')
# Transforming with fitted "indexed"
indexed_df = self.indexer.transform(df2)
# Converting indexed back to array
indexed_df = indexed_df.groupby('id_temp_added').agg(F.collect_list(F.col(self.outputCol)).alias(self.outputCol))
# Joining to the main dataframe
df = df.join(indexed_df, on='id_temp_added', how='left')
# dropping created id column
df = df.drop('id_temp_added')
return df
class ArrayStringIndexer(Estimator
,DefaultParamsReadable, DefaultParamsWritable):
"""
A custom Transformer which applies string indexer to the array of strings
(explodes, applies StirngIndexer, aggregates back)
"""
def __init__(self, inputCol: str, outputCol: str):
super(ArrayStringIndexer, self).__init__()
# self.indexer = None
self.inputCol = inputCol
self.outputCol = outputCol
def _fit(self, df: DataFrame = []) -> ArrayStringIndexerModel:
# Creating always increasing id
df = df.withColumn("id_temp_added", monotonically_increasing_id())
# Exploding "inputCol" and saving to the new dataframe
df2 = df.withColumn('inputExpl', F.explode(self.inputCol)).select('id_temp_added', 'inputExpl')
# Indexing self.indexer and self.indexed dataframe with exploded input column
indexer = StringIndexer(inputCol='inputExpl', outputCol=self.outputCol)
indexer = indexer.fit(df2)
# Returns ArrayStringIndexerModel class with fitted StringIndexer, input and output columns
return ArrayStringIndexerModel(indexer=indexer, inputCol=self.inputCol, outputCol=self.outputCol)
How to use the class in a code:
tags_indexer = ArrayStringIndexer(inputCol="tags", outputCol="tagsIndex")
tags_indexer.fit(df).transform(df).show()

How to safely subclass ndarray and get behavior consistent with ndarray - odd nanmin/max results?

I'm trying to subclass an ndarray so that I can add some additional fields. When I do this however, I get odd behavior in a variety of numpy functions. For example nanmin returns now return an object of the type of my new array classs, whereas previously I'd get a float64. Why? Is this a bug with nanmin or my class?
import numpy as np
class NDArrayWithColumns(np.ndarray):
def __new__(cls, obj, columns=None):
obj = obj.view(cls)
obj.columns = tuple(columns)
return obj
def __array_finalize__(self, obj):
if obj is None: return
self.columns = getattr(obj, 'columns', None)
NAN = float("nan")
r = np.array([1.,0.,1.,0.,1.,0.,1.,0.,NAN, 1., 1.])
print "MIN", np.nanmin(r), type(np.nanmin(r))
gives:
MIN 0.0 <type 'numpy.float64'>
but
>>> r = NDArrayWithColumns(r, ["a"])
>>> print "MIN", np.nanmin(r), type(np.nanmin(r))
MIN 0.0 <class '__main__.NDArrayWithColumns'>
>>> print r.shape
(11,)
Note the change in type, and also that str(np.nanmin(r)) shows 1 field, not 11.
In case you're interested, I'm subclassing because I'd like to track columns names is matrices of a single type but structure arrays and record type arrays allow for varying type).
You need to implement the __array_wrap__ method that gets called at the end of ufuncs, per the docs:
def __array_wrap__(self, out_arr, context=None):
print('In __array_wrap__:')
print(' self is %s' % repr(self))
print(' arr is %s' % repr(out_arr))
# then just call the parent
return np.ndarray.__array_wrap__(self, out_arr, context)