Understand pandas' applymap argument - pandas

I'm trying to highlight specific columns in my dataframe using guideline from this post, https://stackoverflow.com/a/41655055/5158984.
My question is on the use of the subset argument. My guess is that it's part of the **kwargs argument. However, the official documentation from Pandas, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html, vaguely explains it.
So in general, how can I know which key words I can use whenever I see **kwargs?
Thanks!

It seems that you are confusing pandas.DataFrame.applymap and df.style.applymap (where df is an instance of pd.DataFrame), for which subset stands on its own and is not part of the kwargs arguments.
Here is one way to find out (in your terminal or a Jupyter notebook cell) what are the named parameters of this method (or any other Pandas method for that matter):
import pandas as pd
df = pd.DataFrame()
help(df.style.applymap)
# Output
Help on method applymap in module pandas.io.formats.style:
applymap(func: 'Callable', subset: 'Subset | None' = None, **kwargs)
-> 'Styler' method of pandas.io.formats.style.Styler instance
Apply a CSS-styling function elementwise.
Updates the HTML representation with the result.
Parameters
----------
func : function
``func`` should take a scalar and return a string.
subset : label, array-like, IndexSlice, optional
A valid 2d input to `DataFrame.loc[<subset>]`, or, in the case of a 1d input
or single key, to `DataFrame.loc[:, <subset>]` where the columns are
prioritised, to limit ``data`` to *before* applying the function.
**kwargs : dict
Pass along to ``func``.
...

Related

Pandas: what are "string function names" called technically?

When using pandas you can in certain cases pass names of functions as strings instead of actual references to those functions. For example: df.transform('round').
In the pandas docs they call these "string function names" but is there another (perhaps more technical) name for these kinds of strings?
Well, Pandas doesn't really want to do this, it's just that in some cases i.e. when using some functions like mean it's required to put the quotes, otherwise errors would be called.
With cases like round quotes wouldn't be actually needed, since they're already builtin functions. The "function names" are really just sort of a way to represent these function names so that they don't get mixed up with other functions.
As mentioned in the documentation link you provided, they call it:
string function name
There is really no special turn IMO.
By passing an invalid string to the aggregate method (ex. df.agg('max2')) and following the Traceback I got to the following code (pandas version 1.1.4):
class SelectionMixin:
"""
mixin implementing the selection & aggregation interface on a group-like
object sub-classes need to define: obj, exclusions
"""
# < some lines deleted >
def _try_aggregate_string_function(self, arg: str, *args, **kwargs):
"""
if arg is a string, then try to operate on it:
- try to find a function (or attribute) on ourselves
- try to find a numpy function
- raise
"""
assert isinstance(arg, str)
f = getattr(self, arg, None)
if f is not None:
if callable(f):
return f(*args, **kwargs)
# people may try to aggregate on a non-callable attribute
# but don't let them think they can pass args to it
assert len(args) == 0
assert len([kwarg for kwarg in kwargs if kwarg not in ["axis"]]) == 0
return f
f = getattr(np, arg, None)
if f is not None:
if hasattr(self, "__array__"):
# in particular exclude Window
return f(self, *args, **kwargs)
raise AttributeError(
f"'{arg}' is not a valid function for '{type(self).__name__}' object"
)
It seems that we fall into this code whenever we pass a string function name to aggregate. If we were to look into the familiar pandas objects (Series, DataFrame, GroupBy) we would find that they inherit from SelectionMixin.
The string function names are looked up either in the pandas object itself (getattr(self, arg, None)) or in Numpy (getattr(np, arg, None)). So the string function names simply represent attributes of some object, either methods of a pandas object or functions defined in Numpy.

Get python string from Tensor without the numpy function

I'm following the tutorial on https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/word_embeddings.ipynb
TextVectorization by defalut splits on whitespace but I want to implement custom split. I want to keep punctuations (which I have implemented in custom_standardization), and split between words and punctuations.
For instance, "fn(1,2)=1+2=3" needs to split to ["fn","(","1",",","2",")","=","1","+","2","=","3"].
def custom_split(input_data: tf.Tensor):
assert input_data.dtype.name=='string'
assert hasattr(input_data,'numpy') == False
???
vectorize_layer = TextVectorization(
standardize=custom_standardization,
split=custom_split,
output_mode='int',
output_sequence_length=sequence_length)
I'm confident in such spliting given a standard Python string. However the input is tf.Tensor and following the aforementioned tutorial, input_data does not have numpy() function.
What's the proper way to do such spliting? Is it possible to retrieve Python string from string Tensor?

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

subclass ndarray in python numpy: change size and value of array

Someone asks the question here: Subclassing numpy ndarray problem but it is basically unanswered.
Here is my version of the question. Suppose you subclass the numpy.ndarray to something that automatically expands when you try to set an element beyond the current shape. You would need to override the setitem and use some numpy.concatenate calls to construct a new array and then assign it to "self" somehow. How to assign the array to "self"?
class myArray(numpy.ndarray):
def __new__(cls, input_array):
obj = numpy.asarray(input_array).view(cls)
return(obj)
def __array_finalize__(self, obj):
if obj is None: return
try:
super(myArray, self).__setitem__(coords, value)
except IndexError as e:
logging.error("Adjusting array")
...
self = new_array # THIS IS WRONG
Why subclass? Why not just give your wrapper object it's own data member that is an ndarray and use __getitem__ and __setitem__ to operate on the wrapped data member? This is basically what ndarray already does, wrapping Python's built-in containers. Also have a look at Python Pandas which already does a lot of what you're talking about wrapped on top of ndarray.