Error when using numpy to encode categorical features of dataset - pandas

I use the following function to encode the categorical features of my dataset (it has 27 features where 11 of them is categorical):
from sklearn import preprocessing
def features_encoding(data):
columnsToEncode = list(data.select_dtypes(include=['category', 'object']))
le = preprocessing.LabelEncoder()
for feature in columnsToEncode:
try:
data[feature] = le.fit_transform(data[feature])
except:
continue
return data
But I get this error:
FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
I don't understand this error. Kindly, can someone explain what is it about and how to fix it?

This is almost certainly being caused by there being np.nan more than once in an array of dtype=object that is passed into np.unique.
This may help clarify what's going on:
>>> np.nan is np.nan
True
>>> np.nan == np.nan
False
>>> np.array([np.nan], dtype=object) == np.array([np.nan], dtype=object)
FutureWarning: numpy equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
array([ True], dtype=bool)
So when comparing two arrays of dtype=object, numpy checks if the return of the comparison function is False when both objects being compared are the exact same. Because right now it assumes that all objects compare equal to themselves, but that will be changed at same time in the future.
All in all, it's just a warning, so you can ignore it, at least for now...

Related

Why does introducing a different datatype to the logistic regression from the statsmodels api throw an error?

X = pd.DataFrame([[True, 12.3], [False, 14.2], [True, 18.0]])
y = pd.Series([0, 1, 0])
log_reg = sm.Logit(y, X).fit()
log_reg.summary()
If I remove either the boolean or float variable it works.
However, when I leave them both in the, the following value error gets raised:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
You need to convert your variables into float variables:
sm.Logit(y, X.astype(float)).fit()
In your particular case the problem is the boolean variable. Maybe this statsmodels issue could be useful.

How to compare numpy arrays of tuples?

Here's an MWE that illustrates the issue I have:
import numpy as np
arr = np.full((3, 3), -1, dtype="i,i")
doesnt_work = arr == (-1, -1)
n_arr = np.full((3, 3), -1, dtype=int)
works = n_arr == 10
arr is supposed to be an array of tuples, but it doesn't behave as expected.
works is an array of booleans, as expected, but doesnt_work is False. Is there a way to get numpy to do elementwise comparisons on more complex types, or do I have to resort to list comprehension, flatten and reshape?
There's a second problem:
f = arr[(0, 0)] == (-1, -1)
f is False, because arr[(0,0)] is of type numpy.void rather than a tuple. So even if the componentwise comparison worked, it would give the wrong result. Is there a clever numpy way to do this or should I just resort to list comprehension?
Both problems are actually the same problem! And are both related to the custom data type you created when you specified dtype="i,i".
If you run arr.dtype you will get dtype([('f0', '<i4'), ('f1', '<i4')]). That is a 2 signed integers that are placed in one continuous block of memory. This is not a python tuple. Thus it is clear why the naive comparison fails, since (-1,-1) is a python tuple and is not represented in memory the same way that the numpy data type is.
However if you compare with a_comp = np.array((-1,-1), dtype="i,i") you get the exact behavior you are expecting!
You can read more about how the custom dtype stuff works on the numpy docs:
https://numpy.org/doc/stable/reference/arrays.dtypes.html
Oh and to address what np.void is: it comes from the idea that it is a void c pointer which essentially means that it is an address to a continuous block of memory of unspecified type. But, provided you (the programer) knows what is going to be stored in that memory (in this case two back to back integers) it's fine provided you are careful (compare with the same custom data type).

How to control the display precision of a NumPy float64 scalar?

I'm writing a teaching document that uses lots of examples of Python code and includes the resulting numeric output. I'm working from inside IPython and a lot of the examples use NumPy.
I want to avoid print statements, explicit formatting or type conversions. They clutter the examples and detract from the principles I'm trying to explain.
What I know:
From IPython I can use %precision to control the displayed precision of any float results.
I can use np.set_printoptions() to control the displayed precision of elements within a NumPy array.
What I'm looking for is a way to control the displayed precision of a NumPy float64 scalar which doesn't respond to either of the above. These get returned by a lot of NumPy functions.
>>> x = some_function()
Out[2]: 0.123456789
>>> type(x)
Out[3]: numpy.float64
>>> %precision 2
Out[4]: '%.2f'
>>> x
Out[5]: 0.123456789
>>> float(x) # that precision works for regular floats
Out[6]: 0.12
>>> np.set_printoptions(precision=2)
>>> x # but doesn't work for the float64
Out[8]: 0.123456789
>>> np.r_[x] # does work if it's in an array
Out[9]: array([0.12])
What I want is
>>> # some formatting command
>>> x = some_function() # that returns a float64 = 0.123456789
Out[2]: 0.12
but I'd settle for:
a way of telling NumPy to give me float scalars by default, rather than float64.
a way of telling IPython how to handling a float64, kind of like what I can do with a repr_pretty for my own classes.
IPython has formatters (core/formatters.py) which contain a dict that maps a type to a format method. There seems to be some knowledge of NumPy in the formatters but not for the np.float64 type.
There are a bunch of formatters, for HTML, LaTeX etc. but text/plain is the one for consoles.
We first get the IPython formatter for console text output
plain = get_ipython().display_formatter.formatters['text/plain']
and then set a formatter for the float64 type, we use the same formatter as already exists for float since it already knows about %precision
plain.for_type(np.float64, plain.lookup_by_type(float))
Now
In [26]: a = float(1.23456789)
In [28]: b = np.float64(1.23456789)
In [29]: %precision 3
Out[29]: '%.3f'
In [30]: a
Out[30]: 1.235
In [31]: b
Out[31]: 1.235
In the implementation I also found that %precision calls np.set_printoptions() with a suitable format string. I didn't know it did this, and potentially problematic if the user has already set this. Following the example above
In [32]: c = np.r_[a, a, a]
In [33]: c
Out[33]: array([1.235, 1.235, 1.235])
we see it is doing the right thing for array elements.
I can do this formatter initialisation explicitly in my own code, but a better fix might to modify IPython code/formatters.py line 677
#default('type_printers')
def _type_printers_default(self):
d = pretty._type_pprinters.copy()
d[float] = lambda obj,p,cycle: p.text(self.float_format%obj)
# suggested "fix"
if 'numpy' in sys.modules:
d[numpy.float64] = lambda obj,p,cycle: p.text(self.float_format%obj)
# end suggested fix
return d
to also handle np.float64 here if NumPy is included. Happy for feedback on this, if I feel brave I might submit a PR.

Python: What's the proper way to convert between python bool and numpy bool_?

I kept getting errors with a numpy ndarray with booleans not being accepted as a mask by a pandas structure when it occured to me that I may have the 'wrong' booleans. Edit: it was not a raw numpy array but a pandas.Index.
While I was able to find a solution, the only one that worked was quite ugly:
mymask = mymask.astype(np.bool_) #ver.1 does not work, elements remain <class 'bool'>
mymask = mymask==True #ver.2, does work, elements become <class 'numpy.bool_'>
mypdstructure[mymask]
What's the proper way to typecast the values?
Ok, I found the problem. My original post was not fully correct: my mask was a pandas.Index.
It seems that the pands.Index.astype is behaving unexpectedly (for me), as I get different behavior for the following:
mask = pindex.map(myfun).astype(np.bool_) # doesn't cast
mask = pindex.map(myfun).astype(np.bool_,copy=False) # doesn't cast
mask = pindex.map(myfun).values.astype(np.bool_) # does cast
Maybe it is actually a pandas bug? This result is surprising to me because I was under the impression that pandas is usually just calling the functions of the numpy arrays that it is based on. This is clearly not the case here.

Native method to skip Nans in a lambda function?

I was wondering if there's a native method to skip nans in a lambda function.
I have dataframe 'y' in the form below. I'm attempting to turn the Year column into ints. But the lambda function breaks because of the NaN. I've come up with the below, but I'm wondering if there are better ways to deal with this pervasive issue? Thanks!
Year
137 2005
138 NaN
To deal with it, i just used try/except. I wonder if there' a better way to deal with NaNs.
def turn_int(x):
try:
return int(x)
except:
return np.nan
y.Year.apply(lambda x: turn_int(x))
int doesn't have a representation of NaN. The normal way to deal with it would be to drop all the NaN's first:
year = y.Year.dropna().astype(int)
I have done this for int series.
import numpy as np
y['year'] = y['year'].apply(lambda x:x if np.isnan(x) else int(x))