a possible bug in numpy.isclose when comparing matrices with nans - numpy

consider the next piece of code:
In [90]: m1 = np.matrix([1,2,3], dtype=np.float32)
In [91]: m2 = np.matrix([1,2,3], dtype=np.float32)
In [92]: m3 = np.matrix([1,2,'nan'], dtype=np.float32)
In [93]: np.isclose(m1, m2, equal_nan=True)
Out[93]: matrix([[ True, True, True]], dtype=bool)
In [94]: np.isclose(m1, m3, equal_nan=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-94-5d2b979bc263> in <module>()
----> 1 np.isclose(m1, m3, equal_nan=True)
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in isclose(a, b, rtol, atol, equal_nan)
2571 # Ideally, we'd just do x, y = broadcast_arrays(x, y). It's in
2572 # lib.stride_tricks, though, so we can't import it here.
-> 2573 x = x * ones_like(cond)
2574 y = y * ones_like(cond)
2575 # Avoid subtraction with infinite/nan values...
/usr/local/lib/python2.7/dist-packages/numpy/matrixlib/defmatrix.pyc in __mul__(self, other)
341 if isinstance(other, (N.ndarray, list, tuple)) :
342 # This promotes 1-D vectors to row vectors
--> 343 return N.dot(self, asmatrix(other))
344 if isscalar(other) or not hasattr(other, '__rmul__') :
345 return N.dot(self, other)
ValueError: shapes (1,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)
when comparing arrays with nans it's working as expected:
In [95]: np.isclose(np.array(m1), np.array(m3), equal_nan=True)
Out[95]: array([[ True, True, False]], dtype=bool)
why is np.isclose failing? from the documentation it seems that it should work
thanks

The problem comes from np.nan == np.nan, which is False in the float logic.
In [39]: np.nan == np.nan
Out[39]: False
The `equal_nan` parameter is to force two `nan` values to be considered as equal , not to consider any value to be equal to `nan`.
In [37]: np.isclose(m3,m3)
Out[37]: array([ True, True, False], dtype=bool)
In [38]: np.isclose(m3,m3,equal_nan=True)
Out[38]: array([ True, True, True], dtype=bool)

Related

Pandas: Find row with a ndarray

I am failry new to panda.
To find all rows with a certain value, I can run
data[data['category'] == 'name']
which would return a Series as expected.
Ony of my column is a 1x2 numpy array. However if I do
data[data['list'] == np.array([0, 0])]
I get ValueError: Lengths must match to compare
How would I find the row with a certain numpy array in it?
You can use apply with lambda function like df[df.list.apply(lambda x: (x == c).all())]
Ex.:
>>> df
list
0 [0, 0]
1 [1, 1]
2 [0, 0]
3 [1, 0]
>>> c
array([0, 0])
>>> df.list.apply(lambda x: x == c)
0 [True, True]
1 [False, False]
2 [True, True]
3 [False, True]
Name: list, dtype: object
>>> df.list.apply(lambda x: (x == c).all())
0 True
1 False
2 True
3 False
Name: list, dtype: bool
>>> df[df.list.apply(lambda x: (x == c).all())]
list
0 [0, 0]
2 [0, 0]

Numpy masked array initialization from another array

Is there a cleaner way to initialize a numpy masked array from a non-ma, with all masked values False, than this?
masked_array = np.ma.masked_array(array, mask=np.zeros_like(array, dtype='bool'))
The duplicate reference to array seems unnecessary and clunky. If you do not give the mask= parameter, the mask defaults to a scalar boolean, which prevents sliced access to the mask.
You should be able to just set the mask to False:
>>> array = np.array([1,2,3])
>>> masked_array = np.ma.masked_array(array, mask=False)
>>> masked_array
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
I saw hpaulj’s comment and played around with different ways of solving this issue and comparing performance. I can’t explain the difference, but #hpaulj seems to have a much deeper understanding of how numpy works. Any input on why m3() executes so much faster would be most appreciated.
def origM():
array = np.array([1,2,3])
return np.ma.masked_array(array, mask=np.zeros_like(array, dtype='bool'))
def m():
array = np.array([1,2,3])
return np.ma.masked_array(array, mask=False)
def m2():
array = np.array([1,2,3])
m = np.ma.masked_array(array)
m.mask = False
return m
def m3():
array = np.array([1,2,3])
m = array.view(np.ma.masked_array)
m.mask = False
return m
>>> origM()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
All four return the same result:
>>> m()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
>>> m2()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
>>> m3()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
m3() executes the fastest:
>>> timeit.timeit(origM, number=1000)
0.024451958015561104
>>> timeit.timeit(m, number=1000)
0.0393978749634698
>>> timeit.timeit(m2, number=1000)
0.024049583997111768
>>> timeit.timeit(m3, number=1000)
0.018082750029861927

Fill nan in numpy array

Is there a straight forward way of filling nan values in a numpy array when the left and right non nan values match?
For example, if I have an array that has False, False , NaN, NaN, False, I want the NaN values to also be False. If the left and right values do not match, I want it to keep the NaN
Your first task is to reliably identify the np.nan elements. Because it's a unique float value, testing isn't trivail. np.isnan is the best numpy tool.
To mix boolean and float (np.nan) you have to use object dtype:
In [68]: arr = np.array([False, False, np.nan, np.nan, False],object)
In [69]: arr
Out[69]: array([False, False, nan, nan, False], dtype=object)
converting to float changes the False to 0 (and True to 1):
In [70]: arr.astype(float)
Out[70]: array([ 0., 0., nan, nan, 0.])
np.isnan is a good test for nan, but it only works on floats:
In [71]: np.isnan(arr)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-71-25d2f1dae78d> in <module>
----> 1 np.isnan(arr)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
In [72]: np.isnan(arr.astype(float))
Out[72]: array([False, False, True, True, False])
You could test the object array (or a list) with a helper function and a list comprehension:
In [73]: def foo(x):
...: try:
...: return np.isnan(x)
...: except TypeError:
...: return x
...:
In [74]: [foo(x) for x in arr]
Out[74]: [False, False, True, True, False]
Having reliably identified the nan values, you can then apply the before/after logic. I'm not sure if it's easier with lists or array (your logic isn't entirely clear).

Numpy creating logical array in the presence of NaNs

I have an array x, from which I would like to extract a logical mask. x contains nan values, and the mask operation raises a warning, which is what I am trying to avoid.
Here is my code:
import numpy as np
x = np.array([[0, 1], [2.0, np.nan]])
mask = np.isfinite(x) & (x > 0)
The resulting mask is correct (array([[False, True], [ True, False]], dtype=bool)), but a warning is raised:
__main__:1: RuntimeWarning: invalid value encountered in greater
How can I construct the mask in a way that avoids comparing against NaNs? I am not trying to suppress the warning (which I know how to do).
We could do it in two steps - Create the mask of finite ones and then use the same mask to index into itself and also to select the valid mask of remaining finite elements off x for testing and setting into the remaining elements in that mask. So, we would have an implementation like so -
In [35]: x
Out[35]:
array([[ 0., 1.],
[ 2., nan]])
In [36]: mask = np.isfinite(x)
In [37]: mask[mask] = x[mask]>0
In [38]: mask
Out[38]:
array([[False, True],
[ True, False]], dtype=bool)
Looks like masked arrays works with this case:
In [214]: x = np.array([[0, 1], [2.0, np.nan]])
In [215]: xm = np.ma.masked_invalid(x)
In [216]: xm
Out[216]:
masked_array(data =
[[0.0 1.0]
[2.0 --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [217]: xm>0
Out[217]:
masked_array(data =
[[False True]
[True --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [218]: _.data
Out[218]:
array([[False, True],
[ True, False]], dtype=bool)
But other than propagating the masking I don't know how it handles element by element operations like this. The usual fill and compressed steps don't seem relevant.

Creating matrix out of an array of categories in numpy

I have a length-n numpy array, y, of integers in the range [0...k-1]. From this, I would like to create an n-by-k numpy matrix M, where M[i,j] is 1 if y[i]==j, and 0 else.
What is the best way to do this in numpy?
Use broadcasting:
a = np.array([1, 2, 3, 1, 2, 2, 3, 0])
m = a[:, None] == np.arange(max(a)+1)
the result is:
array([[False, True, False, False],
[False, False, True, False],
[False, False, False, True],
[False, True, False, False],
[False, False, True, False],
[False, False, True, False],
[False, False, False, True],
[ True, False, False, False]], dtype=bool)
Or create a zero array and fill, I think it's faster:
m2 = np.zeros((len(a), a.max()+1), np.bool)
m2[np.arange(len(a)), a] = True
print m2
This is maybe a bit out there, but its a pretty extensible solution and at least worth noting. If you've already got scikit-learn, the DictVectorizer class is used to transform categorical features in a dataset to column-wise binary representations just like you described:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
# starting with your numpy array
y = np.array([1, 2, 3, 1, 2, 2, 3, 0])
# transform the array to a list of dicts, with original
# int values now as strings, and a throw-away key ''
y_dict = [{'':str(x)} for x in y.tolist()]
# create the vectorizer and transform the list of dicts
vec = DictVectorizer(sparse=False, dtype=int)
M = vec.fit_transform(y_dict)
print M
[[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
[0 1 0 0]
[0 0 1 0]
[0 0 1 0]
[0 0 0 1]
[1 0 0 0]]
Again, probably overkill but it's kind of cute and I thought I'd throw it out there.