Is there a cleaner way to initialize a numpy masked array from a non-ma, with all masked values False, than this?
masked_array = np.ma.masked_array(array, mask=np.zeros_like(array, dtype='bool'))
The duplicate reference to array seems unnecessary and clunky. If you do not give the mask= parameter, the mask defaults to a scalar boolean, which prevents sliced access to the mask.
You should be able to just set the mask to False:
>>> array = np.array([1,2,3])
>>> masked_array = np.ma.masked_array(array, mask=False)
>>> masked_array
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
I saw hpaulj’s comment and played around with different ways of solving this issue and comparing performance. I can’t explain the difference, but #hpaulj seems to have a much deeper understanding of how numpy works. Any input on why m3() executes so much faster would be most appreciated.
def origM():
array = np.array([1,2,3])
return np.ma.masked_array(array, mask=np.zeros_like(array, dtype='bool'))
def m():
array = np.array([1,2,3])
return np.ma.masked_array(array, mask=False)
def m2():
array = np.array([1,2,3])
m = np.ma.masked_array(array)
m.mask = False
return m
def m3():
array = np.array([1,2,3])
m = array.view(np.ma.masked_array)
m.mask = False
return m
>>> origM()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
All four return the same result:
>>> m()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
>>> m2()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
>>> m3()
masked_array(data = [1 2 3],
mask = [False False False],
fill_value = 999999)
m3() executes the fastest:
>>> timeit.timeit(origM, number=1000)
0.024451958015561104
>>> timeit.timeit(m, number=1000)
0.0393978749634698
>>> timeit.timeit(m2, number=1000)
0.024049583997111768
>>> timeit.timeit(m3, number=1000)
0.018082750029861927
Related
Other than using a set of or statements
isinstance( x, np.float64 ) or isinstance( x, np.float32 ) or isinstance( np.float16 )
Is there a cleaner way to check of a variable is a floating type?
You can use np.floating:
In [11]: isinstance(np.float16(1), np.floating)
Out[11]: True
In [12]: isinstance(np.float32(1), np.floating)
Out[12]: True
In [13]: isinstance(np.float64(1), np.floating)
Out[13]: True
Note: non-numpy types return False:
In [14]: isinstance(1, np.floating)
Out[14]: False
In [15]: isinstance(1.0, np.floating)
Out[15]: False
to include more types, e.g. python floats, you can use a tuple in isinstance:
In [16]: isinstance(1.0, (np.floating, float))
Out[16]: True
To check numbers in numpy array, it provides 'character code' for the general kind of data.
x = np.array([3.6, 0.3])
if x.dtype.kind == 'f':
print('x is floating point')
See other kinds of data here in the manual
EDITED---------------
Be careful when using isinstance and is operator to determine type of numbers.
import numpy as np
a = np.array([1.2, 1.3], dtype=np.float32)
print(isinstance(a.dtype, np.float32)) # False
print(isinstance(type(a[0]), np.float32)) # False
print(a.dtype is np.float32) # False
print(type(a[0]) is np.dtype(np.float32)) # False
print(isinstance(a[0], np.float32)) # True
print(type(a[0]) is np.float32) # True
print(a.dtype == np.float32) # True
The best way to a check if a NumPy array is in floating precision whether 16,32 or 64 is as follows:
import numpy as np
a = np.random.rand(3).astype(np.float32)
print(issubclass(a.dtype.type,np.floating))
a = np.random.rand(3).astype(np.float64)
print(issubclass(a.dtype.type,np.floating))
a = np.random.rand(3).astype(np.float16)
print(issubclass(a.dtype.type,np.floating))
In this case all will be True.
The common solution however can be give wrong outputs as shown below,
import numpy as np
a = np.random.rand(3).astype(np.float32)
print(isinstance(a,np.floating))
a = np.random.rand(3).astype(np.float64)
print(isinstance(a,np.floating))
a = np.random.rand(3).astype(np.float16)
print(isinstance(a,np.floating))
In this case all will be False
The workaround for above though is
import numpy as np
a = np.random.rand(3).astype(np.float32)
print(isinstance(a[0],np.floating))
a = np.random.rand(3).astype(np.float64)
print(isinstance(a[0],np.floating))
a = np.random.rand(3).astype(np.float16)
print(isinstance(a[0],np.floating))
Now all will be True
I am failry new to panda.
To find all rows with a certain value, I can run
data[data['category'] == 'name']
which would return a Series as expected.
Ony of my column is a 1x2 numpy array. However if I do
data[data['list'] == np.array([0, 0])]
I get ValueError: Lengths must match to compare
How would I find the row with a certain numpy array in it?
You can use apply with lambda function like df[df.list.apply(lambda x: (x == c).all())]
Ex.:
>>> df
list
0 [0, 0]
1 [1, 1]
2 [0, 0]
3 [1, 0]
>>> c
array([0, 0])
>>> df.list.apply(lambda x: x == c)
0 [True, True]
1 [False, False]
2 [True, True]
3 [False, True]
Name: list, dtype: object
>>> df.list.apply(lambda x: (x == c).all())
0 True
1 False
2 True
3 False
Name: list, dtype: bool
>>> df[df.list.apply(lambda x: (x == c).all())]
list
0 [0, 0]
2 [0, 0]
consider the next piece of code:
In [90]: m1 = np.matrix([1,2,3], dtype=np.float32)
In [91]: m2 = np.matrix([1,2,3], dtype=np.float32)
In [92]: m3 = np.matrix([1,2,'nan'], dtype=np.float32)
In [93]: np.isclose(m1, m2, equal_nan=True)
Out[93]: matrix([[ True, True, True]], dtype=bool)
In [94]: np.isclose(m1, m3, equal_nan=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-94-5d2b979bc263> in <module>()
----> 1 np.isclose(m1, m3, equal_nan=True)
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in isclose(a, b, rtol, atol, equal_nan)
2571 # Ideally, we'd just do x, y = broadcast_arrays(x, y). It's in
2572 # lib.stride_tricks, though, so we can't import it here.
-> 2573 x = x * ones_like(cond)
2574 y = y * ones_like(cond)
2575 # Avoid subtraction with infinite/nan values...
/usr/local/lib/python2.7/dist-packages/numpy/matrixlib/defmatrix.pyc in __mul__(self, other)
341 if isinstance(other, (N.ndarray, list, tuple)) :
342 # This promotes 1-D vectors to row vectors
--> 343 return N.dot(self, asmatrix(other))
344 if isscalar(other) or not hasattr(other, '__rmul__') :
345 return N.dot(self, other)
ValueError: shapes (1,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)
when comparing arrays with nans it's working as expected:
In [95]: np.isclose(np.array(m1), np.array(m3), equal_nan=True)
Out[95]: array([[ True, True, False]], dtype=bool)
why is np.isclose failing? from the documentation it seems that it should work
thanks
The problem comes from np.nan == np.nan, which is False in the float logic.
In [39]: np.nan == np.nan
Out[39]: False
The `equal_nan` parameter is to force two `nan` values to be considered as equal , not to consider any value to be equal to `nan`.
In [37]: np.isclose(m3,m3)
Out[37]: array([ True, True, False], dtype=bool)
In [38]: np.isclose(m3,m3,equal_nan=True)
Out[38]: array([ True, True, True], dtype=bool)
I tried to find entries in an Array containing a substring with np.where and an in condition:
import numpy as np
foo = "aa"
bar = np.array(["aaa", "aab", "aca"])
np.where(foo in bar)
this only returns an empty Array.
Why is that so?
And is there a good alternative solution?
We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -
np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Sample run -
In [91]: bar
Out[91]:
array(['aaa', 'aab', 'aca'],
dtype='|S3')
In [92]: foo
Out[92]: 'aa'
In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[93]: array([0, 1])
In [94]: bar[2] = 'jaa'
In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[95]: array([0, 1, 2])
Look at some examples of using in:
In [19]: bar = np.array(["aaa", "aab", "aca"])
In [20]: 'aa' in bar
Out[20]: False
In [21]: 'aaa' in bar
Out[21]: True
In [22]: 'aab' in bar
Out[22]: True
In [23]: 'aab' in list(bar)
It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.
But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.
As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.
In [42]: np.char.find(bar, 'aa')
Out[42]: array([ 0, 0, -1])
Docstring:
This module contains a set of functions for vectorized string
operations and methods.
The preferred alias for defchararray is numpy.char.
For operations like this I think the np.char speeds are about same as with:
In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)
Out[49]: array([0, 0, -1], dtype=object)
In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)
Out[50]: array([True, True, False], dtype=object)
Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.
If using pandas is acceptable, then utilizing the str.contains method can be used.
import numpy as np
entries = np.array(["aaa", "aab", "aca"])
import pandas as pd
pd.Series(entries).str.contains('aa') # <----
Results in:
0 True
1 True
2 False
dtype: bool
The method also accepts regular expressions for more complex patterns:
pd.Series(entries).str.contains(r'a.a')
Results in:
0 True
1 False
2 True
dtype: bool
The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.
foo in bar
>>> False
np.where(False)
>>> (array([], dtype=int32),)
np.where(np.array([True, True, False]))
>>> (array([0, 1], dtype=int32),)
The problem is that numpy does not define the in operator as an element-wise boolean operation.
One way you could accomplish what you want is with a list comprehension.
foo = 'aa'
bar = np.array(['aaa', 'aab', 'aca'])
out = [i for i, v in enumerate(bar) if foo in v]
# out = [0, 1]
bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']
out = [i for i, v in enumerate(bar) if foo in v]
# out = [2, 3]
You can also do something like this:
mask = [foo in x for x in bar]
filter = bar[ np.where( mask * bar != '') ]
I have an array x, from which I would like to extract a logical mask. x contains nan values, and the mask operation raises a warning, which is what I am trying to avoid.
Here is my code:
import numpy as np
x = np.array([[0, 1], [2.0, np.nan]])
mask = np.isfinite(x) & (x > 0)
The resulting mask is correct (array([[False, True], [ True, False]], dtype=bool)), but a warning is raised:
__main__:1: RuntimeWarning: invalid value encountered in greater
How can I construct the mask in a way that avoids comparing against NaNs? I am not trying to suppress the warning (which I know how to do).
We could do it in two steps - Create the mask of finite ones and then use the same mask to index into itself and also to select the valid mask of remaining finite elements off x for testing and setting into the remaining elements in that mask. So, we would have an implementation like so -
In [35]: x
Out[35]:
array([[ 0., 1.],
[ 2., nan]])
In [36]: mask = np.isfinite(x)
In [37]: mask[mask] = x[mask]>0
In [38]: mask
Out[38]:
array([[False, True],
[ True, False]], dtype=bool)
Looks like masked arrays works with this case:
In [214]: x = np.array([[0, 1], [2.0, np.nan]])
In [215]: xm = np.ma.masked_invalid(x)
In [216]: xm
Out[216]:
masked_array(data =
[[0.0 1.0]
[2.0 --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [217]: xm>0
Out[217]:
masked_array(data =
[[False True]
[True --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [218]: _.data
Out[218]:
array([[False, True],
[ True, False]], dtype=bool)
But other than propagating the masking I don't know how it handles element by element operations like this. The usual fill and compressed steps don't seem relevant.