Is there a straight forward way of filling nan values in a numpy array when the left and right non nan values match?
For example, if I have an array that has False, False , NaN, NaN, False, I want the NaN values to also be False. If the left and right values do not match, I want it to keep the NaN
Your first task is to reliably identify the np.nan elements. Because it's a unique float value, testing isn't trivail. np.isnan is the best numpy tool.
To mix boolean and float (np.nan) you have to use object dtype:
In [68]: arr = np.array([False, False, np.nan, np.nan, False],object)
In [69]: arr
Out[69]: array([False, False, nan, nan, False], dtype=object)
converting to float changes the False to 0 (and True to 1):
In [70]: arr.astype(float)
Out[70]: array([ 0., 0., nan, nan, 0.])
np.isnan is a good test for nan, but it only works on floats:
In [71]: np.isnan(arr)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-71-25d2f1dae78d> in <module>
----> 1 np.isnan(arr)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
In [72]: np.isnan(arr.astype(float))
Out[72]: array([False, False, True, True, False])
You could test the object array (or a list) with a helper function and a list comprehension:
In [73]: def foo(x):
...: try:
...: return np.isnan(x)
...: except TypeError:
...: return x
...:
In [74]: [foo(x) for x in arr]
Out[74]: [False, False, True, True, False]
Having reliably identified the nan values, you can then apply the before/after logic. I'm not sure if it's easier with lists or array (your logic isn't entirely clear).
Related
I want to create an array with a NaN everywhere except at certain indices where I place some string.
def to_padded_array(vals, idxs, length, fill_na):
arr = np.array([fill_na] * length)
np.put(arr, idxs, vals)
ser = pd.Series(arr)
return tuple(ser.tolist())
I have examples of vals and idxs that look like this:
idxs = np.array([0,4,5]) # this was made to be a numpy array
vals = pd.Series(['a', 'b', np.nan], name='city') # this actually would come from a pd.agg function
Note that the initial input vals has NaN. If I try to set fill_na=np.nan, I get an error saying
could not convert string to float: 'a'
If I use fill_na=None, I get both None and NaN, which is not good:
>>> to_padded_array(vals, idxs, length=6,fill_na=None)
('a', None, None, None, 'b', nan)
I was thinking of using pandas to circumvent this issue, but I have yet to find an equivalent for numpy.put for pandas. What can I do about this?
You could use Series.reindex here:
Example
def to_padded_array(vals, idxs, length):
# Note that `vals` is a pd.Series object.
ser = pd.Series(vals.values, index=idxs).reindex(np.arange(length))
# if vals is an array, then vals can be used instead of vals.values
return tuple(ser.tolist())
to_padded_array(vals,idxs, 6)
[out]
('a', nan, nan, nan, 'b', nan)
consider the next piece of code:
In [90]: m1 = np.matrix([1,2,3], dtype=np.float32)
In [91]: m2 = np.matrix([1,2,3], dtype=np.float32)
In [92]: m3 = np.matrix([1,2,'nan'], dtype=np.float32)
In [93]: np.isclose(m1, m2, equal_nan=True)
Out[93]: matrix([[ True, True, True]], dtype=bool)
In [94]: np.isclose(m1, m3, equal_nan=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-94-5d2b979bc263> in <module>()
----> 1 np.isclose(m1, m3, equal_nan=True)
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in isclose(a, b, rtol, atol, equal_nan)
2571 # Ideally, we'd just do x, y = broadcast_arrays(x, y). It's in
2572 # lib.stride_tricks, though, so we can't import it here.
-> 2573 x = x * ones_like(cond)
2574 y = y * ones_like(cond)
2575 # Avoid subtraction with infinite/nan values...
/usr/local/lib/python2.7/dist-packages/numpy/matrixlib/defmatrix.pyc in __mul__(self, other)
341 if isinstance(other, (N.ndarray, list, tuple)) :
342 # This promotes 1-D vectors to row vectors
--> 343 return N.dot(self, asmatrix(other))
344 if isscalar(other) or not hasattr(other, '__rmul__') :
345 return N.dot(self, other)
ValueError: shapes (1,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)
when comparing arrays with nans it's working as expected:
In [95]: np.isclose(np.array(m1), np.array(m3), equal_nan=True)
Out[95]: array([[ True, True, False]], dtype=bool)
why is np.isclose failing? from the documentation it seems that it should work
thanks
The problem comes from np.nan == np.nan, which is False in the float logic.
In [39]: np.nan == np.nan
Out[39]: False
The `equal_nan` parameter is to force two `nan` values to be considered as equal , not to consider any value to be equal to `nan`.
In [37]: np.isclose(m3,m3)
Out[37]: array([ True, True, False], dtype=bool)
In [38]: np.isclose(m3,m3,equal_nan=True)
Out[38]: array([ True, True, True], dtype=bool)
Where 'absent' can mean either nan or np.masked, whichever is easiest to implement this with.
For instance:
>>> from numpy import nan
>>> do_it([1, nan, nan, 2, nan, 3, nan, nan, 4, 3, nan, 2, nan])
array([1, 1, 1, 2, 2, 3, 3, 3, 4, 3, 3, 2, 2])
# each nan is replaced with the first non-nan value before it
>>> do_it([nan, nan, 2, nan])
array([nan, nan, 2, 2])
# don't care too much about the outcome here, but this seems sensible
I can see how you'd do this with a for loop:
def do_it(a):
res = []
last_val = nan
for item in a:
if not np.isnan(item):
last_val = item
res.append(last_val)
return np.asarray(res)
Is there a faster way to vectorize it?
Assuming there are no zeros in your data (in order to use numpy.nan_to_num):
b = numpy.maximum.accumulate(numpy.nan_to_num(a))
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 4.])
mask = numpy.isnan(a)
a[mask] = b[mask]
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 3.])
EDIT: As pointed out by Eric, an even better solution is to replace nans with -inf:
mask = numpy.isnan(a)
a[mask] = -numpy.inf
b = numpy.maximum.accumulate(a)
a[mask] = b[mask]
cumsumming over an array of flags provides a good way to determine which numbers to write over the NaNs:
def do_it(x):
x = np.asarray(x)
is_valid = ~np.isnan(x)
is_valid[0] = True
valid_elems = x[is_valid]
replacement_indices = is_valid.cumsum() - 1
return valid_elems[replacement_indices]
Working from #Benjamin's deleted solution, everything is great if you work with indices
def do_it(data, valid=None, axis=0):
# normalize the inputs to match the question examples
data = np.asarray(data)
if valid is None:
valid = ~np.isnan(data)
# flat array of the data values
data_flat = data.ravel()
# array of indices such that data_flat[indices] == data
indices = np.arange(data.size).reshape(data.shape)
# thanks to benjamin here
stretched_indices = np.maximum.accumulate(valid*indices, axis=axis)
return data_flat[stretched_indices]
Comparing solution runtime:
>>> import numpy as np
>>> data = np.random.rand(10000)
>>> %timeit do_it_question(data)
10000 loops, best of 3: 17.3 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 179 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 182 µs per loop
# with lots of nans
>>> data[data > 0.25] = np.nan
>>> %timeit do_it_question(data)
10000 loops, best of 3: 18.9 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 177 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 231 µs per loop
So both this and #user2357112's solution blow the solution in the question out of the water, but this has the slight edge over #user2357112 when there are high numbers of nans
I have an array x, from which I would like to extract a logical mask. x contains nan values, and the mask operation raises a warning, which is what I am trying to avoid.
Here is my code:
import numpy as np
x = np.array([[0, 1], [2.0, np.nan]])
mask = np.isfinite(x) & (x > 0)
The resulting mask is correct (array([[False, True], [ True, False]], dtype=bool)), but a warning is raised:
__main__:1: RuntimeWarning: invalid value encountered in greater
How can I construct the mask in a way that avoids comparing against NaNs? I am not trying to suppress the warning (which I know how to do).
We could do it in two steps - Create the mask of finite ones and then use the same mask to index into itself and also to select the valid mask of remaining finite elements off x for testing and setting into the remaining elements in that mask. So, we would have an implementation like so -
In [35]: x
Out[35]:
array([[ 0., 1.],
[ 2., nan]])
In [36]: mask = np.isfinite(x)
In [37]: mask[mask] = x[mask]>0
In [38]: mask
Out[38]:
array([[False, True],
[ True, False]], dtype=bool)
Looks like masked arrays works with this case:
In [214]: x = np.array([[0, 1], [2.0, np.nan]])
In [215]: xm = np.ma.masked_invalid(x)
In [216]: xm
Out[216]:
masked_array(data =
[[0.0 1.0]
[2.0 --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [217]: xm>0
Out[217]:
masked_array(data =
[[False True]
[True --]],
mask =
[[False False]
[False True]],
fill_value = 1e+20)
In [218]: _.data
Out[218]:
array([[False, True],
[ True, False]], dtype=bool)
But other than propagating the masking I don't know how it handles element by element operations like this. The usual fill and compressed steps don't seem relevant.
For example, I have a array with value [1,2,4,3,6,7,33,2]. I want to get all the values which are bigger than 6. As I know the numpy.take can only get values with indices.
Which function should I use?
You can index into the array with a boolean array index:
>>> a = np.array([1,2,4,3,6,7,33,2])
>>> a > 6
array([False, False, False, False, False, True, True, False], dtype=bool)
>>> a[a > 6]
array([ 7, 33])
If you want the indices where this occurs, you can use np.where:
>>> np.where(a>6)
(array([5, 6]),)