Get specific elements from ndarray of ndarrays with shape (n,) - numpy

Given the ndarray:
A = np.array([np.array([1], dtype='f'),
np.array([2, 3], dtype='f'),
np.array([4, 5], dtype='f'),
np.array([6], dtype='f'),
np.array([7, 8, 9], dtype='f')])
which displays as:
A
array([array([ 1.], dtype=float32), array([ 2., 3.], dtype=float32),
array([ 4., 5.], dtype=float32), array([ 6.], dtype=float32),
array([ 7., 8., 9.], dtype=float32)], dtype=object)
I am trying to create a new array from the first elements of each "sub-array" of A. To show you what I mean, below is some code creating the array that I want using a loop. I would like to achieve the same thing but as efficiently as possible, since my array A is quite large (~50000 entries) and I need to do the operation many times.
B = np.zeros(len(A))
for i, val in enumerate(A):
B[i] = val[0]
B
array([ 1., 2., 4., 6., 7.])

Here's an approach that concatenates all elements into an 1D array and then select the first elements by linear-indexing. The implementation would look like this -
lens = np.array([len(item) for item in A])
out = np.concatenate(A)[np.append(0,lens[:-1].cumsum())]
The bottleneck would be with the concatenation part, but that might be offsetted if there are huge number of elements with small lengths. So, the efficiency would depend on the format of the input array.

I suggest transforming your original jagged array of arrays into a single masked array:
B = np.ma.masked_all((len(A), max(map(len, A))))
for ii, row in enumerate(A):
B[ii,:len(row)] = row
Now you have:
[[1.0 -- --]
[2.0 3.0 --]
[4.0 5.0 --]
[6.0 -- --]
[7.0 8.0 9.0]]
And you can get the first column this way:
B[:,0].data

Related

Astropy quantity in-place conversion

Is there a way to convert an astropy quantity to another set of units "in-place"? The to method always returns a copy so that's not so useful. Something like:
import astropy.units as u
data = [1, 2, 3]*u.g
data.convert_to('kg')
Both Pint and yt.units have in-place conversions:
from pint import UnitRegistry
u = UnitRegistry()
data = [1, 2, 3]*u.g
data.ito('kg')
and
from yt.units import g
data = [1, 2, 3]*g
data.convert_to_units('kg')
A cursory glance at the astropy docs and source code indicates that the answer is "no" but perhaps I'm missing something.
There's a few ways you can do it at the moment. Given your example:
>>> import astropy.units as u
>>> data = [1, 2, 3] * u.g
>>> data
<Quantity [1., 2., 3.] g>
You can do this:
>>> data.value * u.kg
<Quantity [1., 2., 3.] kg>
Or this:
>>> data * u.kg / data.unit
<Quantity [1., 2., 3.] kg>
Or this:
>>> data._unit = u.kg
>>> data
<Quantity [1., 2., 3.] kg>
None of these ways copy the Numpy array, so are OK performance-wise for many applications.
I don't think there is a method available so that setting data._unit becomes possible without reaching for the private data member. This was discussed a bit (in the context of Column and Quantity objects) here and I think the conclusion was that a set_unit method would be a useful addition, but it hasn't been implemented yet. So you could open an issue with that feature request here.

Find values in Array 1 (more than one column) that are greater than values in Array2 (Single column) along columns

Suppose I have a numpy ndarray of shape (2,4) as follows
>>> array1 = numpy.random.rand(2,4)
array([[ 0.87791012, 0.84566058, 0.73877908, 0.40377929],
[ 0.9669688 , 0.15913901, 0.70374509, 0.95776427]])
I have second array of shape (2,) as follows
>>> array2 = numpy.random.rand(2)
array([ 0.57126204, 0.67938752])
I would like to compare both the arrays along the column dimension to find the elements in array1 that are greater than array2 (elementwise). The desired result is
array([[ 1., 1., 1., 0.],
[ 1., 0., 1., 1.]])
If both have the same dimensions, I can directly use (array1 > array2).astype(int). In case of array1 being a multidimensional array with more than one column, I am using the following method involving a loop
results = np.zeros_like(array1)
for each in range(array1.shape[1]):
results[:,each] = array1[:,each] > array2
Is there a more pythonic/numpy way of doing it?
Reshape array2 to 2d array with shape (2,1), then the comparison should work due to numpy broadcasting:
(array1 > array2[:,None]).astype(int)
#array([[1, 1, 1, 0],
# [1, 0, 1, 1]])

How can I efficiently "stretch" present values in an array over absent ones

Where 'absent' can mean either nan or np.masked, whichever is easiest to implement this with.
For instance:
>>> from numpy import nan
>>> do_it([1, nan, nan, 2, nan, 3, nan, nan, 4, 3, nan, 2, nan])
array([1, 1, 1, 2, 2, 3, 3, 3, 4, 3, 3, 2, 2])
# each nan is replaced with the first non-nan value before it
>>> do_it([nan, nan, 2, nan])
array([nan, nan, 2, 2])
# don't care too much about the outcome here, but this seems sensible
I can see how you'd do this with a for loop:
def do_it(a):
res = []
last_val = nan
for item in a:
if not np.isnan(item):
last_val = item
res.append(last_val)
return np.asarray(res)
Is there a faster way to vectorize it?
Assuming there are no zeros in your data (in order to use numpy.nan_to_num):
b = numpy.maximum.accumulate(numpy.nan_to_num(a))
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 4.])
mask = numpy.isnan(a)
a[mask] = b[mask]
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 3.])
EDIT: As pointed out by Eric, an even better solution is to replace nans with -inf:
mask = numpy.isnan(a)
a[mask] = -numpy.inf
b = numpy.maximum.accumulate(a)
a[mask] = b[mask]
cumsumming over an array of flags provides a good way to determine which numbers to write over the NaNs:
def do_it(x):
x = np.asarray(x)
is_valid = ~np.isnan(x)
is_valid[0] = True
valid_elems = x[is_valid]
replacement_indices = is_valid.cumsum() - 1
return valid_elems[replacement_indices]
Working from #Benjamin's deleted solution, everything is great if you work with indices
def do_it(data, valid=None, axis=0):
# normalize the inputs to match the question examples
data = np.asarray(data)
if valid is None:
valid = ~np.isnan(data)
# flat array of the data values
data_flat = data.ravel()
# array of indices such that data_flat[indices] == data
indices = np.arange(data.size).reshape(data.shape)
# thanks to benjamin here
stretched_indices = np.maximum.accumulate(valid*indices, axis=axis)
return data_flat[stretched_indices]
Comparing solution runtime:
>>> import numpy as np
>>> data = np.random.rand(10000)
>>> %timeit do_it_question(data)
10000 loops, best of 3: 17.3 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 179 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 182 µs per loop
# with lots of nans
>>> data[data > 0.25] = np.nan
>>> %timeit do_it_question(data)
10000 loops, best of 3: 18.9 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 177 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 231 µs per loop
So both this and #user2357112's solution blow the solution in the question out of the water, but this has the slight edge over #user2357112 when there are high numbers of nans

Setting a value in masked location with NaNs present in numpy

I have an array with NaNs, say
>>> a = np.random.randn(3, 3)
>>> a[1, 1] = a[2, 2] = np.nan
>>> a
array([[-1.68425874, 0.65435007, 0.55068277],
[ 0.71726307, nan, -0.09614409],
[-1.45679335, -0.12772348, nan]])
I would like to set negative numbers in this array to -1. Doing this the "straightforward" way results in a warning, which I am trying to avoid:
>>> a[a < 0] = -1
__main__:1: RuntimeWarning: invalid value encountered in less
>>> a
array([[-1. , 0.65435007, 0.55068277],
[ 0.71726307, nan, -1. ],
[-1. , -1. , nan]])
Applying AND to the masks results in the same warning because of course a < 0 is computed as a separate temp array:
>>> n = ~np.isnan(a)
>>> a[n & (a < 0)] = -1
__main__:1: RuntimeWarning: invalid value encountered in less
When I try to apply a mask the nans out of a, the masked portion is not written back to the original array:
>>> n = ~np.isnan(a)
>>> a[n][a[n] < 0] = -1
>>> a
array([[-1.68425874, 0.65435007, 0.55068277],
[ 0.71726307, nan, -0.09614409],
[-1.45679335, -0.12772348, nan]])
The only way I could figure out of solving this is by using a gratuitous intermediate masked version of a:
>>> n = ~np.isnan(a)
>>> b = a[n]
>>> b[b < 0] = -1
>>> a[n] = b
>>> a
array([[-1. , 0.65435007, 0.55068277],
[ 0.71726307, nan, -1. ],
[-1. , -1. , nan]])
Is there a simpler way to perform this masked assignment with the presence of NaNs? I would like to solve this without the use of masked arrays if possible.
NOTE
The snippets above are best run with
import numpy as np
import warnings
np.seterr(all='warn')
warnings.simplefilter("always")
as per https://stackoverflow.com/a/30496556/2988730.
If you want to avoid that warning occurring at a < 0 with a containing NaNs, I would think alternative ways would involve using flattened or row-column indices of non-Nan positions and then performing the comparison. Thus, we would have two approaches with that philosophy.
One with flattened indices -
idx = np.flatnonzero(~np.isnan(a))
a.ravel()[idx[a.ravel()[idx] < 0]] = -1
Another with subscripted-indices -
r,c = np.nonzero(~np.isnan(a))
mask = a[r,c] < 0
a[r[mask],c[mask]] = -1
You can suppress the warning temporarily, is this what you're after?
In [9]: a = np.random.randn(3, 3)
In [10]: a[1, 1] = a[2, 2] = np.nan
In [11]: with np.errstate(invalid='ignore'):
....: a[a < 0] = -1
....:
Poking around the np.nan... functions I found np.nan_to_num
In [569]: a=np.arange(9.).reshape(3,3)-5
In [570]: a[[1,2],[1,2]]=np.nan
In [571]: a
Out[571]:
array([[ -5., -4., -3.],
[ -2., nan, 0.],
[ 1., 2., nan]])
In [572]: np.nan_to_num(a) # replace nan with 0
Out[572]:
array([[-5., -4., -3.],
[-2., 0., 0.],
[ 1., 2., 0.]])
In [573]: np.nan_to_num(a)<0 # and safely do the <
Out[573]:
array([[ True, True, True],
[ True, False, False],
[False, False, False]], dtype=bool)
In [574]: a[np.nan_to_num(a)<0]=-1
In [575]: a
Out[575]:
array([[ -1., -1., -1.],
[ -1., nan, 0.],
[ 1., 2., nan]])
Looking at the nan_to_num code, it looks like it uses a masked copyto:
In [577]: a1=a.copy(); np.copyto(a1, 0.0, where=np.isnan(a1))
In [578]: a1
Out[578]:
array([[-1., -1., -1.],
[-1., 0., 0.],
[ 1., 2., 0.]])
So it's like your version with the 'gratuitous' mask, but it's hidden in the function.
np.place, np.putmask are other functions that use a mask.

Disable numpy fancy indexing and assignment?

This post identifies a "feature" that I would like to disable.
Current numpy behavior:
>>> a = arange(10)
>>> a[a>5] = arange(10)
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3])
The reason it's a problem: say I wanted an array to have two different sets of values on either side of a breakpoint (e.g., for making a "broken power-law" or some other simple piecewise function). I might accidentally do something like this:
>>> x = empty(10)
>>> a = arange(10)
>>> x[a<=5] = 0 # this is fine
>>> x[a>5] = a**2 # this is not
# but what I really meant is this
>>> x[a>5] = a[a>5]**2
The first behavior, x[a>5] = a**2 yields something I would consider counterintuitive - the left side and right side shapes disagree and the right side is not scalar, but numpy lets me do this assignment. As pointed out on the other post, x[5:]=a**2 is not allowed.
So, my question: is there any way to make x[a>5] = a**2 raise an Exception instead of performing the assignment? I'm very worried that I have typos hiding in my code because I never before suspected this behavior.
I don't know of a way offhand to disable a core numpy feature. Instead of disabling the behavior you could try using np.select:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html
In [110]: x = np.empty(10)
In [111]: a = np.arange(10)
In [112]: x[a<=5] = 0
In [113]: x[a>5] = a**2
In [114]: x
Out[114]: array([ 0., 0., 0., 0., 0., 0., 0., 1., 4., 9.])
In [117]: condlist = [a<=5,a>5]
In [119]: choicelist=[0,a**2]
In [120]: x = np.select(condlist,choicelist)
In [121]: x
Out[121]: array([ 0, 0, 0, 0, 0, 0, 36, 49, 64, 81])