Conditional nd argmin: How can I find the coordinates of the min of a subset of a multidimensional array? - numpy

I know I can use argmin and unravel_index to find the index of the smallest value in an ndarray, but what if I want to find the smallest nonzero element, or the smallest element which is not NaN?

Here's an approach using flattened indices -
def flatnonzero_based(a,condition): # condition = a!= or ~np.isnan(a)
idx = np.flatnonzero(condition)
return np.unravel_index(idx[np.take(a, idx).argmin()], a.shape)
Benchmarking
Approaches -
def flatnonzero_based(a,condition): # Proposed soln
idx = np.flatnonzero(condition)
return np.unravel_index(idx[np.take(a, idx).argmin()], a.shape)
def where_based(a, condition): # #Paul Panzer's soln
nz = np.where(condition)
return np.array(nz)[:, np.argmin(a[nz])]
Timings and verification -
In [233]: a = np.random.rand(40,50,30)
In [234]: nan_idx = np.random.choice(range(a.size), size = a.size//100, replace=0)
In [235]: a.ravel()[nan_idx] = np.nan
In [236]: condition = ~np.isnan(a)
In [237]: where_based(a, condition)
Out[237]: array([16, 10, 8])
In [238]: flatnonzero_based(a, condition)
Out[238]: (16, 10, 8)
In [239]: %timeit where_based(a, condition)
1000 loops, best of 3: 877 µs per loop
In [240]: %timeit flatnonzero_based(a, condition)
10000 loops, best of 3: 143 µs per loop
With 4D data -
In [255]: a = np.random.rand(40,50,30,30)
In [256]: nan_idx = np.random.choice(range(a.size), size = a.size//100, replace=0)
In [257]: a.ravel()[nan_idx] = np.nan
In [258]: condition = ~np.isnan(a)
In [259]: where_based(a, condition)
Out[259]: array([34, 14, 5, 10])
In [260]: flatnonzero_based(a, condition)
Out[260]: (34, 14, 5, 10)
In [261]: %timeit where_based(a, condition)
10 loops, best of 3: 64.9 ms per loop
In [262]: %timeit flatnonzero_based(a, condition)
100 loops, best of 3: 5.32 ms per loop
Incorporating #user7138814's suggestion -
In [267]: np.unravel_index(np.nanargmin(a), a.shape)
Out[267]: (34, 14, 5, 10)
In [268]: %timeit np.unravel_index(np.nanargmin(a), a.shape)
100 loops, best of 3: 4.54 ms per loop

This should work (condition is data != 0 or ~np.isnan(data))
nz = np.where(condition)
cond_arg_min = np.array(nz)[:, np.argmin(data[nz])]

Related

Add to items, with multiple occurrences [duplicate]

I have unsorted array of indexes:
i = np.array([1,5,2,6,4,3,6,7,4,3,2])
I also have an array of values of the same length:
v = np.array([2,5,2,3,4,1,2,1,6,4,2])
I have array with zeros of desired values:
d = np.zeros(10)
Now I want to add to elements in d values of v based on it's index in i.
If I do it in plain python I would do it like this:
for index,value in enumerate(v):
idx = i[index]
d[idx] += v[index]
It is ugly and inefficient. How can I change it?
np.add.at(d, i, v)
You'd think d[i] += v would work, but if you try to do multiple additions to the same cell that way, one of them overrides the others. The ufunc.at method avoids those problems.
We can use np.bincount which is supposedly pretty efficient for such accumulative weighted counting, so here's one with that -
counts = np.bincount(i,v)
d[:counts.size] = counts
Alternatively, using minlength input argument and for a generic case when d could be any array and we want to add into it -
d += np.bincount(i,v,minlength=d.size).astype(d.dtype, copy=False)
Runtime tests
This section compares np.add.at based approach listed in the other post with the np.bincount based one listed earlier in this post.
In [61]: def bincount_based(d,i,v):
...: counts = np.bincount(i,v)
...: d[:counts.size] = counts
...:
...: def add_at_based(d,i,v):
...: np.add.at(d, i, v)
...:
In [62]: # Inputs (random numbers)
...: N = 10000
...: i = np.random.randint(0,1000,(N))
...: v = np.random.randint(0,1000,(N))
...:
...: # Setup output arrays for two approaches
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [63]: bincount_based(d1,i,v) # Run approaches
...: add_at_based(d2,i,v)
...:
In [64]: np.allclose(d1,d2) # Verify outputs
Out[64]: True
In [67]: # Setup output arrays for two approaches again for timing
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [68]: %timeit add_at_based(d2,i,v)
1000 loops, best of 3: 1.83 ms per loop
In [69]: %timeit bincount_based(d1,i,v)
10000 loops, best of 3: 52.7 µs per loop

Vectorized running bin index calculation with Tensorflow or numpy

I have an integer array like this:
in=[1, 2, 6, 1, 3, 2, 1]
I would like to calculate a running index for the equal values in the array. For the matrix above the output would be:
out=[0, 0, 0, 1, 0, 1, 2]
So the naive implementation would be to have a counter for all the values. I would like to have a vectorized solution to run it with tensorflow, perhaps with numpy.
I already thought of creating a 2D tensor of shape=(in.shape[0], tf.max(in), ) and writing 1 to the tensor[i, in[i]] cell, and then call a cumsum column-wise, then writing back row-wise. But my input array is quite big (with several 100k entries) with the maximum value of ~500k, thus this sparse matrix wouldn't even fit into the memory.
Do you have better suggestions? Thank you!
Here's a pandas solution:
s = pd.Series([1, 2, 6, 1, 3, 2, 1])
s.groupby(s).cumcount().values
Output:
array([0, 0, 0, 1, 0, 1, 2], dtype=int64)
Test on similar sized data:
s = pd.Series(np.random.randint(0,500000, 100000))
%timeit -n 100 s.groupby(s).cumcount().values
# 23.9 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use an actual sparse matrix, i.e. use sparse storage. With that an input like a = np.random.randint(0,5*10**5,10**6) is no problem:
import numpy as np
from scipy import sparse
def running(a):
n,m = a.size,a.max()+1
aux = sparse.csr_matrix((np.ones_like(a),a,np.arange(n+1)),(n,m)).tocsc()
msk = aux.indptr[1:] != aux.indptr[:-1]
indptr = aux.indptr[:-1][msk]
aux.data[0] = 0
aux.data[indptr[1:]] -= np.diff(indptr)
out = np.empty_like(a)
out[aux.indices] = aux.data.cumsum()
return out
# alternative method for validation
def use_argsort(a):
indices = a.argsort(kind="stable")
ao = a[indices]
indptr = np.concatenate([[0],(ao[1:] != ao[:-1]).nonzero()[0]+1])
data = np.ones_like(a)
data[0] = 0
data[indptr[1:]] -= np.diff(indptr)
out = np.empty_like(a)
out[indices] = data.cumsum()
return out
in_ = np.array([1, 2, 6, 1, 3, 2, 1])
print("OP example",in_,"->",running(in_))
print("second opinion","->",use_argsort(in_))
from timeit import timeit
A = np.random.randint(0,500_000,1_000_000)
print("large example (500k labels, 1M entries) takes",
timeit(lambda:running(A),number=10)*100,"ms")
print("using other method takes",
timeit(lambda:use_argsort(A),number=10)*100,"ms")
print("same result:",(use_argsort(A) == running(A)).all())
Sample run:
OP example [1 2 6 1 3 2 1] -> [0 0 0 1 0 1 2]
second opinion -> [0 0 0 1 0 1 2]
large example (500k labels, 1M entries) takes 84.1427305014804 ms
using other method takes 262.38483290653676 ms
same result: True

Convert numpy array with many dimensions into 2D array with nested numpy arrays

I would like to convert an array with many dimensions (more than 2) into a 2D array where other dimensions would be converted to nested stand-alone arrays.
So if I have an array like numpy.arange(3 * 4 * 5 * 5 * 5).reshape((3, 4, 5, 5, 5)), I would like to convert it to an array of shape (3, 4), where each element would be an array of shape (5, 5, 5). The dtype of the outer array would be object.
For example, for np.arange(8).reshape((1, 1, 2, 2, 2)), the output would be equivalent to:
a = np.ndarray(shape=(1,1), dtype=object)
a[0, 0] = np.arange(8).reshape((1, 1, 2, 2, 2))[0, 0, :, :, :]
How can I do this efficiently?
We can reshape and assign elements from the regular array into the output object dtype array in a single loop that seems to be a tad faster than with two loops, like so -
def reshape_approach(a):
m,n = a.shape[:2]
a.shape = (m*n,) + a.shape[2:]
out = np.empty((m*n),dtype=object)
for i in range(m*n):
out[i] = a[i]
out.shape = (m,n)
a.shape = (m,n) + a.shape[1:]
return out
Runtime test
Other approach(es) -
# #Scotty1-'s soln
def simply_assign(a):
m,n = a.shape[:2]
out = np.empty((m,n),dtype=object)
for i in range(m):
for j in range(n):
out[i,j] = a[i,j]
return out
Timings -
In [154]: m,n = 300,400
...: a = np.arange(m * n * 5 * 5 * 5).reshape((m,n, 5, 5, 5))
In [155]: %timeit simply_assign(a)
10 loops, best of 3: 39.4 ms per loop
In [156]: %timeit reshape_approach(a)
10 loops, best of 3: 32.9 ms per loop
With 7D data -
In [160]: m,n,p,q = 30,40,30,40
...: a = np.arange(m * n *p * q * 5 * 5 * 5).reshape((m,n,p,q, 5, 5, 5))
In [161]: %timeit simply_assign(a)
1000 loops, best of 3: 421 µs per loop
In [162]: %timeit reshape_approach(a)
1000 loops, best of 3: 316 µs per loop
Thanks for your hint Mitar. This is how it should look like using dtype=np.object arrays:
outer_array = np.empty((x.shape[0], x.shape[1]), dtype=np.object)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
outer_array[i, j] = x[i, j]
Looping may not be the most efficient way to do it, but there is afaik no vectorized operation for this task.
(Using some more reshaping, this should be even faster than Divakar's solution: ;)) ---> No, Divakar is faster.... Nice solution Divakar!
def advanced_reshape_solution(x):
m, n = x.shape[:2]
sub_arr_size = np.prod(x.shape[2:])
out_array = np.empty((m * n), dtype=object)
x_flat_view = x.reshape(-1)
for i in range(m*n):
out_array[i] = x_flat_view[i * sub_arr_size:(i + 1) * sub_arr_size].reshape(x.shape[2:])
return out_array.reshape((m, n))

Selecting rows from ndarray via bytearray

I have a bytearray that is pulled from redis.
r.set('a', '')
r.setbit('a', 0, 1)
r.setbit('a', 1, 1)
r.setbit('a', 12, 1)
a_raw = db.get('a')
# b'\xc0\x08'
a_bin = bin(int.from_bytes(a, byteorder="big"))
# 0b1100000000001000
I want to use that bytearray to select rows from an ndarray.
arr = np.arange(12)
arr[a_raw]
# array([0, 1, 12])
Edit Both solutions work, but I found #paul-panzer's to be faster
import timeit
setup = '''import numpy as np; a = b'\\xc0\\x08'; '''
t1 = timeit.timeit('idx = np.unpackbits(np.frombuffer(a, np.uint8)); np.where(idx)',
setup = setup, number=10000)
t2 = timeit.timeit('idx = np.array(list(bin(int.from_bytes(a, byteorder="big"))[2:])) == "1"; np.where(idx)',
setup = setup, number=10000)
print(t1, t2)
#0.019560601096600294 0.054518797900527716
Edit 2 Actually, the from_bytes method doesn't return what I'm looking for:
redis_db.delete('timeit_test')
redis_db.setbit('timeit_test', 12666, 1)
redis_db.setbit('timeit_test', 14379, 1)
by = redis_db.get('timeit_test')
idx = np.unpackbits(np.frombuffer(by, np.uint8))
indices = np.where(idx)
idx = np.array(list(bin(int.from_bytes(by, byteorder="big"))[2:])) == "1"
indices_2 = np.where(idx)
print(indices, indices_2)
#(array([12666, 14379]),) (array([ 1, 1714]),)
Here is a way using unpackbits:
>>> a = b'\xc0\x08'
>>> b = np.arange(32).reshape(16, 2)
>>> c = np.arange(40).reshape(20, 2)
>>>
>>> idx = np.unpackbits(np.frombuffer(a, np.uint8))
>>>
# if the sizes match boolen indexing can be used
>>> b[idx.view(bool)]
array([[ 0, 1],
[ 2, 3],
[24, 25]])
>>>
# non matching sizes can be worked around using where
>>> c[np.where(idx)]
array([[ 0, 1],
[ 2, 3],
[24, 25]])
>>>
Here's one way:
In [57]: b = 0b1100000000001000
In [58]: mask = np.array(list(bin(b)[2:])) == '1'
In [59]: arr = np.arange(13)
In [60]: arr[mask[:len(arr)]]
Out[60]: array([ 0, 1, 12])
Additionally it's a simple check to demonstrate that the __getitem__ implementation for ndarray does not support indexing directly on a bytes object:
In [61]: by = b'\xc0\x08'
In [62]: arr[by]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-111-6cd68003b176> in <module>()
----> 1 arr[by]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`)
and integer or boolean arrays are valid indices
So unless you subclass ndarray or create an extension module with customized __getitem__ behavior, there is no way to do it directly from the bytes, and you must convert the bytes into a boolean mask based on bitwise conditions.
Here's an example comparing the timing for a few different approaches that work directly from the original bytes object:
In [171]: %timeit np.array(list(bin(int.from_bytes(by, byteorder='big'))[2:])) == '1'
3.51 µs ± 38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [172]: %timeit np.unpackbits(np.frombuffer(by, np.uint8))
2.05 µs ± 29.59 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [173]: %timeit np.array(list(bin(struct.unpack('>H', by)[0])[2:])) == '1'
2.65 µs ± 6.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

what is the fastest way to get the mode of a numpy array

I have to find the mode of a NumPy array that I read from an hdf5 file. The NumPy array is 1d and contains floating point values.
my_array=f1[ds_name].value
mod_value=scipy.stats.mode(my_array)
My array is 1d and contains around 1M values. It takes about 15 min for my script to return the mode value. Is there any way to make this faster?
Another question is why scipy.stats.median(my_array) does not work while mode works?
AttributeError: module 'scipy.stats' has no attribute 'median'
The implementation of scipy.stats.mode has a Python loop for handling the axis argument with multidimensional arrays. The following simple implementation, for one-dimensional arrays only, is faster:
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
Here's an example. First, make an array of integers with length 1000000.
In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)
In [41]: x.shape
Out[41]: (1000000,)
Check that scipy.stats.mode and mode1 give the same result.
In [42]: from scipy.stats import mode
In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))
In [44]: mode1(x)
Out[44]: (1009, 1066)
Now check the performance.
In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.91 seconds for mode(x) and only 39.6 milliseconds for mode1(x).
Here's one approach based on sorting -
def mode1d(ar_sorted):
ar_sorted.sort()
idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
count = np.empty(idx.size+1,dtype=int)
count[1:-1] = idx[1:] - idx[:-1]
count[0] = idx[0] + 1
count[-1] = ar_sorted.size - idx[-1] - 1
argmax_idx = count.argmax()
if argmax_idx==len(idx):
modeval = ar_sorted[-1]
else:
modeval = ar_sorted[idx[argmax_idx]]
modecount = count[argmax_idx]
return modeval, modecount
Note that this mutates/changes the input array as it sorts it. So, if you want to keep the input array un-mutated or do mind the input array being sorted, pass a copy.
Sample run on 1M elements -
In [65]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)
In [66]: from scipy.stats import mode
In [67]: mode(x)
Out[67]: ModeResult(mode=array([ 295.]), count=array([1098]))
In [68]: mode1d(x)
Out[68]: (295.0, 1098)
Runtime test
In [75]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)
# Scipy's mode
In [76]: %timeit mode(x)
1 loop, best of 3: 1.64 s per loop
# #Warren Weckesser's soln
In [77]: %timeit mode1(x)
10 loops, best of 3: 52.7 ms per loop
# Proposed in this post
In [78]: %timeit mode1d(x)
100 loops, best of 3: 12.8 ms per loop
With a copy, the timings for mode1d would be comparable to mode1.
I added the two functions mode1 and mode1d from replies above to my script and tried to compare with the scipy.stats.mode.
dir_name="C:/Users/test_mode"
file_name="myfile2.h5"
ds_name="myds"
f_in=os.path.join(dir_name,file_name)
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
def mode1d(ar_sorted):
ar_sorted.sort()
idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
count = np.empty(idx.size+1,dtype=int)
count[1:-1] = idx[1:] - idx[:-1]
count[0] = idx[0] + 1
count[-1] = ar_sorted.size - idx[-1] - 1
argmax_idx = count.argmax()
if argmax_idx==len(idx):
modeval = ar_sorted[-1]
else:
modeval = ar_sorted[idx[argmax_idx]]
modecount = count[argmax_idx]
return modeval, modecount
startTime=time.time()
with h5py.File(f_in, "a") as f1:
myds=f1[ds_name].value
time1=time.time()
file_read_time=time1-startTime
print(str(file_read_time)+"\t"+"s"+"\t"+str((file_read_time)/60)+"\t"+"min")
print("mode_scipy=")
mode_scipy=scipy.stats.mode(myds)
print(mode_scipy)
time2=time.time()
mode_scipy_time=time2-time1
print(str(mode_scipy_time)+"\t"+"s"+"\t"+str((mode_scipy_time)/60)+"\t"+"min")
print("mode1=")
mode1=mode1(myds)
print(mode1)
time3=time.time()
mode1_time=time3-time2
print(str(mode1_time)+"\t"+"s"+"\t"+str((mode1_time)/60)+"\t"+"min")
print("mode1d=")
mode1d=mode1d(myds)
print(mode1d)
time4=time.time()
mode1d_time=time4-time3
print(str(mode1d_time)+"\t"+"s"+"\t"+str((mode1d_time)/60)+"\t"+"min")
The result from running the script for a numpy array of around 1M is :
mode_scipy=
ModeResult(mode=array([ 1.11903353e-06], dtype=float32), count=array([304909]))
938.8368742465973 s
15.647281237443288 min
mode1=(1.1190335e-06, 304909)
0.06500649452209473 s
0.0010834415753682455 min
mode1d=(1.1190335e-06, 304909)
0.06200599670410156 s
0.0010334332784016928 min