Vectorwise iteration of nD array with nditer - numpy

Given I have two arrays, say A (shape K,L,M) and B (shape K,M).
I want to iterate vectorwise and construct an output C (shape equal to A) by running a function f on each input vector a and scalar b and then reassembling it into the output (i.e. for each c = f(a, b) (where a = A[i, :, j], b = B[i, j], c as a)). In this example the vector axis would be a.shape[1], but in general it could be any other also.
After reading the documentation page of nditer, I thought it should be appropriate and elegant to use here, since apparently it can allocate everything for you, allows a separate external loop, and easily allows reassembly of the output.
However, I am unable to even make something as simple as a vector-wise copy (again along axis) of an existing array using nditer work properly. Is what I want to do simply not possible with nditer or am I using it wrong?
def test(arr, offsets, axis=0):
#out = np.zeros_like(arr)
with np.nditer([arr, None], flags=['external_loop'], #[arr, out]
op_flags=[['readonly'], ['writeonly', 'allocate']],
op_axes=[[axis], None], #[[axis], [axis]]
) as ndit:
for i, o in ndit:
print(i.shape, o.shape)
o[...] = i
return ndit.operands[1]
tested = test(xam.data, shifts, axis=1)
print('test output shape', tested2.shape)
>>> (<L>,) (<L>,)
>>> test output shape (<L>,)
This gives an output only of the very first input. Even if I explicitly give an output that has the same shape as input (e.g. via the commented out changes), then the nditer only runs once on the very first length L vector.
>>> (<L>,) (<L>,)
>>> test output shape (<N>, <L>, <M>)
I have made an alternative version using rollaxis views, but it is not particularly pretty or intuitive, so I was wondering if it should not also be possible with nditer, somehow...
def test2(arr, offsets, axis=0):
arr_r = np.rollaxis(arr, axis).reshape((arr.shape[axis], -1)).T
out = np.zeros_like(arr)
out_r = np.rollaxis(out, axis).reshape((arr.shape[axis], -1)).T # create view
for i, o in zip(arr_r, out_r):
o[...] = i
return out

Changing your function to work with a list/tuple of axes:
In [378]: def test(arr, offsets, axis=0):
...: #out = np.zeros_like(arr)
...: with np.nditer([arr, None],flags=['external_loop'], #[arr, out]
...: op_flags=[['readonly'], ['writeonly', 'allocate']],
...: op_axes=[axis, None], #[[axis], [axis]]
...: ) as ndit:
...: for i, o in ndit:
...: print(i.shape, o.shape)
...: print(i)
...: o[...] = i
...: return ndit.operands[1]
...:
Now it iterates on the whole 2d array. With external_loop it passes a whole (flat) array.
In [379]: test(np.arange(12).reshape((3,4)),0,axis=[0,1])
(12,) (12,)
[ 0 1 2 3 4 5 6 7 8 9 10 11]
Out[379]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [380]: test(np.arange(12).reshape((3,4)),0,axis=[1,0])
(12,) (12,)
[ 0 1 2 3 4 5 6 7 8 9 10 11]
Out[380]:
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
test2
Adding the print to test2 to better see what's passed:
In [385]: def test2(arr, offsets, axis=0):
...: arr_r = np.rollaxis(arr, axis).reshape((arr.shape[axis], -1)).T
...: out = np.zeros_like(arr)
...: out_r = np.rollaxis(out, axis).reshape((arr.shape[axis], -1)).T # create view
...: for i, o in zip(arr_r, out_r):
...: print(i.shape, i)
...: o[...] = i
...: return out
...:
In [386]: test2(np.arange(12).reshape((3,4)),0,axis=0)
(3,) [0 4 8]
(3,) [1 5 9]
(3,) [ 2 6 10]
(3,) [ 3 7 11]
Out[386]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [387]: test2(np.arange(12).reshape((3,4)),0,axis=1)
(4,) [0 1 2 3]
(4,) [4 5 6 7]
(4,) [ 8 9 10 11]
Out[387]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
timings
Taking out the prints to do timings:
nditer:
In [391]: timeit test0(np.arange(12).reshape((3,4)),0,axis=(0,1))
11.6 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
iteration:
In [392]: timeit test20(np.arange(12).reshape((3,4)),0,axis=0)
26.5 µs ± 732 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
nditer, but without external_loop
In [395]: timeit test01(np.arange(12).reshape((3,4)),0,axis=(0,1))
17.9 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Often in time tests nditer performs slower. Here though the external_loop case only has to iterate once, passing the whole flattened array to the body.
big picture
So far we are just trying to iterate through a 2d array. In the intro you talk of using
A (shape K,L,M) and B (shape K,M).
Normally in numpy we try to avoid any iteration. If B is (K,1,M) (as with B[:,None,:], then we can do all kinds of things with them
C = A + B[:,None]
C = A * B[:,None]
without needing to iterate. Any python level iteration with arrays slows down the code.

Related

Numpy: apply function that creates an array

Numpy apply_along_axis/apply_over_axes assume that the applied function returns a scalar, but what if I want to use a function that returns an array (thus adding new dimensions)?
Below is a simplified example. I want to apply my_func to each row of an array. I could do this in pandas but expect numpy to be faster.
Function:
def my_func(k):
x = np.arange(3)
y = x ** k
return y
Original array:
array([[1],
[2],
[3]])
Expected result:
array([[ 0, 1, 2, 3],
[ 0, 1, 4, 9],
[ 0, 1, 8, 27]], dtype=int32)
Update: it was an oversimplified example. I should have said the real function can only take a scalar as input. But the solution proposed by Michael Szczesny in comments works for such functions too.
Update2: I should have said a function that does not broadcast, like this:
def my_func(k):
return np.random.randint(1, 4, 5) + k
I am sharing the code for your reference,
import numpy as np
def my_func(k):
x = np.arange(4)
y = x ** k
return y
inp = np.array([[1],[2],[3]])
print(my_func(inp))
Output:
[[ 0 1 2 3]
[ 0 1 4 9]
[ 0 1 8 27]]
See if it helps?
Your function, with an added print to see exactly what k is:
In [39]: def my_func(k):
...: print(k)
...: x = np.arange(4) # range to match your expected result
...: y = x ** k
...: return y
...:
As written the function works with your (3,1) array, arr = np.arange(1,4)[:,None]:
In [40]: my_func(arr)
[[1]
[2]
[3]]
Out[40]:
array([[ 0, 1, 2, 3],
[ 0, 1, 4, 9],
[ 0, 1, 8, 27]])
Note the whole 2d array is passed. The x**k step works by broadcasting, using a (4,) array with a (3,1), to produce a (3,4) result. You should, if possible write functions that work like this, taking full advantage of the numpy methods and operators.
apply... can be used as here:
In [41]: np.apply_along_axis(my_func, 1, arr)
[1]
[2]
[3]
Out[41]:
array([[ 0, 1, 2, 3],
[ 0, 1, 4, 9],
[ 0, 1, 8, 27]])
Note that it passes (1,) arrays to the function. The docs should make it clear that this is designed to pass a 1d array to the function, NOT a scalar.
The equivalent for a 2d arr array is:
In [42]: np.array([my_func(i) for i in arr])
[1]
[2]
[3]
Out[42]:
array([[ 0, 1, 2, 3],
[ 0, 1, 4, 9],
[ 0, 1, 8, 27]])
Now lets comment out the print and do some time tests:
In [44]: timeit my_func(arr)
7.41 µs ± 6.75 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [45]: timeit np.apply_along_axis(my_func, 1, arr)
89.2 µs ± 649 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [46]: timeit np.array([my_func(i) for i in arr])
28.9 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The broadcasted approach is fastest. apply_along_axis is slowest.
I claim that apply_along_axis is only useful when the array dimensions are greater than 2, and even then it just makes the code prettier, not faster.
For example with a 3d array, that still broadcasts with the (4,) shape x:
In [47]: arr = np.arange(24).reshape(2,3,4)
In [49]: np.apply_along_axis(my_func, 2, arr).shape
Out[49]: (2, 3, 4)
In [50]: my_func(arr).shape
Out[50]: (2, 3, 4)
In [51]: np.array([[my_func(arr[i,j,:]) for j in range(3)] for i in range(2)]).shape
Out[51]: (2, 3, 4)
The list iteration requires a double loop. apply_along_axis hides this, but does not reduce the total number of calls to my_func.
If your function really required a scalar (e.g. use a math.cos or if test), then you might consider np.vectorize. For smallist examples it's slower than the equivalent list comprehension, but it does scale better for large ones. But again, if you can write the function to work directly with array, you'll much happier with the performance.

Find duplicated sequences in numpy.array or pandas column

For example, I have got an array like this:
([ 1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5 ])
I need to find all duplicated sequences , not values, but sequences of at least two values one by one.
The result should be like this:
of length 2: [1, 5] with indexes (0, 16);
of length 3: [3, 3, 7] with indexes (6, 12); [7, 9, 4] with indexes (2, 8)
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
I can do it with pandas.apply function, but it calculates too slow, swifter did not help me.
And in real life I need to find all of them, with length from 10 up to 100 values one by one on database with 1500 columns with 700 000 values each. So i really do need a vectorized decision.
Is there a vectorized decision for finding all at once? Or at least for finding only 10-values sequences? Or only 4-values sequences? Anything, that will be fully vectorized?
One possible implementation (although not fully vectorized) that finds all sequences of size n that appear more than once is the following:
import numpy as np
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
if M.any():
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
repeated_inds = np.count_nonzero(matches, axis=-1) > n
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)[::n]
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
else:
return [], []
return unique_seqs[repeated_inds], grouped_idxs
In theory, you could replace
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
with
matches = scipy.signal.convolve(
M, np.ones((1, n), dtype=int), mode="full"
).astype(int)
which would make the whole thing "fully vectorized", but my tests showed that this was 3 to 4 times slower than the for-loop. So I'd stick with that. Or simply,
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
which does not have any significant speed-up, since it's basically a hidden loop (see this).
This is based off #Divakar's answer here that dealt with a very similar problem, in which the sequence to look for was provided. I simply made it so that it could follow this procedure for all possible sequences of size n, which are found inside the function with n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]; unique_seqs = np.unique(n_seqs, axis=0).
For example,
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> repeated_seqs, inds = repeated_sequences(a, n)
>>> for i, seq in enumerate(repeated_seqs[:10]):
...: print(f"{seq} with indexes {inds[i]}")
...:
[3 3 7] with indexes [ 6 12]
[7 9 4] with indexes [2 8]
Disclaimer
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
This is not directly taken into account and the sequence [5, 5] would appear more than once according to this algorithm. You could do something like this, based off #Paul's answer here, but it involves a loop:
import numpy as np
repeated_matches = np.array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
>>> print(grouped_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64),
array([ 7, 8, 9, 10], dtype=int64)]
# If there are consecutive numbers in grouped_idxs, that means that there is a long
# sequence that should be excluded. So, you'd have to check for consecutive numbers
filtered_idxs = []
for idx in grouped_idxs:
if not all((idx[1:] - idx[:-1]) == 1):
filtered_idxs.append(idx)
>>> print(filtered_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64)]
Some tests:
>>> n = 3
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> %timeit repeated_sequences(a, n)
414 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> n = 4
>>> a = np.random.randint(0, 10, (10000,))
>>> %timeit repeated_sequences(a, n)
3.88 s ± 54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> result, _ = repeated_sequences(a, n)
>>> result.shape
(2637, 4)
This is not the most efficient implementation by far, but it works as a 2D approach. Plus, if there aren't any repeated sequences, it returns empty lists.
EDIT: Full implementation
I vectorized the routine I added in the Disclaimer section as a possible solution to the long sequence problem and ended up with the following:
import numpy as np
# Taken from:
# https://stackoverflow.com/questions/53051560/stacking-numpy-arrays-of-different-length-using-padding
def stack_padding(it):
def resize(row, size):
new = np.array(row)
new.resize(size)
return new
row_length = max(it, key=len).__len__()
mat = np.array([resize(row, row_length) for row in it])
return mat
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
repeated_seqs = []
idxs_repeated_seqs = []
if M.any():
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
repeated_inds = np.count_nonzero(matches, axis=-1) > n
if repeated_inds.any():
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
# Additional routine
# Pad this uneven array with zeros so that we can use it normally
grouped_idxs = np.array(grouped_idxs, dtype=object)
padded_idxs = stack_padding(grouped_idxs)
# Find the indices where there are padded zeros
pad_positions = padded_idxs == 0
# Perform the "consecutive-numbers check" (this will take one
# item off the original array, so we have to correct for its shape).
idxs_to_remove= np.pad(
(padded_idxs[:, 1:] - padded_idxs[:, :-1]) == 1,
[(0, 0), (0, 1)],
constant_values=True,
)
pad_positions = np.argwhere(pad_positions)
i = pad_positions[:, 0]
j = pad_positions[:, 1] - 1 # Shift by one (shape correction)
idxs_to_remove[i, j] = True # Masking, since we don't want pad indices
# Obtain a final mask (boolean opposite of indices to remove)
final_mask = ~idxs_to_remove.all(axis=-1)
grouped_idxs = grouped_idxs[final_mask] # Filter the long sequences
repeated_seqs = unique_seqs[repeated_inds][final_mask]
# In order to get the correct indices, we must first limit the
# search to a shape (on axis=1) of the closest multiple of n.
# This will avoid taking more indices than we should to show where
# each repeated sequence begins
to = padded_idxs.shape[1] & (-n)
# Build the final list of indices (that goes from 0 - to with
# a step of n
idxs_repeated_seqs = [
grouped_idxs[i][:to:n] for i in range(grouped_idxs.shape[0])
]
return repeated_seqs, idxs_repeated_seqs
For example,
n = 2
examples = [
# First example is your original example array.
np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Second example has a long sequence of 5's, and since there aren't
# any [5, 5] anywhere else, it's not taken into account and therefore
# should not come out.
np.array([1, 5, 5, 5, 5, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Third example has the same long sequence but since there is a [5, 5]
# later, then it should take it into account and this sequence should
# be found.
np.array([1, 5, 5, 5, 5, 6, 5, 5, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Fourth example has a [5, 5] first and later it has a long sequence of
# 5's which are uneven and the previous implementation got confused with
# the indices to show as the starting indices. In this case, it should be
# 1, 13 and 15 for [5, 5].
np.array([1, 5, 5, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 5, 5, 5, 5, 5]),
]
for a in examples:
print(f"\nExample: {a}")
repeated_seqs, inds = repeated_sequences(a, n)
for i, seq in enumerate(repeated_seqs):
print(f"\t{seq} with indexes {inds[i]}")
Output (as expected):
Example: [1 5 7 9 4 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
[7 9] with indexes [2 8]
[9 4] with indexes [3 9]
Example: [1 5 5 5 5 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
Example: [1 5 5 5 5 6 5 5 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [ 0 16]
[5 5] with indexes [1 3 6]
Example: [1 5 5 9 4 6 3 3 7 9 4 0 3 5 5 5 5 5]
[5 5] with indexes [ 1 13 15]
[9 4] with indexes [3 9]
You can test it out yourself with more examples and more cases. Keep in mind this is what I understood from your disclaimer. If you want to count the long sequences as one, even if multiple sequences are in there (for example, [5, 5] appears twice in [5, 5, 5, 5]), this won't work for you and you'd have to come up with something else.

Group Pandas dataframe Age column by Age groups [duplicate]

I have a data frame column with numeric values:
df['percentage'].head()
46.5
44.2
100.0
42.12
I want to see the column as bin counts:
bins = [0, 1, 5, 10, 25, 50, 100]
How can I get the result as bins with their value counts?
[0, 1] bin amount
[1, 5] etc
[5, 10] etc
...
You can use pandas.cut:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts or groupby and aggregate size:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut returns categorical.
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.
Using the Numba module for speed up.
On big datasets (more than 500k), pd.cut can be quite slow for binning data.
I wrote my own function in Numba with just-in-time compilation, which is roughly six times faster:
from numba import njit
#njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x < 1):
bins[idx] = 1
elif (x >= 1) & (x < 5):
bins[idx] = 2
elif (x >= 5) & (x < 10):
bins[idx] = 3
elif (x >= 10) & (x < 25):
bins[idx] = 4
elif (x >= 25) & (x < 50):
bins[idx] = 5
elif (x >= 50) & (x < 100):
bins[idx] = 6
else:
bins[idx] = 7
return bins
cut(df['percentage'].to_numpy())
# array([5., 5., 7., 5.])
Optional: you can also map it to bins as strings:
a = cut(df['percentage'].to_numpy())
conversion_dict = {1: 'bin1',
2: 'bin2',
3: 'bin3',
4: 'bin4',
5: 'bin5',
6: 'bin6',
7: 'bin7'}
bins = list(map(conversion_dict.get, a))
# ['bin5', 'bin5', 'bin7', 'bin5']
Speed comparison:
# Create a dataframe of 8 million rows for testing
dfbig = pd.concat([df]*2000000, ignore_index=True)
dfbig.shape
# (8000000, 1)
%%timeit
cut(dfbig['percentage'].to_numpy())
# 38 ms ± 616 µs per loop (mean ± standard deviation of 7 runs, 10 loops each)
%%timeit
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
pd.cut(dfbig['percentage'], bins=bins, labels=labels)
# 215 ms ± 9.76 ms per loop (mean ± standard deviation of 7 runs, 10 loops each)
We could also use np.select:
bins = [0, 1, 5, 10, 25, 50, 100]
df['groups'] = (np.select([df['percentage'].between(i, j, inclusive='right')
for i,j in zip(bins, bins[1:])],
[1, 2, 3, 4, 5, 6]))
Output:
percentage groups
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Convenient and fast version using Numpy
np.digitize is a convenient and fast option:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1,2,3,4,5]})
df['y'] = np.digitize(a['x'], bins=[3,5])
print(df)
returns
x y
0 1 0
1 2 0
2 3 1
3 4 1
4 5 2

Calculate statistics of one numpy array based on the values in a second numpy array

Lets say I have a 2-d numpy array
a = np.array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]]
and a 3-d numpy array like
b = np.array([[[1, 2, 8, 8],
[3, 4, 8, 8],
[8, 7, 0, 1],
[6, 5, 3, 2]],
[[1, 1, 1, 3],
[1, 1, 4, 2],
[0, 3, 2, 1],
[3, 2, 3, 9]]])
I want to calculate the statistics (mean, median, majority, sum, count,...) of b according to the "IDs" in a.
Example: sum should result in another array (or a list if that is easier), that gives the sum of the values in b. There are 4 unique "IDs" in a: 1,2,3,4, and 2 'layers' in b. For the 1's in a that is a sum of 10 (layer 0) and 4 (layer 1). For the 2's
it's 32 (layer 0) and 10 (layer 1), and so on...
Expected result for sum:
sums = [[1, 10, 4],
[2, 32, 10],
[3, 26, 8],
[4, 6, 15]]
Expected result for mean:
avgs = [[1, 2.5, 1.0 ],
[2, 8.0, 2.5 ],
[3, 6.5, 2.0 ],
[4, 1.5, 3.75]]
My guess, is that there is a handy function in numpy that does that already, but I am not sure what to search for exactly. Any pointers of how to do it, or what to search for, are much appreciated.
Update:
I came up with this for-loop, which is fine for very small arrays. However, my arrays are much larger than 4 by 4 and a faster impementation is needed.
result = []
ids = np.unique(a)
for id in ids:
line = [id]
for band in range(0, b.shape[0]):
cell = b[band][np.where(a == id)]
line.append(cell.mean())
# line.append(cell.min())
# line.append(cell.max())
# line.append(cell.std())
line.append(cell.sum())
line.append(np.median(cell))
result.append(line)
You can try the code below
cal_sums = [[b[j, :, :][np.argwhere(a==i)[:,0],np.argwhere(a==i)[:,1]].sum()
for i in np.unique(a)] for j in range(2)]
cal_mean = [[b[j, :, :][np.argwhere(a==i)[:,0],np.argwhere(a==i)[:,1]].mean()
for i in np.unique(a)] for j in range(2)]
sums = np.zeros((np.unique(a).size, b.shape[0]+1))
means = np.zeros((np.unique(a).size, b.shape[0]+1))
sums[:, 0] , sums[:,1:] = np.unique(a), np.asarray(cal_sums).T
means[:, 0] , means[:,1:] = np.unique(a), np.asarray(cal_mean).T
print(sums)
[[ 1. 10. 4.]
[ 2. 32. 10.]
[ 3. 26. 8.]
[ 4. 6. 15.]]
print(means)
[[1. 2.5 1. ]
[2. 8. 2.5 ]
[3. 6.5 2. ]
[4. 1.5 3.75]]
I tested it in quite large array size and it is fast
n = 1000
a = np.random.randint(1, 5, size=(n, n))
b = np.random.randint(1, 10, size=(2, n, n))
speed:
377 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient numpy submatrix view

I wish to apply Hungarian algorithm to many subsets of numpy matrix C indexed by cross products of lists row_ind, col_ind. Currently, I see the following options to do so:
Double slicing:
linear_sum_assignment(C[row_ind,:][:,col_ind])
Problem: two copies per subset operation.
Advanced slicing via np.ix_:
linear_sum_assignment(C[np.ix_(row_ind, col_ind)])
Problem: one copy per subset, np.ix_ is inefficient (allocates n x n matrix).
UPDATE: as noted by #hpaulj, np.ix_ doesn't it fact allocate n x n matrix, but it is somehow still slower than 1.
Masked array.
Problem: doesn't work with linear_sum_assignment.
So, no option is satisfying.
What is ideally desired is an ability to specify a submatrix view using the matrix C and a couple of unidimensional masks for rows and cols respectively, so such a view could be passed to linear_sum_assignment. For another linear_sum_assignment call, I would quickly adjust masks but never modify or copy/subset the full matrix.
Is there something similar already available in numpy?
What is the most efficient way (as little copies/memory allocations as possible) to process multiple submatrices of the same big matrix?
The different ways of indexing an array with a lists/arrays time about the same. They all produce copies, not views.
For example
In [99]: arr = np.ones((1000,1000),int)
In [100]: id1=np.arange(0,1000,10)
In [101]: id2=np.arange(0,1000,20)
In [105]: timeit arr[id1,:][:,id2].shape
52.5 µs ± 243 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [106]: timeit arr[np.ix_(id1,id2)].shape
66.5 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In contrast if I use slices (in this case selecting the same elements), I get a view, which is much faster:
In [107]: timeit arr[::10,::20].shape
661 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
ix_ doesn't create a (m,n) array; it returns a tuple of adjusted 1d arrays. It's the equivalent of
In [108]: timeit arr[id1[:,None], id2].shape
54.5 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The timing difference is primarily due to an extra layer of function calls.
Your scipy link has a [source] link:
https://github.com/scipy/scipy/blob/v0.19.1/scipy/optimize/_hungarian.py#L13-L107
This optimize.linear_sum_assignment function creates a _Hungary object with the cost_matrix. That makes a copy, and solves the problem by searching and manipulating its values.
Using the documentation example:
In [110]: optimize.linear_sum_assignment(cost)
Out[110]: (array([0, 1, 2], dtype=int32), array([1, 0, 2], dtype=int32))
What it does is create a state object:
In [111]: H=optimize._hungarian._Hungary(cost)
In [112]: vars(H)
Out[112]:
{'C': array([[4, 1, 3],
[2, 0, 5],
[3, 2, 2]]),
'Z0_c': 0,
'Z0_r': 0,
'col_uncovered': array([ True, True, True], dtype=bool),
'marked': array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]),
'path': array([[0, 0],
[0, 0],
[0, 0],
[0, 0],
[0, 0],
[0, 0]]),
'row_uncovered': array([ True, True, True], dtype=bool)}
It iterates,
In [113]: step=optimize._hungarian._step1
In [114]: while step is not None:
...: step = step(H)
...:
And the resulting state is:
In [115]: vars(H)
Out[115]:
{'C': array([[1, 0, 1],
[0, 0, 4],
[0, 1, 0]]),
'Z0_c': 0,
'Z0_r': 1,
'col_uncovered': array([False, False, False], dtype=bool),
'marked': array([[0, 1, 0],
[1, 0, 0],
[0, 0, 1]]),
'path': array([[1, 0],
[0, 0],
[0, 0],
[0, 0],
[0, 0],
[0, 0]]),
'row_uncovered': array([ True, True, True], dtype=bool)}
The solution is pulled from the marked array
In [116]: np.where(H.marked)
Out[116]: (array([0, 1, 2], dtype=int32), array([1, 0, 2], dtype=int32))
The total cost is the sum of these values:
In [122]: cost[np.where(H.marked)]
Out[122]: array([1, 2, 2])
But the cost from the C array in the final state is 0:
In [124]: H.C[np.where(H.marked)]
Out[124]: array([0, 0, 0])
So even if the submatrix that you give to optimize.linear_sum_assignment is a view, the search still involves a copy. The search space and time increases significantly with the size of this cost matrix.