Given two arrays, `a` and `b`, how to find efficiently all combinations of elements in `b` that have equal value in `a`? - numpy

Given two arrays, a and b, how to find efficiently all combinations of elements in b that have equal value in a?
here is an example:
Given
a = [0, 0, 0, 1, 1, 2, 2, 2, 2]
b = [1, 2, 4, 5, 9, 3, 7, 22, 10]
how would you calculate
c = [[1, 2],
[1, 4],
[2, 4],
[5, 9],
[3, 7],
[3, 22],
[3, 10],
[7, 22],
[7, 10],
[22, 10]]
?
a can be assumed to be sorted.
I can do this with loops, a la:
import torch
a = torch.tensor([0, 0, 0, 1, 1, 2, 2, 2, 2])
b = torch.tensor([1, 2, 4, 5, 9, 3, 7, 22, 10])
jumps = torch.cat((torch.tensor([0]),
torch.where(a.diff() > 0)[0] + 1,
torch.tensor([len(a)])))
cs = []
for i in range(len(jumps) - 1):
cs.append(torch.combinations(b[jumps[i]:jumps[i + 1]]))
c = torch.cat(cs)
Is there any efficient way to avoid the loop? The solution should work for CPU and CUDA.
Also, the solution should have runtime O(m * m), where m is the largest number of equal elements in a and not O(n * n) where n is the length of of a.
I prefer solutions for pytorch, but I am curious for solution for numpy as well.

I think the overhead of using torch is only justified for bigger datasets, as there is basically no computational difficulty in the function, imho you can achieve same results with:
from collections import Counter
def find_combinations1(a, b):
count_a = Counter(a)
combinations = []
for x in set(b):
if count_a[x] == b.count(x):
combinations.append(x)
return combinations
or even a simpler:
def find_combinations2(a, b):
return list(set(a) & set(b))
With pytorch I assume the most simple approach is:
import torch
def find_combinations3(a, b):
a = torch.tensor(a)
b = torch.tensor(b)
eq = torch.eq(a, b.view(-1, 1))
indices = torch.nonzero(eq)
return indices[:, 1]
This option has of course a time complexity of O(n*m) where n is the size of a and m is the size of b, and O(n+m) is the memory for the tensors.

Related

Simplest way to turn a covariance table into a covariance matrix in numpy

Say you load in a table of correlations that looks like this:
pd.DataFrame([['A', 'B', 1], ['B', 'C', 2], ['A', 'C', 3], ['C', 'D', 100]])
1 is the covariance between A and B, 2 is the covariance between B and C, etc.
What's the most elegant (readable + efficient) way to convert this to:
np.array([[1, 1, 3, 0], [1, 1, 2, 0], [3, 2, 1, 100], [0, 0, 100, 1]])
The full covariance matrix for A, B, C, and D, assuming that unwritten relationships are 0 (true in my case). Variables can be in any order.
First, as I mention in a comment, having a ndarray with strign and numbers will convert all to strings. So you must convert back to float the values of correlation in the first table.
Assuming you have the first table like this:
correlations = np.array([[0, 1, 1], [1, 2, 2], [0, 2, 3], [2, 3, 100])
Where 'A' is 0, 'B' is 1, and so on...
you can create the matrix you need like this:
matrix = np.identity(var_counts)
for correlation in correlations:
i, j, value = correlation
i, j = int(i), int(j)
matrix[i, j] = matrix[j, i] = value
Where var_counts is the number of variables you have in the first table.

Transforming a sequence of integers into the binary representation of that sequence's strides [duplicate]

I'm looking for a way to select multiple slices from a numpy array at once. Say we have a 1D data array and want to extract three portions of it like below:
data_extractions = []
for start_index in range(0, 3):
data_extractions.append(data[start_index: start_index + 5])
Afterwards data_extractions will be:
data_extractions = [
data[0:5],
data[1:6],
data[2:7]
]
Is there any way to perform above operation without the for loop? Some sort of indexing scheme in numpy that would let me select multiple slices from an array and return them as that many arrays, say in an n+1 dimensional array?
I thought maybe I can replicate my data and then select a span from each row, but code below throws an IndexError
replicated_data = np.vstack([data] * 3)
data_extractions = replicated_data[[range(3)], [slice(0, 5), slice(1, 6), slice(2, 7)]
You can use the indexes to select the rows you want into the appropriate shape.
For example:
data = np.random.normal(size=(100,2,2,2))
# Creating an array of row-indexes
indexes = np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
# data[indexes] will return an element of shape (3,5,2,2,2). Converting
# to list happens along axis 0
data_extractions = list(data[indexes])
np.all(data_extractions[1] == data[1:6])
True
The final comparison is against the original data.
stride_tricks can do that
a = np.arange(10)
b = np.lib.stride_tricks.as_strided(a, (3, 5), 2 * a.strides)
b
# array([[0, 1, 2, 3, 4],
# [1, 2, 3, 4, 5],
# [2, 3, 4, 5, 6]])
Please note that b references the same memory as a, in fact multiple times (for example b[0, 1] and b[1, 0] are the same memory address). It is therefore safest to make a copy before working with the new structure.
nd can be done in a similar fashion, for example 2d -> 4d
a = np.arange(16).reshape(4, 4)
b = np.lib.stride_tricks.as_strided(a, (3,3,2,2), 2*a.strides)
b.reshape(9,2,2) # this forces a copy
# array([[[ 0, 1],
# [ 4, 5]],
# [[ 1, 2],
# [ 5, 6]],
# [[ 2, 3],
# [ 6, 7]],
# [[ 4, 5],
# [ 8, 9]],
# [[ 5, 6],
# [ 9, 10]],
# [[ 6, 7],
# [10, 11]],
# [[ 8, 9],
# [12, 13]],
# [[ 9, 10],
# [13, 14]],
# [[10, 11],
# [14, 15]]])
In this post is an approach with strided-indexing scheme using np.lib.stride_tricks.as_strided that basically creates a view into the input array and as such is pretty efficient for creation and being a view occupies nomore memory space.
Also, this works for ndarrays with generic number of dimensions.
Here's the implementation -
def strided_axis0(a, L):
# Store the shape and strides info
shp = a.shape
s = a.strides
# Compute length of output array along the first axis
nd0 = shp[0]-L+1
# Setup shape and strides for use with np.lib.stride_tricks.as_strided
# and get (n+1) dim output array
shp_in = (nd0,L)+shp[1:]
strd_in = (s[0],) + s
return np.lib.stride_tricks.as_strided(a, shape=shp_in, strides=strd_in)
Sample run for a 4D array case -
In [44]: a = np.random.randint(11,99,(10,4,2,3)) # Array
In [45]: L = 5 # Window length along the first axis
In [46]: out = strided_axis0(a, L)
In [47]: np.allclose(a[0:L], out[0]) # Verify outputs
Out[47]: True
In [48]: np.allclose(a[1:L+1], out[1])
Out[48]: True
In [49]: np.allclose(a[2:L+2], out[2])
Out[49]: True
You can slice your array with a prepared slicing array
a = np.array(list('abcdefg'))
b = np.array([
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]
])
a[b]
However, b doesn't have to generated by hand in this way. It can be more dynamic with
b = np.arange(5) + np.arange(3)[:, None]
In the general case you have to do some sort of iteration - and concatenation - either when constructing the indexes or when collecting the results. It's only when the slicing pattern is itself regular that you can use a generalized slicing via as_strided.
The accepted answer constructs an indexing array, one row per slice. So that is iterating over the slices, and arange itself is a (fast) iteration. And np.array concatenates them on a new axis (np.stack generalizes this).
In [264]: np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
Out[264]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
indexing_tricks convenience methods to do the same thing:
In [265]: np.r_[0:5, 1:6, 2:7]
Out[265]: array([0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6])
This takes the slicing notation, expands it with arange and concatenates. It even lets me expand and concatenate into 2d
In [269]: np.r_['0,2',0:5, 1:6, 2:7]
Out[269]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
In [270]: data=np.array(list('abcdefghijk'))
In [272]: data[np.r_['0,2',0:5, 1:6, 2:7]]
Out[272]:
array([['a', 'b', 'c', 'd', 'e'],
['b', 'c', 'd', 'e', 'f'],
['c', 'd', 'e', 'f', 'g']],
dtype='<U1')
In [273]: data[np.r_[0:5, 1:6, 2:7]]
Out[273]:
array(['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'f', 'c', 'd', 'e',
'f', 'g'],
dtype='<U1')
Concatenating results after indexing also works.
In [274]: np.stack([data[0:5],data[1:6],data[2:7]])
My memory from other SO questions is that relative timings are in the same order of magnitude. It may vary for example with the number of slices versus their length. Overall the number of values that have to be copied from source to target will be the same.
If the slices vary in length, you'd have to use the flat indexing.
No matter which approach you choose, if 2 slices contain same element, it doesn't support mathematical operations correctly unlesss you use ufunc.at which can be more inefficient than loop. For testing:
def as_strides(arr, window_size, stride, writeable=False):
'''Get a strided sub-matrices view of a 4D ndarray.
Args:
arr (ndarray): input array with shape (batch_size, m1, n1, c).
window_size (tuple): with shape (m2, n2).
stride (tuple): stride of windows in (y_stride, x_stride).
writeable (bool): it is recommended to keep it False unless needed
Returns:
subs (view): strided window view, with shape (batch_size, y_nwindows, x_nwindows, m2, n2, c)
See also numpy.lib.stride_tricks.sliding_window_view
'''
batch_size = arr.shape[0]
m1, n1, c = arr.shape[1:]
m2, n2 = window_size
y_stride, x_stride = stride
view_shape = (batch_size, 1 + (m1 - m2) // y_stride,
1 + (n1 - n2) // x_stride, m2, n2, c)
strides = (arr.strides[0], y_stride * arr.strides[1],
x_stride * arr.strides[2]) + arr.strides[1:]
subs = np.lib.stride_tricks.as_strided(arr,
view_shape,
strides=strides,
writeable=writeable)
return subs
import numpy as np
np.random.seed(1)
Xs = as_strides(np.random.randn(1, 5, 5, 2), (3, 3), (2, 2), writeable=True)[0]
print('input\n0,0\n', Xs[0, 0])
np.add.at(Xs, np.s_[:], 5)
print('unbuffered sum output\n0,0\n', Xs[0,0])
np.add.at(Xs, np.s_[:], -5)
Xs = Xs + 5
print('normal sum output\n0,0\n', Xs[0, 0])
We can use list comprehension for this
data=np.array([1,2,3,4,5,6,7,8,9,10])
data_extractions=[data[b:b+5] for b in [1,2,3,4,5]]
data_extractions
Results
[array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]), array([4, 5, 6, 7, 8]), array([5, 6, 7, 8, 9]), array([ 6, 7, 8, 9, 10])]

Numpy Vectorization: add row above to current row on ndarray

I would like to add the values in the above row to the row below using vectorization. For example, if I had the ndarray,
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]]
Then after one iteration through this method, it would result in
[[0, 0, 0, 0],
[1, 1, 1, 1],
[3, 3, 3, 3],
[5, 5, 5, 5]]
One can simply do this with a for loop:
import numpy as np
def addAboveRow(arr):
cpy = arr.copy()
r, c = arr.shape
for i in range(1, r):
for j in range(c):
cpy[i][j] += arr[i - 1][j]
return cpy
ndarr = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]).reshape(4, 4)
print(addAboveRow(ndarr))
I'm not sure how to approach this using vectorization though. I think slicers should be used? Also, I'm not really sure how to deal with the issue of the top border, because nothing should be added onto the first row. Any help would be appreciated. Thanks!
Note: I am really new to vectorization so an explanation would be great!
You can use indexing directly:
b = np.zeros_like(a)
b[0] = a[0]
b[1:] = a[1:] + a[:-1]
>>> b
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[3, 3, 3, 3],
[5, 5, 5, 5]])
An alternative:
b = a.copy()
b[1:] += a[:-1]
Or:
b = a.copy()
np.add(b[1:], a[:-1], out=b[1:])
You could try the following
np.put(arr, np.arange(arr.shape[1], arr.size), arr[1:]+arr[:-1])

Elegantly generate result array in numpy

I have my X and Y numpy arrays:
X = np.array([0,1,2,3])
Y = np.array([0,1,2,3])
And my function which maps x,y values to Z points:
def z(x,y):
return x+y
I wish to produce the obvious thing required for a 3D plot: the 2-dimensional numpy array for the corresponding Z-values. I believe it should look like:
Z = np.array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6]])
I can do this in several lines, but I'm looking for the briefest most elegant piece of code.
For a function that is array aware it is more economical to use an open grid:
>>> import numpy as np
>>>
>>> X = np.array([0,1,2,3])
>>> Y = np.array([0,1,2,3])
>>>
>>> def z(x,y):
... return x+y
...
>>> XX, YY = np.ix_(X, Y)
>>> XX, YY
(array([[0],
[1],
[2],
[3]]), array([[0, 1, 2, 3]]))
>>> z(XX, YY)
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6]])
If your grid axes are ranges you can directly create the grid using np.ogrid
>>> XX, YY = np.ogrid[:4, :4]
>>> XX, YY
(array([[0],
[1],
[2],
[3]]), array([[0, 1, 2, 3]]))
If the function is not array aware you can make it so using np.vectorize:
>>> def f(x, y):
... if x > y:
... return x
... else:
... return -x
...
>>> np.vectorize(f)(*np.ogrid[-3:4, -3:4])
array([[ 3, 3, 3, 3, 3, 3, 3],
[-2, 2, 2, 2, 2, 2, 2],
[-1, -1, 1, 1, 1, 1, 1],
[ 0, 0, 0, 0, 0, 0, 0],
[ 1, 1, 1, 1, -1, -1, -1],
[ 2, 2, 2, 2, 2, -2, -2],
[ 3, 3, 3, 3, 3, 3, -3]])
One very short way to achieve what you want is to produce a meshgrid from your coordinates:
X,Y = np.meshgrid(x,y)
z = X+Y
or more general:
z = f(X,Y)
or even in one line:
z = f(*np.meshgrid(x,y))
EDIT:
If your function also may return a constant, you have to somehow infer the dimensions that the result should have. If you want to continue using meshgrids one very simple way would be re-write your function in this way:
def f(x,y):
return x*0+y*0+a
where a would be your constant. numpy would then take care of the dimensions for you. This is of course a bit weird looking, so instead you could write
def f(x,y):
return np.full(x.shape, a)
If you really want to go with functions that work both on scalars and arrays, it's probably best to go with np.vectorize as in #PaulPanzer's answer.

referencing rows in a matrix using index from another matrix

You have an original sparse matrix X:
>>print type(X)
>>print X.todense()
<class 'scipy.sparse.csr.csr_matrix'>
[[1,4,3]
[3,4,1]
[2,1,1]
[3,6,3]]
You have a second sparse matrix Z, which is derived from some rows of X (say the values are doubled so we can see the difference between the two matrices). In pseudo-code:
>>Z = X[[0,2,3]]
>>print Z.todense()
[[1,4,3]
[2,1,1]
[3,6,3]]
>>Z = Z*2
>>print Z.todense()
[[2, 8, 6]
[4, 2, 2]
[6, 12,6]]
What's the best way of retrieving the rows in Z using the ORIGINAL indices from X. So for instance, in pseudo-code:
>>print Z[[0,3]]
[[2,8,6] #0 from Z, and what would be row **0** from X)
[6,12,6]] #2 from Z, but what would be row **3** from X)
That is, how can you retrieve rows from Z, using indices that refer to the original rows position in the original matrix X? To do this, you can't modify X in anyway (you can't add an index column to the matrix X), but there are no other limits.
If you have the original indices in an array i, and the values in i are in increasing order (as in your example), you can use numpy.searchsorted(i, [0, 3]) to find the indices in Z that correspond to indices [0, 3] in the original X. Here's a demonstration in an IPython session:
In [39]: X = csr_matrix([[1,4,3],[3,4,1],[2,1,1],[3,6,3]])
In [40]: X.todense()
Out[40]:
matrix([[1, 4, 3],
[3, 4, 1],
[2, 1, 1],
[3, 6, 3]])
In [41]: i = array([0, 2, 3])
In [42]: Z = 2 * X[i]
In [43]: Z.todense()
Out[43]:
matrix([[ 2, 8, 6],
[ 4, 2, 2],
[ 6, 12, 6]])
In [44]: Zsub = Z[searchsorted(i, [0, 3])]
In [45]: Zsub.todense()
Out[45]:
matrix([[ 2, 8, 6],
[ 6, 12, 6]])