Efficiently get first N numbers that satisfy a condition in each row in a pytorch/numpy tensor - numpy

Given a tensor b, and I would like to extract N elements in each row that satisfy a specific condition. For example, suppose a is a matrix that indicates whether an element in b satisfy the condition or not. Now, I would like to extract N elements in each row whose corresponding value in a is 1.
And there can be two scenarios. (1) I just extract the first N elements in each row in order. (2) among all the elements that satisfy the condition, I randomly sample N elements in each row.
Is there an efficient way to achieve these two cases in pytorch or numpy? Thanks!
Below I give an example that shows the first case.
import torch
# given
a = torch.tensor([[1, 0, 0, 1, 1, 1], [0, 1, 0, 1, 1, 1], [1,1,1,1,1,0]])
b = torch.arange(18).view(3,6)
# suppose N=3
# output:
c = torch.tensor([[0, 3,4],[7,9,10], [12,13,14]])

Related

select from multi-dimensional arrays based on a given condition

There is an ndarray data with shape, e.g., (5, 10, 2); and other two lists, x1 and x2. Both of size 10. I want select subset from data based on the following conditions,
Across the second dimension
If x1[i]<= data[j, i,0] <=x2[i], then we will select data[j, i,:]
I tried selected = data[x1<=data[:,:,0]<=x2]. It does not work. I am not clear what's the efficient (or vectorized) way to implement this condition-based selection.
The code below selects all values in data where the third dimension is 0 (i.e. each value has some index data[i, j, 0] and where the value is <= than the corresponding x2 and >= than the corresponding x1:
idx = np.where(np.logical_and(data[:, :, 0] >= np.array(x1), data[:, :, 0] <= np.array(x2)))
# data[idx] contains the full rows of length 2 rather than just the 0th column, so we need to select the 0th column.
selected = data[idx][:, 0]
The code assumes that x1 and x2 are lists with lengths equal to the size of data's second dimension (in this case, 10). Note that the code only returns the values, not the indices of the values.
Let me know if you have any questions.

Aligning words in 2 sentences using a numpy 2d array

Given 2 sentences, I need to align the words in those sentences based on the best similarity match between the words in these sentences.
For instance, consider 2 sentences:
sent1 = "John saw Mary" # 3 tokens
sent2 = "All the are grown by farmers" # 6 tokens
Here, for each token in sent1, I need to find the most similar token in sent2. Further, if a token in sent2 is already matched with a token in sent1, then it cannot be matched with another token in sent1.
For the purpose, I use a similarity matrix between the tokens in a sentence, as given below
cosMat = (array([[0.1656948 , 0.16653526, 0.13380264, 0.09286133, 0.16262592,
0.14392284],
[0.40876892, 0.46331584, 0.28574535, 0.34924293, 0.2480594 ,
0.25846344],
[0.15394737, 0.10269377, 0.12189645, 0.09426117, 0.09631223,
0.10549664]], dtype=float32)
cosMat is a 2d ndarray of size (3,6) which contain the cosine similarity scores of the tokens in both the sentences.
np.argmax would provide the following array as output
np.argmax(cosMat,axis=1)
array([1, 1, 0]))
However this is not a valid solution, as the first and second tokens of sent1 aligns with second token of sent2.
Instead I chose to do the following:
sortArr = np.dstack(np.unravel_index(np.argsort(-cosMat.ravel()), cosMat.shape))
rowSet = set()
colSet = set()
matches = list()
for item in sortArr[0]:
if item[1] not in colSet:
if item[0] not in rowSet:
matches.append((item[0],item[1],cosMat[item[0],item[1]]))
colSet.add(item[1])
rowSet.add(item[0])
matches
This gives the following output, which is the desirable output:
[(1, 1, 0.46331584), (0, 0, 0.1656948), (2, 2, 0.121896446)]
My question is, is there a more efficient way to achieve, for what I have done using the code above?
Here's an alternative, it requires you to copy the initial similarity matrix. Everytime you find the best match, you discard the two tokens from the pair by replacing the correspond row and column in the copied matrix by 0. This ensures you do not find the same token in multiple pairs.
res = []
mat = np.copy(cosMat)
for _ in range(mat.shape[0]):
i, j = np.unravel_index(mat.argmax(), mat.shape)
res.append((i, j, mat[i, j]))
mat[i,:], mat[:,j] = 0, 0
Will return:
[(1, 1, 0.46331584), (0, 0, 0.1656948), (2, 2, 0.12189645)]
However, considering you only using np.argsort once. Yours will, most probably, be faster.
Otherwise, I would rewrite your version, for conciseness, as:
sortArr = zip(*np.unravel_index(np.argsort(-cosMat.ravel()), cosMat.shape))
matches = []
rows, cols = set(), set()
for x, y in sortArr:
if x not in cols and y not in rows:
matches.append((x, y, cosMat[x, y]))
cols.add(x)
rows.add(y)
You could use a single set instead of two, by using some kind of prefix for the indices in order to distinguish rows from columns. Here again I'm not sure there's much gain in doing so:
matches = []
matched = set()
for x, y in sortArr:
if 'i%i'%x not in matched and 'j%i'%y not in matched:
matches.append((x, y, cosMat[x, y]))
matched.update(['i%s'%x, 'j%s'%y])

what kind of x fits into argsort(x) == argsort(argsort(x))?

For a 1-d array, what kind of x gives you argsort(x) == argsort(argsort(x)) ? sorted array would be a trivial soliton.
but you can have not sorted array like [1, 0, 2] or [1, 0, 2, 3]
i'm really interested.
sorted_array = np.arange(10)
np.testing.assert_array_equal(np.argsort(sorted_array), np.argsort(np.argsort(sorted_array)))
# or
semi_sorted = [1, 0, 2]
np.testing.assert_array_equal(np.argsort(semi_sorted), np.argsort(np.argsort(semi_sorted)))
# or
semi_sorted = [1, 0, 2, 3]
np.testing.assert_array_equal(np.argsort(semi_sorted), np.argsort(np.argsort(semi_sorted)))
# or
semi_sorted = [2, 1, 3, 4, 5]
np.testing.assert_array_equal(np.argsort(semi_sorted), np.argsort(np.argsort(semi_sorted)))
what type of arrays fits in the criteria?
To formalize #Alex Riley's intuition:
For any (zero based) permutation p we have argsort(p) = p^-1 because by definition of argsort p[argsort(p)] = [0,1,2,...] and [0,1,2,...] viewed as a permutation is the identity.
Now, no matter what x, argsort(x) is a permutation, so writing p for that we get p = p^-1 or, equivalently, p^2 = id.
What do permutations p that are self-inverse look like? If p is applied twice nothing changes, so if the first application of p moves x to y the second application of p must move y to x. As y may equal x p must therefore consist of flips of two elements and of elements that stay put. That is also sufficient.
We now know what argsort(x) looks like. What about x itself? Let us for simplicity assume x has only unique elements, otherwise the details of the sort algorithm used have to be considered. Let us write s for the sorted x. Then s = x[p]. Permuting both sides with p we get s[p] = x[p^2] = x. So x may be any sequence that is obtained from an ordered sequence by flipping the positions of some (possibly zero) nonoverlapping pairs.

Get indices for values of one array in another array

I have two 1D-arrays containing the same set of values, but in a different (random) order. I want to find the list of indices, which reorders one array according to the other one. For example, my 2 arrays are:
ref = numpy.array([5,3,1,2,3,4])
new = numpy.array([3,2,4,5,3,1])
and I want the list order for which new[order] == ref.
My current idea is:
def find(val):
return numpy.argmin(numpy.absolute(ref-val))
order = sorted(range(new.size), key=lambda x:find(new[x]))
However, this only works as long as no values are repeated. In my example 3 appears twice, and I get new[order] = [5 3 3 1 2 4]. The second 3 is placed directly after the first one, because my function val() does not track which 3 I am currently looking for.
So I could add something to deal with this, but I have a feeling there might be a better solution out there. Maybe in some library (NumPy or SciPy)?
Edit about the duplicate: This linked solution assumes that the arrays are ordered, or for the "unordered" solution, returns duplicate indices. I need each index to appear only once in order. Which one comes first however, is not important (neither possible based on the data provided).
What I get with sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)] is: [3, 0, 5, 1, 0, 2]. But what I am looking for is [3, 0, 5, 1, 4, 2].
Given ref, new which are shuffled versions of each other, we can get the unique indices that map ref to new using the sorted version of both arrays and the invertibility of np.argsort.
Start with:
i = np.argsort(ref)
j = np.argsort(new)
Now ref[i] and new[j] both give the sorted version of the arrays, which is the same for both. You can invert the first sort by doing:
k = np.argsort(i)
Now ref is just new[j][k], or new[j[k]]. Since all the operations are shuffles using unique indices, the final index j[k] is unique as well. j[k] can be computed in one step with
order = np.argsort(new)[np.argsort(np.argsort(ref))]
From your original example:
>>> ref = np.array([5, 3, 1, 2, 3, 4])
>>> new = np.array([3, 2, 4, 5, 3, 1])
>>> np.argsort(new)[np.argsort(np.argsort(ref))]
>>> order
array([3, 0, 5, 1, 4, 2])
>>> new[order] # Should give ref
array([5, 3, 1, 2, 3, 4])
This is probably not any faster than the more general solutions to the similar question on SO, but it does guarantee unique indices as you requested. A further optimization would be to to replace np.argsort(i) with something like the argsort_unique function in this answer. I would go one step further and just compute the inverse of the sort:
def inverse_argsort(a):
fwd = np.argsort(a)
inv = np.empty_like(fwd)
inv[fwd] = np.arange(fwd.size)
return inv
order = np.argsort(new)[inverse_argsort(ref)]

How to perform matching between two sequences?

I have two mini-batch of sequences :
a = C.sequence.input_variable((10))
b = C.sequence.input_variable((10))
Both a and b have variable-length sequences.
I want to do matching between them where matching is defined as: match (eg. dot product) token at each time step of a with token at every time step of b .
How can I do this?
I have mostly answered this on github but to be consistent with SO rules, I am including a response here. In case of something simple like a dot product you can take advantage of the fact that it factorizes nicely, so the following code works
axisa = C.Axis.new_unique_dynamic_axis('a')
axisb = C.Axis.new_unique_dynamic_axis('b')
a = C.sequence.input_variable(1, sequence_axis=axisa)
b = C.sequence.input_variable(1, sequence_axis=axisb)
c = C.sequence.broadcast_as(C.sequence.reduce_sum(a), b) * b
c.eval({a: [[1, 2, 3],[4, 5]], b: [[6, 7], [8]]})
[array([[ 36.],
[ 42.]], dtype=float32), array([[ 72.]], dtype=float32)]
In the general case you need the following steps
static_b, mask = C.sequence.unpack(b, neutral_value).outputs
scores = your_score(a, static_b)
The first line will convert the b sequence into a static tensor with one more axis than b. Because of packing, some elements of this tensor will be invalid and those will be indicated by the mask. The neutral_value will be placed as a dummy value in the static_b tensor wherever data was missing. Depending on your score you might be able to arrange for the neutral_value to not affect the final score (e.g. if your score is a dot product a 0 would be a good choice, if it involves a softmax -infinity or something close to that would be a good choice). The second line can now have access to each element of a and all the elements of b as the first axis of static_b. For a dot product static_b is a matrix and one element of a is a vector so a matrix vector multiplication will result in a sequence whose elements are all inner products between the corresponding element of a and all elements of b.