Aligning words in 2 sentences using a numpy 2d array - numpy

Given 2 sentences, I need to align the words in those sentences based on the best similarity match between the words in these sentences.
For instance, consider 2 sentences:
sent1 = "John saw Mary" # 3 tokens
sent2 = "All the are grown by farmers" # 6 tokens
Here, for each token in sent1, I need to find the most similar token in sent2. Further, if a token in sent2 is already matched with a token in sent1, then it cannot be matched with another token in sent1.
For the purpose, I use a similarity matrix between the tokens in a sentence, as given below
cosMat = (array([[0.1656948 , 0.16653526, 0.13380264, 0.09286133, 0.16262592,
0.14392284],
[0.40876892, 0.46331584, 0.28574535, 0.34924293, 0.2480594 ,
0.25846344],
[0.15394737, 0.10269377, 0.12189645, 0.09426117, 0.09631223,
0.10549664]], dtype=float32)
cosMat is a 2d ndarray of size (3,6) which contain the cosine similarity scores of the tokens in both the sentences.
np.argmax would provide the following array as output
np.argmax(cosMat,axis=1)
array([1, 1, 0]))
However this is not a valid solution, as the first and second tokens of sent1 aligns with second token of sent2.
Instead I chose to do the following:
sortArr = np.dstack(np.unravel_index(np.argsort(-cosMat.ravel()), cosMat.shape))
rowSet = set()
colSet = set()
matches = list()
for item in sortArr[0]:
if item[1] not in colSet:
if item[0] not in rowSet:
matches.append((item[0],item[1],cosMat[item[0],item[1]]))
colSet.add(item[1])
rowSet.add(item[0])
matches
This gives the following output, which is the desirable output:
[(1, 1, 0.46331584), (0, 0, 0.1656948), (2, 2, 0.121896446)]
My question is, is there a more efficient way to achieve, for what I have done using the code above?

Here's an alternative, it requires you to copy the initial similarity matrix. Everytime you find the best match, you discard the two tokens from the pair by replacing the correspond row and column in the copied matrix by 0. This ensures you do not find the same token in multiple pairs.
res = []
mat = np.copy(cosMat)
for _ in range(mat.shape[0]):
i, j = np.unravel_index(mat.argmax(), mat.shape)
res.append((i, j, mat[i, j]))
mat[i,:], mat[:,j] = 0, 0
Will return:
[(1, 1, 0.46331584), (0, 0, 0.1656948), (2, 2, 0.12189645)]
However, considering you only using np.argsort once. Yours will, most probably, be faster.
Otherwise, I would rewrite your version, for conciseness, as:
sortArr = zip(*np.unravel_index(np.argsort(-cosMat.ravel()), cosMat.shape))
matches = []
rows, cols = set(), set()
for x, y in sortArr:
if x not in cols and y not in rows:
matches.append((x, y, cosMat[x, y]))
cols.add(x)
rows.add(y)
You could use a single set instead of two, by using some kind of prefix for the indices in order to distinguish rows from columns. Here again I'm not sure there's much gain in doing so:
matches = []
matched = set()
for x, y in sortArr:
if 'i%i'%x not in matched and 'j%i'%y not in matched:
matches.append((x, y, cosMat[x, y]))
matched.update(['i%s'%x, 'j%s'%y])

Related

Efficiently get first N numbers that satisfy a condition in each row in a pytorch/numpy tensor

Given a tensor b, and I would like to extract N elements in each row that satisfy a specific condition. For example, suppose a is a matrix that indicates whether an element in b satisfy the condition or not. Now, I would like to extract N elements in each row whose corresponding value in a is 1.
And there can be two scenarios. (1) I just extract the first N elements in each row in order. (2) among all the elements that satisfy the condition, I randomly sample N elements in each row.
Is there an efficient way to achieve these two cases in pytorch or numpy? Thanks!
Below I give an example that shows the first case.
import torch
# given
a = torch.tensor([[1, 0, 0, 1, 1, 1], [0, 1, 0, 1, 1, 1], [1,1,1,1,1,0]])
b = torch.arange(18).view(3,6)
# suppose N=3
# output:
c = torch.tensor([[0, 3,4],[7,9,10], [12,13,14]])

select from multi-dimensional arrays based on a given condition

There is an ndarray data with shape, e.g., (5, 10, 2); and other two lists, x1 and x2. Both of size 10. I want select subset from data based on the following conditions,
Across the second dimension
If x1[i]<= data[j, i,0] <=x2[i], then we will select data[j, i,:]
I tried selected = data[x1<=data[:,:,0]<=x2]. It does not work. I am not clear what's the efficient (or vectorized) way to implement this condition-based selection.
The code below selects all values in data where the third dimension is 0 (i.e. each value has some index data[i, j, 0] and where the value is <= than the corresponding x2 and >= than the corresponding x1:
idx = np.where(np.logical_and(data[:, :, 0] >= np.array(x1), data[:, :, 0] <= np.array(x2)))
# data[idx] contains the full rows of length 2 rather than just the 0th column, so we need to select the 0th column.
selected = data[idx][:, 0]
The code assumes that x1 and x2 are lists with lengths equal to the size of data's second dimension (in this case, 10). Note that the code only returns the values, not the indices of the values.
Let me know if you have any questions.

How to find matrix common members of matrices in Numpy

I have a 2D matrix A and a vector B. I want to find all row indices of elements in A that are also contained in B.
A = np.array([[1,9,5], [8,4,9], [4,9,3], [6,7,5]], dtype=int)
B = np.array([2, 4, 8, 10, 12, 18], dtype=int)
My current solution is only to compare A to one element of B at a time but that is horribly slow:
res = np.array([], dtype=int)
for i in range(B.shape[0]):
cres, _ = (B[i] == A).nonzero()
degElem = np.append(res, cres)
res = np.unique(res)
The following Matlab statement would solve my issue:
find(any(reshape(any(reshape(A, prod(size(A)), 1) == B, 2),size(A, 1),size(A, 2)), 2))
However comparing a row and a colum vector in Numpy does not create a Boolean intersection matrix as it does in Matlab.
Is there a proper way to do this in Numpy?
We can use np.isin masking.
To get all the row numbers, it would be -
np.where(np.isin(A,B).T)[1]
If you need them split based on each element's occurence -
[np.flatnonzero(i) for i in np.isin(A,B).T if i.any()]
Posted MATLAB code seems to be doing broadcasting. So, an equivalent one would be -
np.where(B[:,None,None]==A)[1]

How to perform matching between two sequences?

I have two mini-batch of sequences :
a = C.sequence.input_variable((10))
b = C.sequence.input_variable((10))
Both a and b have variable-length sequences.
I want to do matching between them where matching is defined as: match (eg. dot product) token at each time step of a with token at every time step of b .
How can I do this?
I have mostly answered this on github but to be consistent with SO rules, I am including a response here. In case of something simple like a dot product you can take advantage of the fact that it factorizes nicely, so the following code works
axisa = C.Axis.new_unique_dynamic_axis('a')
axisb = C.Axis.new_unique_dynamic_axis('b')
a = C.sequence.input_variable(1, sequence_axis=axisa)
b = C.sequence.input_variable(1, sequence_axis=axisb)
c = C.sequence.broadcast_as(C.sequence.reduce_sum(a), b) * b
c.eval({a: [[1, 2, 3],[4, 5]], b: [[6, 7], [8]]})
[array([[ 36.],
[ 42.]], dtype=float32), array([[ 72.]], dtype=float32)]
In the general case you need the following steps
static_b, mask = C.sequence.unpack(b, neutral_value).outputs
scores = your_score(a, static_b)
The first line will convert the b sequence into a static tensor with one more axis than b. Because of packing, some elements of this tensor will be invalid and those will be indicated by the mask. The neutral_value will be placed as a dummy value in the static_b tensor wherever data was missing. Depending on your score you might be able to arrange for the neutral_value to not affect the final score (e.g. if your score is a dot product a 0 would be a good choice, if it involves a softmax -infinity or something close to that would be a good choice). The second line can now have access to each element of a and all the elements of b as the first axis of static_b. For a dot product static_b is a matrix and one element of a is a vector so a matrix vector multiplication will result in a sequence whose elements are all inner products between the corresponding element of a and all elements of b.

Numpy index array of unknown dimensions?

I need to compare a bunch of numpy arrays with different dimensions, say:
a = np.array([1,2,3])
b = np.array([1,2,3],[4,5,6])
assert(a == b[0])
How can I do this if I do not know either the shape of a and b, besides that
len(shape(a)) == len(shape(b)) - 1
and neither do I know which dimension to skip from b. I'd like to use np.index_exp, but that does not seem to help me ...
def compare_arrays(a,b,skip_row):
u = np.index_exp[ ... ]
assert(a[:] == b[u])
Edit
Or to put it otherwise, I wan't to construct slicing if I know the shape of the array and the dimension I want to miss. How do I dynamically create the np.index_exp, if I know the number of dimensions and positions, where to put ":" and where to put "0".
I was just looking at the code for apply_along_axis and apply_over_axis, studying how they construct indexing objects.
Lets make a 4d array:
In [355]: b=np.ones((2,3,4,3),int)
Make a list of slices (using list * replicate)
In [356]: ind=[slice(None)]*b.ndim
In [357]: b[ind].shape # same as b[:,:,:,:]
Out[357]: (2, 3, 4, 3)
In [358]: ind[2]=2 # replace one slice with index
In [359]: b[ind].shape # a slice, indexing on the third dim
Out[359]: (2, 3, 3)
Or with your example
In [361]: b = np.array([1,2,3],[4,5,6]) # missing []
...
TypeError: data type not understood
In [362]: b = np.array([[1,2,3],[4,5,6]])
In [366]: ind=[slice(None)]*b.ndim
In [367]: ind[0]=0
In [368]: a==b[ind]
Out[368]: array([ True, True, True], dtype=bool)
This indexing is basically the same as np.take, but the same idea can be extended to other cases.
I don't quite follow your questions about the use of :. Note that when building an indexing list I use slice(None). The interpreter translates all indexing : into slice objects: [start:stop:step] => slice(start, stop, step).
Usually you don't need to use a[:]==b[0]; a==b[0] is sufficient. With lists alist[:] makes a copy, with arrays it does nothing (unless used on the RHS, a[:]=...).