select from multi-dimensional arrays based on a given condition - pandas

There is an ndarray data with shape, e.g., (5, 10, 2); and other two lists, x1 and x2. Both of size 10. I want select subset from data based on the following conditions,
Across the second dimension
If x1[i]<= data[j, i,0] <=x2[i], then we will select data[j, i,:]
I tried selected = data[x1<=data[:,:,0]<=x2]. It does not work. I am not clear what's the efficient (or vectorized) way to implement this condition-based selection.

The code below selects all values in data where the third dimension is 0 (i.e. each value has some index data[i, j, 0] and where the value is <= than the corresponding x2 and >= than the corresponding x1:
idx = np.where(np.logical_and(data[:, :, 0] >= np.array(x1), data[:, :, 0] <= np.array(x2)))
# data[idx] contains the full rows of length 2 rather than just the 0th column, so we need to select the 0th column.
selected = data[idx][:, 0]
The code assumes that x1 and x2 are lists with lengths equal to the size of data's second dimension (in this case, 10). Note that the code only returns the values, not the indices of the values.
Let me know if you have any questions.

Related

Merge masked selection of array with original array

I'm facing a problem with an assignment at the moment.
So I have an array which contains 400 2d Points. So an array of shape 400 X 2.
Then I have a mask that selects m points (rows) that I wanna compute some changes on.
As per the assignment I'm supposed to store the points that I want to change in an array of shape m X 2.
Then I do my changes on this resulting array. But now after the changes I want to insert these new computed values in my original array at the original indices. And I just have no clue how to do that.
So I basically have:
orig (400 X 2)
mask (400 X 1) (boolean mask selecting the rows to edit)
change (m X 2) (just the changes I want to add)
changed (m X 2) (the original values + the change (with a factor applied) added together
How do I transform my change or changed arrays with the mask so that I can add/insert the changes into my original array?
Look at this example with 4 rows.
The principle is that the mask that "extract" from orig can also return the sub-array to the original place.
import numpy as np
x = np.array([[1,2],[3,4],[5,6],[7,8]])
print(x)
mask_ix = np.array([True,False, True, False])
masked = x[mask_ix,:]
masked = masked * 10 # the change
print(masked)
x[mask_ix] = masked # return to the original x in the mask_ix mask
print(x)
x =[[1 2]
[3 4]
[5 6]
[7 8]]
masked = [[10 20]
[50 60]]
x = [[10 20]
[ 3 4]
[50 60]
[ 7 8]]

Aligning words in 2 sentences using a numpy 2d array

Given 2 sentences, I need to align the words in those sentences based on the best similarity match between the words in these sentences.
For instance, consider 2 sentences:
sent1 = "John saw Mary" # 3 tokens
sent2 = "All the are grown by farmers" # 6 tokens
Here, for each token in sent1, I need to find the most similar token in sent2. Further, if a token in sent2 is already matched with a token in sent1, then it cannot be matched with another token in sent1.
For the purpose, I use a similarity matrix between the tokens in a sentence, as given below
cosMat = (array([[0.1656948 , 0.16653526, 0.13380264, 0.09286133, 0.16262592,
0.14392284],
[0.40876892, 0.46331584, 0.28574535, 0.34924293, 0.2480594 ,
0.25846344],
[0.15394737, 0.10269377, 0.12189645, 0.09426117, 0.09631223,
0.10549664]], dtype=float32)
cosMat is a 2d ndarray of size (3,6) which contain the cosine similarity scores of the tokens in both the sentences.
np.argmax would provide the following array as output
np.argmax(cosMat,axis=1)
array([1, 1, 0]))
However this is not a valid solution, as the first and second tokens of sent1 aligns with second token of sent2.
Instead I chose to do the following:
sortArr = np.dstack(np.unravel_index(np.argsort(-cosMat.ravel()), cosMat.shape))
rowSet = set()
colSet = set()
matches = list()
for item in sortArr[0]:
if item[1] not in colSet:
if item[0] not in rowSet:
matches.append((item[0],item[1],cosMat[item[0],item[1]]))
colSet.add(item[1])
rowSet.add(item[0])
matches
This gives the following output, which is the desirable output:
[(1, 1, 0.46331584), (0, 0, 0.1656948), (2, 2, 0.121896446)]
My question is, is there a more efficient way to achieve, for what I have done using the code above?
Here's an alternative, it requires you to copy the initial similarity matrix. Everytime you find the best match, you discard the two tokens from the pair by replacing the correspond row and column in the copied matrix by 0. This ensures you do not find the same token in multiple pairs.
res = []
mat = np.copy(cosMat)
for _ in range(mat.shape[0]):
i, j = np.unravel_index(mat.argmax(), mat.shape)
res.append((i, j, mat[i, j]))
mat[i,:], mat[:,j] = 0, 0
Will return:
[(1, 1, 0.46331584), (0, 0, 0.1656948), (2, 2, 0.12189645)]
However, considering you only using np.argsort once. Yours will, most probably, be faster.
Otherwise, I would rewrite your version, for conciseness, as:
sortArr = zip(*np.unravel_index(np.argsort(-cosMat.ravel()), cosMat.shape))
matches = []
rows, cols = set(), set()
for x, y in sortArr:
if x not in cols and y not in rows:
matches.append((x, y, cosMat[x, y]))
cols.add(x)
rows.add(y)
You could use a single set instead of two, by using some kind of prefix for the indices in order to distinguish rows from columns. Here again I'm not sure there's much gain in doing so:
matches = []
matched = set()
for x, y in sortArr:
if 'i%i'%x not in matched and 'j%i'%y not in matched:
matches.append((x, y, cosMat[x, y]))
matched.update(['i%s'%x, 'j%s'%y])

Isn't this a row vector? [duplicate]

I know that numpy array has a method called shape that returns [No.of rows, No.of columns], and shape[0] gives you the number of rows, shape[1] gives you the number of columns.
a = numpy.array([[1,2,3,4], [2,3,4,5]])
a.shape
>> [2,4]
a.shape[0]
>> 2
a.shape[1]
>> 4
However, if my array only have one row, then it returns [No.of columns, ]. And shape[1] will be out of the index. For example
a = numpy.array([1,2,3,4])
a.shape
>> [4,]
a.shape[0]
>> 4 //this is the number of column
a.shape[1]
>> Error out of index
Now how do I get the number of rows of an numpy array if the array may have only one row?
Thank you
The concept of rows and columns applies when you have a 2D array. However, the array numpy.array([1,2,3,4]) is a 1D array and so has only one dimension, therefore shape rightly returns a single valued iterable.
For a 2D version of the same array, consider the following instead:
>>> a = numpy.array([[1,2,3,4]]) # notice the extra square braces
>>> a.shape
(1, 4)
Rather then converting this to a 2d array, which may not be an option every time - one could either check the len() of the tuple returned by shape or just check for an index error as such:
import numpy
a = numpy.array([1,2,3,4])
print(a.shape)
# (4,)
print(a.shape[0])
try:
print(a.shape[1])
except IndexError:
print("only 1 column")
Or you could just try and assign this to a variable for later use (or return or what have you) if you know you will only have 1 or 2 dimension shapes:
try:
shape = (a.shape[0], a.shape[1])
except IndexError:
shape = (1, a.shape[0])
print(shape)

Transform a numpy 3D ndarray to a symmetric form with respect to a specific index

In the case of a matrix mat n x n, i can do the following
sym = 0.5 * (mat + mat.T)
the operation gives the desired result sym[i,j] = sym[j,i]
Suppose we have a 3D array ndarr[i,j,k], where i,j,k 0,1,...n,
then ndarr is n x n x n. The idea is to obtain the following "symmetric" form
nsym[i,j,k] = nsym[j,i,k] using ndarr. I tried this:
import numpy as np
# Generate some random matrix, n = 5
ndarr = np.random.beta(0.1,1,(5,5,5))
# First attempt to symmetrize
sym1 = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(5)])
The problem here is that sym1[i,j,k] != sym1[j,i,k] as it is required. In fact I obtain sym1[i,j,k] = sym1[i,k,j], symmetric under the exchange of the last two symbols!
# Second attempt
sym2 = 0.5*(ndarr+ndarr.T)
Same problem here and sym2 is symmetric with respect the second index sym2[i,j,k]=sym2[k,j,i].
To resume, the goal is to find a symmetric form for a 3D array with respect to the third index and to preserve the values in the diagonal for the original ndarr[i,i,i].
The problem here is that you're not using the correct transpose:
sym = 0.5 * (ndarr + np.transpose(ndarr, (1, 0, 2)))
By default, np.transpose and the .T property will reverse the order of the axes. In your case, we want to only flip the first two axes: (0,1,2) -> (1,0,2).
EDIT: The reason your first attempt failed is because you were concatenating each symmetrized matrix along the first axis, not the last. It's more clear if you make ndarr with shape (5, 5, 3):
In [16]: sym = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(3)])
In [17]: sym.shape
Out[17]: (3L, 5L, 5L)
In any case, the version above with np.transpose is cleaner and more efficient.

combine 2-D arrays of unknown sizes to make one 3-D

I have a function, say peaksdetect(), that will generate a 2-D array of unknown number of rows; I will call it a few times, let's say 3 and I would like to make of these 3 arrays, one 3-D array. Here is my start but it is very complicated with a lot of if statements, so I want to make things simpler if possible:
import numpy as np
dim3 = 3 # the number of times peaksdetect() will be called
# it is named dim3 because this number will determine
# the size of the third dimension of the result 3-D array
for num in range(dim3):
data = peaksdetect(dataset[num]) # generates a 2-D array of unknown number of rows
if num == 0:
3Darray = np.zeros([dim3, data.shape]) # in fact the new dimension is in position 0
# so dimensions 0 and 1 of "data" will be
# 1 and 2 respectively
else:
if data.shape[0] > 3Darray.shape[1]:
"adjust 3Darray.shape[1] so that it equals data[0] by filling with zeroes"
3Darray[num] = data
else:
"adjust data[0] so that it equals 3Darray.shape[1] by filling with zeroes"
3Darray[num] = data
...
If you are counting on having to resize your array, there is very likely not going to be much to be gained by preallocating it. It will probably be simpler to store your arrays in a list, then figure out the size of the array to hold them all, and dump the data into it:
data = []
for num in range(dim3):
data.append(peaksdetect(dataset[num]))
shape = map(max, zip(*(j.shape for j in data)))
shape = (dim3,) + tuple(shape)
data_array = np.zeros(shape, dtype=data[0].dtype)
for j, d in enumerate(data):
data_array[j, :d.shape[0], :d.shape[1]] = d