Extract different rows from two numpy 2D arrays - numpy

I generated a new random rows matrix B (50, 40) from a matrix A (100, 40):
B = A[np.random.randint(0,100,size=50)] # it works fine.
Now, I want to take the rows from A that isn't in matrix B.
C = A not in B # pseudocode.

This should do the job:
import numpy as np
A=np.random.randint(5,size=[100,40])
l=np.random.choice(100, size=50, replace=False)
B = A[l]
C= A[np.setdiff1d(np.arange(0,100),l)]
l stores the selected rows, and for C you take the complement of l. Then C is the required matrix.
Note that I set l=np.random.choice(100, size=50, replace=False) to avoid replacement. If you use np.random.randint(0,100,size=50) you may get repeated rows as the same number is selected at random.

Inspried by this question, Check whether each row of a matrix is in another matrix [Python]. First get indices of rows exists in B, then get difference from whole A indices. select rows using difference in the end.
index = np.argwhere((B[:,None,:] == A[:,:]).all(-1))[:, 1]
C = A[np.setdiff1d(np.arange(100), index)]

The numpy_indexed package (Disclaimer: i am its author) has efficient vectorized functionality for all these kinds of operations.
import numpy_indexed as npi
C = npi.difference(A, B)

Related

In numpy, is there a function to find the inverse of ix_?

With numpy, my goal is to select a quadratic submatrix from a quadratic matrix, and then also look at the collection of elements that are not in the first submatrix.
For the first submatrix, I'm using np.ix_:
import numpy as np
r = np.random.rand(3,3)
l = [1,2]
r[np.ix_(l, l)]
Then, r[np.ix_(l, l)] will pick out a 2x2 matrix, marked by **:
0
1
2
0
r0,0
r0,1
r0,2
1
r1,0
** r1,1 **
** r1,2**
2
r2,0
** r2,1 **
** r2,2 **
But now what is the best approach to select the difference between the submatrix and the parent matrix?
I have looked at:
~np.ix_, like ~np.eye, but this doesn't seem to be supported
np.subtract, but the problem is that I need to select the elements by their indices and not by their values.
Based on a comment by #hpaulj, I followed the approach with the numpy.ma submodule:
import numpy as np
r = np.random.rand(3,3)
l = [1,2]
r[np.ix_(l, l)]
import numpy.ma as ma
mask = ma.zeros(r.shape)
mask[np.ix_(l, l)] = 1
Then, ma.compressed() gives the desired result:
ma.compressed(ma.array(r, mask=mask))
Using np.ix_ is equivalent to using basic indexing, but by triggering advanced indexing.
So it lets you fetch all the elements belonging to the 1st and 2nd rows and 1st and 2nd columns completely as a copy (basic indexing yields a view)
import numpy as np
r = np.random.rand(3,3)
l = [1,2]
r[np.ix_(l, l)]
array([[0.46899841, 0.49051596],
[0.00256912, 0.86447371]])
Equivalent to np.ix_, using basic indexing (this is a view and not a copy!) -
r[1:3, 1:3]
array([[0.46899841, 0.49051596],
[0.00256912, 0.86447371]])
If you, however, want to fetch the (1,1) and (2,2) index elements, then you can directly use advance indexing as below -
r[l,l]
array([0.46899841, 0.86447371])
As you can see, this returns the diagonal elements which you are looking for (with the np.eye for example)
Read more about how indexing (basic and advance) works here or check out a detailed answer where I explain this as well here.

How to obtain a matrix by adding one more new row vector within an iteration?

I am generating arrays (technically they are row vectors) with a for-loop. a, b, c ... are the outputs.
Can I add the new array to the old ones together to form a matrix?
import numpy as np
# just for example:
a = np.asarray([2,5,8,10])
b = np.asarray([1,2,3,4])
c = np.asarray([2,3,4,5])
... ... ... ... ...
I have tried ab = np.stack((a,b)), and this could work. But my idea is to always add a new row to the old matrix in a new loop, but with abc = np.stack((ab,c)) then there would be an error ValueError: all input arrays must have the same shape.
Can anyone tell me how I could add another vector to an already existing matrix? I couldn´t find a perfect answer in this forum.
np.stack wouldn't work, you can only stack arrays with same dimensions.
One possible solution is to use np.concatenate((original, to_append), axis = 0) each time. Check the docs.
You can also try using np.append.
Thanks to the ideas from everybody, the best solution of this problem is to append nparray or list to a list during the iteration and convert the list to a matrix using np.asarray in the end.
a = np.asarray([2,5,8,10]) # or a = [2,5,8,10]
b = np.asarray([1,2,3,4]) # b = [1,2,3,4]
c = np.asarray([2,3,4,5]) # c = [2,3,4,5]
... ...
l1 = []
l1.append(a)
l1.append(b)
l1.append(c)
... ...
l1don´t have to be empty, however, the elements which l1 already contained should be the same type as the a,b,c
For example, the difference between l1 = [1,1,1,1] and l1 = [[1,1,1,1]] is huge in this case.

Efficient way to calculate the pairwise matrix product between one tensor and all the rolling of another tensor

Suppose we have two tensors:
tensor A whose shape is (d,m,n)
tensor B whose shape is (d,n,l).
If we want to get the pairwise matrix product of the right-most matrix of A and B, I think we can use np.einsum('dmn,...nl->d...ml',A,B) whose size is (d,d,m,l). However, I would like to get the pairwise product of not all the pairs.
Import a parameter k, 1<=k<=d, I want to get the following pairwise matrix product:
from
A(0,...)#B(0,...)
to
A(0,...)#B(k-1,...)
;
from
A(1,...)#B(1,...)
to
A(1,...)#B(k,...)
;
....
;
from
A(d-2,...)#B(d-2,...),
A(d-2,...)#B(d-1,...)
to
A(d-2,...)#B(k-3,...)
;
from
A(d-1,...)#B(d-1,...)
to
A(d-1,...)#B(k-2,...)
.
Note here we we use a rolling way to deal with tensor B. (like numpy.roll).
Finally, we actually get a tensor whose shape is (d,k,m,l).
What's the most efficient way to do this.
I know several ways like:
First get np.einsum('dmn,...nl->d...ml',A,B), then use a mask to extract the (d,k) pairs.
tile B first, then use einsum in some way.
But I think there exists a better way.
I doubt you can do much better than a for loop. Here is, for example, a vectorized version using einsum and stride_tricks compared to a double for loop:
Code:
from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np
from numpy.lib.stride_tricks import as_strided
B = BenchmarkBuilder()
#B.add_function()
def loopy(A,B,k):
d,m,n = A.shape
l = B.shape[-1]
out = np.empty((d,k,m,l),int)
for i in range(d):
for j in range(k):
out[i,j] = A[i]#B[(i+j)%d]
return out
#B.add_function()
def vectory(A,B,k):
d,m,n = A.shape
l = B.shape[-1]
BB = np.concatenate([B,B[:k-1]],0)
BB = as_strided(BB,(d,k,n,l),np.repeat(BB.strides,(2,1,1)))
return np.einsum("ikl,ijln->ijkn",A,BB)
#B.add_arguments('d x k x m x n x l')
def argument_provider():
for exp in range(10):
d,k,m,n,l = (np.r_[1.6,1.5,1.5,1.5,1.5]**exp*(4,2,2,2,2)).astype(int)
print(d,k,m,n,l)
A = np.random.randint(0,10,(d,m,n))
B = np.random.randint(0,10,(d,n,l))
yield k*d*m*n*l,MultiArgument([A,B,k])
r = B.run()
r.plot()
import pylab
pylab.savefig('diagwa.png')

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.

How can I replace the summing in numpy matrix multiplication with concatenation in a new dimension?

For each location in the result matrix, instead of storing the dot product of the corresponding row and column in the argument matrices, I would like like to store the element wise product, which will be a vector extending into a third dimension.
One idea would be to convert the argument matrices to vectors with vector entries, and then take their outer product, but I'm not sure how to do this either.
EDIT:
I figured it out before I saw there was a reply. Here is my solution:
def newdot(A, B):
A = A.reshape((1,) + A.shape)
B = B.reshape((1,) + B.shape)
A = A.transpose(2, 1, 0)
B = B.transpose(1, 0, 2)
return A * B
What I am doing is taking apart each row and column pair that will have their outer product taken, and forming two lists of them, which then get their contents matrix multiplied together in parallel.
It's a little convoluted (and difficult to explain) but this function should get you what you're looking for:
def f(m1, m2):
return (m2.A.T * m1.A.reshape(m1.shape[0],1,m1.shape[1]))
m3 = m1 * m2
m3_el = f(m1, m2)
m3[i,j] == sum(m3_el[i,j,:])
m3 == m3_el.sum(2)
The basic idea is to turn the matrices into arrays and do element-by-element multiplication. One of the arrays gets reshaped to have a size of one in its middle dimension, and array broadcasting rules expand this dimension out to match the height of the other array.