Indexing on axis by list in PyTorch - indexing

I have Variables lengths_X of size (10L,) and A of size (10L, 16L, 5L).
I want to use lengths_X to index along the second axis of A. In other words, I want to get a new tensor predicted_Y of size (10L, 5L) that indexes axis 1 at i for all entries with index i in axis 0.
What is the best way to do this in PyTorch?

What you are looking for is actually called batched_index_select and I looked for such functionality before but couldn't find any native function in PyTorch that can do the job. But we can simply use:
A = torch.randn(10, 16, 5)
index = torch.from_numpy(numpy.random.randint(0, 16, size=10))
B = torch.stack([a[i] for a, i in zip(A, index)])
You can see the discussion here. You can also check out the function batched_index_select provided in the AllenNLP library. I would be happy to know if there is a better solution.

Related

Applying LSA on term document matrix when number of documents are very less

I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine similarity of the last document with each of the other documents.
But since SVD produces an S of size (min(6, 25931),), If I used the S to reduce my X, I get a 6 * 6 matrix. But In this case, I feel that I will be losing too much information since I am reducing a vector of size (25931,) to (6,).
And when you think about it, usually, the number of documents will always be less than number of vocabulary words. In this case, using SVD to reduce dimensionality will always produce vectors that are of size (no documents,).
According to everything that I have read, when SVD is used like this on a term-document matrix, it's called LSA.
Am I implementing LSA correctly?
If this is correct, then is there any other way to reduce the dimensionality and get denser vectors where the size of the compressed vector is greater than (6,)?
P.S.: I also tried using fit_transform from sklearn.decomposition.TruncatedSVD which expects the vector to be of the form (n_samples, n_components) which is why the shape of my term-document matrix is (6, 25931) and not (25931, 6). I kept getting a (6, 6) matrix which initially confused me. But now it makes sense after I remembered the math behind SVD.
If the objective of the exercise is to find the cosine similarity, then the following approach can help. The author is only attempting to solve for the objective and not to comment on the definition of Latent Semantic Analysis or the definition of Singular Value Decomposition mentioned by the questioner.
Let us first invoke all the required libraries. Please install them if they do not exist in the machine.
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Let us generate some sample data for this exercise.
df = {'sentence': ['one two three','two three four','four five','six seven eight nine ten']}
df = pd.DataFrame(df, columns = ['sentence'])
The first step is to get the exhaustive list of all the possible features. So collate all of the content at one place.
all_content = [' '.join(df['sentence'])]
Let us build a vectorizer and fit it now. Please note that the arguments in the vectorizer are not explained by the author as the focus is on solving the problem.
vectorizer = TfidfVectorizer(encoding = 'latin-1',norm = 'l2', min_df = 0.03, ngram_range = (1,2), max_features = 5000)
vectorizer.fit(all_content)
We can inspect the vocabulary to see if it makes sense. If needed, one could add stop words in the vectorizer above and supress them to see if they are indeed supressed.
print(vectorizer.vocabulary_)
Let us vectorize the sentences for us to deploy cosine similarity.
s1Tokens = vectorizer.transform(df.iloc[1,])
s2Tokens = vectorizer.transform(df.iloc[2,])
Finally, the cosine of the similarity can be computed as follows.
cosine_similarity(s1Tokens , s2Tokens)

Faster solution for sampling an index by value of ndarray

I have some pretty large arrays to deal with. By describing them big, I mean like in the scale of (514, 514, 374). I want to randomly get an index base on its pixel value. For example, I need the 3-d index of a pixl with value equal to 1. So, I list all the possibilities by
indices = np.asarray(np.where(img_arr == 1)).T
This works perfect, except that it runs very slow, to an intolerable extent, since the array is so big. So my question is is there a better way to do that? It would be nicer if I can input a list of pixel values, and I get back a list of corresponding indices. For example, I want to sample the indices of these pixel values [0, 1, 2], and I get back list of indices [[1,2,3], [53, 215, 11], [223, 42, 113]]
Since I am working with medical images, solutions with SimpleITK is also welcomed. So feel free to leave your opinions, thanks.
import numpy as np
value = 1
# value_list = [1, 3, 5] you can also use a list of values -> *
n_samples = 3
n_subset = 500
# Create a example array
img_arr = np.random.randint(low=0, high=5, size=(10, 30, 20))
# Choose randomly indices for the array
idx_subset = np.array([np.random.randint(high=s, size=n_subset) for s in x.shape]).T
# Get the values at the sampled positions
values_subset = img_arr[[idx_subset[:, i] for i in range(img_arr.ndim)]]
# Check which values match
idx_subset_matching_temp = np.where(values_subset == value)[0]
# idx_subset_matching_temp = np.argwhere(np.isin(values_subset, value_list)).ravel() -> *
# Get all the indices of the subset with the correct value(s)
idx_subset_matching = idx_subset[idx_subset_matching_temp, :]
# Shuffle the array of indices
np.random.shuffle(idx_subset_matching)
# Only keep as much as you need
idx_subset_matching = idx_subset_matching[:n_samples, :]
This gives you the desired samples. The distribution of those samples should be the same as if you are using your method of looking at all matches in the array. In both cases you get a uniform distribution along all the positions with matching values.
You have to be careful when choosing the size of the subset and the number of samples you want. The subset must be large enough that there are enough matches for the values, otherwise it won't work.
A similar problem occurs if the values you want to sample are very sparse, then the size of the subset needs to be very large (in the edge case the whole array) and you gain nothing.
If you are sampling often from the same array maybe it is also a good idea to store the indices for each value
indices_i = np.asarray(np.where(img_arr == i)).T
and use those for the your further computations.

In tensorflow, how to gather with indices matching params first dimensions?

If i have indices of shape (D_0,...,D_k) and params of shape (D_0,...,D_k,I,F) (with 0 ≤ indices[i_0,...,i_k] < I), what is the fastest/most elegant way to get the array output of shape (D_0,...,D_k,F) with
output[i_0,...,i_k,f]=params[i_0,...,i_k,indices[i_0,...,i_k],f]
If k=0, then we can use gather. So, in the past, I had a solution based on flattening. Is there a nicer solution now that tensorflow has matured?
Most of the times, when I want this type of gathering, indices is obtained by indices = tf.argmax(params[:,...,:,:,0]). For every (i_0,...,i_k), I have I vectors of size (F,) and I want to keep only those with the maximal value for one of the features. A solution which would only work for this special case (a kind of reduce_max only using one feature to decide how to reduce) would satisfy me.

Numpy- Deep Learning, Training Examples

Silly Question, I am going through the third week of Andrew Ng's newest Deep learning course, and getting stuck at a fairly simple Numpy function ( i think? ).
The exercise is to find How many training examples, m , we have.
Any idea what the Numpy function is to find out about the size of a preloaded training example.
Thanks!
shape_X = X.shape
shape_Y = Y.shape
m = ?
print ('The shape of X is: ' + str(shape_X))
print ('The shape of Y is: ' + str(shape_Y))
print ('I have m = %d training examples!' % (m))
It depends on what kind of storage-approach you use.
Most python-based tools use the [n_samples, n_features] approach where the first dimension is the sample-dimension, the second dimension is the feature-dimension (like in scikit-learn and co.). Alternatively expressed: samples are rows and features are columns.
So:
# feature 1 2 3 4
x = np.array([[1,2,3,4], # first sample
[2,3,4,5], # second sample
[3,4,5,6]
])
is a training-set of 3 samples with 4 features each.
The sizes M,N (again: interpretation might be different for others) you can get with:
M, N = x.shape
because numpy's first dimension are rows, numpy's second dimension are columns like in matrix-algebra.
For the above example, the target-array is of shape (M) = n_samples.
Anytime you want to find the number of training examples or the size of an array, you can use
m = X.size
This will give you the size or the total number of the examples. In this case, it would be 400.
The above method is also correct but not the optimal method to find the size since, in large datasets, the values could be large and while python easily handles large values, it is not advisable to utilize extra unneeded space.
Or a better way of doing the above scenario is
m=X.shape[1]

sklearn: get feature names after L1-based feature selection

This question and answer demonstrate that when feature selection is performed using one of scikit-learn's dedicated feature selection routines, then the names of the selected features can be retrieved as follows:
np.asarray(vectorizer.get_feature_names())[featureSelector.get_support()]
For example, in the above code, featureSelector might be an instance of sklearn.feature_selection.SelectKBest or sklearn.feature_selection.SelectPercentile, since these classes implement the get_support method which returns a boolean mask or integer indices of the selected features.
When one performs feature selection via linear models penalized with the L1 norm, it's unclear how to accomplish this. sklearn.svm.LinearSVC has no get_support method and the documentation doesn't make clear how to retrieve the feature indices after using its transform method to eliminate features from a collection of samples. Am I missing something here?
For sparse estimators you can generally find the support by checking where the non-zero entries are in the coefficients vector (provided the coefficients vector exists, which is the case for e.g. linear models)
support = np.flatnonzero(estimator.coef_)
For your LinearSVC with l1 penalty it would accordingly be
from sklearn.svm import LinearSVC
svc = LinearSVC(C=1., penalty='l1', dual=False)
svc.fit(X, y)
selected_feature_names = np.asarray(vectorizer.get_feature_names())[np.flatnonzero(svc.coef_)]
I've been using sklearn 15.2, and according to LinearSVC documentation , coef_ is an array, shape = [n_features] if n_classes == 2 else [n_classes, n_features].
So first, np.flatnonzero doesn't work for multi-class. You'll have index out of range error. Second, it should be np.where(svc.coef_ != 0)[1] instead of np.where(svc.coef_ != 0)[0] . 0 is index of classes, not features. I ended up with using np.asarray(vectorizer.get_feature_names())[list(set(np.where(svc.coef_ != 0)[1]))]