scikit-learn PCA with unknown feature values - pandas

I want to use sklearn for pca analysis (then regression and kmeans clustering). I have a dataset with 20k features, 2000k rows. However for each row in the dataset only a subset (typically any 5 or so of the 20k) of features have been measured.
How should I pad my pandas dataframe / setup sklearn so that sklearn not use features for the instances where the value has not been measured? (eg if I set null feature values to 0.0 would this distort the outcome?).
eg:
X = array[:,0:n]
Y = array[:,n]
pca = PCA()
fit = pca.fit(X)
If the dataset is padded with zeros for most feature values - then will pca be valid?

I see 3 options, however none is a solution for your problem:
1) You replace the null values by 0, but that will definetly worsen your results;
2) You replace the unknown values with the mean or median of each feature, this migth be better, however it will still give you a distorted PCA;
3) Last option don't use PCA and search for dimensionality reduction techique for sparse data.

Related

How to see the indices of the split on the data that GridSearchCV used when it made the split?

When using GridSearchCV() to perform a k-fold cross validation analysis on some data is there a way to know which data was used for each split?
For example, assumed the goal is to build a binary classifier of your choosing, named 'model'. There are 100 data points (rows) with 5 features each and an associated 1 or 0 target. 20 of the 100 data points are held out for testing after training and hyperparameter tuning, GridSearchCV will never see those 20 data points. The other 80 data rows are put into the estimator as X and Y, so GridSearchCV will only see 80 rows of data. Various hyper parameters are tuned and laid out in the param_grid variable. For this case the cross validation parameter of cv is assigned a value of 3, as shown:
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3) grid_result = grid.fit(X, Y)
Is there a way to see which data was used as the training data and as the cross validation data for each fold? Maybe seeing which indices were used for the split?

Creating a matrix with certain conditions

I am trying to create a matrix using PyTorch of size 32x10x1.
The conditions that I need to fulfill are that
torch.mean(a, dim=0) # size is 10x1 and should be almost 0
torch.mean(a, dim=1) # size is 32x1 and should be almost 0
This is a noise matrix for GANs and I am trying to sample it from Normal Distribution. I tried using torch.MultiVariateNormal() but it didnt give me matrix of that shape
Is there any other function or something in numpy or scikit to get this kind of matrix
Use numpy.random.normal
import numpy.random as npr
mean = 0
std_dev = 0.1
size = (32, 10, 1)
mat = npr.normal(loc=mean, scale=std_dev, size=size)
and set the mean and standard deviation as desired to keep the values close to 0.
Here you can see the effect of changing the mean and standard deviation on the graph
By Inductiveload - self-made, Mathematica, Inkscape, Public Domain, https://commons.wikimedia.org/w/index.php?curid=3817954

Applying LSA on term document matrix when number of documents are very less

I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine similarity of the last document with each of the other documents.
But since SVD produces an S of size (min(6, 25931),), If I used the S to reduce my X, I get a 6 * 6 matrix. But In this case, I feel that I will be losing too much information since I am reducing a vector of size (25931,) to (6,).
And when you think about it, usually, the number of documents will always be less than number of vocabulary words. In this case, using SVD to reduce dimensionality will always produce vectors that are of size (no documents,).
According to everything that I have read, when SVD is used like this on a term-document matrix, it's called LSA.
Am I implementing LSA correctly?
If this is correct, then is there any other way to reduce the dimensionality and get denser vectors where the size of the compressed vector is greater than (6,)?
P.S.: I also tried using fit_transform from sklearn.decomposition.TruncatedSVD which expects the vector to be of the form (n_samples, n_components) which is why the shape of my term-document matrix is (6, 25931) and not (25931, 6). I kept getting a (6, 6) matrix which initially confused me. But now it makes sense after I remembered the math behind SVD.
If the objective of the exercise is to find the cosine similarity, then the following approach can help. The author is only attempting to solve for the objective and not to comment on the definition of Latent Semantic Analysis or the definition of Singular Value Decomposition mentioned by the questioner.
Let us first invoke all the required libraries. Please install them if they do not exist in the machine.
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Let us generate some sample data for this exercise.
df = {'sentence': ['one two three','two three four','four five','six seven eight nine ten']}
df = pd.DataFrame(df, columns = ['sentence'])
The first step is to get the exhaustive list of all the possible features. So collate all of the content at one place.
all_content = [' '.join(df['sentence'])]
Let us build a vectorizer and fit it now. Please note that the arguments in the vectorizer are not explained by the author as the focus is on solving the problem.
vectorizer = TfidfVectorizer(encoding = 'latin-1',norm = 'l2', min_df = 0.03, ngram_range = (1,2), max_features = 5000)
vectorizer.fit(all_content)
We can inspect the vocabulary to see if it makes sense. If needed, one could add stop words in the vectorizer above and supress them to see if they are indeed supressed.
print(vectorizer.vocabulary_)
Let us vectorize the sentences for us to deploy cosine similarity.
s1Tokens = vectorizer.transform(df.iloc[1,])
s2Tokens = vectorizer.transform(df.iloc[2,])
Finally, the cosine of the similarity can be computed as follows.
cosine_similarity(s1Tokens , s2Tokens)

Numpy- Deep Learning, Training Examples

Silly Question, I am going through the third week of Andrew Ng's newest Deep learning course, and getting stuck at a fairly simple Numpy function ( i think? ).
The exercise is to find How many training examples, m , we have.
Any idea what the Numpy function is to find out about the size of a preloaded training example.
Thanks!
shape_X = X.shape
shape_Y = Y.shape
m = ?
print ('The shape of X is: ' + str(shape_X))
print ('The shape of Y is: ' + str(shape_Y))
print ('I have m = %d training examples!' % (m))
It depends on what kind of storage-approach you use.
Most python-based tools use the [n_samples, n_features] approach where the first dimension is the sample-dimension, the second dimension is the feature-dimension (like in scikit-learn and co.). Alternatively expressed: samples are rows and features are columns.
So:
# feature 1 2 3 4
x = np.array([[1,2,3,4], # first sample
[2,3,4,5], # second sample
[3,4,5,6]
])
is a training-set of 3 samples with 4 features each.
The sizes M,N (again: interpretation might be different for others) you can get with:
M, N = x.shape
because numpy's first dimension are rows, numpy's second dimension are columns like in matrix-algebra.
For the above example, the target-array is of shape (M) = n_samples.
Anytime you want to find the number of training examples or the size of an array, you can use
m = X.size
This will give you the size or the total number of the examples. In this case, it would be 400.
The above method is also correct but not the optimal method to find the size since, in large datasets, the values could be large and while python easily handles large values, it is not advisable to utilize extra unneeded space.
Or a better way of doing the above scenario is
m=X.shape[1]

R-Squared of alternative model

In order to reduce the influence of outliers and obtain a more robust regression, I've applied a winsorization technique to modify the values of a series ('x'). I then regress these values against series 'y'.
The R-squared of this model is naturally much higher, but I'm not making the right comparison.
How do I use scipy or statsmodels to obtain the R-squared of the original data using the beta estimates from the winsorized model?
You need to calculate it yourself, essentially by replicating the formula for rsquared.
For example
>>> res_tmp = OLS(np.random.randn(100), np.column_stack((np.ones(100),np.random.randn(100, 2)))).fit()
>>> y_orig = res_tmp.model.endog
>>> res_tmp.rsquared
0.022009069788207714
>>> (1 - ((y_orig - res_tmp.fittedvalues)**2).sum() / ((y_orig - y_orig.mean())**2).sum())
0.022009069788207714
The last expression would apply to your case if res_tmp.fittedvalues are the predicted or fitted values of your winsorized model, and y_orig is your original unchanged response variable. This definition of R squared applies if there is a constant in the model.
Note: The most frequent naming for the linear model corresponds to y = X b, where y is the response variable and X are the explanatory variables. IIUC, then you reversed the labeling in your question.