Sklearn PCA: Correct Dimensionality of PCs - pandas

I have a dataframe, df, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:
extract this numpy array;
flatten it into a 1x23040 vector;
add this entry as a column in a new numpy array or dataframe;
perform PCA on the resulting matrix.
However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.
To illustrate my problem, I demonstrate a minimal example that works perfectly well:
EXAMPLE 1
from sklearn import datasets, decomposition
digits = datasets.load_digits()
X = digits.data
pca = decomposition.PCA()
X_pca = pca.fit_transform(X)
print (X.shape)
Result: (1797, 64)
print (X_pca.shape)
Result: (1797, 64)
There are 1797 entries in each case, with eigenvectors of dimension 64.
Now onto my example:
EXAMPLE 2
from sklearn import datasets, decomposition
import pandas as pd
hdf=pd.HDFStore('./afile.h5')
df=hdf.select('batch0')
print(df['event'][0].shape)
Result: (1, 24, 24, 40)
print(df['event'][0].shape.flatten())
Result: (23040,)
for index, row in df.iterrows():
entry = df['event'][index].flatten()
_list.append(entry)
X = np.asarray(_list)
pca = decomposition.PCA()
X_pca=pca.fit_transform(X)
print (X.shape)
Result: (201, 23040)
print (X_pca.shape)
Result:(201, 201)
This has dimensions of the number of data, 201 entries!
I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X in Example 2 can be reshaped and plotted as expected.
Any thoughts would be appreciated!
Kind regards!

Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features).
Now, heading to your example:
In your first example, the number of data samples 1797 is less than the number of dimensions 64, therefore it keeps the whole dimensionality (since you are not specifying the number of components). However, in your second example, the number of data samples is far less than the number of features, hence, sklearns' PCA reduces the number of dimensions to n_samples.

Related

numpy.corrcoeff() MemoryError

Can't understand MemoryError I get using numpy.corrcoeff() to find correlation coefficient between 2 vectors smin & smax as following:
import numpy as np
from numpy import random as rn
r=0.01
sigma=0.2
T=1
K=1
N=252
h=T/N
M = 50000
Z = rn.randn(M,N)
S=np.ones((M,N+1))
smax=np.ones((M,1))
smin=np.ones((M,1))
for i in range(0,N):
S[:,i+1]=S[:,i]*(np.exp((r-(sigma**2)/2)*h+sigma*Z[:,i]*np.sqrt(h)))
for j in range(0,M):
smax[j,:]=np.exp(-r*T)*(np.max(S[j,:])>K)*(np.max(S[j,:])-K)
smin[j,:]=np.exp(-r*T)*(np.min(S[j,:])<K)*(K-np.min(S[j,:]))
c=np.corrcoef(smax,smin)
print(c)
if there is another way to find correlation coeff.,like using pandas it's also good.
The shape of your arrays here is what is the problem. The function documentation states that x is a "1-D or 2-D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables." and that y is an additional set of variables and observations. So this is trying to allocate an array of size (10000, 10000), which is huge.
If you just want to calculate the pearson correlation coefficient between two one dimensional vectors, you can use a much simpler formula than what is implemented here. This documentation has the formula I am referring to.
https://hydroerr.readthedocs.io/en/stable/api/HydroErr.HydroErr.pearson_r.html#HydroErr.HydroErr.pearson_r
But to be able to still use the numpy version you need to pass in the observations and predictions in the same parameter x, and x and y need to be 1D arrays.
import numpy as np
simulated_array = np.random.rand(50000)
observed_array = np.random.rand(50000)
c = np.corrcoef([simulated_array, observed_array])[1, 0]
More explanation about this here.

Numpy: stack arrays whose internal dimensions differ

I have a situation similar to the following:
import numpy as np
a = np.random.rand(55, 1, 3)
b = np.random.rand(55, 626, 3)
Here the shapes represent the number of observations, then the number of time slices per observation, then the number of dimensions of the observation at the given time slice. So b is a full representation of 3 dimensions for each of the 55 observations at one new time interval.
I'd like to stack a and b into an array with shape 55, 627, 3. How can one accomplish this in numpy? Any suggestions would be greatly appreciated!
To follow up on Divakar's answer above, the axis argument in numpy is the index of a given dimension within an array's shape. Here I want to stack a and b by virtue of their middle shape value, which is at index = 1:
import numpy as np
a = np.random.rand(5, 1, 3)
b = np.random.rand(5, 100, 3)
# create the desired result shape: 55, 627, 3
stacked = np.concatenate((b, a), axis=1)
# validate that a was appended to the end of b
print(stacked[:, -1, :], '\n\n\n', a.squeeze())
This returns:
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
A purist might use instead np.all(stacked[:, -1, :] == a.squeeze()) to validate this equivalence. All glory to #Divakar!
Strictly for the curious, the use case for this concatenation is a kind of wonky data preparation pipeline for a Long Short Term Memory Neural Network. In that kind of network, the training data shape should be number_of_observations, number_of_time_intervals, number_of_dimensions_per_observation. I am generating new predictions of each object at a new time interval, so those predictions have shape number_of_observations, 1, number_of_dimensions_per_observation. To visualize the sequence of observations' positions over time, I want to add the new positions to the array of previous positions, hence the question above.

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

PCA sklearn - Which dimension does it take

Does sklearn PCA consider the columns of the dataframe as the vectors to reduce or the rows as vectors to reduce ?
Because when doing this:
df=pd.DataFrame([[1,-21,45,3,4],[4,5,89,-5,6],[7,-4,58,1,19]‌​,[10,11,74,20,12],[1‌​3,14,15,45,78]]) #5 rows 5 columns
pca=PCA(n_components=3)
pca.fit(df)
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
I get the following error:
ValueError: Shape of passed values is (5, 3), indices imply (5, 5)
Rows represent samples and columns represent features. PCA reduces the dimensionality of the data, ie features. So columns.
So if you are talking about vectors, then it considers a row as single feature vector and reduces its size.
If you have a dataframe of shape say [100, 6] and PCA n_components is set to 3. So your output will be [100, 3].
# You need this
df_pcs=pca.transform(df)
# This produces error because shapes dont match.
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
pca.components_ is an array of [3,5] and your index parameter is using the df.index which is of shape [5,]. Hence the error. pca.components_ represents a completely different thing.
According to documentation:-
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the
directions of maximum variance in the data.

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.