What is the meaning of data_shape in CSVIter of MXNet - mxnet

if I have a input matrix of 400 rows with 20 features and I have 10 labels then what should be my data_shape and label_shape tuples

Each row of your input CSV file can be viewed as a vector and it will be reshaped into the data_shape. So, if a row in an input file is 1,2,3,4,5,6 and data_shape is (3,2), that row will be reshaped, yielding the array [[1,2],[3,4],[5,6]] of shape (3,2).
You can see CSVIter doc for more details.

Related

How to gather the last X indices embeddings before every grouping of False in a boolean mask for every row in a batch

Mask for input is given, its shape is [batch size, no of timesteps]. From this, I need to collect X number of embeddings of shape [batch size, timestep index, embedding size] such that they are grouped just before each grouping of False.
say mask of a batch size of 1 is T,T,T,F,F,F,F,|T,T,F,F,F,F,F and X=2 (by '|', I assumed a break which indicates a row split length=7), then should get list of concatenated embeddings given by indices (1,2) (8, 9).
Should be able to replicate the above when batch size is variable without having to do individually for each batch as my batch size is pretty high. Output should be [ [ (1,2) , (.,.),.. for other batches at first row split (0:7) ] , [ (8,9) , (.,.) for other batches at second row split (7:14)], ..]

Sklearn PCA: Correct Dimensionality of PCs

I have a dataframe, df, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:
extract this numpy array;
flatten it into a 1x23040 vector;
add this entry as a column in a new numpy array or dataframe;
perform PCA on the resulting matrix.
However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.
To illustrate my problem, I demonstrate a minimal example that works perfectly well:
EXAMPLE 1
from sklearn import datasets, decomposition
digits = datasets.load_digits()
X = digits.data
pca = decomposition.PCA()
X_pca = pca.fit_transform(X)
print (X.shape)
Result: (1797, 64)
print (X_pca.shape)
Result: (1797, 64)
There are 1797 entries in each case, with eigenvectors of dimension 64.
Now onto my example:
EXAMPLE 2
from sklearn import datasets, decomposition
import pandas as pd
hdf=pd.HDFStore('./afile.h5')
df=hdf.select('batch0')
print(df['event'][0].shape)
Result: (1, 24, 24, 40)
print(df['event'][0].shape.flatten())
Result: (23040,)
for index, row in df.iterrows():
entry = df['event'][index].flatten()
_list.append(entry)
X = np.asarray(_list)
pca = decomposition.PCA()
X_pca=pca.fit_transform(X)
print (X.shape)
Result: (201, 23040)
print (X_pca.shape)
Result:(201, 201)
This has dimensions of the number of data, 201 entries!
I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X in Example 2 can be reshaped and plotted as expected.
Any thoughts would be appreciated!
Kind regards!
Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features).
Now, heading to your example:
In your first example, the number of data samples 1797 is less than the number of dimensions 64, therefore it keeps the whole dimensionality (since you are not specifying the number of components). However, in your second example, the number of data samples is far less than the number of features, hence, sklearns' PCA reduces the number of dimensions to n_samples.

what's the appropriate placeholder for my input

I have a 1k rows and 14 columns dataframe containing numpy arrays like shown below.
Here a subset of 2 rows and 3 columns :
[5,4,74,-12] [ 78,1,2,-9] [5 ,1,1,2]
[10,4,4,-1] [ 8,15,21,-19] [1,1,0,0]
where each cell is a numpy array of shape (4,1).
I couldn't find the right placeholder to input my whole dataframe as it needs to be processed by row batches.
Could anyone have an idea ?
I tried this to find the proper placeholder for my dataframe but its not correct:
x = tf.placeholder(tf.int32,[None,14],name='x')
with tf.Session() as sess:
print(sess.run(x,feed_dict={x:Data}))
It gives ValueError: setting an array element with a sequence.
Does anyone have an idea please ?
You did not specify in which format your data is available, so I assume it is a numpy array. In this case, you can do it like this:
n_columns = 14
n_elements_per_column = 4
x = tf.placeholder(tf.int32, [None, n_columns, n_elements_per_column], name='x')
with tf.Session() as sess:
print(sess.run(x,feed_dict={x:Data}))

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

PCA sklearn - Which dimension does it take

Does sklearn PCA consider the columns of the dataframe as the vectors to reduce or the rows as vectors to reduce ?
Because when doing this:
df=pd.DataFrame([[1,-21,45,3,4],[4,5,89,-5,6],[7,-4,58,1,19]‌​,[10,11,74,20,12],[1‌​3,14,15,45,78]]) #5 rows 5 columns
pca=PCA(n_components=3)
pca.fit(df)
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
I get the following error:
ValueError: Shape of passed values is (5, 3), indices imply (5, 5)
Rows represent samples and columns represent features. PCA reduces the dimensionality of the data, ie features. So columns.
So if you are talking about vectors, then it considers a row as single feature vector and reduces its size.
If you have a dataframe of shape say [100, 6] and PCA n_components is set to 3. So your output will be [100, 3].
# You need this
df_pcs=pca.transform(df)
# This produces error because shapes dont match.
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
pca.components_ is an array of [3,5] and your index parameter is using the df.index which is of shape [5,]. Hence the error. pca.components_ represents a completely different thing.
According to documentation:-
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the
directions of maximum variance in the data.