reshape list to (-1,1) and return float as datatype - numpy

I am trying to build Logistic Regression model, data.Exam1 is the first column
reg = linear_model.LogisticRegression()
X = list(data.Exam1.values.reshape(-1,1)).........(1)
I have performed this operation
type(X[0]) returns numpy.ndarray
reg.fit accepts parameters which contains all float items in the list, so I did this because of this exception ValueError: Unknown label type: 'continuous'
newX = []
for item in X:
type(float(item))
newX.append(float(item))
so when I tried to do
reg.fit(newX,newY,A)
It throws me this exception
Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
which I already did in (1), and when I try to reshape again it returns ndarray again, how can I have reshape and convert items to float simultaneously??

Adapting our solution from chat
You are trying to understand Admission (type: bool) as a function of Exam scores (Exam1: float, Exam2: float). The crux of your issue is that sklearn.linear_model.LogisticRegression expects two inputs:
X: a vector/matrix of training data with the shape (number of observations, number of predictors) with type float
Y: a vector of categorical outcomes (in this case binary) with the shape (number of observations, 1) with type bool or int
They way you are calling it is trying to fit Exam2 (float) as a function of Exam1 (float). This is the fundamental issue. Further complicating matters is the way you are recasting your reshaped numpy array as a list. Assuming data is a pandas.DataFrame, you want something like:
X = np.vstack((data.Exam1, data.Exam2)).T
print X.shape # should be (100, 2)
reg.fit(X, data.Admitted)
Here, both data.Exam1 and data.Exam2 are vectors of length 100. Using np.vstack combines them into the shape (2, 100), so we take the transpose so that we have it oriented properly with observations along the first dimension (100, 2). No need to recast as list or even take data.Exam1.values as the pd.Series gets recast as np.array during np.vstack. Similarly, data.Admitted (with shape (100,)) plays nicely with reg.fit.

Related

How can I reconstruct original matrix from SVD components with following shapes?

I am trying to reconstruct the following matrix of shape (256 x 256 x 2) with SVD components as
U.shape = (256, 256, 256)
s.shape = (256, 2)
vh.shape = (256, 2, 2)
I have already tried methods from documentation of numpy and scipy to reconstruct the original matrix but failed multiple times, I think it maybe 3D matrix has a different way of reconstruction.
I am using numpy.linalg.svd for decompostion.
From np.linalg.svd's documentation:
"... If a has more than two dimensions, then broadcasting rules apply, as explained in :ref:routines.linalg-broadcasting. This means that SVD is
working in "stacked" mode: it iterates over all indices of the first
a.ndim - 2 dimensions and for each combination SVD is applied to the
last two indices."
This means that you only need to handle the s matrix (or tensor in general case) to obtain the right tensor. More precisely, what you need to do is pad s appropriately and then take only the first 2 columns (or generally, the number of rows of vh which should be equal to the number of columns of the returned s).
Here is a working code with example for your case:
import numpy as np
mat = np.random.randn(256, 256, 2) # Your matrix of dim 256 x 256 x2
u, s, vh = np.linalg.svd(mat) # Get the decomposition
# Pad the singular values' arrays, obtain diagonal matrix and take only first 2 columns:
s_rep = np.apply_along_axis(lambda _s: np.diag(np.pad(_s, (0, u.shape[1]-_s.shape[0])))[:, :_s.shape[0]], 1, s)
mat_reconstructed = u # s_rep # vh
mat_reconstructed equals to mat up to precision error.

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

what the meaning of slim.metrics.streaming_sparse_average_precision_at_k?

This function refer to tf.contrib.metrics.streaming_sparse_average_precision_at_k,and the explanation in source code is follow,any one can explain it by giving some simply example? And I wonder whether this metric is same as the average precision calculation used in PASCAL VOC 2012 challenge.Thanks a lot.
def sparse_average_precision_at_k(labels,
predictions,
k,
weights=None,
metrics_collections=None,
updates_collections=None,
name=None):
"""Computes average precision#k of predictions with respect to sparse labels.
`sparse_average_precision_at_k` creates two local variables,
`average_precision_at_<k>/total` and `average_precision_at_<k>/max`, that
are used to compute the frequency. This frequency is ultimately returned as
`average_precision_at_<k>`: an idempotent operation that simply divides
`average_precision_at_<k>/total` by `average_precision_at_<k>/max`.
For estimation of the metric over a stream of data, the function creates an
`update_op` operation that updates these variables and returns the
`precision_at_<k>`. Internally, a `top_k` operation computes a `Tensor`
indicating the top `k` `predictions`. Set operations applied to `top_k` and
`labels` calculate the true positives and false positives weighted by
`weights`. Then `update_op` increments `true_positive_at_<k>` and
`false_positive_at_<k>` using these values.
If `weights` is `None`, weights default to 1. Use weights of 0 to mask values.
Args:
labels: `int64` `Tensor` or `SparseTensor` with shape
[D1, ... DN, num_labels] or [D1, ... DN], where the latter implies
num_labels=1. N >= 1 and num_labels is the number of target classes for
the associated prediction. Commonly, N=1 and `labels` has shape
[batch_size, num_labels]. [D1, ... DN] must match `predictions`. Values
should be in range [0, num_classes), where num_classes is the last
dimension of `predictions`. Values outside this range are ignored.
predictions: Float `Tensor` with shape [D1, ... DN, num_classes] where
N >= 1. Commonly, N=1 and `predictions` has shape
[batch size, num_classes]. The final dimension contains the logit values
for each class. [D1, ... DN] must match `labels`.
k: Integer, k for #k metric. This will calculate an average precision for
range `[1,k]`, as documented above.
weights: `Tensor` whose rank is either 0, or n-1, where n is the rank of
`labels`. If the latter, it must be broadcastable to `labels` (i.e., all
dimensions must be either `1`, or the same as the corresponding `labels`
dimension).
metrics_collections: An optional list of collections that values should
be added to.
updates_collections: An optional list of collections that updates should
be added to.
name: Name of new update operation, and namespace for other dependent ops.
Returns:
mean_average_precision: Scalar `float64` `Tensor` with the mean average
precision values.
update: `Operation` that increments variables appropriately, and whose
value matches `metric`.
Raises:
ValueError: if k is invalid.
"""

PCA sklearn - Which dimension does it take

Does sklearn PCA consider the columns of the dataframe as the vectors to reduce or the rows as vectors to reduce ?
Because when doing this:
df=pd.DataFrame([[1,-21,45,3,4],[4,5,89,-5,6],[7,-4,58,1,19]‌​,[10,11,74,20,12],[1‌​3,14,15,45,78]]) #5 rows 5 columns
pca=PCA(n_components=3)
pca.fit(df)
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
I get the following error:
ValueError: Shape of passed values is (5, 3), indices imply (5, 5)
Rows represent samples and columns represent features. PCA reduces the dimensionality of the data, ie features. So columns.
So if you are talking about vectors, then it considers a row as single feature vector and reduces its size.
If you have a dataframe of shape say [100, 6] and PCA n_components is set to 3. So your output will be [100, 3].
# You need this
df_pcs=pca.transform(df)
# This produces error because shapes dont match.
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
pca.components_ is an array of [3,5] and your index parameter is using the df.index which is of shape [5,]. Hence the error. pca.components_ represents a completely different thing.
According to documentation:-
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the
directions of maximum variance in the data.

How can I replace the summing in numpy matrix multiplication with concatenation in a new dimension?

For each location in the result matrix, instead of storing the dot product of the corresponding row and column in the argument matrices, I would like like to store the element wise product, which will be a vector extending into a third dimension.
One idea would be to convert the argument matrices to vectors with vector entries, and then take their outer product, but I'm not sure how to do this either.
EDIT:
I figured it out before I saw there was a reply. Here is my solution:
def newdot(A, B):
A = A.reshape((1,) + A.shape)
B = B.reshape((1,) + B.shape)
A = A.transpose(2, 1, 0)
B = B.transpose(1, 0, 2)
return A * B
What I am doing is taking apart each row and column pair that will have their outer product taken, and forming two lists of them, which then get their contents matrix multiplied together in parallel.
It's a little convoluted (and difficult to explain) but this function should get you what you're looking for:
def f(m1, m2):
return (m2.A.T * m1.A.reshape(m1.shape[0],1,m1.shape[1]))
m3 = m1 * m2
m3_el = f(m1, m2)
m3[i,j] == sum(m3_el[i,j,:])
m3 == m3_el.sum(2)
The basic idea is to turn the matrices into arrays and do element-by-element multiplication. One of the arrays gets reshaped to have a size of one in its middle dimension, and array broadcasting rules expand this dimension out to match the height of the other array.