Calculate pearson correlation between a tensor and a numpy array - dataframe

I have managed to form a Dataframe of the predicted tensors(y_pred) which are of (459,1) after reshaping from (459,1,1) and i have the original y values in the other column which are also float32.
I would like to measure the pearson correlation between this 2 columns. but i am getting error:
pearsonr(df_pred['y_pred'],df_pred['y'])
unsupported operand type(s) for +: 'float' and 'tuple'
So i am not sure whether i can convert the tensor to numpy array and add that to the DataFrame. I have tried
predicted= tf.reshape(predicted, [459, 1])
predicted.numpy()
But it does not work. Any ideas?

I think you have to evaluate each tensor in the column to get it's value.
df['y_pred'] = df['y_pred'].apply(lambda x: x.eval())
How to get the value of a tensor?

predicted =predicted.numpy()
The above code worked at the end. As the values were appended under a for loop only writing
predicted.numpy()
did not work.

Related

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

Numpy Array Shape Issue

I have initialized this empty 2d np.array
inputs = np.empty((300, 2), int)
And I am attempting to append a 2d row to it as such
inputs = np.append(inputs, np.array([1,2]), axis=0)
But Im getting
ValueError: all the input arrays must have same number of dimensions
And Numpy thinks it's a 2 row 0 dimensional object (transpose of 2d)
np.array([1, 2]).shape
(2,)
Where have I gone wrong?
To add a row to a (300,2) shape array, you need a (1,2) shape array. Note the matching 2nd dimension.
np.array([[1,2]]) works. So does np.array([1,2])[None, :] and np.atleast_2d([1,2]).
I encourage the use of np.concatenate. It forces you to think more carefully about the dimensions.
Do you really want to start with np.empty? Look at its values. They are random, and probably large.
#Divakar suggests np.row_stack. That puzzled me a bit, until I checked and found that it is just another name for np.vstack. That function passes all inputs through np.atleast_2d before doing np.concatenate. So ultimately the same solution - turn the (2,) array into a (1,2)
Numpy requires double brackets to declare an array literal, so
np.array([1,2])
needs to be
np.array([[1,2]])
If you intend to append that as the last row into inputs, you can just simply use np.row_stack -
np.row_stack((inputs,np.array([1,2])))
Please note this np.array([1,2]) is a 1D array.
You can even pass it a 2D row version for the same result -
np.row_stack((inputs,np.array([[1,2]])))

assign certain entries of Tensor, like set_subtensor of Theano

Can I just assign values to certain entries in a tensor? I got this problems when I compute the cross correlation matrix of a NxP feature matrix feats, where N is observations and P is dimension. Some columns are constant so the standard deviation is zero, and I don't want to devide by std for those constant column. Here is what I did:
fmean, fvar = tf.nn.moments(feats, axes = [0], keep_dims = False)
fstd = tf.sqrt(fvar)
feats = feats - fmean
sel = (fstd != 0)
feats[:, sel] = feats[:, sel]/ fstd[sel]
corr = tf.matmul(tf.transpose(feats), feats)
However, I got this error: TypeError: 'Tensor' object does not support item assignment. Is there any workaround for such issue?
You can make your feats a tf.Variable and use tf.scatter_update to update locations selectively.
It's a bit awkward in that scatter_update needs a list of linear indices to update, so you'd need to convert your [:, sel] implicit 2D specification into explicit list of 1D indices. There's example of constructing 1D indices from 2D here
There's some work in simplifying this kind of use-case in issue #206

how to switch a tensor in theano to numpy.array

I have a tensor T (shape:300) and a array A(shape:300), what i want to do is combine them into a new array [T,A] with the shape (600). I tried the solutiona below:
1 combine directly,use function: np.concatenate((T,A)), the result is:zero-dimensional arrays cannot be concatenated
2 switch one type to another, try to switch the T to the type of numpy.array: i use: a=np.array(T), but when print a.shape, it is (), nothing in the bracket.
Besides, when i print T.shape and A.shape, T.shape is ([300]) and A.shape is (300,)what is the difference?
when we want to get a numpy.array from a tensor T, it can be done by T.eval(), i tried a lot and found this way. But i haven't found the way switched from numpy.array to tensor T yet. Anyone can help?

Should a pandas dataframe column be converted in some way before passing it to a scikit learn regressor?

I have a pandas dataframe and passing df[list_of_columns] as X and df[[single_column]] as Y to a Random Forest regressor.
What does the following warnning mean and what should be done to resolve it?
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). probas = cfr.fit(trainset_X, trainset_Y).predict(testset_X)
Simply check the shape of your Y variable, it should be a one-dimensional object, and you are probably passing something with more (possibly trivial) dimensions. Reshape it to the form of list/1d array.
You can use df.single_column.values or df['single_column'].values to get the underlying numpy array of your series (which, in this case, should also have the correct 1D-shape as mentioned by lejlot).
Actually the warning tells you exactly what is the problem:
You pass a 2d array which happened to be in the form (X, 1), but the method expects a 1d array and has to be in the form (X, ).
Moreover the warning tells you what to do to transform to the form you need: y.values.ravel().
Use Y = df[[single_column]].values.ravel() solves DataConversionWarning for me.