Should a pandas dataframe column be converted in some way before passing it to a scikit learn regressor? - pandas

I have a pandas dataframe and passing df[list_of_columns] as X and df[[single_column]] as Y to a Random Forest regressor.
What does the following warnning mean and what should be done to resolve it?
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). probas = cfr.fit(trainset_X, trainset_Y).predict(testset_X)

Simply check the shape of your Y variable, it should be a one-dimensional object, and you are probably passing something with more (possibly trivial) dimensions. Reshape it to the form of list/1d array.

You can use df.single_column.values or df['single_column'].values to get the underlying numpy array of your series (which, in this case, should also have the correct 1D-shape as mentioned by lejlot).

Actually the warning tells you exactly what is the problem:
You pass a 2d array which happened to be in the form (X, 1), but the method expects a 1d array and has to be in the form (X, ).
Moreover the warning tells you what to do to transform to the form you need: y.values.ravel().

Use Y = df[[single_column]].values.ravel() solves DataConversionWarning for me.

Related

Construct NumPy matrix row by row

I'm trying to construct a 2D NumPy array from values in an extant 2D NumPy array using an iterative process. Using ordinary python lists the process I'm describing would look like so:
coords = #data from file contained in a 2D list
d = #integer
edges = []
for i in range(d+1):
for j in range(i+1, d+1):
edge = coords[j] - coords[i]
edges.append(edge)
However, the NumPy array imposes restrictions that do not permit the process shown above. Below I try to do the same thing using NumPy arrays, and it should immediately be clear where the problems are:
coords = np.genfromtxt('Energies.txt', dtype=float, skip_header=1)
d = #integer
#how to initialize?
for i in range(d+1):
for j in range(i+1, d+1):
edge = coords[j] - coords[i]
#how to append?
Because .append does not exist for NumPy arrays I need to rely on concatenate or stack instead. But these functions are designed to join existing arrays, and I don't have anything to concatenate or stack until after the first iteration of my loop. So I suppose I need to change my data flow, but I'm unsure how to go about this.
Any help would be greatly appreciated. Thanks in advance.
that function is numpy.meshgrid [1] , the function does it by default.
[1] https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.meshgrid.html

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

Numpy Array Shape Issue

I have initialized this empty 2d np.array
inputs = np.empty((300, 2), int)
And I am attempting to append a 2d row to it as such
inputs = np.append(inputs, np.array([1,2]), axis=0)
But Im getting
ValueError: all the input arrays must have same number of dimensions
And Numpy thinks it's a 2 row 0 dimensional object (transpose of 2d)
np.array([1, 2]).shape
(2,)
Where have I gone wrong?
To add a row to a (300,2) shape array, you need a (1,2) shape array. Note the matching 2nd dimension.
np.array([[1,2]]) works. So does np.array([1,2])[None, :] and np.atleast_2d([1,2]).
I encourage the use of np.concatenate. It forces you to think more carefully about the dimensions.
Do you really want to start with np.empty? Look at its values. They are random, and probably large.
#Divakar suggests np.row_stack. That puzzled me a bit, until I checked and found that it is just another name for np.vstack. That function passes all inputs through np.atleast_2d before doing np.concatenate. So ultimately the same solution - turn the (2,) array into a (1,2)
Numpy requires double brackets to declare an array literal, so
np.array([1,2])
needs to be
np.array([[1,2]])
If you intend to append that as the last row into inputs, you can just simply use np.row_stack -
np.row_stack((inputs,np.array([1,2])))
Please note this np.array([1,2]) is a 1D array.
You can even pass it a 2D row version for the same result -
np.row_stack((inputs,np.array([[1,2]])))

Evaluate several elements of numpy object array

I have an ndarray A that stores objects of the same type, in particular various LinearNDInterpolator objects. For example's sake assume it's just 2:
>>> A
array([ <scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe122adc750>,
<scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe11daee590>], dtype=object)
I want to be able to do two things. First, I'd like to evaluate all objects in A at a certain point and get back an ndarray of A.shape with all the values in it. Something like
>> A[[0,1]](1,1) =
array([ 1, 2])
However, I get
TypeError: 'numpy.ndarray' object is not callable
Is it possible to do that?
Second, I would like to change the interpolation values without constructing new LinearNDInterpolator objects (since the nodes stay the same). I.e., something like
A[[0,1]].values = B
where B is an ndarray containing the new values for every element of A.
Thank you for your suggestions.
The same issue, but with simpler functions:
In [221]: A=np.array([add,multiply])
In [222]: A[0](1,2) # individual elements can be called
Out[222]: 3
In [223]: A(1,2) # but not the array as a whole
---------------------------------------------------------------------------
TypeError: 'numpy.ndarray' object is not callable
We can iterate over a list of functions, or that array as well, calling each element on the parameters. Done right we can even zip a list of functions and a list of parameters.
In [224]: ll=[add,multiply]
In [225]: [x(1,2) for x in ll]
Out[225]: [3, 2]
In [226]: [x(1,2) for x in A]
Out[226]: [3, 2]
Another test, the callable function:
In [229]: callable(A)
Out[229]: False
In [230]: callable(A[0])
Out[230]: True
Can you change the interpolation values for individual Interpolators? If so, just iterate through the list and do that.
In general, dtype object arrays function like lists. They contain the same kind of object pointers. Most operations requires the same sort of iteration. Unless you need to organize the elements in multiple dimensions, dtype object arrays have few, if any advantages over lists.
Another thought - the normal array dtype is numeric or fixed length strings. These elements are not callable, so there's no need to implement a .__call__ method on these arrays. They could write something like that to operate on object dtype arrays, but the core action is a Python call. So such a function would just hide the kind of iteration that I outlined.
In another recent question I showed how to use np.char.upper to apply a string method to every element of a S dtype array. But my time tests showed that this did not speedup anything.

Iterating over multidimensional arrays(images) with numpy array - python

Hy!
I have two images(same dimension) as numpy array imgA - imgB
i would like to iterate each row and column and get somenthing like that:
for i in range(0, h-1):
for j in range(0, w-1):
final[i][j]= imgA[i,j] - imgB[i-k[i],j]
where h and w are the height and the width of the image and k is and array with dimension[h*w].
i have seen this topic:
Iterating over a numpy array
but it doens't work with images, i get the error: too many values to unpack
Is there any way to do that with numpy and python 2.7?
thanks
edit
I try to explain better myself.
I have 2 images in LAB color space.
these images are (288,384,3).
Now I would like to make deltaE so I could do like that(spitting the 2 arrays):
imgLabL=np.dsplit(imgL,3)
imgLabR=np.dsplit(imgR,3)
imgLl=imgLabL[0]
imgLa=imgLabL[1]
imgLb=imgLabL[2]
imgRl=imgLabR[0]
imgRa=imgLabR[1]
imgRb=imgLabR[2]
delta=np.sqrt(((imgLl-imgRl)**2) + ((imgLa - imgRa)**2) + ((imgLb - imgRb)**2) )
Till now everything is fine.
But now i have this array k of size (288,384).
So now i need a new delta but with different x axis,like the pixel in imgRl(0,0) i want to add the pixel in imgLl(0+k,0)
do you get more my problems?
I'm pretty sure that whatever it is you are trying to do can be vectorized and run without any loops in it. But the way your code is written, it is no surprise that it doesn't work...
If k is an array of shape (h, w), then k[i] is an array of shape (w,). when you do i-k[i], numpy will do its broadcasting magic, and you will get an array of shape (w,). So you are indexing imgB with an array of shape (w,) and a single integer. Because one of the items in the indexing is an array, fancy indexing kicks in. So assuming imgB also has shape (h, w, 1), the return value of imgB[i-k[i], j] will not be an array of shape (1,), but an array of shape (w, 1). When you then try to substract that from imgA[i, j], which is an array of shape (1,), broadcasting magic works again, and so you get an array of shape (w, 1).
We do not know what is final. But if it is an array of shape (h, w, 1), as imgA and imgB, then final[i][j] is an array of shape (1,), and you are trying to assign to it an array of shape (w, 1), which does not fit. Hence the operand requires a reduction,but reduction is not enabled error message.
EDIT
You don't really need to split your arrays to compute DeltaE...
def deltaE(a, b) :
return np.sqrt(((a - b)**2).sum(axis=-1))
delta = deltaE(imgLabL, imgLabR)
I still don't understand what you want to do in the second case... If you want to compare the two images displaced along the x-axis, I would suggest using np.roll:
deltaE(imgLabL, np.roll(imgLabR, k, axis=0))
will have at position (r, c) the deltaE between the pixel (r, c) of imgLabL and the pixel (r - k, c) of imgLAbR. Is that what you want?
I usually use numpy.nditer, the docs for which are here and have many examples. Briefly:
import numpy as np
a = np.ones([4,4])
it = np.nditer(a)
for elem in a:
#do stuff
You can also use c style iteration, i.e.
while not it.finished:
#do stuff
it.iternext()
If you need to access the indices of your arrays. In your situation, I would zip your two images together to create an array of shape [2,h,w] and then iterate over this, filling an empty array with the results of the computation.