Getting Error while performing Undersampling for Sklearn - pandas

I am trying built an randomforest classifier for binary classification . My data is inbalanced hence I am performing undersampling.
train = data.drop(['Co_Name','Cust_ID','Phone','Shpr_ID','Resi_Cnt','Buz_Cnt','Nearby_Cnt','parseNumber','removeString','Qty','bins','Adj_Addr','Resi','Weight','Resi_Area','Lat','Lng'], axis=1)
Y = data['Resi']
from sklearn import metrics
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(train, Y)
I am getting the below error
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
How to fix this.

Can you share the dataframe? or a sample of that!
This error can be a lot of things, for example:
If you try:
np.asarray(
[
[1, 2],
[2, 3, 4]
],
dtype=np.float)
You will get:
ValueError: setting an array element with a sequence.
This is because the array have incorrect shape of columns. So you can't create an array from lists, with a column length different on the second list. So doesn't match column length.
But your error probably it's related to train vs Y shape or the type in the train(data). During the Under-sampled fit function should have some conversion that throws this error. Confirm if train (data) have the appropriate type before to do the RandomUnderSampler.

Related

numpy.VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences

Here's an example of behavior I cannot understand, maybe someone can share the insight into the logic behind it:
ccn = np.ones(1)
bbb = 7
bbn = np.array(bbb)
bbn * ccn # this is OK
array([7.])
np.prod((bbn,ccn)) # but this is NOT
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.2.2\plugins\python-ce\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "<__array_function__ internals>", line 5, in prod
File "C:\Users\...\venv\lib\site-packages\numpy\core\fromnumeric.py", line 2999, in prod
return _wrapreduction(a, np.multiply, 'prod', axis, dtype, out,
File "C:\Users\...\venv\lib\site-packages\numpy\core\fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Why? Why would a simple multiplication of two numbers be a problem? As far as formal algebra goes there's no dimensional problems, no datatype problems? The result is invariably also a single number, there's no chance it "suddenly" turn vector or object anything alike. prod(a,b) for a and b being scalars or 1by1 "matrices" is something MATLAB or Octave would eat no problem.
I know I can turn this error off and such, but why is it even and error?
In [346]: ccn = np.ones(1)
...: bbb = 7
...: bbn = np.array(bbb)
In [347]: ccn.shape
Out[347]: (1,)
In [348]: bbn.shape
Out[348]: ()
In [349]: np.array((bbn,ccn))
<ipython-input-349-997419ba7a2f>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
np.array((bbn,ccn))
Out[349]: array([array(7), array([1.])], dtype=object)
You have arrays with different dimensions, that can't be combined into one numeric array.
That np.prod expression is actually:
np.multiply.reduce(np.array([bbn,ccn]))
can be deduced from your traceback.
In Octave both objects have shape (1,1), 2d
>> ccn = ones(1)
ccn = 1
>> ccn = ones(1);
>> size(ccn)
ans =
1 1
>> bbn = 7;
>> size(bbn)
ans =
1 1
>> [bbn,ccn]
ans =
7 1
It doesn't have true scalars; everything is 2d (even 3d is a fudge on the last dimension).
And with 'raw' Python inputs:
In [350]: np.array([1,[1]])
<ipython-input-350-f17372e1b22d>:1: VisibleDeprecationWarning: ...
np.array([1,[1]])
Out[350]: array([1, list([1])], dtype=object)
The object dtype array preserves the type of the inputs.
edit
prod isn't a simple multiplication. It's a reduction operation, like the big Pi in math. Even in Octave it isn't:
>> prod([[2,3],[3;4]])
error: horizontal dimensions mismatch (1x2 vs 2x1)
>> [2,3]*[3;4]
ans = 18
>> [2,3].*[3;4]
ans =
6 9
8 12
The numpy equivalent:
In [97]: np.prod((np.array([2,3]),np.array([[3],[4]])))
/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py:87: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences...
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: could not broadcast input array from shape (2,1) into shape (2,)
In [98]: np.array([2,3])#np.array([[3],[4]])
Out[98]: array([18])
In [99]: np.array([2,3])*np.array([[3],[4]])
Out[99]:
array([[ 6, 9],
[ 8, 12]])
The warning, and here the error, is produced by trying to make ONE array from (np.array([2,3]),np.array([[3],[4]])).

sklearn TimeSeriesSplit Error: KeyError: '[ 0 1 2 ...] not in index'

I want to use TimeSeriesSplit from sklearn on the following dataframe to predict sum:
So to prepare X and y I do the following:
X = df.drop(['sum'],axis=1)
y = df['sum']
and then feed these two to:
for train_index, test_index in tscv.split(X):
X_train01, X_test01 = X[train_index], X[test_index]
y_train01, y_test01 = y[train_index], y[test_index]
by doing so, I get the following error:
KeyError: '[ 0 1 2 ...] not in index'
Here X is a dataframe, and apparently this cause the error, because if I convert X to an array as following:
X = X.values
Then it will work. However, for later evaluation of the model I need X as a dataframe. Is there any way that I can keep X as a dataframe and feed it to tscv without converting it to an array?
As #Jarad rightly said, if you have updated version of pandas, it will not automatically switch to integer based indexing as was possible in previous versions. You need to explicitly use .iloc for integer based slicing.
for train_index, test_index in tscv.split(X):
X_train01, X_test01 = X.iloc[train_index], X.iloc[test_index]
y_train01, y_test01 = y.iloc[train_index], y.iloc[test_index]
See https://pandas.pydata.org/pandas-docs/stable/indexing.html

what's the appropriate placeholder for my input

I have a 1k rows and 14 columns dataframe containing numpy arrays like shown below.
Here a subset of 2 rows and 3 columns :
[5,4,74,-12] [ 78,1,2,-9] [5 ,1,1,2]
[10,4,4,-1] [ 8,15,21,-19] [1,1,0,0]
where each cell is a numpy array of shape (4,1).
I couldn't find the right placeholder to input my whole dataframe as it needs to be processed by row batches.
Could anyone have an idea ?
I tried this to find the proper placeholder for my dataframe but its not correct:
x = tf.placeholder(tf.int32,[None,14],name='x')
with tf.Session() as sess:
print(sess.run(x,feed_dict={x:Data}))
It gives ValueError: setting an array element with a sequence.
Does anyone have an idea please ?
You did not specify in which format your data is available, so I assume it is a numpy array. In this case, you can do it like this:
n_columns = 14
n_elements_per_column = 4
x = tf.placeholder(tf.int32, [None, n_columns, n_elements_per_column], name='x')
with tf.Session() as sess:
print(sess.run(x,feed_dict={x:Data}))

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

Error when using KNeighborsClassifier using sklearn

I am doing KNN classification for a dataset of 28 features and 5000 samples:
trainingSet = []
testSet = []
imdb_score = range(1,11)
print ("Start splitting the dataset ...")
splitDataset(path + 'movies.csv', 0.60, trainingSet, testSet)
print ("Start KNeighborsClassifier ... \n")
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(trainingSet, imdb_score)
However, I ran into this error:
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [3362, 10]
I think my code looks alright. Kindly, has anyone run into this issue before?
So you got 6000 samples, use 60% of these, resulting in 3362 samples (as it seems, i don't seed your exact calculations).
You call fit(X,Y) where the following is needed:
y : {array-like, sparse matrix}
Target values of shape = [n_samples] or [n_samples, n_outputs]
As your y=imdb_score is just a list of 10 values, neither of these rules apply as it needs to be either an array-like data-structure (list would be okay) with 3362 values or an array of shape (3362, 1).