ValueError Inconsistent number of samples error with MultinomialNB - pandas

I need to create a model that classifies records accurately based on a variable. For instance, if a record has predictor A or B, I want it to be classified as having predicted value X. The actual data is in this form:
Predicted Predictor
X A
X B
Y D
X A
For my solution, I did the following:
1. Used LabelEncoder to create numerical values for the Predicted column
2. The predictor variable has multiple categories, which I parsed into individual columns using get_dummies.
Here is a sub-section of the dataframe with the (dummy)Predictor and a couple of predictor categories (pardon the misalignment):
Predicted Predictor_A Predictor_B
9056 30 0 0
2482 74 1 0
3407 56 1 0
12882 15 0 0
7988 30 0 0
13032 12 0 0
9738 28 0 0
6739 40 0 0
373 131 0 0
3030 62 0 0
8964 30 0 0
691 125 0 0
6214 41 0 0
6438 41 1 0
5060 42 0 0
3703 49 0 0
12461 16 0 0
2235 75 0 0
5107 42 0 0
4464 46 0 0
7075 39 1 0
11891 16 0 0
9190 30 0 0
8312 30 0 0
10328 24 0 0
1602 97 0 0
8804 30 0 0
8286 30 0 0
6821 40 0 0
3953 46 1
After reshaping the data into the datframe as shown above, I try using MultinomialNB from sklearn. When doing so, the error I run into is:
ValueError: Found input variables with inconsistent numbers of samples: [1, 8158]
I'm running into the error while trying it with a dataframe that has only 2 columns -> Predicted and Predictor_A
My questions are:
What do I need to do resolve the error?
Is my approach correct?

To fit the MultinomialNB model, you need the training samples and their features and their corresponding labels (target values).
In your case, Predicted is the target variable and Predictor_A and Predictor_B are the features(predictors).
Example 1:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("dt.csv", delim_whitespace=True)
# X is the features
X = df[['Predictor_A','Predictor_B']]
#y is the labels or targets or classes
y = df['Predicted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf.fit(X_train, y_train)
clf.predict(X_test)
#array([30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30])
#this result makes sense if you look at X_test. all the samples are similar
print(X_test)
Predictor_A Predictor_B
8286 0 0
12461 0 0
6214 0 0
9190 0 0
373 0 0
3030 0 0
11891 0 0
9056 0 0
8804 0 0
6438 1 0
#get the probabilities
clf.predict_proba(X_test)
Note 2: The data that I used can be found here
EDIT
If you train the model using some documents that have let's say 4 tags(predictors), then the new document that you want to predict should also have the same number of tags.
Example 2:
clf.fit(X, y)
here, X is a [29, 2] array. So we have 29 training samples(documents) and its has 2 tags(predictors)
clf.predict(X_new)
here, the X_new could be [n, 2]. So we can predict the classes on n new documents but these new documents should also have exactly 2 tags (predictors).

Related

knn classifiers troubleshooting

I'm exploring knn classifiers using some stock data - the features I'm using as to classify are the mean_return and volatility. My classifiers are labels 'green' and 'red' or 0 and 1 respectively.
This is my code so far, including training and testing:
year_one.loc[year_one['labels'] == 'green', 'label_two'] = 0
year_one.loc[year_one['labels'] == 'red', 'label_two'] = 1
X = year_one.iloc[:, 2:4] # features
y = year_one.iloc[:, -1] # label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 20)
And this is what my dataframe looks like this...
Year Week_Number mean_return volatility labels label_two
159 2020 1 1.57500 0.738242 green 0
160 2020 2 1.21760 0.672509 green 0
161 2020 3 -0.20475 3.040763 red 1
162 2020 4 -2.10100 3.879057 red 1
163 2020 5 0.35420 5.266582 green 0
164 2020 6 0.57760 1.611520 green 0
165 2020 7 -0.49050 3.277057 red 1
166 2020 8 -1.11040 3.086351 red 1
167 2020 9 -0.31020 4.117689 red 1
168 2020 10 -4.88960 12.424480 red 1
When I try run the knn classifier on sklearn, I get an error that says 'ValueError: Unknown label type: 'unknown'
classifier = KNeighborsClassifier(n_neighbors = 3)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Any idea what the error is and what I'm doing wrong? Thanks.

How to select row value from given columns based on comparison of other column values in Pandas data frame?

I have the following Pandas DataFrame:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1
0 0 0.628205 0.371795 1 0.491648 0.508352
0 0 0.564113 0.435887 1 0.474973 0.525027
0 1 0.463897 0.536103 0 0.660307 0.339693
0 1 0.454559 0.545441 0 0.512349 0.487651
0 0 0.608345 0.391655 1 0.499531 0.500469
0 0 0.816127 0.183873 1 0.456669 0.543331
0 1 0.442693 0.557307 0 0.573354 0.426646
1 0 0.653497 0.346503 1 0.487212 0.512788
0 1 0.392380 0.607620 0 0.627419 0.372581
0 1 0.375816 0.624184 0 0.631532 0.368468
This is a collection of disagreeing ML model predictions with labels and label probabilities of two models (m1, m2) and the actual label (true_y).
I would like to have any of the hard label predictions (m1_labels or m2_labels) which have a higher probability to the respective predicted class of their respective models per row. So for row #1, I expect 0 (as the m1 model has a higher probability for its prediction 0 than the m2 model for its prediction 1). Basically, this is intended to be a manual voting ensemble of the two models.
How can I get this vector with a Pandas query?
You can use the apply function for this:
df.apply(lambda x: x["m1_labels"] if max(x["m1_probs_0"], x["m1_probs_1"]) > max(x["m2_probs_0"], x["m2_probs_1"]) else x["m2_labels"], axis=1)
This select the first model label if the probabilty of its predicted class is higher than the probability of the second model predicted class. Otherwise, it selects the label from the second model.
You can use:
# get max probability for m1
p1 = df.filter(like='m1_probs').max(axis=1)
# get max probability for m2
p2 = df.filter(like='m2_probs').max(axis=1)
# m1_label if it has a greater probability, else m2_label
df['best'] = df['m1_labels'].where(p1.gt(p2), df['m2_labels'])
output:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1 best
0 0 0 0.628205 0.371795 1 0.491648 0.508352 0
1 0 0 0.564113 0.435887 1 0.474973 0.525027 0
2 0 1 0.463897 0.536103 0 0.660307 0.339693 0
3 0 1 0.454559 0.545441 0 0.512349 0.487651 1
4 0 0 0.608345 0.391655 1 0.499531 0.500469 0
5 0 0 0.816127 0.183873 1 0.456669 0.543331 0
6 0 1 0.442693 0.557307 0 0.573354 0.426646 0
7 1 0 0.653497 0.346503 1 0.487212 0.512788 0
8 0 1 0.392380 0.607620 0 0.627419 0.372581 0
9 0 1 0.375816 0.624184 0 0.631532 0.368468 0

Pandas linear regression: use normalisation (StandardScaler) only on non-categorical values

I have the following dataset which I'm reading into a Pandas dataframe:
age gender bmi smoker married region value
39 female 23.0 yes no us 136
28 male 22.0 no no us 143
23 male 34.0 no yes europe 153
17 male 29.0 no no asia 162
Gender, smoker and region are categorical values. So I convert them (using replace function for gender and smoker and one hot encoding for region. The result is the following:
age sex bmi smoker married value r_asia r_europe r_us
39 1 23.0 1 0 136 0 0 1
28 0 22.0 0 0 143 0 0 1
23 0 34.0 0 1 153 0 1 0
17 0 29.0 0 0 162 1 0 0
Then I'm splitting into features and target
y = dataset['value'].values
X = dataset.drop('value',axis=1).values
Next I'm splitting into a training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
As a next step I want to normalise. Normally I would do:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
However this also normalises the categorical values. I want to only normalise the non-categorical value (in this example the only non-categorical value is 'bmi').
How can I only normalise the 'bmi' column and insert these normalised values into X_train and X_test?
train_test_split returns a list of NumPy arrays (X_train is one of them), not dataframes, so X_train["bmi"] raises an exception. Same for StandardScaler.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
print(X_train)
# Output <class 'numpy.ndarray'>
[[28 'male' 22.0 'no' 'no' 'us']
[39 'female' 23.0 'yes' 'no' 'us']]
So, here is one way to do it:
# Back to Pandas
X_train = pd.DataFrame(X_train)
# Fit and transform the target column (2 == "bmi")
scaler = StandardScaler()
scaler.fit(X_train.loc[:, 2].to_numpy().reshape(-1, 1))
X_train[2] = scaler.transform(X_train.loc[:, 2].to_numpy().reshape(-1, 1))
# Revert to Numpy
X_train = X_train.to_numpy()
print(X_train)
# Output
[[28 'male' -1.0 'no' 'no' 'us']
[39 'female' 1.0 'yes' 'no' 'us']]

Data standardization of feat having lt/gt values among absolute values

One of the datasets I am dealing with has few features which have lt/gt values along with absolute values. Please refer to an example below -
>>> df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
>>> df
foo
0 <10
1 23
2 34
3 22
4 >90
5 42
note - foo is % value. ie 0 <= foo <= 100
How are such data transformed to run regression models on?
One thing you could do is, for values <10, impute the median value (5). Similarly, for those >90, impute 95.
Then add two extra boolean columns:
df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
dummies = pd.get_dummies(df, columns=['foo'])[['foo_<10', 'foo_>90']]
df = df.replace('<10', 5).replace('>90', 95)
df = pd.concat([df, dummies], axis=1)
df
This will give you
foo foo_<10 foo_>90
0 5 1 0
1 23 0 0
2 34 0 0
3 22 0 0
4 95 0 1
5 42 0 0

In Torch how do I create a 1-hot tensor from a list of integer labels?

I have a byte tensor of integer class labels, e.g. from the MNIST data set.
1
7
5
[torch.ByteTensor of size 3]
How do use it to create a tensor of 1-hot vectors?
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 0 0 0
[torch.DoubleTensor of size 3x10]
I know I could do this with a loop, but I'm wondering if there's any clever Torch indexing that will get it for me in a single line.
indices = torch.LongTensor{1,7,5}:view(-1,1)
one_hot = torch.zeros(3, 10)
one_hot:scatter(2, indices, 1)
You can find the documentation for scatter in the torch/torch7 github readme (in the master branch).
An alternate method is to shuffle rows from an identity matrix:
indicies = torch.LongTensor{1,7,5}
one_hot = torch.eye(10):index(1, indicies)
This was not my idea, I found it in karpathy/char-rnn.