knn classifiers troubleshooting - pandas

I'm exploring knn classifiers using some stock data - the features I'm using as to classify are the mean_return and volatility. My classifiers are labels 'green' and 'red' or 0 and 1 respectively.
This is my code so far, including training and testing:
year_one.loc[year_one['labels'] == 'green', 'label_two'] = 0
year_one.loc[year_one['labels'] == 'red', 'label_two'] = 1
X = year_one.iloc[:, 2:4] # features
y = year_one.iloc[:, -1] # label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 20)
And this is what my dataframe looks like this...
Year Week_Number mean_return volatility labels label_two
159 2020 1 1.57500 0.738242 green 0
160 2020 2 1.21760 0.672509 green 0
161 2020 3 -0.20475 3.040763 red 1
162 2020 4 -2.10100 3.879057 red 1
163 2020 5 0.35420 5.266582 green 0
164 2020 6 0.57760 1.611520 green 0
165 2020 7 -0.49050 3.277057 red 1
166 2020 8 -1.11040 3.086351 red 1
167 2020 9 -0.31020 4.117689 red 1
168 2020 10 -4.88960 12.424480 red 1
When I try run the knn classifier on sklearn, I get an error that says 'ValueError: Unknown label type: 'unknown'
classifier = KNeighborsClassifier(n_neighbors = 3)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Any idea what the error is and what I'm doing wrong? Thanks.

Related

Pandas linear regression: use normalisation (StandardScaler) only on non-categorical values

I have the following dataset which I'm reading into a Pandas dataframe:
age gender bmi smoker married region value
39 female 23.0 yes no us 136
28 male 22.0 no no us 143
23 male 34.0 no yes europe 153
17 male 29.0 no no asia 162
Gender, smoker and region are categorical values. So I convert them (using replace function for gender and smoker and one hot encoding for region. The result is the following:
age sex bmi smoker married value r_asia r_europe r_us
39 1 23.0 1 0 136 0 0 1
28 0 22.0 0 0 143 0 0 1
23 0 34.0 0 1 153 0 1 0
17 0 29.0 0 0 162 1 0 0
Then I'm splitting into features and target
y = dataset['value'].values
X = dataset.drop('value',axis=1).values
Next I'm splitting into a training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
As a next step I want to normalise. Normally I would do:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
However this also normalises the categorical values. I want to only normalise the non-categorical value (in this example the only non-categorical value is 'bmi').
How can I only normalise the 'bmi' column and insert these normalised values into X_train and X_test?
train_test_split returns a list of NumPy arrays (X_train is one of them), not dataframes, so X_train["bmi"] raises an exception. Same for StandardScaler.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
print(X_train)
# Output <class 'numpy.ndarray'>
[[28 'male' 22.0 'no' 'no' 'us']
[39 'female' 23.0 'yes' 'no' 'us']]
So, here is one way to do it:
# Back to Pandas
X_train = pd.DataFrame(X_train)
# Fit and transform the target column (2 == "bmi")
scaler = StandardScaler()
scaler.fit(X_train.loc[:, 2].to_numpy().reshape(-1, 1))
X_train[2] = scaler.transform(X_train.loc[:, 2].to_numpy().reshape(-1, 1))
# Revert to Numpy
X_train = X_train.to_numpy()
print(X_train)
# Output
[[28 'male' -1.0 'no' 'no' 'us']
[39 'female' 1.0 'yes' 'no' 'us']]

TensorFlow linear regresison task - very high loss problem

I'm trying to build a linear model on my own yield
# Create features
X = np.array([-7.0, -4.0, -1.0, 2.0, 5.0, 8.0, 11.0, 14.0])
# Create labels
y = np.array([3.0, 6.0, 9.0, 12.0, 15.0, 18.0, 21.0, 24.0])
model = tf.keras.Sequential([
tf.keras.layers.Dense(50, activation = "elu", input_shape = [1]),
tf.keras.layers.Dense(1)
])
model.compile(loss = "mae",
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.01),
metrics = ["mae"])
model.fit(X, y, epochs = 150)
When I train with the above X and y data, the loss value starts from a normal value.
experience salary
0 0 2250
1 1 2750
2 5 8000
3 8 9000
4 4 6900
5 15 20000
6 7 8500
7 3 6000
8 2 3500
9 12 15000
10 10 13000
11 14 18000
12 6 7500
13 11 14500
14 12 14900
15 3 5800
16 2 4000
But when I use such a dataset, the initial loss value starts as 800.(same as above model btw)
What could be the reason for this?
Your learning rate is significantly high. You should opt for much lower initial learning rates, such as 0.0001 or 0.00001.
Otherwise you are using 'linear' activation on the last layer (default one) and the correct loss function and metric. Also note that the default batch_size in absence of explicit mentioning is 32.
UPDATING : as determined by the author of the question, underfitting was also fundamental to the problem. Adding multiple more layers helped solved the problem.

AttributeError: 'MultiIndex' object has no attribute 'labels'

Currently working on a Movie Recommendation System. Trying to set userId and movieId as indexes in order to make them as x and y axis of sparse matrix.
But getting this error:
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
<ipython-input-34-45d9a0074103> in <module>
11 ratings_df.set_index(['userId', 'movieId'], inplace=True)
12 ratings_matrix = sps.csr_matrix((ratings_df.rating,
---> 13 (ratings_df.index.labels[0], ratings_df.index.labels[1])))
14
15 print('shape ratings_matrix:', ratings_matrix.todense())
> AttributeError: 'MultiIndex' object has no attribute 'labels'
Here is the code snippet:
import scipy.sparse.linalg
from sklearn.metrics import mean_absolute_error
import scipy.sparse as sps
# set userId and movieId as indexes in order to make them as x and y axis of sparse matrix
ratings_df = ratings_df_raw.copy()
print('unique users:', len(set(ratings_df['userId'])))
print('unique movies:', len(set(ratings_df['movieId'])))
ratings_df.set_index(['userId', 'movieId'], inplace=True)
ratings_matrix = sps.csr_matrix((ratings_df.rating,
(ratings_df.index.labels[0], ratings_df.index.labels[1]))).todense()
print('shape ratings_matrix:', ratings_matrix.shape)
Edit: ratings_df_raw
userId movieId rating
0 0 30 2.5
1 6 30 3.0
2 30 30 4.0
3 31 30 4.0
4 35 30 3.0
... ... ... ...
99999 663 7103 2.5
100000 663 7359 3.5
100001 664 115 3.0
100002 664 3712 1.0
100003 667 4629 1.0
[100004 rows x 3 columns]

pandas return multiple DataFrames from apply function

EDIT Based on comments, clarifying the examples further to depict more realistic use case
I want to call a function with df.apply. This function returns multiple DataFrames. I want to join each of these DataFrames into logical groups. I am unable to do that without using for loop (which defeats the purpose of calling with apply).
I have tried calling function for each row of dataframe and it is slower than apply. However, with apply combining the results slows down things again.
Any tips?
# input data frame
data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)
Output>
Name Age Score
0 Ani 15 93
1 Bob 12 98
2 Cal 13 95
3 Dom 14 99
Function to be applied>
def func1(name, age):
num_rows = np.random.randint(int(age/3))
age_mul_1 = np.random.randint(low=1, high=age, size = num_rows)
age_mul_2 = np.random.randint(low=1, high=age, size = num_rows)
data = {'Name': [name]*num_rows, 'Age_Mul_1': age_mul_1, 'Age_Mul_2': age_mul_2}
df_func1 = pd.DataFrame(data)
return df_func1
def func2(name, age, score, other_params):
num_rows = np.random.randint(int(score/10))
score_mul_1 = np.random.randint(low=age, high=score, size = num_rows)
data2 = {'Name': [name]*num_rows, 'score_Mul_1': score_mul_1}
df_func2 = pd.DataFrame(data2)
return df_func2
def ret_mul_df(row):
df_A = func1(row['Name'], row['Age'])
#print(df_A)
df_B = func2(row['Name'], row['Age'], row['Score'],1)
#print(df_B)
return df_A, df_B
What I want to do is essentially create is two dataframes df_A_combined and df_B_combined
However, How I am currently combining is as follows:
df_out = df_in.apply(lambda row: ret_mul_df(row), axis=1)
df_A_combined = pd.DataFrame()
df_B_combined = pd.DataFrame()
for ser in df_out:
df_A_combined = df_A_combined.append(ser[0], ignore_index=True)
df_B_combined = df_B_combined.append(ser[1], ignore_index=True)
print(df_A_combined)
Name Age_Mul_1 Age_Mul_2
0 Ani 7 8
1 Ani 1 4
2 Ani 1 8
3 Ani 12 6
4 Bob 9 8
5 Cal 8 7
6 Cal 8 1
7 Cal 4 8
print(df_B_combined)
Name score_Mul_1
0 Ani 28
1 Ani 29
2 Ani 50
3 Ani 35
4 Ani 84
5 Ani 24
6 Ani 51
7 Ani 28
8 Bob 32
9 Cal 26
10 Cal 70
11 Dom 56
12 Dom 53
How can I avoid the iteration?
The func1, func2 are calls to 3rd party libraries (which are very computation intensive) and several such calls are made. Also dataframes df_A_combined and df_B_combined are not combinable among themselves
Note: This is a much simplified example and splitting the function will lead to lot of redundancies.
If this isn't what you want, I'll update if you can post what the two dataframes should look like.
data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)
df_A = df_in[['Name','Age']]
df_A['Age_Multiplier'] = df_A['Age'] * 3
print(df_A)
...: print(df_A)
Name Age Age_Multiplier
0 Ani 15 45
1 Bob 12 36
2 Cal 13 39
3 Dom 14 42
df_B = df_in[['Name','Score']]
df_B['Score_Multiplier'] = df_B['Score'] * 2
print(df_B)
...: print(df_B)
Name Score Score_Multiplier
0 Ani 93 186
1 Bob 98 196
2 Cal 95 190
3 Dom 99 198

ValueError Inconsistent number of samples error with MultinomialNB

I need to create a model that classifies records accurately based on a variable. For instance, if a record has predictor A or B, I want it to be classified as having predicted value X. The actual data is in this form:
Predicted Predictor
X A
X B
Y D
X A
For my solution, I did the following:
1. Used LabelEncoder to create numerical values for the Predicted column
2. The predictor variable has multiple categories, which I parsed into individual columns using get_dummies.
Here is a sub-section of the dataframe with the (dummy)Predictor and a couple of predictor categories (pardon the misalignment):
Predicted Predictor_A Predictor_B
9056 30 0 0
2482 74 1 0
3407 56 1 0
12882 15 0 0
7988 30 0 0
13032 12 0 0
9738 28 0 0
6739 40 0 0
373 131 0 0
3030 62 0 0
8964 30 0 0
691 125 0 0
6214 41 0 0
6438 41 1 0
5060 42 0 0
3703 49 0 0
12461 16 0 0
2235 75 0 0
5107 42 0 0
4464 46 0 0
7075 39 1 0
11891 16 0 0
9190 30 0 0
8312 30 0 0
10328 24 0 0
1602 97 0 0
8804 30 0 0
8286 30 0 0
6821 40 0 0
3953 46 1
After reshaping the data into the datframe as shown above, I try using MultinomialNB from sklearn. When doing so, the error I run into is:
ValueError: Found input variables with inconsistent numbers of samples: [1, 8158]
I'm running into the error while trying it with a dataframe that has only 2 columns -> Predicted and Predictor_A
My questions are:
What do I need to do resolve the error?
Is my approach correct?
To fit the MultinomialNB model, you need the training samples and their features and their corresponding labels (target values).
In your case, Predicted is the target variable and Predictor_A and Predictor_B are the features(predictors).
Example 1:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("dt.csv", delim_whitespace=True)
# X is the features
X = df[['Predictor_A','Predictor_B']]
#y is the labels or targets or classes
y = df['Predicted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf.fit(X_train, y_train)
clf.predict(X_test)
#array([30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30])
#this result makes sense if you look at X_test. all the samples are similar
print(X_test)
Predictor_A Predictor_B
8286 0 0
12461 0 0
6214 0 0
9190 0 0
373 0 0
3030 0 0
11891 0 0
9056 0 0
8804 0 0
6438 1 0
#get the probabilities
clf.predict_proba(X_test)
Note 2: The data that I used can be found here
EDIT
If you train the model using some documents that have let's say 4 tags(predictors), then the new document that you want to predict should also have the same number of tags.
Example 2:
clf.fit(X, y)
here, X is a [29, 2] array. So we have 29 training samples(documents) and its has 2 tags(predictors)
clf.predict(X_new)
here, the X_new could be [n, 2]. So we can predict the classes on n new documents but these new documents should also have exactly 2 tags (predictors).