low accuracies with ML modules - numpy

I'm working with breast cancer dataset with 2 classes 0-1 and the training and accuracy was great, but I have changed the number of classes to 8 classes 0-7 and I'm getting low accuracy wit Ml algorithms but meanwhile the accuracy with ANN 97% maybe I made a mistake but I don't know where
y_pred :
[5 0 3 0 3 6 1 0 2 1 7 6 7 3 0 3 6 3 7 0 7 1 5 2 5 0 3 6 5 5 7 2 0 6 6 6 3
6 5 0 0 6 6 5 3 0 5 1 6 4 0 7 6 0 5 5 5 0 0 5 7 1 6 6 7 6 0 1 7 5 6 0 6 0
3 3 6 7 7 1 0 7 0 5 5 0 6 0 0 6 1 6 5 0 0 7 0 1 6 1 0 6 0 7 0 6 0 5 0 6 3
6 7 0 6 6 0 0 0 5 7 4 6 6 2 3 5 6 0 7 7 0 5 6 0 0 0 6 1 5 0 7 4 6 0 7 3 6
5 6 6 0 2 0 1 0 7 0 1 7 0 7 7 6 6 6 7 6 6 0 6 5 1 1 7 6 6 7 0 7 0 1 6 0]
y_test:
[1 0 1 6 4 6 1 0 1 3 0 2 6 3 0 1 0 7 0 0 6 6 5 6 2 6 3 6 5 6 7 6 5 7 0 2 3
6 5 0 7 2 6 4 0 0 2 6 3 7 7 1 3 6 5 0 2 7 0 7 6 0 1 7 6 6 0 4 7 0 0 0 6 0
3 5 0 0 7 6 0 0 7 0 6 7 7 2 7 1 1 5 5 3 7 4 7 2 2 4 0 0 0 7 0 2 0 6 0 6 1
7 6 0 6 0 0 1 0 6 6 7 6 6 7 0 6 1 0 0 7 0 5 7 0 0 7 7 6 5 0 0 1 6 0 7 6 6
5 2 6 0 2 0 6 0 5 0 2 7 0 7 7 6 7 6 6 6 0 6 6 0 1 1 7 6 2 7 6 0 0 6 5 0]
I have replaced multilabel_confusion_matrix with confusion_matrix but still I'm getting the same results the accuracy between 40% to 50%.
and I'm getting results with : cv_results.mean() *100
K-Nearest Neighbours: 39.62 %
Support Vector Machine: 48.09 %
Naive Bayes: 30.46 %
Decision Tree: 30.46 %
Randoom Forest: 52.32 %
Logistic Regression: 44.26 %
here is Ml part :
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
cm = multilabel_confusion_matrix(y_test, y_pred)
models = []
models.append(('K-Nearest Neighbours', KNeighborsClassifier(n_neighbors = 5)))
models.append(('Support Vector Machine', SVC()))
models.append(('Naive Bayes', GaussianNB()))
models.append(('Decision Tree', DecisionTreeClassifier()))
models.append(('Randoom Forest', RandomForestClassifier(n_estimators=100)))
models.append(('Logistic Regression', LogisticRegression()))
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state = 8)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')

Related

pandas dataframe enforce monotically per row

I have a dataframe:
df = 0 1 2 3 4
1 1 3 2 5
4 1 5 7 8
7 1 2 3 9
I want to enforce monotonically per row, to get:
df = 0 1 2 3 4
1 1 3 3 5
4 4 5 7 8
7 7 7 7 9
What is the best way to do so?
Try cummax
out = df.cummax(1)
Out[80]:
0 1 2 3 4
0 1 1 3 3 5
1 4 4 5 7 8
2 7 7 7 7 9

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.
GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

Backfill and Increment by one?

I have a column of a DataFrame that consists of 0's and NaN's:
Timestamp A B C
1 3 3 NaN
2 5 2 NaN
3 9 1 NaN
4 2 6 NaN
5 3 3 0
6 5 2 NaN
7 3 1 NaN
8 2 8 NaN
9 1 6 0
And I want to backfill it and increment the last value:
Timestamp A B C
1 3 3 4
2 5 2 3
3 9 1 2
4 2 6 1
5 3 3 0
6 5 2 3
7 3 1 2
8 2 8 1
9 1 6 0
YOu can use iloc[::-1] to reverse the data, and groupby().cumcount() to create the row counter:
s = df['C'].iloc[::-1].notnull()
df['C'] = df['C'].bfill() + s.groupby(s.cumsum()).cumcount()
Output
Timestamp A B C
0 1 3 3 4.0
1 2 5 2 3.0
2 3 9 1 2.0
3 4 2 6 1.0
4 5 3 3 0.0
5 6 5 2 3.0
6 7 3 1 2.0
7 8 2 8 1.0
8 9 1 6 0.0

How to find the average of multiple columns using a common column in pandas

How to calculate the mean value of all the columns with 'count' column.I have created a dataframe with random generated values in the below code.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,10)*100/10).astype(int)
df
output:
A B C D E F G H I J
0 4 3 2 8 5 0 9 9 0 5
1 1 5 8 0 5 9 8 3 9 1
2 9 5 1 1 3 2 6 3 8 3
3 4 0 8 1 7 3 4 2 8 8
4 9 4 8 2 7 9 7 8 9 7
5 1 0 7 3 8 6 1 7 2 0
6 3 6 8 9 6 6 5 0 8 4
7 8 9 9 5 3 9 0 7 5 5
8 5 5 8 7 8 4 3 0 9 9
9 2 4 2 3 0 5 2 0 3 0
I found mean value for a single column like this.How to find the mean for multiple columns with respect to count in pandas.
df['count'] = 1
print(df)
df.groupby('count').agg({'A':'mean'})
A B C D E F G H I J count
0 4 3 2 8 5 0 9 9 0 5 1
1 1 5 8 0 5 9 8 3 9 1 1
2 9 5 1 1 3 2 6 3 8 3 1
3 4 0 8 1 7 3 4 2 8 8 1
4 9 4 8 2 7 9 7 8 9 7 1
5 1 0 7 3 8 6 1 7 2 0 1
6 3 6 8 9 6 6 5 0 8 4 1
7 8 9 9 5 3 9 0 7 5 5 1
8 5 5 8 7 8 4 3 0 9 9 1
9 2 4 2 3 0 5 2 0 3 0 1
A
count
1 4.6
If need mean of all columns per groups by column count use:
df.groupby('count').mean()
If need mean by all rows (like grouping if same values in count) use:
df.mean().to_frame().T

replace some entries in a column of dataframe by a column of another dataframe

I have a dataframe about user-product-rating as below,
df1 =
USER_ID PRODUCT_ID RATING
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
another dataframe is the true ratings of some users and some products as below,
df2 =
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
I want to use the true ratings from df2 to replace the corresponding ratings in df1. So what I want to obtain is
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
Any operation to realize this?
rng = [i for i in range(0,10)]
df1 = pd.DataFrame({"USER_ID": rng,
"PRODUCT_ID": rng,
"RATING": rng})
rng_2 = [i for i in range(0,4)]
df2 = pd.DataFrame({'USER_ID' : rng_2,'PRODUCT_ID' : rng_2,
'RATING' : [10,10,10,10]})
Try to use update:
df1 = df1.set_index(['USER_ID', 'PRODUCT_ID'])
df2 = df2.set_index(['USER_ID', 'PRODUCT_ID'])
df1.update(df2)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
print(df2)
USER_ID PRODUCT_ID RATING
0 0 0 10.0
1 1 1 10.0
2 2 2 10.0
3 3 3 10.0
4 4 4 4.0
5 5 5 5.0
6 6 6 6.0
7 7 7 7.0
8 8 8 8.0
9 9 9 9.0
You can use combine first:
df2.astype(object).combine_first(df1)
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9