Dataframe within a Dataframe - to create new column_ - pandas

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.

GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

Related

pandas dataframe enforce monotically per row

I have a dataframe:
df = 0 1 2 3 4
1 1 3 2 5
4 1 5 7 8
7 1 2 3 9
I want to enforce monotonically per row, to get:
df = 0 1 2 3 4
1 1 3 3 5
4 4 5 7 8
7 7 7 7 9
What is the best way to do so?
Try cummax
out = df.cummax(1)
Out[80]:
0 1 2 3 4
0 1 1 3 3 5
1 4 4 5 7 8
2 7 7 7 7 9

Reset 'Id' value of appended Dataframe

I have appended multiple dataframes to form single dataframe. Each dataframe had multiple rows assigned with specific ID. After appending, Big dataframe has multiple rows with same Id. Would like assign new id's.
Current Dataframe:
Index name groupid
0 Abc 0
1 cvb 0
2 sdf 0
3 ksh 1
4 kjl 1
5 lmj 2
6 hyb 2
0 khf 0
1 uyt 0
2 tre 1
3 awe 1
4 uys 2
5 asq 2
6 lsx 2
Desired Output:
Index name groupid new_id
0 Abc 0 0
1 cvb 0 0
2 sdf 0 0
3 ksh 1 1
4 kjl 1 1
5 lmj 2 2
6 hyb 2 2
7 khf 0 3
8 uyt 0 3
9 tre 1 4
10 awe 1 4
11 uys 2 5
12 asq 2 5
13 lsx 2 5
You would have to use a slightly modified version of groupby:
df['new_id'] = df.groupby(df['groupid'].ne(df['groupid'].shift()).cumsum(), sort=False)
.ngroup())
Output is:
Index name groupid new_id
0 0 Abc 0 0
1 1 cvb 0 0
2 2 sdf 0 0
3 3 ksh 1 1
4 4 kjl 1 1
5 5 lmj 2 2
6 6 hyb 2 2
7 0 khf 0 3
8 1 uyt 0 3
9 2 tre 1 4
10 3 awe 1 4
11 4 uys 2 5
12 5 asq 2 5
13 6 lsx 2 5
See previous answer for reference.

Using If-else to change values in Pandas

I’ve a pd df consists three columns: ID, t, and ind1.
import pandas as pd
dat = {'ID': [1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,6,6,6],
't': [0,1,2,3,0,1,2,0,1,2,3,0,1,2,0,1,0,1,2],
'ind1' : [1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0]
}
df = pd.DataFrame(dat, columns = ['ID', 't', 'ind1'])
print (df)
What I need to do is to create a new column (res) that
for all ID with ind1==0, then res is zero.
for all ID with
ind1==1 and if t==max(t) (group by ID), then res = 1, otherwise zero.
Here’s anticipated output
Check with groupby with idxmax , then where with transform all
df['res']=df.groupby('ID').t.transform('idxmax').where(df.groupby('ID').ind1.transform('all')).eq(df.index).astype(int)
df
Out[160]:
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
This works on the knowledge that the ID column is sorted :
cond1 = df.ind1.eq(0)
cond2 = df.ind1.eq(1) & (df.t.eq(df.groupby("ID").t.transform("max")))
df["res"] = np.select([cond1, cond2], [0, 1], 0)
df
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
Use groupby.apply:
df['res'] = (df.groupby('ID').apply(lambda x: x['ind1'].eq(1)&x['t'].eq(x['t'].max()))
.astype(int).reset_index(drop=True))
print(df)
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0

low accuracies with ML modules

I'm working with breast cancer dataset with 2 classes 0-1 and the training and accuracy was great, but I have changed the number of classes to 8 classes 0-7 and I'm getting low accuracy wit Ml algorithms but meanwhile the accuracy with ANN 97% maybe I made a mistake but I don't know where
y_pred :
[5 0 3 0 3 6 1 0 2 1 7 6 7 3 0 3 6 3 7 0 7 1 5 2 5 0 3 6 5 5 7 2 0 6 6 6 3
6 5 0 0 6 6 5 3 0 5 1 6 4 0 7 6 0 5 5 5 0 0 5 7 1 6 6 7 6 0 1 7 5 6 0 6 0
3 3 6 7 7 1 0 7 0 5 5 0 6 0 0 6 1 6 5 0 0 7 0 1 6 1 0 6 0 7 0 6 0 5 0 6 3
6 7 0 6 6 0 0 0 5 7 4 6 6 2 3 5 6 0 7 7 0 5 6 0 0 0 6 1 5 0 7 4 6 0 7 3 6
5 6 6 0 2 0 1 0 7 0 1 7 0 7 7 6 6 6 7 6 6 0 6 5 1 1 7 6 6 7 0 7 0 1 6 0]
y_test:
[1 0 1 6 4 6 1 0 1 3 0 2 6 3 0 1 0 7 0 0 6 6 5 6 2 6 3 6 5 6 7 6 5 7 0 2 3
6 5 0 7 2 6 4 0 0 2 6 3 7 7 1 3 6 5 0 2 7 0 7 6 0 1 7 6 6 0 4 7 0 0 0 6 0
3 5 0 0 7 6 0 0 7 0 6 7 7 2 7 1 1 5 5 3 7 4 7 2 2 4 0 0 0 7 0 2 0 6 0 6 1
7 6 0 6 0 0 1 0 6 6 7 6 6 7 0 6 1 0 0 7 0 5 7 0 0 7 7 6 5 0 0 1 6 0 7 6 6
5 2 6 0 2 0 6 0 5 0 2 7 0 7 7 6 7 6 6 6 0 6 6 0 1 1 7 6 2 7 6 0 0 6 5 0]
I have replaced multilabel_confusion_matrix with confusion_matrix but still I'm getting the same results the accuracy between 40% to 50%.
and I'm getting results with : cv_results.mean() *100
K-Nearest Neighbours: 39.62 %
Support Vector Machine: 48.09 %
Naive Bayes: 30.46 %
Decision Tree: 30.46 %
Randoom Forest: 52.32 %
Logistic Regression: 44.26 %
here is Ml part :
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
cm = multilabel_confusion_matrix(y_test, y_pred)
models = []
models.append(('K-Nearest Neighbours', KNeighborsClassifier(n_neighbors = 5)))
models.append(('Support Vector Machine', SVC()))
models.append(('Naive Bayes', GaussianNB()))
models.append(('Decision Tree', DecisionTreeClassifier()))
models.append(('Randoom Forest', RandomForestClassifier(n_estimators=100)))
models.append(('Logistic Regression', LogisticRegression()))
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state = 8)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')

pandas aggregate based on continuous same rows

Suppose I have this data frame and I want to aggregate and sum values on column 'a' based on the labels that have the same amount.
a label
0 1 0
1 3 0
2 5 0
3 2 1
4 2 1
5 2 1
6 3 0
7 3 0
8 4 1
The desired result will be:
a label
0 9 0
1 6 1
2 6 0
3 4 1
and not this:
a label
0 15 0
1 10 1
IIUC
s=df.groupby(df.label.diff().ne(0).cumsum()).agg({'a':'sum','label':'first'})
s
Out[280]:
a label
label
1 9 0
2 6 1
3 6 0
4 4 1