Find not duplicated indice of dataframe for same index by pandas? - pandas

Example:
| param_a | param_b | param_c
1 | 0 | 0 | 0
1 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 0 | 0
4 | 0 | 0 | 0
For duplicated index(1,3,4), I want to find them where each indice is different. Take index 1 and 4 for example, there are different indices.
Output:
param_a param_b param_c
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0
Notice: it returns unique indices for each duplicated index.
I refered this post but could not get the answer.

IIUC, using tuple , after reset_index get all value in the df as group key , then we filter the df by transform nunique , and then drop_duplicates
s=df.reset_index()
yourdf=s[s.apply(tuple, 1).groupby(s['index']).transform('nunique') > 1].\
drop_duplicates().\
set_index('index')
yourdf
Out[207]:
param_a param_b param_c
index
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0

First convert index to column and remove duplicates by DataFrame.drop_duplicates and then get all duplicates per column index by Series.duplicated with keep=False and boolean indexing:
df = df.reset_index().drop_duplicates()
print (df)
index param_a param_b param_c
0 1 0 0 0
1 1 0 2 1
2 3 2 1 1
3 4 0 2 1
6 4 0 0 0
print (df['index'].duplicated(keep=False))
0 True
1 True
2 False
3 True
6 True
Name: index, dtype: bool
df1 = df[df['index'].duplicated(keep=False)].set_index('index').rename_axis(None)
print (df1)
param_a param_b param_c
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0

I tried this way with duplicated: (There is also parameter keep to keep the duplicates or no):
df=df.reset_index()
mask = pd.DataFrame(np.sort(df[list(df)], axis=1), index=df.index).duplicated()
df1 = df[~mask]
df1=df1.set_index('index')
param_a param_b param_c
1 0 0 0
1 0 2 1
3 2 1 1
4 0 2 1
3 2 1 1
4 0 0 0
4 0 0 0
index param_a param_b param_c
0 1 0 0 0
1 1 0 2 1
2 3 2 1 1
3 4 0 2 1
4 3 2 1 1
5 4 0 0 0
6 4 0 0 0
param_a param_b param_c
index
1 0 0 0
1 0 2 1
3 2 1 1
4 0 2 1
4 0 0 0
If you try to keep the duplicates:
mask = pd.DataFrame(np.sort(df[list(df)], axis=1), index=df.index).duplicated(keep=False)
You will end in result:
param_a param_b param_c
index
1 0 0 0
1 0 2 1
4 0 2 1
Which is again close but it is not taking the duplicated row because there :
4 0 0 0
In account since it has a duplicate row (with that index 4) and it should because there is another row with starting index 4.
So this was close, but it is straight forward approach.

Related

Adding missing rows with condition

i have data that look like this:
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 1 1 0 12
what i wish to do is: for every snp_id if it has only is_severe column ==0 or only is_severe column ==1 that add an extra row with the missing is_severe value and other will be equal to zero
example :
GL000191.1-37698 is Ok because he has is_severe 0 and 1 values but GL000191.1-37922 has only 1 . so i would like to add:
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 1 1 0 12
4 3 GL000191.1-37922 0 0 0 0
and if the data looked like this :
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 0 1 0 12
the result would be :
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 0 1 0 12
4 3 GL000191.1-37922 1 0 0 0
I read about indexing in some questions asked but the problem is that I'm supposed to do it for the snp_id column and it is a string and not an integer.
I also thought about the possibility of pivoting but I'm not sure it would help and then filling the NANs created with values , but it didn't work well:
count_pivote=count.pivot(index='snp_id', columns=["is_severe","encoding_1","encoding_2"], values=["encoding_1","encoding_2"])
is there any way to do this ?

Pandas iloc and conditional sum

This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2

Using If-else to change values in Pandas

I’ve a pd df consists three columns: ID, t, and ind1.
import pandas as pd
dat = {'ID': [1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,6,6,6],
't': [0,1,2,3,0,1,2,0,1,2,3,0,1,2,0,1,0,1,2],
'ind1' : [1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0]
}
df = pd.DataFrame(dat, columns = ['ID', 't', 'ind1'])
print (df)
What I need to do is to create a new column (res) that
for all ID with ind1==0, then res is zero.
for all ID with
ind1==1 and if t==max(t) (group by ID), then res = 1, otherwise zero.
Here’s anticipated output
Check with groupby with idxmax , then where with transform all
df['res']=df.groupby('ID').t.transform('idxmax').where(df.groupby('ID').ind1.transform('all')).eq(df.index).astype(int)
df
Out[160]:
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
This works on the knowledge that the ID column is sorted :
cond1 = df.ind1.eq(0)
cond2 = df.ind1.eq(1) & (df.t.eq(df.groupby("ID").t.transform("max")))
df["res"] = np.select([cond1, cond2], [0, 1], 0)
df
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
Use groupby.apply:
df['res'] = (df.groupby('ID').apply(lambda x: x['ind1'].eq(1)&x['t'].eq(x['t'].max()))
.astype(int).reset_index(drop=True))
print(df)
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0

getting dummy values acorss all columns

get dummies method does not seem to work as expected while using with more than one column.
For e.g. if I have this dataframe...
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread", "Milk"],
["Rice", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
If I use get_dummmies method, the items are repeated across columns like this:
pd.get_dummies(df)
0_Apple 0_Rice 1_Bread 1_Milk 1_Rice 2_Bread 2_Fridge 2_Milk 3_Milk
0 1 0 1 0 0 0 1 0 0
1 0 1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0 0 1
3 0 1 0 1 0 0 0 0 0
4 1 0 1 0 0 0 0 1 0
While the expected result is:
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 1 1
3 0 0 0 1 1
4 1 1 0 1 0
Add parameters prefix and prefix_sep to get_dummies and then add max for avoid duplicated columns names (it aggregate by max):
df = pd.get_dummies(df, prefix='', prefix_sep='').max(axis=1, level=0)
print(df)
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 1 0
3 0 1 0 1 0
4 1 0 1 1 0

Pandas: The best way to create new Frame by specific criteria

I have a DataFrame:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 0
10 4 1
11 4 1
I want to get new DateFrame where there are only id's with both sex values.
So I want to get something like this.
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 3 1
5 3 1
6 3 0
Using groupby and filter with required condition
In [2952]: df.groupby('id').filter(lambda x: set(x.sex) == set([0,1]))
Out[2952]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Also,
In [2953]: df.groupby('id').filter(lambda x: all([any(x.sex == v) for v in [0,1]]))
Out[2953]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use drop_duplicates by both columns and then get size of one column by value_counts first.
Then filter all values by boolean indexing with isin:
s = df.drop_duplicates()['id'].value_counts()
print (s)
3 2
1 2
4 1
2 1
Name: id, dtype: int64
df = df[df['id'].isin(s.index[s == 2])]
print (df)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
One more:)
df.groupby('id').filter(lambda x: x['sex'].nunique()>1)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use isin()
Something like this:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
male = df[df['sex'] == 0]
male = male['id']
female = df[df['sex'] == 1]
female = female['id']
df = df[(df['id'].isin(male)) & (df['id'].isin(female))]
print(df)
Output:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Or you can try this
m=df.groupby('id')['sex'].nunique().eq(2)
df.loc[df.id.isin(m[m].index)]
Out[112]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0