Example:
| param_a | param_b | param_c
1 | 0 | 0 | 0
1 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 0 | 0
4 | 0 | 0 | 0
For duplicated index(1,3,4), I want to find them where each indice is different. Take index 1 and 4 for example, there are different indices.
Output:
param_a param_b param_c
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0
Notice: it returns unique indices for each duplicated index.
I refered this post but could not get the answer.
IIUC, using tuple , after reset_index get all value in the df as group key , then we filter the df by transform nunique , and then drop_duplicates
s=df.reset_index()
yourdf=s[s.apply(tuple, 1).groupby(s['index']).transform('nunique') > 1].\
drop_duplicates().\
set_index('index')
yourdf
Out[207]:
param_a param_b param_c
index
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0
First convert index to column and remove duplicates by DataFrame.drop_duplicates and then get all duplicates per column index by Series.duplicated with keep=False and boolean indexing:
df = df.reset_index().drop_duplicates()
print (df)
index param_a param_b param_c
0 1 0 0 0
1 1 0 2 1
2 3 2 1 1
3 4 0 2 1
6 4 0 0 0
print (df['index'].duplicated(keep=False))
0 True
1 True
2 False
3 True
6 True
Name: index, dtype: bool
df1 = df[df['index'].duplicated(keep=False)].set_index('index').rename_axis(None)
print (df1)
param_a param_b param_c
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0
I tried this way with duplicated: (There is also parameter keep to keep the duplicates or no):
df=df.reset_index()
mask = pd.DataFrame(np.sort(df[list(df)], axis=1), index=df.index).duplicated()
df1 = df[~mask]
df1=df1.set_index('index')
param_a param_b param_c
1 0 0 0
1 0 2 1
3 2 1 1
4 0 2 1
3 2 1 1
4 0 0 0
4 0 0 0
index param_a param_b param_c
0 1 0 0 0
1 1 0 2 1
2 3 2 1 1
3 4 0 2 1
4 3 2 1 1
5 4 0 0 0
6 4 0 0 0
param_a param_b param_c
index
1 0 0 0
1 0 2 1
3 2 1 1
4 0 2 1
4 0 0 0
If you try to keep the duplicates:
mask = pd.DataFrame(np.sort(df[list(df)], axis=1), index=df.index).duplicated(keep=False)
You will end in result:
param_a param_b param_c
index
1 0 0 0
1 0 2 1
4 0 2 1
Which is again close but it is not taking the duplicated row because there :
4 0 0 0
In account since it has a duplicate row (with that index 4) and it should because there is another row with starting index 4.
So this was close, but it is straight forward approach.
Related
i have data that look like this:
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 1 1 0 12
what i wish to do is: for every snp_id if it has only is_severe column ==0 or only is_severe column ==1 that add an extra row with the missing is_severe value and other will be equal to zero
example :
GL000191.1-37698 is Ok because he has is_severe 0 and 1 values but GL000191.1-37922 has only 1 . so i would like to add:
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 1 1 0 12
4 3 GL000191.1-37922 0 0 0 0
and if the data looked like this :
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 0 1 0 12
the result would be :
X snp_id is_severe encoding_1 encoding_2 encoding_0
1 0 GL000191.1-37698 0 0 1 7
2 1 GL000191.1-37698 1 0 2 11
3 2 GL000191.1-37922 0 1 0 12
4 3 GL000191.1-37922 1 0 0 0
I read about indexing in some questions asked but the problem is that I'm supposed to do it for the snp_id column and it is a string and not an integer.
I also thought about the possibility of pivoting but I'm not sure it would help and then filling the NANs created with values , but it didn't work well:
count_pivote=count.pivot(index='snp_id', columns=["is_severe","encoding_1","encoding_2"], values=["encoding_1","encoding_2"])
is there any way to do this ?
This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2
I’ve a pd df consists three columns: ID, t, and ind1.
import pandas as pd
dat = {'ID': [1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,6,6,6],
't': [0,1,2,3,0,1,2,0,1,2,3,0,1,2,0,1,0,1,2],
'ind1' : [1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0]
}
df = pd.DataFrame(dat, columns = ['ID', 't', 'ind1'])
print (df)
What I need to do is to create a new column (res) that
for all ID with ind1==0, then res is zero.
for all ID with
ind1==1 and if t==max(t) (group by ID), then res = 1, otherwise zero.
Here’s anticipated output
Check with groupby with idxmax , then where with transform all
df['res']=df.groupby('ID').t.transform('idxmax').where(df.groupby('ID').ind1.transform('all')).eq(df.index).astype(int)
df
Out[160]:
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
This works on the knowledge that the ID column is sorted :
cond1 = df.ind1.eq(0)
cond2 = df.ind1.eq(1) & (df.t.eq(df.groupby("ID").t.transform("max")))
df["res"] = np.select([cond1, cond2], [0, 1], 0)
df
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
Use groupby.apply:
df['res'] = (df.groupby('ID').apply(lambda x: x['ind1'].eq(1)&x['t'].eq(x['t'].max()))
.astype(int).reset_index(drop=True))
print(df)
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
get dummies method does not seem to work as expected while using with more than one column.
For e.g. if I have this dataframe...
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread", "Milk"],
["Rice", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
If I use get_dummmies method, the items are repeated across columns like this:
pd.get_dummies(df)
0_Apple 0_Rice 1_Bread 1_Milk 1_Rice 2_Bread 2_Fridge 2_Milk 3_Milk
0 1 0 1 0 0 0 1 0 0
1 0 1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0 0 1
3 0 1 0 1 0 0 0 0 0
4 1 0 1 0 0 0 0 1 0
While the expected result is:
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 1 1
3 0 0 0 1 1
4 1 1 0 1 0
Add parameters prefix and prefix_sep to get_dummies and then add max for avoid duplicated columns names (it aggregate by max):
df = pd.get_dummies(df, prefix='', prefix_sep='').max(axis=1, level=0)
print(df)
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 1 0
3 0 1 0 1 0
4 1 0 1 1 0
I have a DataFrame:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 0
10 4 1
11 4 1
I want to get new DateFrame where there are only id's with both sex values.
So I want to get something like this.
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 3 1
5 3 1
6 3 0
Using groupby and filter with required condition
In [2952]: df.groupby('id').filter(lambda x: set(x.sex) == set([0,1]))
Out[2952]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Also,
In [2953]: df.groupby('id').filter(lambda x: all([any(x.sex == v) for v in [0,1]]))
Out[2953]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use drop_duplicates by both columns and then get size of one column by value_counts first.
Then filter all values by boolean indexing with isin:
s = df.drop_duplicates()['id'].value_counts()
print (s)
3 2
1 2
4 1
2 1
Name: id, dtype: int64
df = df[df['id'].isin(s.index[s == 2])]
print (df)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
One more:)
df.groupby('id').filter(lambda x: x['sex'].nunique()>1)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use isin()
Something like this:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
male = df[df['sex'] == 0]
male = male['id']
female = df[df['sex'] == 1]
female = female['id']
df = df[(df['id'].isin(male)) & (df['id'].isin(female))]
print(df)
Output:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Or you can try this
m=df.groupby('id')['sex'].nunique().eq(2)
df.loc[df.id.isin(m[m].index)]
Out[112]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0