Pandas: compare df and add missing rows - pandas

I have a list of dataframes which have 1 column in common ('label'). However, in some of the dataframes some rows are missing.
Example: df1 = pd.DataFrame([['sample1',2,3], ['sample4',7,8]], columns=['label', 'B', 'E'], index=[1,2]) df2 = pd.DataFrame([['sample1',20,30], ['sample2',70,80], ['sample3',700,800]], columns=['label', 'B', 'C'], index=[2,3,4])
I would like to add rows, so the length of the dfs are the same but preserving the right order. The desired output would be:
label B E
1 sample1 2 3
2 0 0 0
3 0 0 0
4 sample4 7 8
label B C
1 sample1 20 30
2 sample2 70 80
3 sample3 700 800
4 0 0 0
I was looking into pandas three-way joining multiple dataframes on columns
but I don't want to merge my dataframes. And pandas align() function : illustrative example
doesn't give the desired output either. I was also thinking about comparing the 'label' column with a list and loop through to add the missing rows. If somebody could point me into the right direction, that would be great.

You can get the common indices in the desired order, then reindex:
# here the order matters to get the preference
# for a sorted order use:
# unique = sorted(pd.concat([df1['label'], df2['label']]).unique())
unique = pd.concat([df2['label'], df1['label']]).unique()
out1 = (df1.set_axis(df1['label'])
.reindex(unique, fill_value=0)
.reset_index(drop=True)
)
out2 = (df2.set_axis(df2['label'])
.reindex(unique, fill_value=0)
.reset_index(drop=True)
)
outputs:
# out1
label B E
0 sample1 2 3
1 0 0 0
2 0 0 0
3 sample4 7 8
# out2
label B C
0 sample1 20 30
1 sample2 70 80
2 sample3 700 800
3 0 0 0

Related

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

python pandas sort_values with multiple custom keys

I have a dataframe with 2 columns, I would like to sort column A ascending, and B ascending but using the absolute value. How do I do this? I tried like df.sort_values(by=['A', 'B'], key=lambda x: abs(x)) but it will take the absolute value of both columns and sort ascend.
df = pd.DataFrame({'A': [1,2,-3], 'B': [-1, 2, -3]})
output:
A B
0 1 -1
1 2 2
2 -3 -3
Expected output:
A B
0 -3 -1
1 1 2
2 2 -3
You can't have multiple sort key because index can't be dissociated. The only way is to sort your columns independently and recreate a dataframe:
>>> df.agg({'A': lambda x: x.sort_values().values,
'B': lambda x: x.sort_values(key=abs).values}) \
.apply(pd.Series).T
A B
0 -3 -1
1 1 2
2 2 -3
Use numpy.sort, to so sort column A values
df =df.assign(A= np.sort(df['A'].values))
A B
0 -3 -1
1 1 2
2 2 -3

Remove rows in pandas df with index values within a range

I would like to remove all rows in a pandas df that have an index value within 4 counts of the index value of the previous row.
In the pandas df below,
A B
0 1 1
5 5 5
8 9 9
9 10 10
Only the row with index value 0 should remain.
Thanks!
get the differences between the current and previous row as a list and pass to loc. Chose to get it as a list so i could return a dataframe as a final output.
ind = [ a for a,b in zip(df.index,df.index[1:]) if b-a > 4]
df.loc[ind]
A B
0 1 1
You can use reset_index, diff and shift:
In [1309]: df
Out[1309]:
A B
0 1 1
5 5 5
8 9 9
9 10 10
In [1310]: d = df.reset_index()
In [1313]: df = d[d['index'].diff(1).shift(-1) >=4].drop('index', 1)
In [1314]: df
Out[1313]:
A B
0 1 1

How do I extract information from nested duplicates in pandas?

I am trying to extract information from duplicates.
data = np.array([[100,1,0, 'GB'],[100,0,1, 'IT'],[101,1,0, 'CN'],[101,0,1, 'CN'],
[102,1,0, 'JP'],[102,0,1, 'CN'],[103,0,1, 'DE'],
[103,0,1, 'DE'],[103,1,0, 'VN'],[103,1,0, 'VN']])
df = pd.DataFrame(data, columns = ['wed_cert_id','spouse_1',
'spouse_2', 'nationality'])
I would like to categorise each wedding as either cross-national or not.
In my actual data set there can be more than 2 spouses to a marriage.
My aim is to obtain a data frame like this:
or like this:
I have tried to find a way to filter the data using .duplicated() and trying to deny .duplicated() with a not operator, but have not succeed in working it out:
df = df.loc[df.wed_cert_id.duplicated(keep=False) ~df.nationality.duplicated(keep=False), :]
df = df.loc[df.wed_cert_id.duplicated(keep=False) not df.nationality.duplicated(keep=False), :]
Dropping the duplicates drops too many observations. My data set allows for >2 spouses per wedding, creating the potential for duplication:
df.drop_duplicates(subset=['wed_cert_id','nationality'], keep=False, inplace=True)
How do I do it?
Many thanks from now
I believe you need:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1'))
print(df)
Or:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1')
.mul(df[['spouse_1','spouse_2']].prod(1)))
print(df)
wed_cert_id spouse_1 spouse_2 nationality cross_national
0 100 1 0 GB 1
1 100 0 1 IT 1
2 101 1 0 CN 0
3 101 0 1 CN 0
4 102 1 0 JP 1
5 102 0 1 CN 1
6 103 0 1 DE 1
7 103 0 1 DE 1
8 103 1 0 VN 1
9 103 1 0 VN 1

Comparing and replacing column items pandas dataframe

I have three columns C1,C2,C3 in panda dataframe. My aim is to replace C1_i by C2_j whenever C3_i=C1_j. These are all strings. I was trying where but failed. What is a good way to do this avoiding for loop?
If my data frame is
df=pd.DataFrame({'c1': ['a', 'b', 'c'], 'c2': ['d','e','f'], 'c3': ['c', 'z', 'b']})
Then I want c3 to be replaced by ['f','z','e']
I tried this, which takes very long time.
for i in range(0,len(df)):
for j in range(0,len(df)):
if (df.iloc[i]['c1']==df.iloc[j]['c3']):
df.iloc[j]['c3']=accounts.iloc[i]['c2']
Use map by Series created by set_index:
df['c3'] = df['c3'].map(df.set_index('c1')['c2']).fillna(df['c3'])
Alternative solution with update:
df['c3'].update(df['c3'].map(df.set_index('c1')['c2']))
print (df)
c1 c2 c3
0 a d f
1 b e z
2 c f e
Example data:
dataframe = pd.DataFrame({'a':['10','4','3','40','5'], 'b':['5','4','3','2','1'], 'c':['s','d','f','g','h']})
Output:
a b c
0 10 5 s
1 4 4 d
2 3 3 f
3 40 2 g
4 5 1 h
Code:
def replace(df):
if len(dataframe[dataframe.b==df.a]) != 0:
df['a'] = dataframe[dataframe.b==df.a].c.values[0]
return df
dataframe = dataframe.apply(replace, 1)
Output:
a b c
0 1 5 0
1 2 4 0
2 0 3 0
3 4 2 0
4 5 1 0
Is it what you want?