I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.
I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function.
But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns.
The following is a simplified example of my problem.
Say, I have this df
df = pd.DataFrame([[1,2],[3,4],[1,5]], columns=['A','B']))
A B
0 1 2
1 3 4
2 1 5
I want to group by column A and keep only those rows where B has an even value. Thus the desired df is this:
B
A
1 2
3 4
The custom function my_combine_func should do the filtering. But applying it after a groupby, leads to an MultiIndex with the former Index in the second level. And thus column A existing two times.
my_combine_func = group[group['B'] % 2 == 0]
df.groupby(['A']).apply(my_combine_func)
A B
A
1 0 1 2
3 1 3 4
How to apply a custom group function and have the desired df?
It's easier to use apply here so you get a boolean array back:
df[df.groupby('A')['B'].apply(lambda x: x % 2 == 0)]
A B
0 1 2
1 3 4
I have a dataset containing 3 columns, I’m trying to group them and print each group in sorted fashion (based on highest value in each group). The records in each group also have to be in sorted fashion.
Dataset looks like below.
key1,key2,val
b,y,21
c,y,25
c,z,10
b,x,20
b,z,5
c,x,17
a,x,15
a,y,18
a,z,100
df=pd.read_csv('/tmp/hello.csv')
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max', 'val'], ascending=False).drop('max', axis=1)
I'm applying transform as it works per group basis and then sorting the values.
Above code results in my desired dataframe:
a,z,100
a,y,18
a,x,15
c,y,25
c,x,17
c,z,10
b,y,21
b,x,20
b,z,5
But, the same code fails for below dataset.
key1,key2,val
b,y,10
c,y,10
c,z,10
b,x,2
b,z,2
c,x,2
a,x,2
a,y,2
a,z,2
Below is the desired output
key1,key2,val
c,y,10
c,z,10
c,x,2
b,y,10
b,x,2
b,z,2
a,x,2
a,y,2
a,z,2
Please help me in properly grouping and sorting the dataframe for my scenario.
Add column key1 to sort_values because in second DataFrame are multiple maximum values 10 per groups, so sorting cannot distingush groups:
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
8 a z 100
7 a y 18
6 a x 15
1 c y 25
5 c x 17
2 c z 10
0 b y 21
3 b x 20
4 b z 5
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
1 c y 10
2 c z 10
5 c x 2
0 b y 10
3 b x 2
4 b z 2
6 a x 2
7 a y 2
8 a z 2
I have a dataframe
A B C
1 2 3
2 3 4
3 8 7
I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)
What will be the best way to do so?
You can use rolling for general solution working with any pattern:
pat = np.asarray([3,4])
N = len(pat)
mask= (df['C'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool))
df = df[mask]
print (df)
A B C
0 1 2 3
1 2 3 4
Explanation:
use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype
Use shift
In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)
In [1086]: df[s | s.shift()]
Out[1086]:
A B C
0 1 2 3
1 2 3 4