I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function.
But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns.
The following is a simplified example of my problem.
Say, I have this df
df = pd.DataFrame([[1,2],[3,4],[1,5]], columns=['A','B']))
A B
0 1 2
1 3 4
2 1 5
I want to group by column A and keep only those rows where B has an even value. Thus the desired df is this:
B
A
1 2
3 4
The custom function my_combine_func should do the filtering. But applying it after a groupby, leads to an MultiIndex with the former Index in the second level. And thus column A existing two times.
my_combine_func = group[group['B'] % 2 == 0]
df.groupby(['A']).apply(my_combine_func)
A B
A
1 0 1 2
3 1 3 4
How to apply a custom group function and have the desired df?
It's easier to use apply here so you get a boolean array back:
df[df.groupby('A')['B'].apply(lambda x: x % 2 == 0)]
A B
0 1 2
1 3 4
I have the following table:
and would like to convert the product column to something like:
How would you recomend I do this in pandas? Test df below
import numpy as np
import pandas as pd
test_dict = {'Acount': ['1', '2', '3', '4'], 'Product': [np.nan, 'A','A,B,C', 'C']}
df = pd.DataFrame.from_dict(test_dict)
For a single column you can use Series.str.get_dummies which allows you to specify the character that separates all categories. Set 'Acount' to the index so that appears in the output:
df.set_index('Acount')['Product'].str.get_dummies(sep=',')
A B C
Acount
1 0 0 0
2 1 0 0
3 1 1 1
4 0 0 1
Let's use .str.split, explode and pd.crosstab:
df_count = df.assign(Product=df['Product'].str.split(',')).explode('Product')
pd.crosstab(df_count['Acount'], df_count['Product']).reindex(df['Acount'].unique(), fill_value=0)
Output:
Product A B C
Acount
1 0 0 0
2 1 0 0
3 1 1 1
4 0 0 1
Details
Let's assign 'Product' as a list of elements using .str.split on commas.
Next, use explode to unnest the list in the 'Product' column.
Now, use pd.crosstab to count the occurrence for each value by 'Acount'.
Lastly, reindex to fill missing 'Acount' not present in crosstab.
I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13
I have a data frame df
0 0 1 1 2
0 0 1 2 2
How do I select rows where at least 2 columns have value > 1? So, only select the second row in the df above
Thanks!
Use gt with sum along axis=1 as:
df[df.gt(1).sum(axis=1).gt(1)]
Or:
df[(df>1).sum(axis=1)>1]