How can I group different rows based on its value? - pandas

I have a data frame in pandas like this:
Attributes1
Attributes value1
Attributes2
Attributes value2
a
1
b
4
b
2
a
5
Does anyone know how can I get a new data frame like below?
a
b
1
2
5
4
Thank you!

Try:
x = pd.DataFrame(
df.apply(
lambda x: dict(
zip(x.filter(regex=r"Attributes\d+$"), x.filter(like="value"))
),
axis=1,
).to_list()
)
print(x)
Prints:
a b
0 1 4
1 5 2

Use the transpose() function to transpose the rows to columns, then use the groupby() function to group the columns with the same name.
Also, in the future, please add what you've tried to do to solve the problem as well.

We can do wide_to_long then pivot
s = pd.wide_to_long(df.reset_index(),
['Attributes','Attributes value'],
i = 'index',
j = 'a').reset_index().drop(['a'],axis=1)
s = s.pivot(*s)
Out[22]:
Attributes a b
index
0 1 4
1 5 2

Related

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

I want to remove specific rows and restart the values from 1

I have a dataframe that looks like this:
Time Value
1 5
2 3
3 3
4 2
5 1
I want to remove the first two rows and then restart time from 1. The dataframe should then look like:
Time Value
1 3
2 2
3 1
I attach the code:
file = pd.read_excel(r'C:......xlsx')
df = file0.loc[(file0['Time']>2) & (file0['Time']<11)]
df = df.reset_index()
Now what I get is:
index Time Value
0 3 3
1 4 2
2 5 1
Thank you!
You can use .loc[] accessor and reset_index() method:
df=df.loc[2:].reset_index(drop=True)
Finally use list comprehension:
df['Time']=[x for x in range(1,len(df)+1)]
Now If you print df you will get your desired output:
Time Value
0 1 3
1 2 2
2 3 1
You can use df.loc to extract the subset of dataframe, Reset the index and then change the value of Time column.
df = df.loc[2:].reset_index(drop=True)
df['Time'] = df.index + 1
print(df)
you have two ways to do that.
first :
df[2:].assign(time = df.time.values[:-2])
Which returns your desired output.
time
value
1
3
2
2
3
1
second :
df = df.set_index('time')
df['value'] = df['value'].shift(-2)
df.dropna()
this return your output too but turn the numbers to float64
time
value
1
3.0
2
2.0
3
1.0

Group and count entries in a DataFrame

I'm new to programming and Pandas, I'd like to have an example of how to apply a grouping function that also applies some counters to reduce the following DataFrame:
child
groupName
state
name1
A
ok
name2
A
ko
name3
B
ok
to a new DataFrame like:
groupName
noOfChildren
noOfOk
noOfKo
A
2
1
1
B
1
1
0
Given the allChildren DataFrame, I can create the Series counting the entries by groupName:
childrenByGroupName= allChildren.groupby(['groupName'])['name'].count();
And also the Series to filter them by 'ok' state:
okChildrenByGroupName= allChildren.where(['state']=='ok').groupby(['groupName'])['name'].count();
But I cannot build the merged DataFrame as per the above expectation.
Any help?
Try:
pd.crosstab(df['groupName'], df['state'], margins='sum')
Output:
state ko ok All
groupName
A 1 1 2
B 0 1 1
All 1 2 3
and to (almost) match the expected output:
(pd.crosstab(df['groupName'], df['state'], margins='sum', margins_name='Children')
.drop('Children')
.add_prefix('noOf')
.reset_index()
)
you can try like this:
df1 = df.groupby(['groupName']).agg({'child': 'count', 'state': lambda x: x.value_counts().to_dict()}).add_prefix('noOf').reset_index()
df2 = pd.concat([df1.drop('noOfstate', axis=1), pd.DataFrame(df1['noOfstate'].tolist()).add_prefix('noOf')], axis=1).fillna(0)
df2:
groupName noOfchild noOfok noOfko
0 A 2 1 1.0
1 B 1 1 0.0

pandas split-apply-combine creates undesired MultiIndex

I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function.
But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns.
The following is a simplified example of my problem.
Say, I have this df
df = pd.DataFrame([[1,2],[3,4],[1,5]], columns=['A','B']))
A B
0 1 2
1 3 4
2 1 5
I want to group by column A and keep only those rows where B has an even value. Thus the desired df is this:
B
A
1 2
3 4
The custom function my_combine_func should do the filtering. But applying it after a groupby, leads to an MultiIndex with the former Index in the second level. And thus column A existing two times.
my_combine_func = group[group['B'] % 2 == 0]
df.groupby(['A']).apply(my_combine_func)
A B
A
1 0 1 2
3 1 3 4
How to apply a custom group function and have the desired df?
It's easier to use apply here so you get a boolean array back:
df[df.groupby('A')['B'].apply(lambda x: x % 2 == 0)]
A B
0 1 2
1 3 4

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100