I'm doing a pandas groupby for a specific DataFrame which looks like this
Group
Value
A
1
A
2
B
2
B
3
And when I apply df.groupby('Group')['Value'].mean() I get
Group
A 1.5
B 2.5
Name: Value, dtype: float64
My end result is trying to find the group that has the max groupby aggregation (ie: Group B) as my result
I understand that groups.keys() is an option to list the keys but would there be a way to repeatedly get the groupname for a specific aggregation. ?
Thanks !
By default, groupby sets your grouping column as index of the aggregation. Use idxmax:
df.groupby('Group')['Value'].mean().idxmax()
Output: 'B'
Related
I have a data set with individuals participating multiple times. I would like to count the unique number of individuals by gender. I've done this with a pivot_table and groupby approach and get different values. I can't figure out why. Can you tell me the obvious element which I have overlooked?
Pivot table solution:
Groupby solution:
As you can see, both give the correct values for the specific "gender". Rather, it is the totals that are different. Groupby appears to provide the correct totals whereas pivot_table totals seem off. Why?
This could be your issue. If there are names that are shared between Genders then pivot_table doesn't count the duplicates. groupby IS counting the duplicates as shown in this small example where the name 'A' is both 'M' and 'F' genders
import pandas as pd
import sidetable
df = pd.DataFrame({
'Gender':['M','F','M','M','F','T','F','F'],
'Name': ['A','A','C','D','E','F','G','H'],
})
piv_df = df.pivot_table(index='Gender',values='Name',aggfunc=pd.Series.nunique,margins=True)
gb_df = df.groupby('Gender').agg({'Name':'nunique'}).stb.subtotal()
print(piv_df)
print(gb_df)
Output
Name
Gender
F 4
M 3
T 1
All 7
Name
Gender
F 4
M 3
T 1
grand_total 8
You can test this by df = df.drop_duplicates('Name') before the piv and gb and the counts should match if this is the only reason for the diff counts
I would like to use pandas to group a dataframe by one column, and then run an expanding window calculation on the groups. Imagine the following dataframe:
G Val
A 0
A 1
A 2
B 3
B 4
C 5
C 6
C 7
What I am looking for is a way to group the data by column G (resulting in groups ['A', 'B', 'C']), and then applying a function first to the items in group A, then to items in groups A and B, and finally items in groups A to C.
For example, if the function is sum, then the result would be
A 3
B 10
C 28
For my problem the function that is applied needs to be able to access all original items in the dataframe, not only the aggregates from the groupby.
For example when applying mean, the expected result would be
A 1
B 2
C 3.5
A: mean([0,1,2]), B: mean([0,1,2,3,4]), C: mean([0,1,2,3,4,5,6,7]).
cummean not exist, so possible solution is aggregate counts and sum, use cumulative sum and for mean divide:
df = df.groupby('G')['Val'].agg(['size', 'sum']).cumsum()
s = df['sum'].div(df['size'])
print (s)
A 1.0
B 2.0
C 3.5
dtype: float64
If need general solution is possible extract expanding groups and then use function in dict comprehension like:
g = df['G'].drop_duplicates().apply(list).cumsum()
s = pd.Series({x[-1]: df.loc[df['G'].isin(x), 'Val'].mean() for x in g})
print (s)
A 1.0
B 2.0
C 3.5
dtype: float64
I am using groupby to process many columns using different functions.
I have used only one column, but I can't choose element on condition of other columns.
import pandas as pd
data = {'a':['A','C','E','J'],'b':[1,2,3,4]}
df = pd.DataFrame(data, index=[1,1,1,1])
df.groupby(level=0).agg({
'b':'sum',
'b':select element from b where a = 'C'
})
The goal is using agg to get this:
df.groupby(level=0).apply(lambda x:x.loc[x.a=='C','b'])
df.groupby(level=0).b.first()
df.groupby(level=0).b.sum()
f first sum
1 2 1 10
No, you can not use agg with multiple columns. Agg is to aggregate values of a single column, if you must have conditions based on a separate column, you need to use apply.
df.groupby(level=0).apply(lambda x: pd.Series([x.loc[x.a =="C", 'b'].values[0],
x.b.iloc[0],
x.b.sum()], index=['f','first','sum']))
Output:
f first sum
1 2 1 10
An elegant function like
df[~pandas.isnull(df.loc[:,0])]
can check a pandas DataFrame column and return the entire DataFrame but with all NaN value rows from the selected column removed.
I am wondering if there is a similar function which can check and return a df column conditional on its dtype without using any loops.
I've looked at
.select_dtypes(include=[np.float])
but this only returns columns that have entirely float64 values, not every row in a column that is a float.
First lets set up a DataFrame with two columns. Only column b has a float. We'll try and find this row:
df = pandas.DataFrame({
'a': ['qw', 'er'],
'b' : ['ty', 1.98]
})
When printed this looks like:
a b
0 qw ty
1 er 1.98
Then create a map to select the rows using apply()
def check_if_float(row):
return isinstance(row['b'], float)
map = df.apply(check_if_float, axis=1)
This will give a boolean map of all the rows that have a float in column b:
0 False
1 True
You can then use this map to select the rows you want
filtered_rows = df[map]
Which leaves you only the rows that contain a float in column b:
a b
1 er 1.98
I have a pandas dataframe which need to group by a text column to obtain sum of duplicated values along that column. But when I run the groupby method it drop many columns mysteriously. Can anyone help me on this?
Try to check your column dtypes , sum will only for numeric value.
For example you have df as below :
df=pd.DataFrame({'V1':[1,2,3],'V2':['A','B','C'],'KEY':[1,2,2]})
df.dtypes
Out[159]:
KEY int64
V1 int64
V2 object
dtype: object
Then you groupby key and do sum for whole dataframe it will only return the result of numeric columns
df.groupby('KEY').sum()
Out[160]:
V1
KEY
1 1
2 5
If you need string type to join together you can
df.groupby('KEY',as_index=False).apply(lambda x : x.sum())
Out[164]:
KEY V1 V2
0 1 1 A
1 4 5 BC