When using 'df.groupby(column).apply()' get the groupby column within the 'apply' context? - pandas

I want to get the groupby column i.e. column that is supplied to df.groupby as a by argument (i.e. df.groupby(by=column)), within the apply context that comes after groupby (i.e. df.groupby(by=column).apply(Here)).
For example,
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df.groupby(['Animal']).apply(Here I want to know that groupby column is 'Animal')
df
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
Of course, I can have one more line of code or simply by supplying the groupby column to the apply context separately (e.g. .apply(lambda df_: some_function(df_,s='Animal')) ), but I am curious to see if this can be done in a single line e.g. possibly using pandas function built for doing this.

I just figured out a one-liner solution:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df.groupby(['Animal']).apply(lambda df_: df_.apply(lambda x: all(x==df_.name)).loc[lambda x: x].index.tolist())
returns groupby column within each groupby.apply context.
Animal
Falcon [Animal]
Parrot [Animal]
Since it is quite a long one-liner (uses 3 lambdas!), it is better to wrap it in a separate function, as shown below:
def get_groupby_column(df_): return df_.apply(lambda x: all(x==df_.name)).loc[lambda x: x].index.tolist()
df.groupby(['Animal']).apply(get_groupby_column)
Note of caution: this solution won't apply if other columns of the dataframe also contain the items from the groupby column e.g. if Max Speed column contained any of the items from the groupby column (i.e. Animal) there will be inaccurate results.

You could use grouper.names:
>>> df.groupby('Animal').grouper.names
['Animal']
>>>
With apply:
grouped = df.groupby('Animal')
grouped.apply(lambda x: grouped.grouper.names)

Related

why can't we use the argument expand of split() inside apply() -in pandas-

# Split single column into two columns use apply()
df[['First Name', 'Last Name']] = df["Student_details"].apply(lambda x: pd.Series(str(x).split(",")))
print(df)
1- why when i change the code to .apply(lambda x: str(x).split("," , expand=True)) i got an error which is "expand is invalid argument to split function"
2- why do i have to use pd.Series() although the default return value of str.split() is Series
3- how does pd.Series() return a series while it returns a DF -here-
i tried to write expand and use it normally but it didn't work
here is the DF
import pandas as pd
import numpy as np
technologies = {
'Student_details':["Pramodh_Roy", "Leena_Singh", "James_William", "Addem_Smith"],
'Courses':["Spark", "PySpark", "Pandas", "Hadoop"],
'Fee' :[25000, 20000, 22000, 25000]
}
df = pd.DataFrame(technologies)
print(df)
df[['First Name', 'Last Name']] = df["Student_details"].str.split("_", expand=True)
I don't get what you want... is it about the solution above? Or do you really wanna know why #1 throws an error?
EDIT 1:
The expand parameter does not exist for the split function of the str type you are referring to, as this is the str type of python. The expand parameter has been written for the split function for a Series in pandas.
EDIT 2:
Re your third question: As you can see in my suggestion, I'm not using even the pd.Series function, however of course the df["Student_details"] is a series. The key here in my answer is, that the "expand" parameter is here returning a DF with as many columns as required for the split results. So if one of the names were "a_b_c_d" I would get in total a df with four columns.
EDIT 3:
pd.Series will convert result to a series. The result of the split function will be a list, which in turn will be casted to a Series. All of which would not have been necessary in my eyes. That piece of code I wrote saves time, both in execution and understanding (I hope)
FYI, this does work:
technologies = {
'Student_details':["Pramodh_Roy", "Leena_Singh", "James_William", "Addem_Smith"],
'Courses':["Spark", "PySpark", "Pandas", "Hadoop"],
'Fee' :[25000, 20000, 22000, 25000]
}
df = pd.DataFrame(technologies)
df[['First Name', 'Last Name']] = df["Student_details"].apply(lambda x: pd.Series(str(x).split("_")))
print(df)
Output:
Student_details Courses Fee First Name Last Name
0 Pramodh_Roy Spark 25000 Pramodh Roy
1 Leena_Singh PySpark 20000 Leena Singh
2 James_William Pandas 22000 James William
3 Addem_Smith Hadoop 25000 Addem Smith

Group rows of 2nd Column by 1st Column using condition for length of string characters of text of the1st Column [duplicate]

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:
In [563]: grouped['D'].agg({'result1' : np.sum,
.....: 'result2' : np.mean})
.....:
Out[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938
However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.
What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.
For example, I've tried something like
grouped.agg({'C_sum' : lambda x: x['C'].sum(),
'C_std': lambda x: x['C'].std(),
'D_sum' : lambda x: x['D'].sum()},
'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)
but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).
Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.
If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Using apply and returning a Series
Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
For the first part you can pass a dict of column names for keys and a list of functions for the values:
In [28]: df
Out[28]:
A B C D E GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1
In [26]: f = {'A':['sum','mean'], 'B':['prod']}
In [27]: df.groupby('GRP').agg(f)
Out[27]:
A B
sum mean prod
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060
UPDATE 1:
Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.
Here's a hacky workaround:
In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}
In [69]: df.groupby('GRP').agg(f)
Out[69]:
A B D
sum mean prod <lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901
Here, the resultant 'D' column is made up of the summed 'E' values.
UPDATE 2:
Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.
In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()
In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
In [97]: df.groupby('GRP').agg(f)
Out[97]:
A B D
sum mean prod my name
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441
Pandas >= 0.25.0, named aggregations
Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:
Example:
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
a b c d group
0 0.521279 0.914988 0.054057 0.125668 0
1 0.426058 0.828890 0.784093 0.446211 0
2 0.363136 0.843751 0.184967 0.467351 1
3 0.241012 0.470053 0.358018 0.525032 1
Apply GroupBy.agg with named aggregation:
df.groupby('group').agg(
a_sum=('a', 'sum'),
a_mean=('a', 'mean'),
b_mean=('b', 'mean'),
c_sum=('c', 'sum'),
d_range=('d', lambda x: x.max() - x.min())
)
a_sum a_mean b_mean c_sum d_range
group
0 0.947337 0.473668 0.871939 0.838150 0.320543
1 0.604149 0.302074 0.656902 0.542985 0.057681
As an alternative (mostly on aesthetics) to Ted Petrou's answer, I found I preferred a slightly more compact listing. Please don't consider accepting it, it's just a much-more-detailed comment on Ted's answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:
df.groupby('group') \
.apply(lambda x: pd.Series({
'a_sum' : x['a'].sum(),
'a_max' : x['a'].max(),
'b_mean' : x['b'].mean(),
'c_d_prodsum' : (x['c'] * x['d']).sum()
})
)
a_sum a_max b_mean c_d_prodsum
group
0 0.530559 0.374540 0.553354 0.488525
1 1.433558 0.832443 0.460206 0.053313
I find it more reminiscent of dplyr pipes and data.table chained commands. Not to say they're better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized def functions for these types of operations. This is just an alternative, not necessarily better.)
I generated data in the same manner as Ted, I'll add a seed for reproducibility.
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.374540 0.950714 0.731994 0.598658 0
1 0.156019 0.155995 0.058084 0.866176 0
2 0.601115 0.708073 0.020584 0.969910 1
3 0.832443 0.212339 0.181825 0.183405 1
New in version 0.25.0.
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where
The keywords are the output column names
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
>>> animals = pd.DataFrame({
... 'kind': ['cat', 'dog', 'cat', 'dog'],
... 'height': [9.1, 6.0, 9.5, 34.0],
... 'weight': [7.9, 7.5, 9.9, 198.0]
... })
>>> print(animals)
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=pd.NamedAgg(column='height', aggfunc='min'),
... max_height=pd.NamedAgg(column='height', aggfunc='max'),
... average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=('height', 'min'),
... max_height=('height', 'max'),
... average_weight=('weight', np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
>>> print(
... animals
... .groupby('kind')
... .height
... .agg(
... min_height='min',
... max_height='max',
... )
... )
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
This is a twist on 'exans' answer that uses Named Aggregations. It's the same but with argument unpacking which allows you to still pass in a dictionary to the agg function.
The named aggs are a nice feature, but at first glance might seem hard to write programmatically since they use keywords, but it's actually simple with argument/keyword unpacking.
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]})
agg_dict = {
"min_height": pd.NamedAgg(column='height', aggfunc='min'),
"max_height": pd.NamedAgg(column='height', aggfunc='max'),
"average_weight": pd.NamedAgg(column='weight', aggfunc=np.mean)
}
animals.groupby("kind").agg(**agg_dict)
The Result
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Ted's answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:
create a dataframe
df = pd.DataFrame({
'a': [1, 2, 3, 4, 5, 6],
'b': [1, 1, 0, 1, 1, 0],
'c': ['x', 'x', 'y', 'y', 'z', 'z']
})
print(df)
a b c
0 1 1 x
1 2 1 x
2 3 0 y
3 4 1 y
4 5 1 z
5 6 0 z
grouping and aggregating with apply (using multiple columns)
print(
df
.groupby('c')
.apply(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)]
.mean()
)
c
x 2.0
y 4.0
z 5.0
grouping and aggregating with aggregate (using multiple columns)
I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.
It seems obvious now, but as long as you don't select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.
only access to the selected column
df.groupby('c')['a'].aggregate(lambda x: x[x > 1].mean())
access to all columns since selection is after all the magic
df.groupby('c').aggregate(lambda x: x[(x['a'] > 1) & (x['b'] == 1)].mean())['a']
or similarly
df.groupby('c').aggregate(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)].mean())
I hope this helps.

Compare Rows of Strings in Dataframe [duplicate]

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:
In [563]: grouped['D'].agg({'result1' : np.sum,
.....: 'result2' : np.mean})
.....:
Out[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938
However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.
What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.
For example, I've tried something like
grouped.agg({'C_sum' : lambda x: x['C'].sum(),
'C_std': lambda x: x['C'].std(),
'D_sum' : lambda x: x['D'].sum()},
'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)
but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).
Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.
If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Using apply and returning a Series
Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
For the first part you can pass a dict of column names for keys and a list of functions for the values:
In [28]: df
Out[28]:
A B C D E GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1
In [26]: f = {'A':['sum','mean'], 'B':['prod']}
In [27]: df.groupby('GRP').agg(f)
Out[27]:
A B
sum mean prod
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060
UPDATE 1:
Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.
Here's a hacky workaround:
In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}
In [69]: df.groupby('GRP').agg(f)
Out[69]:
A B D
sum mean prod <lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901
Here, the resultant 'D' column is made up of the summed 'E' values.
UPDATE 2:
Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.
In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()
In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
In [97]: df.groupby('GRP').agg(f)
Out[97]:
A B D
sum mean prod my name
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441
Pandas >= 0.25.0, named aggregations
Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:
Example:
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
a b c d group
0 0.521279 0.914988 0.054057 0.125668 0
1 0.426058 0.828890 0.784093 0.446211 0
2 0.363136 0.843751 0.184967 0.467351 1
3 0.241012 0.470053 0.358018 0.525032 1
Apply GroupBy.agg with named aggregation:
df.groupby('group').agg(
a_sum=('a', 'sum'),
a_mean=('a', 'mean'),
b_mean=('b', 'mean'),
c_sum=('c', 'sum'),
d_range=('d', lambda x: x.max() - x.min())
)
a_sum a_mean b_mean c_sum d_range
group
0 0.947337 0.473668 0.871939 0.838150 0.320543
1 0.604149 0.302074 0.656902 0.542985 0.057681
As an alternative (mostly on aesthetics) to Ted Petrou's answer, I found I preferred a slightly more compact listing. Please don't consider accepting it, it's just a much-more-detailed comment on Ted's answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:
df.groupby('group') \
.apply(lambda x: pd.Series({
'a_sum' : x['a'].sum(),
'a_max' : x['a'].max(),
'b_mean' : x['b'].mean(),
'c_d_prodsum' : (x['c'] * x['d']).sum()
})
)
a_sum a_max b_mean c_d_prodsum
group
0 0.530559 0.374540 0.553354 0.488525
1 1.433558 0.832443 0.460206 0.053313
I find it more reminiscent of dplyr pipes and data.table chained commands. Not to say they're better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized def functions for these types of operations. This is just an alternative, not necessarily better.)
I generated data in the same manner as Ted, I'll add a seed for reproducibility.
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.374540 0.950714 0.731994 0.598658 0
1 0.156019 0.155995 0.058084 0.866176 0
2 0.601115 0.708073 0.020584 0.969910 1
3 0.832443 0.212339 0.181825 0.183405 1
New in version 0.25.0.
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where
The keywords are the output column names
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
>>> animals = pd.DataFrame({
... 'kind': ['cat', 'dog', 'cat', 'dog'],
... 'height': [9.1, 6.0, 9.5, 34.0],
... 'weight': [7.9, 7.5, 9.9, 198.0]
... })
>>> print(animals)
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=pd.NamedAgg(column='height', aggfunc='min'),
... max_height=pd.NamedAgg(column='height', aggfunc='max'),
... average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=('height', 'min'),
... max_height=('height', 'max'),
... average_weight=('weight', np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
>>> print(
... animals
... .groupby('kind')
... .height
... .agg(
... min_height='min',
... max_height='max',
... )
... )
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
This is a twist on 'exans' answer that uses Named Aggregations. It's the same but with argument unpacking which allows you to still pass in a dictionary to the agg function.
The named aggs are a nice feature, but at first glance might seem hard to write programmatically since they use keywords, but it's actually simple with argument/keyword unpacking.
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]})
agg_dict = {
"min_height": pd.NamedAgg(column='height', aggfunc='min'),
"max_height": pd.NamedAgg(column='height', aggfunc='max'),
"average_weight": pd.NamedAgg(column='weight', aggfunc=np.mean)
}
animals.groupby("kind").agg(**agg_dict)
The Result
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Ted's answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:
create a dataframe
df = pd.DataFrame({
'a': [1, 2, 3, 4, 5, 6],
'b': [1, 1, 0, 1, 1, 0],
'c': ['x', 'x', 'y', 'y', 'z', 'z']
})
print(df)
a b c
0 1 1 x
1 2 1 x
2 3 0 y
3 4 1 y
4 5 1 z
5 6 0 z
grouping and aggregating with apply (using multiple columns)
print(
df
.groupby('c')
.apply(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)]
.mean()
)
c
x 2.0
y 4.0
z 5.0
grouping and aggregating with aggregate (using multiple columns)
I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.
It seems obvious now, but as long as you don't select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.
only access to the selected column
df.groupby('c')['a'].aggregate(lambda x: x[x > 1].mean())
access to all columns since selection is after all the magic
df.groupby('c').aggregate(lambda x: x[(x['a'] > 1) & (x['b'] == 1)].mean())['a']
or similarly
df.groupby('c').aggregate(lambda x: x['a'][(x['a'] > 1) & (x['b'] == 1)].mean())
I hope this helps.

How to apply a function on a column of a pandas dataframe? [duplicate]

I have a pandas dataframe with two columns. I need to change the values of the first column without affecting the second one and get back the whole dataframe with just first column values changed. How can I do that using apply() in pandas?
Given a sample dataframe df as:
a b
0 1 2
1 2 3
2 3 4
3 4 5
what you want is:
df['a'] = df['a'].apply(lambda x: x + 1)
that returns:
a b
0 2 2
1 3 3
2 4 4
3 5 5
For a single column better to use map(), like this:
df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
a b c
0 15 15 5
1 20 10 7
2 25 30 9
df['a'] = df['a'].map(lambda a: a / 2.)
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
Given the following dataframe df and the function complex_function,
import pandas as pd
def complex_function(x, y=0):
if x > 5 and x > y:
return 1
else:
return 2
df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})
col1 col2
0 1 6
1 4 7
2 6 1
3 2 2
4 7 8
there are several solutions to use apply() on only one column. In the following I will explain them in detail.
I. Simple solution
The straightforward solution is the one from #Fabio Lamanna:
df['col1'] = df['col1'].apply(complex_function)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 1 8
Only the first column is modified, the second column is unchanged. The solution is beautiful. It is just one line of code and it reads almost like english: "Take 'col1' and apply the function complex_function to it."
However, if you need data from another column, e.g. 'col2', it won't work. If you want to pass the values of 'col2' to variable y of the complex_function, you need something else.
II. Solution using the whole dataframe
Alternatively, you could use the whole dataframe as described in this SO post or this one:
df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)
or if you prefer (like me) a solution without a lambda function:
def apply_complex_function(x):
return complex_function(x['col1'])
df['col1'] = df.apply(apply_complex_function, axis=1)
There is a lot going on in this solution that needs to be explained. The apply() function works on pd.Series and pd.DataFrame. But you cannot use df['col1'] = df.apply(complex_function).loc[:, 'col1'], because it would throw a ValueError.
Hence, you need to give the information which column to use. To complicate things, the apply() function does only accept callables. To solve this, you need to define a (lambda) function with the column x['col1'] as argument; i.e. we wrap the column information in another function.
Unfortunately, the default value of the axis parameter is zero (axis=0), which means it will try executing column-wise and not row-wise. This wasn't a problem in the first solution, because we gave apply() a pd.Series. But now the input is a dataframe and we must be explicit (axis=1). (I marvel how often I forget this.)
Whether you prefer the version with the lambda function or without is subjective. In my opinion the line of code is complicated enough to read even without a lambda function thrown in. You only need the (lambda) function as a wrapper. It is just boilerplate code. A reader should not be bothered with it.
Now, you can modify this solution easily to take the second column into account:
def apply_complex_function(x):
return complex_function(x['col1'], x['col2'])
df['col1'] = df.apply(apply_complex_function, axis=1)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 2 8
At index 4 the value has changed from 1 to 2, because the first condition 7 > 5 is true but the second condition 7 > 8 is false.
Note that you only needed to change the first line of code (i.e. the function) and not the second line.
Side note
Never put the column information into your function.
def bad_idea(x):
return x['col1'] ** 2
By doing this, you make a general function dependent on a column name! This is a bad idea, because the next time you want to use this function, you cannot. Worse: Maybe you rename a column in a different dataframe just to make it work with your existing function. (Been there, done that. It is a slippery slope!)
III. Alternative solutions without using apply()
Although the OP specifically asked for a solution with apply(), alternative solutions were suggested. For example, the answer of #George Petrov suggested to use map(); the answer of #Thibaut Dubernet proposed assign().
I fully agree that apply() is seldom the best solution, because apply() is not vectorized. It is an element-wise operation with expensive function calling and overhead from pd.Series.
One reason to use apply() is that you want to use an existing function and performance is not an issue. Or your function is so complex that no vectorized version exists.
Another reason to use apply() is in combination with groupby(). Please note that DataFrame.apply() and GroupBy.apply() are different functions.
So it does make sense to consider some alternatives:
map() only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details.
df['col1'] = df['col1'].map(complex_function)
applymap() is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route.
df['col1'] = df.applymap(complex_function).loc[:, 'col1']
assign() is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe.
df['col1'] = df.assign(col1=df.col1.apply(complex_function))
Annex: How to speed up apply()?
I only mention it here because it was suggested by other answers, e.g. #durjoy. The list is not exhaustive:
Do not use apply(). This is no joke. For most numeric operations, a vectorized method exists in pandas. If/else blocks can often be refactored with a combination of boolean indexing and .loc. My example complex_function could be refactored in this way.
Refactor to Cython. If you have a complex equation and the parameters of the equation are in your dataframe, this might be a good idea. Check out the official pandas user guide for more information.
Use raw=True parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost.
Use 3rd party packages. The first thing you should try is Numba. I do not know swifter mentioned by #durjoy; and probably many other packages are worth mentioning here.
Try/Fail/Repeat. As mentioned above, map() and applymap() can be faster - depending on the use case. Just time the different versions and choose the fastest. This approach is the most tedious one with the least performance increase.
You don't need a function at all. You can work on a whole column directly.
Example data:
>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df
a b c
0 100 200 300
1 1000 2000 3000
Half all the values in column a:
>>> df.a = df.a / 2
>>> df
a b c
0 50 200 300
1 500 2000 3000
Although the given responses are correct, they modify the initial data frame, which is not always desirable (and, given the OP asked for examples "using apply", it might be they wanted a version that returns a new data frame, as apply does).
This is possible using assign: it is valid to assign to existing columns, as the documentation states (emphasis is mine):
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
In short:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]:
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
In [4]: df
Out[4]:
a b c
0 15 15 5
1 20 10 7
2 25 30 9
Note that the function will be passed the whole dataframe, not only the column you want to modify, so you will need to make sure you select the right column in your lambda.
If you are really concerned about the execution speed of your apply function and you have a huge dataset to work on, you could use swifter to make faster execution, here is an example for swifter on pandas dataframe:
import pandas as pd
import swifter
def fnc(m):
return m*3+4
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)
This will enable your all CPU cores to compute the result hence it will be much faster than normal apply functions. Try and let me know if it become useful for you.
Let me try a complex computation using datetime and considering nulls or empty spaces. I am reducing 30 years on a datetime column and using apply method as well as lambda and converting datetime format. Line if x != '' else x will take care of all empty spaces or nulls accordingly.
df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)
Make a copy of your dataframe first if you need to modify a column
Many answers here suggest modifying some column and assign the new values to the old column. It is common to get the SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. This happens when your dataframe was created from another dataframe but is not a proper copy.
To silence this warning, make a copy and assign back.
df = df.copy()
df['a'] = df['a'].apply('add', other=1)
apply() only needs the name of the function
You can invoke a function by simply passing its name to apply() (no need for lambda). If your function needs additional arguments, you can pass them either as keyword arguments or pass the positional arguments as args=. For example, suppose you have file paths in your dataframe and you need to read files in these paths.
def read_data(path, sep=',', usecols=[0]):
return pd.read_csv(path, sep=sep, usecols=usecols)
df = pd.DataFrame({'paths': ['../x/yz.txt', '../u/vw.txt']})
df['paths'].apply(read_data) # you don't need lambda
df['paths'].apply(read_data, args=(',', [0, 1])) # pass the positional arguments to `args=`
df['paths'].apply(read_data, sep=',', usecols=[0, 1]) # pass as keyword arguments
Don't apply a function, call the appropriate method directly
It's almost never ideal to apply a custom function on a column via apply(). Because apply() is a syntactic sugar for a Python loop with a pandas overhead, it's often slower than calling the same function in a list comprehension, never mind, calling optimized pandas methods. Almost all numeric operators can be directly applied on the column and there are corresponding methods for all of them.
# add 1 to every element in column `a`
df['a'] += 1
# for every row, subtract column `a` value from column `b` value
df['c'] = df['b'] - df['a']
If you want to apply a function that has if-else blocks, then you should probably be using numpy.where() or numpy.select() instead. It is much, much faster. If you have anything larger than 10k rows of data, you'll notice the difference right away.
For example, if you have a custom function similar to func() below, then instead of applying it on the column, you could operate directly on the columns and return values using numpy.select().
def func(row):
if row == 'a':
return 1
elif row == 'b':
return 2
else:
return -999
# instead of applying a `func` to each row of a column, use `numpy.select` as below
import numpy as np
conditions = [df['col'] == 'a', df['col'] == 'b']
choices = [1, 2]
df['new'] = np.select(conditions, choices, default=-999)
As you can see, numpy.select() has very minimal syntax difference from an if-else ladder; only need to separate conditions and choices into separate lists. For other options, check out this answer.

Pandas specify resulting column names when using agg without group by

I am trying to write a function that would summarize my pandas dataframe. This function should be able to do summarization by group as well as without grouping (depending on whether i specify group by parameter).
In order to be able to perform arbitrary aggregation functions, i pass a dictionary as an argument to pandas agg. The dictionary specifies what aggregations should be performed - and ideally it would specify what would be the names of the columns that come out as a result of my aggregation.
Unfortunately, while i can easily name all columns while using df.groupby([[).agg({}), i can't specify the column names in the same way while aggregating the whole dataframe (i.e. df.agg({}))
Example
Let's have a dataframe like this:
df = pd.DataFrame(
{
'my_grouping_col':['A', 'A', 'B', 'B', 'C', 'C'],
'my_value_col_1': [1,2,3,4,5,6],
'my_value_col_2': [7,8,9,10,11,12]
}
)
If i want to aggregate it by group while using some custom functions i could do:
df.groupby(['my_grouping_col']).agg(
{'my_value_col_1': ['min', 'max', ('Q1', lambda x: x.quantile(0.25))],
'my_value_col_2':['min', 'max', ('Q3', lambda x: x.quantile(0.75))]
}
)
Therefore i can use a tuple to specify the aggregation function as well as it's resulting column name.
Now i would like to be able to use the same syntax even without doing groupby:
df.agg(
{'my_value_col_1': ['min', 'max', ('Q1', lambda x: x.quantile(0.25))],
'my_value_col_2':['min', 'max', ('Q3', lambda x: x.quantile(0.75))]
}
)
But this gives me: AttributeError: 'Q1' is not a valid function for 'Series' object.
I can imagine two workarounds that i would rather not use:
Option 1:
Add a column with same value in all rows and then groupby. Afterwards reset index with drop=True to remove this column.
df['my_temporary_grouping_column'] = 1
df.groupby(['my_temporary_grouping_column']).agg(
{'my_value_col_1': ['min', 'max', ('Q1', lambda x: x.quantile(0.25))],
'my_value_col_2':['min', 'max', ('Q3', lambda x: x.quantile(0.75))]
}
).reset_index(drop = True)
Which gives the desired results:
my_value_col_1 my_value_col_2
min max Q1 min max Q3
my_grouping_col
A 1 2 1.25 7 8 7.75
B 3 4 3.25 9 10 9.75
C 5 6 5.25 11 12 11.75
Option 2
aggregate without groupby as:
df.agg(
{'my_value_col_1': ['min', 'max', lambda x: x.quantile(0.25)],
'my_value_col_2':['min', 'max', lambda x: x.quantile(0.75)]
}
)
And rename ... But i think the second option is really impractical.
Can i somehow change the dictionary specifying the details of the aggregation (including specifying column names) so that it works in both cases - with groupby and without? If not what would be a good way to specify the column names in agg without groupby? Note that i do not wish to use anything like: agg(Q3 = lambda x: x.quantile(0.75)). I would like to use the code within a function which takes as a parameter the details of what aggregations should be performed, such as:
def summarise(data, columns, group_cols, functions):
tdict = {}
[tdict.update({i.functions}) for i in columns]
if group_cols is not None:
out = data.groupby(group_cols).agg(tdict)
else:
data['temporary_group_col'] = 1
out = data.groupby(['temporary_group_col']).agg(tdict).reset_index(drop = True)
return out