Predefining pandas aggregations and rename with new version - pandas

In previous versions of pandas, you could do:
aggregations = {
'Col1':{
'SUM_name': 'sum',
'MEAN_name': 'mean',
'MAX_name': 'max',
'MIN_name': 'min'
},
'Other colname':{
'MEAN_newname': 'mean',
'MED_newname': 'median',
'MAX_newname': 'max',
'MIN_newname': 'min'
},
}
agg_df = df[df['somecol'] <= 0].groupby(['gbcol']).agg(aggregations)
This is deprecated with 0.20. What is the equivalent of this form of aggregation in v 0.20?

An alternative is named agg:
aggregations = {
'SUM_name':('Col1','sum'),
'MEAN_name':('Col1','mean'),
'MEAN_newname':('Other_colname', 'mean')
}
agg_df = df.groupby(['gbcol']).agg(**aggregations)

Related

Filtering a specific column in a pandas DataFrame using the 'filter' keyword and a lambda function

How can I use the 'filter' keyword and a lambda function to filter a specific column in a pandas DataFrame?
import pandas as pd
data = [{
'Language': 'Python',
'Percent grow': 56
}, {
'Language': 'Java',
'Percent grow': 34
}, {
'Language': 'C',
'Percent grow': 25
}, {
'Language': 'C++',
'Percent grow': 12
}, {
'Language': 'go',
'Percent grow': 5
}]
df = pd.DataFrame(data)
f = lambda x : x['Percent grow'] > 30
df.filter(f)
IIUC, you want pandas.DataFrame.loc with boolean indexing :
df.loc[df["Percent grow"].gt(30)]
To make it out with pandas.DataFrame.filter, one way will be :
(
df
.set_index("Percent grow")
.pipe(lambda x: x.filter([x for x in x.index if x>30], axis=0))
.reset_index()
.reindex(df.columns, axis=1)
)
Output :
Language Percent grow
0 Python 56
1 Java 34

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Pandas groupby multi conditions and date difference calculation

I am stuck understanding the method to use. I have the following dataframe:
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df
I need to:
groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands
The expected output is the below:
The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.
How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
This results in the following df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days

pandas: how to assign a single column conditionally on multiple other columns?

I'm confused about conditional assignment in Pandas.
I have this dataframe:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
I'm trying to add a new column, conditionally based on the others:
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
This is fairly readable, but is it the standard way to do things?
I've been looking at pd.assign, and am not sure if I should be using that instead.
This should work, you can change or add the conditions however you want.
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'past_due'), 'cancellation_type'] = 'failed_to_pay'
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'active'), 'cancellation_type'] = 'cancelled_by_us'
df.loc[(df['stripe_subscription_id'] == np.nan), 'cancellation_type'] = 'cancelled_by_user'
You migth consider to use np.select
import pandas as pd
import numpy as np
condList = [df["status"]=="past_due",
df["status"]=="active",
~df["status"].isin(["past_due",
"active"])]
choiceList = ["failed_to_pay", "cancelled_by_us", "cancelled_by_user"]
df['cancellation_type'] = np.select(condList, choiceList)

pandas: how to check for nulls in a float column?

I am conditionally assigning a column based on whether another column is null:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
df
But I don't get the results I expect:
I would expect the final two rows to read cancelled_by_user, because the stripe_subscription_id column is null.
If I amend the function:
def get_cancellation_type(row):
if row.stripe_subscription_id.isnull():
Then I get an error: AttributeError: ("'float' object has no attribute 'isnull'", 'occurred at index 0'). What am I doing wrong?
With pandas and numpy we barely have to write our own functions, especially since our own functions will perform slow because these are not vectorized and pandas + numpy provide a rich pool of vectorized methods for us.
In this case your are looking for np.select since you want to create a column based on multiple conditions:
conditions = [
df['stripe_subscription_id'].notna() & df['status'].eq('past_due'),
df['stripe_subscription_id'].notna() & df['status'].eq('active')
]
choices = ['failed_to_pay', 'cancelled_by_us']
df['cancellation_type'] = np.select(conditions, choices, default='cancelled_by_user')
status stripe_subscription_id cancellation_type
0 past_due 1.0 failed_to_pay
1 active 2.0 cancelled_by_us
2 active NaN cancelled_by_user
3 active NaN cancelled_by_user