Filtering a specific column in a pandas DataFrame using the 'filter' keyword and a lambda function - dataframe

How can I use the 'filter' keyword and a lambda function to filter a specific column in a pandas DataFrame?
import pandas as pd
data = [{
'Language': 'Python',
'Percent grow': 56
}, {
'Language': 'Java',
'Percent grow': 34
}, {
'Language': 'C',
'Percent grow': 25
}, {
'Language': 'C++',
'Percent grow': 12
}, {
'Language': 'go',
'Percent grow': 5
}]
df = pd.DataFrame(data)
f = lambda x : x['Percent grow'] > 30
df.filter(f)

IIUC, you want pandas.DataFrame.loc with boolean indexing :
df.loc[df["Percent grow"].gt(30)]
To make it out with pandas.DataFrame.filter, one way will be :
(
df
.set_index("Percent grow")
.pipe(lambda x: x.filter([x for x in x.index if x>30], axis=0))
.reset_index()
.reindex(df.columns, axis=1)
)
Output :
Language Percent grow
0 Python 56
1 Java 34

Related

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Conditional mapping among columns of two data frames with Pandas Data frame

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance
Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

Pandas groupby multi conditions and date difference calculation

I am stuck understanding the method to use. I have the following dataframe:
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df
I need to:
groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands
The expected output is the below:
The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.
How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
This results in the following df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days

Predefining pandas aggregations and rename with new version

In previous versions of pandas, you could do:
aggregations = {
'Col1':{
'SUM_name': 'sum',
'MEAN_name': 'mean',
'MAX_name': 'max',
'MIN_name': 'min'
},
'Other colname':{
'MEAN_newname': 'mean',
'MED_newname': 'median',
'MAX_newname': 'max',
'MIN_newname': 'min'
},
}
agg_df = df[df['somecol'] <= 0].groupby(['gbcol']).agg(aggregations)
This is deprecated with 0.20. What is the equivalent of this form of aggregation in v 0.20?
An alternative is named agg:
aggregations = {
'SUM_name':('Col1','sum'),
'MEAN_name':('Col1','mean'),
'MEAN_newname':('Other_colname', 'mean')
}
agg_df = df.groupby(['gbcol']).agg(**aggregations)

pandas: how to check for nulls in a float column?

I am conditionally assigning a column based on whether another column is null:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
df
But I don't get the results I expect:
I would expect the final two rows to read cancelled_by_user, because the stripe_subscription_id column is null.
If I amend the function:
def get_cancellation_type(row):
if row.stripe_subscription_id.isnull():
Then I get an error: AttributeError: ("'float' object has no attribute 'isnull'", 'occurred at index 0'). What am I doing wrong?
With pandas and numpy we barely have to write our own functions, especially since our own functions will perform slow because these are not vectorized and pandas + numpy provide a rich pool of vectorized methods for us.
In this case your are looking for np.select since you want to create a column based on multiple conditions:
conditions = [
df['stripe_subscription_id'].notna() & df['status'].eq('past_due'),
df['stripe_subscription_id'].notna() & df['status'].eq('active')
]
choices = ['failed_to_pay', 'cancelled_by_us']
df['cancellation_type'] = np.select(conditions, choices, default='cancelled_by_user')
status stripe_subscription_id cancellation_type
0 past_due 1.0 failed_to_pay
1 active 2.0 cancelled_by_us
2 active NaN cancelled_by_user
3 active NaN cancelled_by_user