Filtering a specific column in a pandas DataFrame using the 'filter' keyword and a lambda function

Filtering a specific column in a pandas DataFrame using the 'filter' keyword and a lambda function - dataframe

How can I use the 'filter' keyword and a lambda function to filter a specific column in a pandas DataFrame?
import pandas as pd
data = [{
'Language': 'Python',
'Percent grow': 56
}, {
'Language': 'Java',
'Percent grow': 34
}, {
'Language': 'C',
'Percent grow': 25
}, {
'Language': 'C++',
'Percent grow': 12
}, {
'Language': 'go',
'Percent grow': 5
}]
df = pd.DataFrame(data)
f = lambda x : x['Percent grow'] > 30
df.filter(f)

IIUC, you want pandas.DataFrame.loc with boolean indexing :
df.loc[df["Percent grow"].gt(30)]
To make it out with pandas.DataFrame.filter, one way will be :
(
df
.set_index("Percent grow")
.pipe(lambda x: x.filter([x for x in x.index if x>30], axis=0))
.reset_index()
.reindex(df.columns, axis=1)
)
Output :
Language Percent grow
0 Python 56
1 Java 34

Related

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['ï»¿Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='ï»¿Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['ï»¿Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')

Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Conditional mapping among columns of two data frames with Pandas Data frame

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance

Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

Pandas groupby multi conditions and date difference calculation

I am stuck understanding the method to use. I have the following dataframe:
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df
I need to:
groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands
The expected output is the below:

The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.
How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
This results in the following df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days

Predefining pandas aggregations and rename with new version

In previous versions of pandas, you could do:
aggregations = {
'Col1':{
'SUM_name': 'sum',
'MEAN_name': 'mean',
'MAX_name': 'max',
'MIN_name': 'min'
},
'Other colname':{
'MEAN_newname': 'mean',
'MED_newname': 'median',
'MAX_newname': 'max',
'MIN_newname': 'min'
},
}
agg_df = df[df['somecol'] <= 0].groupby(['gbcol']).agg(aggregations)
This is deprecated with 0.20. What is the equivalent of this form of aggregation in v 0.20?

An alternative is named agg:
aggregations = {
'SUM_name':('Col1','sum'),
'MEAN_name':('Col1','mean'),
'MEAN_newname':('Other_colname', 'mean')
}
agg_df = df.groupby(['gbcol']).agg(**aggregations)

pandas: how to check for nulls in a float column?

I am conditionally assigning a column based on whether another column is null:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
df
But I don't get the results I expect:
I would expect the final two rows to read cancelled_by_user, because the stripe_subscription_id column is null.
If I amend the function:
def get_cancellation_type(row):
if row.stripe_subscription_id.isnull():
Then I get an error: AttributeError: ("'float' object has no attribute 'isnull'", 'occurred at index 0'). What am I doing wrong?

With pandas and numpy we barely have to write our own functions, especially since our own functions will perform slow because these are not vectorized and pandas + numpy provide a rich pool of vectorized methods for us.
In this case your are looking for np.select since you want to create a column based on multiple conditions:
conditions = [
df['stripe_subscription_id'].notna() & df['status'].eq('past_due'),
df['stripe_subscription_id'].notna() & df['status'].eq('active')
]
choices = ['failed_to_pay', 'cancelled_by_us']
df['cancellation_type'] = np.select(conditions, choices, default='cancelled_by_user')
status stripe_subscription_id cancellation_type
0 past_due 1.0 failed_to_pay
1 active 2.0 cancelled_by_us
2 active NaN cancelled_by_user
3 active NaN cancelled_by_user

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Filtering a specific column in a pandas DataFrame using the 'filter' keyword and a lambda function - dataframe

Related

GroupBy Function Not Applying

Conditional mapping among columns of two data frames with Pandas Data frame

Pandas groupby multi conditions and date difference calculation

Predefining pandas aggregations and rename with new version

pandas: how to check for nulls in a float column?

Categories

Resources