pandas: how to check for nulls in a float column? - pandas

I am conditionally assigning a column based on whether another column is null:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
df
But I don't get the results I expect:
I would expect the final two rows to read cancelled_by_user, because the stripe_subscription_id column is null.
If I amend the function:
def get_cancellation_type(row):
if row.stripe_subscription_id.isnull():
Then I get an error: AttributeError: ("'float' object has no attribute 'isnull'", 'occurred at index 0'). What am I doing wrong?

With pandas and numpy we barely have to write our own functions, especially since our own functions will perform slow because these are not vectorized and pandas + numpy provide a rich pool of vectorized methods for us.
In this case your are looking for np.select since you want to create a column based on multiple conditions:
conditions = [
df['stripe_subscription_id'].notna() & df['status'].eq('past_due'),
df['stripe_subscription_id'].notna() & df['status'].eq('active')
]
choices = ['failed_to_pay', 'cancelled_by_us']
df['cancellation_type'] = np.select(conditions, choices, default='cancelled_by_user')
status stripe_subscription_id cancellation_type
0 past_due 1.0 failed_to_pay
1 active 2.0 cancelled_by_us
2 active NaN cancelled_by_user
3 active NaN cancelled_by_user

Related

concatenate values in dataframe if a column has specific values and None or Null values

I have a dataframe with name+address/email information based on the type. Based on a type I want to concat name+address or name+email into a new column (concat_name) within the dataframe. Some of the types are null and are causing ambiguity errors. Identifying the nulls correctly in place is where I'm having trouble.
NULL = None
data = {
'Type': [NULL, 'MasterCard', 'Visa','Amex'],
'Name': ['Chris','John','Jill','Mary'],
'City': ['Tustin','Cleveland',NULL,NULL ],
'Email': [NULL,NULL,'jdoe#yahoo.com','mdoe#aol.com']
}
df_data = pd.DataFrame(data)
#Expected resulting df column:
df_data['concat_name'] = ['ChrisTustin', 'JohnCleveland','Jilljdoe#yahoo.com,'Marymdoe#aol.com']
Attempt one using booleans
if df_data['Type'].isnull() | df_data[df_data['Type'] == 'Mastercard':
df_data['concat_name'] = df_data['Name']+df_data['City']
if df_data[df_data['Type'] == 'Visa' | df_data[df_data['Type'] == 'Amex':
df_data['concat_name'] = df_data['Name']+df_data['Email']
else:
df_data['concat_name'] = 'Error'
Error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Attempt two using np.where
df_data['concat_name'] = np.where((df_data['Type'].isna()|(df_data['Type']=='MasterCard'),df_data['Name']+df_data['City'],
np.where((df_data['Type']=="Visa")|(df_data['Type]=="Amex"),df_data['Name']+df_data['Email'], 'Error'
Error
ValueError: Length of values(2) does not match length of index(12000)
Does the following code solve your use case?
# == Imports needed ===========================
import pandas as pd
import numpy as np
# == Example Dataframe =========================
df_data = pd.DataFrame(
{
"Type": [None, "MasterCard", "Visa", "Amex"],
"Name": ["Chris", "John", "Jill", "Mary"],
"City": ["Tustin", "Cleveland", None, None],
"Email": [None, None, "jdoe#yahoo.com", "mdoe#aol.com"],
# Expected output:
"concat_name": [
"ChrisTustin",
"JohnCleveland",
"Jilljdoe#yahoo.com",
"Marymdoe#aol.com",
],
}
)
# == Solution Implementation ====================
df_data["concat_name2"] = np.where(
(df_data["Type"].isin(["MasterCard", pd.NA, None])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
np.where(
(df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),
"Error",
),
)
# == Expected Output ============================
print(df_data)
# Prints:
# Type Name City Email concat_name concat_name2
# 0 None Chris Tustin None ChrisTustin ChrisTustin
# 1 MasterCard John Cleveland None JohnCleveland JohnCleveland
# 2 Visa Jill None jdoe#yahoo.com Jilljdoe#yahoo.com Jilljdoe#yahoo.com
# 3 Amex Mary None mdoe#aol.com Marymdoe#aol.com Marymdoe#aol.com
Notes
You might also consider simplifying the problem, by replacing the first condition (Type == 'MasterCard' or None) with the opposite of your second condition (Type == 'Visa' or 'Amex'):
df_data["concat_name2"] = np.where(
(~df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", "")
)
Additionally, if you are dealing with messy data, you can also improve the implementation by converting the Type column to lowercase, or uppercase. This makes your code also account for cases where you have values like "mastercard", or "Mastercard", etc.:
df_data["concat_name2"] = np.where(
(df_data["Type"].astype(str).str.lower().isin(["mastercard", pd.NA, None, "none"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
np.where(
(df_data["Type"].astype(str).str.lower().isin(["visa", "amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),
"Error",
),
)

hypothesis - How to generate a pandas dataframe with variable number of columns

I am new to Hypothesis and I would like to know if there is a better way to use to Hypothesis than what I have done here...
class TestFindEmptyColumns:
def test_one_empty_column(self):
input = pd.DataFrame({
'quantity': [None],
})
expected_output = ['quantity']
assert find_empty_columns(input) == expected_output
def test_no_empty_column(self):
input = pd.DataFrame({
'item': ["Item1", ],
'quantity': [10, ],
})
expected_output = []
assert find_empty_columns(input) == expected_output
#given(data_frames([
column(name='col1', elements=st.none() | st.integers()),
column(name='col2', elements=st.none() | st.integers()),
]))
def test_dataframe_with_random_number_of_columns(self, df):
df_with_no_empty_columns = df.dropna(how='all', axis=1)
result = find_empty_columns(df)
# None of the empty columns should be in the reference dataframe df_with_no_empty_columns
assert set(result).isdisjoint(df_with_no_empty_columns.columns)
# The above assert does not catch the condition if the result is a column name
# that is not there in the data-frame at all e.g. 'col3'
assert set(result).issubset(df.columns)
Ideally, I want a dataframe which has a variable number of columns in each test run. The columns can contain any value - some of the columns should contains all null values. Any help would be appreciated?

Conditional mapping among columns of two data frames with Pandas Data frame

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance
Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

pandas: Calculate the rowwise max of categorical columns

I have a DataFrame containing 2 columns of ordered categorical data (of the same category). I want to construct another column that contains the categorical maximum of the first 2 columns. I set up the following.
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
cats = CategoricalDtype(categories=['small', 'normal', 'large'], ordered=True)
data = {
'A': ['normal', 'small', 'normal', 'large', np.nan],
'B': ['small', 'normal', 'large', np.nan, 'small'],
'desired max(A,B)': ['normal', 'normal', 'large', 'large', 'small']
}
df = pd.DataFrame(data).astype(cats)
The columns can be compared, although the np.nan items are problematic, as running the following code shows.
df['A'] > df['B']
The manual suggests that max() works on categorical data, so I try to define my new column as follows.
df[['A', 'B']].max(axis=1)
This yields a column of NaN. Why?
The following code constructs the desired column using the comparability of the categorical columns. I still don't know why max() fails here.
dfA = df['A']
dfB = df['B']
conditions = [dfA.isna(), (dfB.isna() | (dfA >= dfB)), True]
cases = [dfB, dfA, dfB]
df['maxAB'] = np.select(conditions, cases)
Columns A and B are string-types. So you gotta assign integer values to each of these categories first.
# size string -> integer value mapping
size2int_map = {
'small': 0,
'normal': 1,
'large': 2
}
# integer value -> size string mapping
int2size_map = {
0: 'small',
1: 'normal',
2: 'large'
}
# create columns containing the integer value for each size string
for c in df:
df['%s_int' % c] = df[c].map(size2int_map)
# apply the int2size map back to get the string sizes back
print(df[['A_int', 'B_int']].max(axis=1).map(int2size_map))
and you should get
0 normal
1 normal
2 large
3 large
4 small
dtype: object

pandas: how to assign a single column conditionally on multiple other columns?

I'm confused about conditional assignment in Pandas.
I have this dataframe:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
I'm trying to add a new column, conditionally based on the others:
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
This is fairly readable, but is it the standard way to do things?
I've been looking at pd.assign, and am not sure if I should be using that instead.
This should work, you can change or add the conditions however you want.
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'past_due'), 'cancellation_type'] = 'failed_to_pay'
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'active'), 'cancellation_type'] = 'cancelled_by_us'
df.loc[(df['stripe_subscription_id'] == np.nan), 'cancellation_type'] = 'cancelled_by_user'
You migth consider to use np.select
import pandas as pd
import numpy as np
condList = [df["status"]=="past_due",
df["status"]=="active",
~df["status"].isin(["past_due",
"active"])]
choiceList = ["failed_to_pay", "cancelled_by_us", "cancelled_by_user"]
df['cancellation_type'] = np.select(condList, choiceList)