pandas: how to assign a single column conditionally on multiple other columns? - pandas

I'm confused about conditional assignment in Pandas.
I have this dataframe:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
I'm trying to add a new column, conditionally based on the others:
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
This is fairly readable, but is it the standard way to do things?
I've been looking at pd.assign, and am not sure if I should be using that instead.

This should work, you can change or add the conditions however you want.
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'past_due'), 'cancellation_type'] = 'failed_to_pay'
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'active'), 'cancellation_type'] = 'cancelled_by_us'
df.loc[(df['stripe_subscription_id'] == np.nan), 'cancellation_type'] = 'cancelled_by_user'

You migth consider to use np.select
import pandas as pd
import numpy as np
condList = [df["status"]=="past_due",
df["status"]=="active",
~df["status"].isin(["past_due",
"active"])]
choiceList = ["failed_to_pay", "cancelled_by_us", "cancelled_by_user"]
df['cancellation_type'] = np.select(condList, choiceList)

Related

concatenate values in dataframe if a column has specific values and None or Null values

I have a dataframe with name+address/email information based on the type. Based on a type I want to concat name+address or name+email into a new column (concat_name) within the dataframe. Some of the types are null and are causing ambiguity errors. Identifying the nulls correctly in place is where I'm having trouble.
NULL = None
data = {
'Type': [NULL, 'MasterCard', 'Visa','Amex'],
'Name': ['Chris','John','Jill','Mary'],
'City': ['Tustin','Cleveland',NULL,NULL ],
'Email': [NULL,NULL,'jdoe#yahoo.com','mdoe#aol.com']
}
df_data = pd.DataFrame(data)
#Expected resulting df column:
df_data['concat_name'] = ['ChrisTustin', 'JohnCleveland','Jilljdoe#yahoo.com,'Marymdoe#aol.com']
Attempt one using booleans
if df_data['Type'].isnull() | df_data[df_data['Type'] == 'Mastercard':
df_data['concat_name'] = df_data['Name']+df_data['City']
if df_data[df_data['Type'] == 'Visa' | df_data[df_data['Type'] == 'Amex':
df_data['concat_name'] = df_data['Name']+df_data['Email']
else:
df_data['concat_name'] = 'Error'
Error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Attempt two using np.where
df_data['concat_name'] = np.where((df_data['Type'].isna()|(df_data['Type']=='MasterCard'),df_data['Name']+df_data['City'],
np.where((df_data['Type']=="Visa")|(df_data['Type]=="Amex"),df_data['Name']+df_data['Email'], 'Error'
Error
ValueError: Length of values(2) does not match length of index(12000)
Does the following code solve your use case?
# == Imports needed ===========================
import pandas as pd
import numpy as np
# == Example Dataframe =========================
df_data = pd.DataFrame(
{
"Type": [None, "MasterCard", "Visa", "Amex"],
"Name": ["Chris", "John", "Jill", "Mary"],
"City": ["Tustin", "Cleveland", None, None],
"Email": [None, None, "jdoe#yahoo.com", "mdoe#aol.com"],
# Expected output:
"concat_name": [
"ChrisTustin",
"JohnCleveland",
"Jilljdoe#yahoo.com",
"Marymdoe#aol.com",
],
}
)
# == Solution Implementation ====================
df_data["concat_name2"] = np.where(
(df_data["Type"].isin(["MasterCard", pd.NA, None])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
np.where(
(df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),
"Error",
),
)
# == Expected Output ============================
print(df_data)
# Prints:
# Type Name City Email concat_name concat_name2
# 0 None Chris Tustin None ChrisTustin ChrisTustin
# 1 MasterCard John Cleveland None JohnCleveland JohnCleveland
# 2 Visa Jill None jdoe#yahoo.com Jilljdoe#yahoo.com Jilljdoe#yahoo.com
# 3 Amex Mary None mdoe#aol.com Marymdoe#aol.com Marymdoe#aol.com
Notes
You might also consider simplifying the problem, by replacing the first condition (Type == 'MasterCard' or None) with the opposite of your second condition (Type == 'Visa' or 'Amex'):
df_data["concat_name2"] = np.where(
(~df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", "")
)
Additionally, if you are dealing with messy data, you can also improve the implementation by converting the Type column to lowercase, or uppercase. This makes your code also account for cases where you have values like "mastercard", or "Mastercard", etc.:
df_data["concat_name2"] = np.where(
(df_data["Type"].astype(str).str.lower().isin(["mastercard", pd.NA, None, "none"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
np.where(
(df_data["Type"].astype(str).str.lower().isin(["visa", "amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),
"Error",
),
)

How can I increase the efficiency of my code (for loop)?

def assignGroup(row):
if row["E114"]=="Very good":
return 1
elif row['E114']=="Fairly good":
return 2
elif row['E114'] =="Bad":
return 3
elif row['E114'] =="Very bad":
return 4
else:
return np.nan
outcome["leader"]=outcome.apply(assignGroup,axis=1)
outcome["leader"] = outcome["E114"].map({
"Very good" : 1,
"Fairly good": 2,
"Bad": 3,
"Very bad": 4
})
Use numpy's where:
import numpy as np
outcome["leader"] = np.where(outcome["E114"] == "Very good", 1, outcome["leader"])
outcome["leader"] = np.where(outcome["E114"] == "Fairly goo", 2, outcome["leader"])
outcome["leader"] = np.where(outcome["E114"] == "Bad", 3, outcome["leader"])
outcome["leader"] = np.where(outcome["E114"] == "Very bad", 4, outcome["leader"])
In Python loops should be last resource

youtube_dl video descriptions

I have a df containing a set of videoIDs from YT:
import pandas as pd
data = {'Order': ['1', '2', '3'],
'VideoID': ['jxwHmAoKte4', 'LsXM502SpiU','1I3f27iQ4pM']
}
df = pd.DataFrame (data, columns = ['Order','VideoID'])
print (df)
and want to download the video descriptions and save them in the same df in an extra column.
I tried to use youtube_dl in Jupyter this way:
import youtube_dl
def all_descriptions(URL):
videoID=df['VideoId']
URL = 'https://www.youtube.com/watch?v=' + videoID
ydl_opts = {
'forcedescription':True,
'skip_download': True,
'youtube-skip-dash-manifest': True,
'no_warnings': True,
'ignoreerrors': True
}
try:
youtube_dl.YoutubeDL(ydl_opts).download(URL)
return webpage
except:
pass
df['descriptions']=all_descriptions(URL)
I see the output of the code as text, but in df only "None" as text of the column.
Obviously I can't transport the output of the function to df in the proper way.
Can you suggest how to get it right?
Thank you in advance for help.
#perl
I modify the df to include two URLs that are causing two types of error:
import pandas as pd
data = {'Order': ['1', '2', '3', '4', '5'],
'VideoId': ['jxwHmAoKte4', 'LsXM502SpiU','1I3f27iQ4pM', 'MGQOX2rK5s', 'wNayw_E7lIA']
}
df = pd.DataFrame (data, columns = ['Order','VideoId'])
print (df)
Then I test it in the way you suggested, including my definition of ydl_opts:
videoID=df['VideoId']
URL = 'https://www.youtube.com/watch?v=' + videoID
ydl_opts = {
'forcedescription':True,
'skip_download': True,
'youtube-skip-dash-manifest': True,
'no_warnings': True,
'ignoreerrors': True
}
df['description'] = [
youtube_dl.YoutubeDL(ydl_opts).extract_info(
u, download=False)['description'] for u in URL]
df
Reaching to the first error I get the output:
TypeError: 'NoneType' object is not subscriptable
After that I replace 'forcedescription' in my code with 'extract_info':
def all_descriptions(URL):
videoID=df['VideoId']
URL = 'https://www.youtube.com/watch?v=' + videoID
ydl_opts = {
'forcedescription':True,
'skip_download': True,
'youtube-skip-dash-manifest': True,
'no_warnings': True,
'ignoreerrors': True
}
try:
youtube_dl.YoutubeDL(ydl_opts).download(URL)
return webpage
except:
pass
It skips all errors, but as the result there is nothing in the 'description'-column.
Any sugggestions?
You can use extract_info method:
df['description'] = [
youtube_dl.YoutubeDL().extract_info(
u, download=False)['description'] for u in URL]
df
Output:
Order VideoID description
0 1 jxwHmAoKte4 Bundesweit gelten sie nun ab heute, die schärf...
1 2 LsXM502SpiU Wie sicher ist der Impfstoff? Wäre eine Impfpf...
2 3 1I3f27iQ4pM Impfen ja oder nein, diese Frage stellen sich ...
P.S. The forcedescription parameter only prints the description to standard output, it doesn't return it
Update: extract_info returns None if it fails, so in case we have videos that may fail before getting the description from the info we can check that the info is not None:
ydl = youtube_dl.YoutubeDL(ydl_opts)
infos = [ydl.extract_info(u, download=False) for u in URL]
df['description'] = [
info['description'] if info is not None else ''
for info in infos]

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

pandas: how to check for nulls in a float column?

I am conditionally assigning a column based on whether another column is null:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
])
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
else:
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
df
But I don't get the results I expect:
I would expect the final two rows to read cancelled_by_user, because the stripe_subscription_id column is null.
If I amend the function:
def get_cancellation_type(row):
if row.stripe_subscription_id.isnull():
Then I get an error: AttributeError: ("'float' object has no attribute 'isnull'", 'occurred at index 0'). What am I doing wrong?
With pandas and numpy we barely have to write our own functions, especially since our own functions will perform slow because these are not vectorized and pandas + numpy provide a rich pool of vectorized methods for us.
In this case your are looking for np.select since you want to create a column based on multiple conditions:
conditions = [
df['stripe_subscription_id'].notna() & df['status'].eq('past_due'),
df['stripe_subscription_id'].notna() & df['status'].eq('active')
]
choices = ['failed_to_pay', 'cancelled_by_us']
df['cancellation_type'] = np.select(conditions, choices, default='cancelled_by_user')
status stripe_subscription_id cancellation_type
0 past_due 1.0 failed_to_pay
1 active 2.0 cancelled_by_us
2 active NaN cancelled_by_user
3 active NaN cancelled_by_user