concatenate values in dataframe if a column has specific values and None or Null values - pandas

I have a dataframe with name+address/email information based on the type. Based on a type I want to concat name+address or name+email into a new column (concat_name) within the dataframe. Some of the types are null and are causing ambiguity errors. Identifying the nulls correctly in place is where I'm having trouble.
NULL = None
data = {
'Type': [NULL, 'MasterCard', 'Visa','Amex'],
'Name': ['Chris','John','Jill','Mary'],
'City': ['Tustin','Cleveland',NULL,NULL ],
'Email': [NULL,NULL,'','']
df_data = pd.DataFrame(data)
#Expected resulting df column:
df_data['concat_name'] = ['ChrisTustin', 'JohnCleveland',','']
Attempt one using booleans
if df_data['Type'].isnull() | df_data[df_data['Type'] == 'Mastercard':
df_data['concat_name'] = df_data['Name']+df_data['City']
if df_data[df_data['Type'] == 'Visa' | df_data[df_data['Type'] == 'Amex':
df_data['concat_name'] = df_data['Name']+df_data['Email']
df_data['concat_name'] = 'Error'
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Attempt two using np.where
df_data['concat_name'] = np.where((df_data['Type'].isna()|(df_data['Type']=='MasterCard'),df_data['Name']+df_data['City'],
np.where((df_data['Type']=="Visa")|(df_data['Type]=="Amex"),df_data['Name']+df_data['Email'], 'Error'
ValueError: Length of values(2) does not match length of index(12000)

Does the following code solve your use case?
# == Imports needed ===========================
import pandas as pd
import numpy as np
# == Example Dataframe =========================
df_data = pd.DataFrame(
"Type": [None, "MasterCard", "Visa", "Amex"],
"Name": ["Chris", "John", "Jill", "Mary"],
"City": ["Tustin", "Cleveland", None, None],
"Email": [None, None, "", ""],
# Expected output:
"concat_name": [
# == Solution Implementation ====================
df_data["concat_name2"] = np.where(
(df_data["Type"].isin(["MasterCard", pd.NA, None])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
(df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),
# == Expected Output ============================
# Prints:
# Type Name City Email concat_name concat_name2
# 0 None Chris Tustin None ChrisTustin ChrisTustin
# 1 MasterCard John Cleveland None JohnCleveland JohnCleveland
# 2 Visa Jill None
# 3 Amex Mary None
You might also consider simplifying the problem, by replacing the first condition (Type == 'MasterCard' or None) with the opposite of your second condition (Type == 'Visa' or 'Amex'):
df_data["concat_name2"] = np.where(
(~df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", "")
Additionally, if you are dealing with messy data, you can also improve the implementation by converting the Type column to lowercase, or uppercase. This makes your code also account for cases where you have values like "mastercard", or "Mastercard", etc.:
df_data["concat_name2"] = np.where(
(df_data["Type"].astype(str).str.lower().isin(["mastercard", pd.NA, None, "none"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
(df_data["Type"].astype(str).str.lower().isin(["visa", "amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),


alternative way to define a function inside a class method [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I have a the following class:
class Analysis():
def __init__(self, file_dir):
self.path = file_dir #file path directory
def getData(self):
return pd.read_csv(self.path) # create a pandas dataframe
def getStd(self):
return self.getData().loc['1':'5'].apply(lambda x: x.std()) # cacluate the standard deviation of all columns
def getHighlight(self):
#a function to highlight df based on the given condition
def highlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#rows over which the highlighting function should apply
r = ['1', '2', '3', '4', '5']
#first boolean mask for selecting the df elements
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
#second boolean mask for selecting the df elements
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] =[m1, m2], [c1, c2], default=c3)
return df1
#apply the highlight function on the df to get highlighted
return self.getData().style.apply(highlight, axis=None)
getData method returns the df like this:
my_analysis = Analysis(path_to_file)
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
The getHighligt method has an inner function which applies to the df in order to highlight the df elements based on the given mask and it would out put something like this:
My question is what is the pythonic or elegant way of defining the inner function inside the class method?
Disclaimer: the following remarks represent my opinion about the topic of pythonic code.
Avoid Inner Functions
You should avoid inner functions at all cost. Sometimes they're necessary, but most of the time they're an indication that you might want to refactor your code.
Avoid re-reading multiple times
I would also avoid calling pd.read_csv every time I want to perform some operation in the data. Unless there's a good reason to read the file over and over again, It's more performant to read it once and store it in a class attribute, or property.
PEP-8 Naming Conventions
Another important thing to consider, if you're trying to make your code more pythonic, is to try to follow the PEP8 naming conventions, unless you're working on legacy code that does not follow PEP-8.
Class Overkill
Finally, I think that creating a class for what you're doing seems a little overkill. Most of your methods are simply transformations that could be easily converted to functions. Aside from making your code less complex, It would improve its reusability.
How I would write the Analysis class
from __future__ import absolute_import, annotations
from pathlib import Path
from typing import Any, Collection, Iterable, Type, Union
import numpy as np
import pandas as pd
from pandas.core.dtypes.dtypes import ExtensionDtype # type: ignore
# Custom types for type hinting
Axes = Collection[Any]
NpDtype = Union[
str, np.dtype, Type[Union[str, float, int, complex, bool, object]]
Dtype = Union["ExtensionDtype", NpDtype]
# Auxiliary functions
def is_iterable_not_string(iterable: Any) -> bool:
"""Return True, if `iterable` is an iterable object, and not a string.
iterable: Any
The object to check whether it's an iterable except for strings,
or not.
True, if object is iterable, but not a string.
Otherwise, if object isn't an iterable, or if it's a string, return
>>> import numpy as np
>>> import pandas as pd
>>> class FakeIterable(int):
... def __iter__(self): pass
>>> print(is_iterable_not_string('abcde'))
>>> print(is_iterable_not_string(bytes(12345)))
>>> print(is_iterable_not_string(12345))
>>> print(is_iterable_not_string(123.45))
>>> print(is_iterable_not_string(type))
>>> print(is_iterable_not_string(list)) # Type list isn't iterable
>>> print(is_iterable_not_string(object))
>>> print(is_iterable_not_string(None))
>>> print(is_iterable_not_string(list())) # Empty list is still iterable
>>> # `FakeIterable` has a method `__iter__`, therefore it's considered
>>> # iterable, even though it isn't.
>>> print(is_iterable_not_string(FakeIterable(10)))
>>> print(is_iterable_not_string(list('abcde')))
>>> print(is_iterable_not_string(tuple('abcde')))
>>> print(is_iterable_not_string(set('abcde')))
>>> print(is_iterable_not_string(np.array(list('abcdef'))))
>>> print(is_iterable_not_string({col: [1, 2, 3, 4] for col in 'abcde'}))
>>> print(is_iterable_not_string(
... pd.DataFrame({col: [1, 2, 3, 4] for col in 'abcde'}))
... )
>>> print(is_iterable_not_string(pd.DataFrame()))
In python, any object that contains a method called `__iter__` considered
an “iterable”. This means that you can, in theory, fake an “iterable”
object, by creating a method called `__iter__` that doesn't contain any
real implementation. For a concrete case, see the examples section.
Python common iterable objects:
- strings
- bytes
- lists
- tuples
- sets
- dictionaries
Python common non-iterable objects:
- integers
- floats
- None
- types
- objects
return (not isinstance(iterable, (bytes, str))
and isinstance(iterable, Iterable))
def prepare_dict(data: dict) -> dict:
"""Transform non-iterable dictionary values into lists.
data : dict
The dictionary to convert non-iterable values into lists.
Dictionary with non-iterable values converted to lists.
>>> import pandas as pd
>>> d = {'a': '1', 'b': 2}
>>> prepare_dict(d)
{'a': ['1'], 'b': [2]}
>>> pd.DataFrame(d) # doctest: +ELLIPSIS
Traceback (most recent call last):
ValueError: If using all scalar values, you must pass an index
>>> pd.DataFrame(prepare_dict(d))
a b
0 1 2
Use this function to prepare dictionaries, before calling
`pandas.DataFrame`, to make sure all values have the correct format.
return {
key: value if is_iterable_not_string(value) else [value]
for key, value in data.items()
def check_dict_value_lens(data: dict) -> bool:
"""Check whether all values from dictionary have the same lenght.
data : dict
The dictionary to check the values lenghts.
True, if all `data` values have the same lenght. False otherwise.
min_len = min(map(lambda value: len(value), data.values()))
return all(len(value) == min_len for value in data.values())
def read_file(path: Path | str, **kwargs: Any) -> pd.DataFrame:
Read a DataFrame from a file.
Supported file types are:
- `.csv`
- `.xlsx`, `.xls`, `.xlsm`, `.xlsb` (Excel files)
- `.json`
- `.parquet`
- `.feather`
- `.html`
path : Path | str
The path to the file.
kwargs : Any
Keyword arguments to pass to pandas io functions.
The DataFrame read from the file.
If the file type not supported.
If the file doesn't exist.
_path = Path(path)
path = str(path)
if not _path.is_file():
raise FileNotFoundError(f"File {path} does not exist.")
if _path.suffix in [".csv", ".txt"]:
return pd.read_csv(path, **kwargs)
if ".xls" in _path.suffix:
return pd.read_excel(path, **kwargs)
if _path.suffix == ".json":
return pd.read_json(path, **kwargs)
if _path.suffix == ".pickle":
return pd.read_pickle(path, **kwargs)
if _path.suffix == ".html":
return pd.read_html(path, **kwargs)
if _path.suffix == ".feather":
return pd.read_feather(path, **kwargs)
if _path.suffix in [".parquet", ".pq"]:
return pd.read_parquet(path, **kwargs)
raise ValueError(f"File {path} has an unknown extension.")
def highlight(df: pd.DataFrame) -> pd.DataFrame:
"""Highlight a DataFrame.
df : pd.DataFrame
The DataFrame to highlight. Required indexes:
- ["USL", "LSL", "1", "2", "3", "4", "5"]
The DataFrame with highlighted rows.
# The dataframe cells background colors.
c1: str = "background-color:red"
c2: str = "background-color:yellow"
c3: str = "background-color:green"
# Rows over which the highlighting function should apply
rows: list[str] = ["1", "2", "3", "4", "5"]
# First boolean mask for selecting the df elements
m1 = (df.loc[rows] > df.loc["USL"]) | (df.loc[rows] < df.loc["LSL"])
# Second boolean mask for selecting the df elements
m2 = (df.loc[rows] == df.loc["USL"]) | (df.loc[rows] == df.loc["LSL"])
# DataFrame with same index, and column names as the original,
# but with filled empty strings.
df_highlight = pd.DataFrame("", index=df.index, columns=df.columns)
# Change values of df1 columns by boolean mask
df_highlight.loc[rows, :] =
[m1, m2], [c1, c2], default=c3
return df_highlight
class Analysis:
Read a dataframe, and help performing some analysis in the data.
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
_data : pd.DataFrame
The data read from the file.
_path : str | Path
The path to the file.
>>> data = {
... 'A-A': [
... 0.37, 0.39, 0.35, 0.35, 0.36, 0.34, 0.35, 0.33, 0.36, 0.33,
... ],
... 'A-B': [
... 10.24, 10.26, 10.22, 10.23, 10.19, 10.25, 10.2, 10.17, 10.25,
... 10.17,
... ],
... 'A-C': [
... 5.02, 5.04, 5.0, 5.05, 5.07, 5.03, 5.08, 5.05, 5.08, 5.03,
... ],
... 'A-D': [
... 0.63, 0.65, 0.63, 0.65, 0.67, 0.66, 0.69, 0.62, 0.69, 0.62,
... ],
... 'A-E': [
... 20.3, 20.32, 20.28, 20.45, 20.25, 20.33, 20.22, 20.4,
... 20.45, 20.22,
... ],
... }
>>> index = ['Tg', 'USL', 'LSL', '1', '2', '3', '4', '5', 'Max', 'Min']
>>> analysis = Analysis.from_dict(data, index=index)
>>> analysis.get_std()
A-A 0.011402
A-B 0.031937
A-C 0.019494
A-D 0.025884
A-E 0.097211
dtype: float64
_path: Path | str | None = None
_data: pd.DataFrame | None = None
def path(self) -> str | Path:
"""Get the path to the file.
str | Path
The path to the file.
If `_path` is `None`.
if self._path is None:
raise ValueError("Path not set.")
return str(self._path)
def path(self, path: str | Path):
"""Set the path of the file to analyze.
path : str | Path
The path of the file to analyze.
Path should point to a `.csv` file.
If the path not found.
_path = Path(path)
if _path.is_file():
self._path = str(path)
raise FileNotFoundError(f"Path {path} does not exist.")
def data(self) -> pd.DataFrame:
"""Dataframe read from `path`.
The dataframe read from `path`.
if self._data is None:
self._data = self.get_data()
return self._data
def data(self, data: pd.DataFrame):
"""Set the data to analyze.
data : pd.DataFrame
The data to analyze.
self._data = data
def __init__(self, path_or_data: str | Path | pd.DataFrame):
"""Initialize the Analyzer.
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
If `path_or_data` not a `str`, `Path`, or `pd.DataFrame`.
if isinstance(path_or_data, (str, Path)):
self.path = path_or_data
elif isinstance(path_or_data, pd.DataFrame): = path_or_data
raise ValueError(f"Invalid type {type(path_or_data)}.")
def get_data(self) -> pd.DataFrame:
"""Read the data from the file.
The dataframe read from the `path` property.
return read_file(self.path)
def get_std(self) -> pd.Series:
"""Calcuate the standard deviation of every column.
The standard deviation of every column.
return["1":"5"].apply(lambda x: x.std()) # type: ignore
def highlight_frame(
self, round_values: int | None = None
) -> # type: ignore
"""Highlight dataframe, based on some condition.
round_values: int | None
If defined, sets the precision of the Styler object with the
highlighted dataframe.
The Styler object with the highlighted dataframe.
highlight_df =, axis=None)
if isinstance(round_values, int) and round_values >= 0:
return highlight_df.format(precision=round_values)
return highlight_df
def from_dict(
data: dict,
index: Axes | None = None,
columns: Axes | None = None,
dtype: Dtype | None = None,
) -> Analysis:
"""Create an Analysis object from a dictionary.
data : dict
The dictionary to create the Analysis object from.
index : Index or array-like
Index to use for resulting frame. Defaults to RangeIndex, if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data doesn't have
them, defaulting to RangeIndex(0, 1, 2, ..., n).
If data contains column labels, will perform column selection
dtype : dtype, default None
Data type to force. Only a single dtype allowed. If None, infer.
An instance of the `Analysis` class.
If dictionary values have different lenghts.
data = prepare_dict(data)
if check_dict_value_lens(data):
return cls(
pd.DataFrame(data, index=index, columns=columns, dtype=dtype)
raise ValueError(
f"Dictionary values don't have the same lenghts.\nData: {data}"
if __name__ == "__main__":
import doctest

Creating new column in pandas using existing column values as filter using pandas - .isin() fails as Attribute Error

Error: AttributeError: 'int' object has no attribute 'isin'
Question: There are no null values, works in individual code block. Tried to modify the data type of series R to object, error goes : 'str' object has no attribute 'isin'
What am I missing?
X = [1, 2, 3, 4]
if dg['RFM_Segment'] == '111':
return 'Core'
elif (dg['R'].isin(X) & dg['F'].isin([1]) & dg['M'].isin(X) & (dg['RFM_Segment'] != '111')).any():
return 'Loyal'
elif (dg['R'].isin(X) & dg['F'].isin(X) & dg['M'].isin([1]) & (dg['RFM_Segment'] != '111')).any():
return 'Whales'
elif (dg['R'].isin(X) & dg['F'].isin([1]) & dg['M'].isin([3,4])).any():
return 'Promising'
elif (dg['R'].isin([1]) & dg['F'].isin([4]) & dg['M'].isin(X)).any():
return 'Rookies'
elif (dg['R'].isin([4]) & dg['F'].isin([4]) & dg['M'].isin(X)).any():
return 'Slipping'
return 'NA'
dg['user_segment']= dg.apply(user_segment, axis= 1)
I will assume that you accidentally cut off the top of your code snipet, in which you define user_segment.
The issue lies in the way you tried to use apply. Note that apply will operate on Series, rather than DataFrame. So, by indexing into any element of a series, you will not receive a Series object (as you would when indexing into DataFrame), but rather a object of a given columns' type (like int, str etc.). An example:
import pandas as pd
X = ['a', 'c']
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['e', 'f']], columns=['col1', 'col2'])
df['col1'].isin(X) # this works, because I'm applying `isin` on the entire column.
def test_apply(x):
return x
df.apply(test_apply, axis=1) # this doesn't work,
# because I'm applying `isin` on a non-pandas object, in
# this example `str`

hypothesis - How to generate a pandas dataframe with variable number of columns

I am new to Hypothesis and I would like to know if there is a better way to use to Hypothesis than what I have done here...
class TestFindEmptyColumns:
def test_one_empty_column(self):
input = pd.DataFrame({
'quantity': [None],
expected_output = ['quantity']
assert find_empty_columns(input) == expected_output
def test_no_empty_column(self):
input = pd.DataFrame({
'item': ["Item1", ],
'quantity': [10, ],
expected_output = []
assert find_empty_columns(input) == expected_output
column(name='col1', elements=st.none() | st.integers()),
column(name='col2', elements=st.none() | st.integers()),
def test_dataframe_with_random_number_of_columns(self, df):
df_with_no_empty_columns = df.dropna(how='all', axis=1)
result = find_empty_columns(df)
# None of the empty columns should be in the reference dataframe df_with_no_empty_columns
assert set(result).isdisjoint(df_with_no_empty_columns.columns)
# The above assert does not catch the condition if the result is a column name
# that is not there in the data-frame at all e.g. 'col3'
assert set(result).issubset(df.columns)
Ideally, I want a dataframe which has a variable number of columns in each test run. The columns can contain any value - some of the columns should contains all null values. Any help would be appreciated?

pandas: how to assign a single column conditionally on multiple other columns?

I'm confused about conditional assignment in Pandas.
I have this dataframe:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
I'm trying to add a new column, conditionally based on the others:
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
This is fairly readable, but is it the standard way to do things?
I've been looking at pd.assign, and am not sure if I should be using that instead.
This should work, you can change or add the conditions however you want.
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'past_due'), 'cancellation_type'] = 'failed_to_pay'
df.loc[(df['stripe_subscription_id'] != np.nan) & (df['status'] == 'active'), 'cancellation_type'] = 'cancelled_by_us'
df.loc[(df['stripe_subscription_id'] == np.nan), 'cancellation_type'] = 'cancelled_by_user'
You migth consider to use
import pandas as pd
import numpy as np
condList = [df["status"]=="past_due",
choiceList = ["failed_to_pay", "cancelled_by_us", "cancelled_by_user"]
df['cancellation_type'] =, choiceList)

pandas: how to check for nulls in a float column?

I am conditionally assigning a column based on whether another column is null:
df = pd.DataFrame([
{ 'stripe_subscription_id': 1, 'status': 'past_due' },
{ 'stripe_subscription_id': 2, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
{ 'stripe_subscription_id': None, 'status': 'active' },
def get_cancellation_type(row):
if row.stripe_subscription_id:
if row.status == 'past_due':
return 'failed_to_pay'
elif row.status == 'active':
return 'cancelled_by_us'
return 'cancelled_by_user'
df['cancellation_type'] = df.apply(get_cancellation_type, axis=1)
But I don't get the results I expect:
I would expect the final two rows to read cancelled_by_user, because the stripe_subscription_id column is null.
If I amend the function:
def get_cancellation_type(row):
if row.stripe_subscription_id.isnull():
Then I get an error: AttributeError: ("'float' object has no attribute 'isnull'", 'occurred at index 0'). What am I doing wrong?
With pandas and numpy we barely have to write our own functions, especially since our own functions will perform slow because these are not vectorized and pandas + numpy provide a rich pool of vectorized methods for us.
In this case your are looking for since you want to create a column based on multiple conditions:
conditions = [
df['stripe_subscription_id'].notna() & df['status'].eq('past_due'),
df['stripe_subscription_id'].notna() & df['status'].eq('active')
choices = ['failed_to_pay', 'cancelled_by_us']
df['cancellation_type'] =, choices, default='cancelled_by_user')
status stripe_subscription_id cancellation_type
0 past_due 1.0 failed_to_pay
1 active 2.0 cancelled_by_us
2 active NaN cancelled_by_user
3 active NaN cancelled_by_user