Pandas isin boolean operator giving error - pandas

I am running into an error while using the 'isin' Boolean operator:
def rowcheck(row):
return row['CUST_NAME'].isin(['John','Alan'])
My dataframe has column CUST_NAME. So I use:
df['CUSTNAME_CHK'] = df.apply (lambda row: rowcheck(row),axis=1)
I get:
'str' object has no attribute 'isin'
What did I do wrong?

You are doing it inside a function passed to apply, such that row['CUST_NAME'] holds the value for a specific cell (and it is a string). Strings which have no isin method. This method belongs to pd.Series, and not strings.
If you really want to use apply, use np.isin in this case
def rowcheck(row):
return pd.np.isin(row['CUST_NAME'], ['John','Alan'])
As #juanpa.arrivilaga noticed, isin won't be efficient in this case, so its advised to use the operator in directly
return row['CUST_NAME'] in ['John','Alan']
Notice that you probably don't need apply. You can just use pd.Series.isindirectly. For example,
df = pd.DataFrame({'col1': ['abc', 'dfe']})
col1
0 abc
1 dfe
Such that you can do
df.col1.isin(['abc', 'xyz'])
0 True
1 False
Name: col1, dtype: bool

Related

Aggregating DataFrame string columns expected to be the same

I am calling DataFrame.agg on a dataframe with various numeric and string columns. For string columns, I want the result of the aggregation to be (a) the value of an arbitrary row if every row has that same string value or (b) an error otherwise.
I could write a custom aggregation function to do this, but is there a canonical way to approach this?
You can test numbers and add some aggregate function like sum and then if same strings column get first else raise error:
df = pd.DataFrame({'a':['s','s3'], 'b':[5,6]})
def f(x):
if np.issubdtype(x.dtype, np.number):
return x.sum()
else:
if x.eq(x.iat[0]).all():
return x.iat[0]
else:
raise ValueError('not same strings values')
s = df.agg(f)

How to create a true-for-all index on a pandas dataframe?

I am using pandas and have run into a few occasions where I have a programmatically generated list of conditionals, like so
conditionals = [
df['someColumn'] == 'someValue',
df['someOtherCol'] == 'someOtherValue',
df['someThirdCol'].isin(['foo','bar','baz']),
]
and I want to select rows where ALL of these conditions are true. I figure I'd do something like this.
bigConditional = IHaveNoIdeaOfWhatToPutHere
for conditional in conditionals:
bigConditional = bigConditional && conditional
filteredDf = df[bigConditional]
I know that I WANT to use the identity property, to where bigConditional is initialized to a series of true for every index in my dataframe, such that if any condition in my conditionals list evaluates to false that row won't be in the filtered dataframe, but initially every row is considered.
I don't know how to do that, or at least not the best most succinct way that shows it's intentional
Also, I've run into inverse scenarios where I only need on of the conditionals to match to include the row into the new dataframe, so I would need bigConditional to be set to false for every index in the dataframe.
what about sum the conditions and check if it is equal to the number of conditions
filteredDf = df.loc[sum(conditionals)==len(conditionals)]
or even more simple, with np.all
filteredDf = df.loc[np.all(conditionals, axis=0)]
otherwise, for your original question, you can create a series of True indexed like df and your for loop should work.
bigConditional = pd.Series(True, index=df.index)
Maybe you can use query and generate your conditions like this:
conditionals = [
"someColumn == 'someValue'",
"someOtherCol == 'someOtherValue'",
"someThirdCol.isin(['foo', 'bar', 'baz'])",
]
qs = ' & '.join(conditionals)
out = df.query(qs)
Or use eval to create boolean values instead of filter your dataframe:
mask = df.eval(qs)
Demo
Suppose this dataframe:
>>> df
someColumn someOtherCol someThirdCol
0 someValue someOtherValue foo
1 someValue someOtherValue baz
2 someValue anotherValue anotherValue
3 anotherValue anotherValue anotherValue
>>> df.query(qs)
someColumn someOtherCol someThirdCol
0 someValue someOtherValue foo
1 someValue someOtherValue baz
>>> df.eval(qs)
0 True
1 True
2 False
3 False
dtype: bool
You can even use f-strings or another template language to pass variables to your condition list.

Converting true/false to 0/1 boolean in a mixed dataframe

I have a dataframe with mixed data types. I want to create a function that goes through all the columns and converts any columns containing True/False to int32 type 0/1. I tried a lambda function below, where d is my dataframe:
f = lambda x: 1 if x==True else 0
d.applymap(f)
This doesn't work, it converts all my non boolean columns to 0/1 as well. Is there a good way to go through the dataframe and leave everything everything untouched except the boolean columns and convert them to 0's and 1's? Any help is appreciated!
Let's modify your lambda to use an isinstance check:
df.applymap(lambda x: int(x) if isinstance(x, bool) else x)
Only values of type bool will be converted to int, everything else remains the same.
As a better solution, if the column types are scalar (and not "mixed" as I originally assumed given your question), you can instead use
u = df.select_dtypes(bool)
df[u.columns] = u.astype(int)
You can select the columns using loc and change data type.
df = pd.DataFrame({'col1':np.random.randn(2), 'col2':[True, False], 'col3':[False, True]})
df.loc[:, df.dtypes == bool] = df.astype(int)
col1 col2 col3
0 0.999358 1 0
1 0.795179 0 1
If you have a dataframe df, try:
df_1=df.applymap(lambda x: int(x) if type(x)==bool else x)

pandas apply() with and without lambda

What is the rule/process when a function is called with pandas apply() through lambda vs. not? Examples below. Without lambda apparently, the entire series ( df[column name] ) is passed to the "test" function which throws an error trying to do a boolean operation on a series.
If the same function is called via lambda it works. Iteration over each row with each passed as "x" and the df[ column name ] returns a single value for that column in the current row.
It's like lambda is removing a dimension. Anyone have an explanation or point to the specific doc on this? Thanks.
Example 1 with lambda, works OK
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( lambda x: test( x['yTest'], x[ 'yPred']), axis=1 ).head()
Example 1 output
probPredDF columns: Index([0, 1, 'yPred', 'yTest'], dtype='object')
Out[215]:
0 equal
1 equal
2 equal
3 equal
4 equal
dtype: object
Example 2 without lambda, throws boolean operation on series error
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( test( probPredDF['yTest'], probPredDF[ 'yPred']), axis=1 ).head()
Example 2 output
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
There is nothing magic about a lambda. They are functions in one parameter, that can be defined inline, and do not have a name. You can use a function where a lambda is expected, but the function will need to also take one parameter. You need to do something like...
Define it as:
def wrapper(x):
return test(x['yTest'], x['yPred'])
Use it as:
probPredDF.apply(wrapper, axis=1)

Pandas text matching like SQL's LIKE?

Is there a way to do something similar to SQL's LIKE syntax on a pandas text DataFrame column, such that it returns a list of indices, or a list of booleans that can be used for indexing the dataframe? For example, I would like to be able to match all rows where the column starts with 'prefix_', similar to WHERE <col> LIKE prefix_% in SQL.
You can use the Series method str.startswith (which takes a regex):
In [11]: s = pd.Series(['aa', 'ab', 'ca', np.nan])
In [12]: s.str.startswith('a', na=False)
Out[12]:
0 True
1 True
2 False
3 False
dtype: bool
You can also do the same with str.contains (using a regex):
In [13]: s.str.contains('^a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
dtype: bool
So you can do df[col].str.startswith...
See also the SQL comparison section of the docs.
Note: (as pointed out by OP) by default NaNs will propagate (and hence cause an indexing error if you want to use the result as a boolean mask), we use this flag to say that NaN should map to False.
In [14]: s.str.startswith('a') # can't use as boolean mask
Out[14]:
0 True
1 True
2 False
3 NaN
dtype: object
To find all the values from the series that starts with a pattern "s":
SQL - WHERE column_name LIKE 's%'
Python - column_name.str.startswith('s')
To find all the values from the series that ends with a pattern "s":
SQL - WHERE column_name LIKE '%s'
Python - column_name.str.endswith('s')
To find all the values from the series that contains pattern "s":
SQL - WHERE column_name LIKE '%s%'
Python - column_name.str.contains('s')
For more options, check : https://pandas.pydata.org/pandas-docs/stable/reference/series.html
you can use
s.str.contains('a', case = False)