Conditional lambda apply across dataframe based on list equality - pandas

I have a dataframe df whos columns contain lists of strings
df = A B
['-1'] , ['0','1','2']
['2','4','3'], ['2']
['3','8'] , ['-1']
I want to get the length of all the lists except the ones that are ['-1'] for the lists that are ['-1'] I want them to be -1
Expected output:
df = A B
-1, 3
3, 1
2, -1
I've tried
df.apply(lambda x: x.str.len() if not x == ['-1'] else -1)
and got the error ('Lengths must match to compare', (132,), (1,))
I have also tried
data_copy[colBeliefs] = data_copy[colBeliefs].apply(lambda x: x.str.len() if '-1' not in x else -1)
but this produces the wrong output where ['-1'] becomes 1 rather than -1
I'm not sure how I can apply functions to a dataframe based on the whether an entry in a dataframe is equal to a list, or whether an item is in a list.
EDIT: Output of df.head().to_dict()
{'A': {0: ['-1'],
1: ['2','4','3'],
2: ['3','8']},
'B': {0: ['0','1','2'],
1: ['2'],
2: ['-1']}}

You could do:
df.applymap(lambda x: -1 if (ln:=len(x)) == 1 and x[0] == '-1' else ln)
A B
0 -1 3
1 3 1
2 2 -1
Edit:
If yousing python < 3.8 Use the following:
df.applymap(lambda x: -1 if len(x) == 1 and x[0] == '-1' else len(x))

Related

Pandas create new column base on groupby and apply lambda if statement

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue
Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.
You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

How can I flatten the output dataframe of pandas crosstab from two series x and y into a series?

I have the following series x and y:
x = pd.Series(['a', 'b', 'a', 'c', 'c'], name='x')
y = pd.Series([1, 0, 1, 0, 0], name='y')
I call pd.crosstab to get the following dataframe as output:
pd.crosstab(x, y)
Output:
y 0 1
x
a 0 2
b 1 0
c 2 0
I want to transform this into a single series as follows:
x_a_y_0 0
x_a_y_1 2
x_b_y_0 1
x_b_y_1 0
x_c_y_0 2
x_c_y_1 0
For a specific dataframe like this one, I can construct this by visual inspection:
pd.Series(
dict(
x_a_y_0=0,
x_a_y_1=2,
x_b_y_0=1,
x_b_y_1=0,
x_c_y_0=2,
x_c_y_1=0
)
)
But given arbitrary series x and y, how do I generate the corresponding final output?
Use DataFrame.stack with change MultiIndex by map:
s = pd.crosstab(x, y).stack()
s.index = s.index.map(lambda x: f'x_{x[0]}_y_{x[1]}')
print (s)
x_a_y_0 0
x_a_y_1 2
x_b_y_0 1
x_b_y_1 0
x_c_y_0 2
x_c_y_1 0
dtype: int64
Also is possible pass s.index.names, thank you #SeaBean:
s.index = s.index.map(lambda x: f'{s.index.names[0]}_{x[0]}_{s.index.names[1]}_{x[1]}')

Update categories in two Series / Columns for comparison

If I try to compare two Series with different categories I get an error:
a = pd.Categorical([1, 2, 3])
b = pd.Categorical([4, 5, 3])
df = pd.DataFrame([a, b], columns=['a', 'b'])
a b
0 1 4
1 2 5
2 3 3
df.a == df.b
# TypeError: Categoricals can only be compared if 'categories' are the same.
What is the best way to update categories in both Series? Thank you!
My solution:
df['b'] = df.b.cat.add_categories(df.a.cat.categories.difference(df.b.cat.categories))
df['a'] = df.a.cat.add_categories(df.b.cat.categories.difference(df.a.cat.categories))
df.a == df.b
Output:
0 False
1 False
2 True
dtype: bool
One idea with union_categoricals:
from pandas.api.types import union_categoricals
union = union_categoricals([df.a, df.b]).categories
df['a'] = df.a.cat.set_categories(union)
df['b'] = df.b.cat.set_categories(union)
print (df.a == df.b)
0 False
1 False
2 True
dtype: bool

How to apply a multiplier to particular searched values in a dataframe

I have a table of values with 2 different columns say x and y, if a value in the y column = 0 then I need to apply a multiplier to the x column and vice versa. How would I go about doing this?
Thanks in advance.
I would use slicing on rows with .loc to modify each column:
import pandas as pd
df = pd.DataFrame({'x':[1,0,2,0], 'y':[1,3,0,4]})
df.loc[df['x'] == 0, 'y'] = df.loc[df['x'] == 0, 'y'] * 2
df.loc[df['y'] == 0, 'x'] = df.loc[df['y'] == 0, 'x'] * 2

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')