Pandas create new column base on groupby and apply lambda if statement - pandas

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue

Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.

You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

Related

How to add new columns to vaex dataframe? Type Error

How to add new columns to vaex dataframe?
I received the type error when I try to assign a list object to the dataframe, as is done in pandas, but received following error:
ValueError: [1, 1, 1, 1, 1, 1, 1] is not of string or Expression type, but <class 'list'>
Simple; convert list object to numpy array, and i guess that's what they define as expression type;
import numpy as np
a = [1]*7
a = np.array(a)
sub["new"] = a
sub
Let us first create a data frame using Vaex package:
import vaex
import numpy as np
x = np.arange(6)
y = x*2
df = vaex.from_arrays(x=x, y=y)
df
output:
# x y
0 0 0
1 1 2
2 2 4
3 3 6
Now, if you would like to add a new column called greeting:
df['greeting'] = ['hi', 'أهلا', 'hola', 'bonjour']
you will get this error:
ValueError: ['hi', 'أهلا', 'hola', 'bonjour'] is not of string or > Expression type, but <class 'list'>
To handle this problem, please use the following code:
output:
df['text'] = np.asanyarray(['hi', 'أهلا', 'hola', 'bonjour'])
df
# x y text
0 0 0 hi
1 1 2 أهلا
2 2 4 hola
3 3 6 bonjour
Enjoy!

How can I flatten the output dataframe of pandas crosstab from two series x and y into a series?

I have the following series x and y:
x = pd.Series(['a', 'b', 'a', 'c', 'c'], name='x')
y = pd.Series([1, 0, 1, 0, 0], name='y')
I call pd.crosstab to get the following dataframe as output:
pd.crosstab(x, y)
Output:
y 0 1
x
a 0 2
b 1 0
c 2 0
I want to transform this into a single series as follows:
x_a_y_0 0
x_a_y_1 2
x_b_y_0 1
x_b_y_1 0
x_c_y_0 2
x_c_y_1 0
For a specific dataframe like this one, I can construct this by visual inspection:
pd.Series(
dict(
x_a_y_0=0,
x_a_y_1=2,
x_b_y_0=1,
x_b_y_1=0,
x_c_y_0=2,
x_c_y_1=0
)
)
But given arbitrary series x and y, how do I generate the corresponding final output?
Use DataFrame.stack with change MultiIndex by map:
s = pd.crosstab(x, y).stack()
s.index = s.index.map(lambda x: f'x_{x[0]}_y_{x[1]}')
print (s)
x_a_y_0 0
x_a_y_1 2
x_b_y_0 1
x_b_y_1 0
x_c_y_0 2
x_c_y_1 0
dtype: int64
Also is possible pass s.index.names, thank you #SeaBean:
s.index = s.index.map(lambda x: f'{s.index.names[0]}_{x[0]}_{s.index.names[1]}_{x[1]}')

How to show rows with data which are not equal?

I have two tables
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df1 = pd.DataFrame(np.array([[1, 2, 4], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
print(df1.equals(df2))
I want to compare them. I want the same result if I would use function df.compare(df1) or at least something close to it. Can't use above fnction as my complier states that 'DataFrame' object has no attribute 'compare'
First approach:
Let's compare value by value:
In [1183]: eq_df = df1.eq(df2)
In [1196]: eq_df
Out[1200]:
a b c
0 True True False
1 True True True
2 True True True
Then let's reduce it down to see which rows are equal for all columns
from functools import reduce
In [1285]: eq_ser = reduce(np.logical_and, (eq_df[c] for c in eq_df.columns))
In [1288]: eq_ser
Out[1293]:
0 False
1 True
2 True
dtype: bool
Now we can print out the rows which are not equal
In [1310]: df1[~eq_ser]
Out[1315]:
a b c
0 1 2 4
In [1316]: df2[~eq_ser]
Out[1316]:
a b c
0 1 2 3
Second approach:
def diff_dataframes(
df1, df2, compare_cols=None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Given two dataframes and column(s) to compare, return three dataframes with rows:
- common between the two dataframes
- found only in the left dataframe
- found only in the right dataframe
"""
df1 = df1.fillna(pd.NA)
df = df1.merge(df2.fillna(pd.NA), how="outer", on=compare_cols, indicator=True)
df_both = df.loc[df["_merge"] == "both"].drop(columns="_merge")
df_left = df.loc[df["_merge"] == "left_only"].drop(columns="_merge")
df_right = df.loc[df["_merge"] == "right_only"].drop(columns="_merge")
tup = namedtuple("df_diff", ["common", "left", "right"])
return tup(df_both, df_left, df_right)
Usage:
In [1366]: b, l, r = diff_dataframes(df1, df2)
In [1371]: l
Out[1371]:
a b c
0 1 2 4
In [1372]: r
Out[1372]:
a b c
3 1 2 3
Third approach:
In [1440]: eq_ser = df1.eq(df2).sum(axis=1).eq(len(df1.columns))

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')

How to use the values of one column to access values in another column?

How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.