A contain masking operation - pandas

Suppose such an array
In [8]: pd.Series(['testing', 'the', 'masking'])
Out[8]:
0 testing
1 the
2 masking
dtype: object
Masking is handy
In [10]: arr == 'testing'
Out[10]:
0 True
1 False
2 False
dtype: bool
If check if 't' in the individual strings, nested iterations should be applied
In [11]: [ u for u in arr if 't' in u]
Out[11]: ['testing', 'the']
Is it possible to get it done with
arr contains 't'

It is possible
s[s.str.contains('t')]

Related

Pandas: Fast way to get cols/rows containing na

In Pandas we can drop cols/rows by .dropna(how = ..., axis = ...) but is there a way to get an array-like of True/False indicators for each col/row, which would indicate whether a col/row contains na according to how and axis arguments?
I.e. is there a way to convert .dropna(how = ..., axis = ...) to a method, which would instead of actual removal just tell us, which cols/rows would be removed if we called .dropna(...) with specific how and axis.
Thank you for your time!
You can use isna() to replicate the behaviour of dropna without actually removing data. To mimic the 'how' and 'axis' parameter, you can add any() or all() and set the axis accordingly.
Here is a simple example:
import pandas as pd
df = pd.DataFrame([[pd.NA, pd.NA, 1], [pd.NA, pd.NA, pd.NA]])
df.isna()
Output:
0 1 2
0 True True False
1 True True True
Eq. to dropna(how='any', axis=0)
df.isna().any(axis=0)
Output:
0 True
1 True
2 True
dtype: bool
Eq. to dropna(how='any', axis=1)
df.isna().any(axis=1)
Output:
0 True
1 True
dtype: bool

Pandas create new column base on groupby and apply lambda if statement

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue
Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.
You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

Update categories in two Series / Columns for comparison

If I try to compare two Series with different categories I get an error:
a = pd.Categorical([1, 2, 3])
b = pd.Categorical([4, 5, 3])
df = pd.DataFrame([a, b], columns=['a', 'b'])
a b
0 1 4
1 2 5
2 3 3
df.a == df.b
# TypeError: Categoricals can only be compared if 'categories' are the same.
What is the best way to update categories in both Series? Thank you!
My solution:
df['b'] = df.b.cat.add_categories(df.a.cat.categories.difference(df.b.cat.categories))
df['a'] = df.a.cat.add_categories(df.b.cat.categories.difference(df.a.cat.categories))
df.a == df.b
Output:
0 False
1 False
2 True
dtype: bool
One idea with union_categoricals:
from pandas.api.types import union_categoricals
union = union_categoricals([df.a, df.b]).categories
df['a'] = df.a.cat.set_categories(union)
df['b'] = df.b.cat.set_categories(union)
print (df.a == df.b)
0 False
1 False
2 True
dtype: bool

Pandas equals gives False result even it should be True

Let me generate a dataframe
df=pd.DataFrame([[1,2],[2,1]])
then I compare
df[0].equals(df[1].sort_values())
this gives False.
However, both d[0] and df[1].sort_values() gives the same output
0 1
1 2
Name: 0, dtype: int64
Why equals gives False? What is wrong?
There is different order of index values, so if create same e.g. here by Series.reset_index with drop=true it working like you expected:
a = df[0].equals(df[1].sort_values().reset_index(drop=True))
print (a)
True
Details:
print (df[0])
0 1
1 2
Name: 0, dtype: int64
print (df[1].sort_values())
1 1
0 2
Name: 1, dtype: int64
print (df[1].sort_values().reset_index(drop=True))
0 1
1 2
Name: 1, dtype: int64
You can also directly access Series values:
np.equal(df[0].values, df[1].sort_values().values)
array([ True, True])
np.equal(df[0].values, df[1].sort_values().values).all()
True
np.array_equal(df[0], df[1].sort_values())
True
As time perfomance is concerned, second and third approaches are equivalent while df[0].equals(df[1].sort_values().reset_index(drop=True)) is 1.5x slower.

Recode a pandas.Series containing 0, 1, and NaN to False, True, and NaN

Suppose I have a Series with NaNs:
pd.Series([0, 1, None, 1])
I want to transform this to be equal to:
pd.Series([False, True, None, True])
You'd think x == 1 would suffice, but instead, this returns:
pd.Series([False, True, False, True])
where the null value has become False. This is because np.nan == 1 returns False, rather than None or np.nan as in R.
Is there a nice, vectorized way to get what I want?
Maybe map can do it:
import pandas as pd
x = pd.Series([0, 1, None, 1])
print x.map({1: True, 0: False})
0 False
1 True
2 NaN
3 True
dtype: object
You can use where:
In [11]: (s == 1).where(s.notnull(), np.nan)
Out[11]:
0 0
1 1
2 NaN
3 1
dtype: float64
Note: the True and False have been cast to float as 0 and 1.