Don't compare missing / NaN values - pandas

How to compare two Series and leave NaN values? For example:
s1 = pd.Series([np.nan, 1, 3])
s2 = pd.Series([0, 2, 3])
s1.eq(s2).astype(int)
Output:
0 0
1 0
2 1
dtype: int64
Desired result:
0 NaN
1 0.0
2 1.0
dtype: float64

Try this if you allow float in the end
s1.eq(s2).mask(s1.isna() | s2.isna())
or this if you want to keep boolean
s1.eq(s2).mask(s1.isna() | s2.isna()).astype("boolean")

Related

Why do I get warning with this function?

I am trying to generate a new column containing boolean values of whether a value of each row is Null or not. I wrote the following function,
def not_null(row):
null_list = []
for value in row:
null_list.append(pd.isna(value))
return null_list
df['not_null'] = df.apply(not_null, axis=1)
But I get the following warning message,
A value is trying to be set on a copy of a slice from a DataFrame.
Is there a better way to write this function?
Note: I want to be able to apply this function to each row regardless of knowing the header row name or not
Final output ->
Column1 | Column2 | Column3 | null_idx
NaN | Nan | Nan | [0, 1, 2]
1 | 23 | 34 | []
test1 | Nan | Nan | [1, 2]
First your error means there is some filtering before in your code and need DataFrame.copy:
df = df[df['col'].gt(100)].copy()
Then your solution should be improved:
df = pd.DataFrame({'a':[np.nan, 1, np.nan],
'b':[np.nan,4,6],
'c':[4,5,3]})
df['list_boolean_for_missing'] = [x[x].tolist() for x in df.isna().to_numpy()]
print (df)
a b c list_boolean_for_missing
0 NaN NaN 4 [True, True]
1 1.0 4.0 5 []
2 NaN 6.0 3 [True]
Your function:
dd = lambda x: [pd.isna(value) for value in x]
df['list_boolean_for_missing'] = df.apply(not_null, axis=1)
If need:
I am trying to generate a new column containing boolean values of whether a value of each row is Null or not
df['not_null'] = df.notna().all(axis=1)
print (df)
a b c not_null
0 NaN NaN 4 False
1 1.0 4.0 5 True
2 NaN 6.0 3 False
EDIT: For list of positions create helper array by np.arange and filter it:
arr = np.arange(len(df.columns))
df['null_idx'] = [arr[x].tolist() for x in df.isna().to_numpy()]
print (df)
a b c null_idx
0 NaN NaN 4 [0, 1]
1 1.0 4.0 5 []
2 NaN 6.0 3 [0]

How to build a window through number positive and negative ranges in dataframe column?

I would like to have average value and max value in every positive and negative range.
From sample data below:
import pandas as pd
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
Which give me dataframe like this:
value
0 -1
1 -2
2 -3
3 -2
4 -1
5 1
6 2
7 3
8 2
9 1
10 -1
11 -4
12 -5
13 2
14 4
15 7
I would like to have something like that:
AVG1 = [-1, -2, -3, -2, -1] / 5 = - 1.8
Max1 = -3
AVG2 = [1, 2, 3, 2, 1] / 5 = 1.8
Max2 = 3
AVG3 = [2 ,4 ,7] / 3 = 4.3
Max3 = 7
If solution need new column or new dataframe, that is ok for me.
I know that I can use .mean like here
pandas get column average/mean with round value
But this solution give me average from all positive and all negative value.
How to build some kind of window that I can calculate average from first negative group next from second positive group and etc..
Regards
You can create Series by np.sign for distinguish positive and negative groups with compare shifted values with cumulative sum for groups and then aggregate mean and max:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df = df_test.groupby(g)['value'].agg(['mean','max'])
print (df)
mean max
value
1 -1.800000 -1
2 1.800000 3
3 -3.333333 -1
4 4.333333 7
EDIT:
For find locale extremes is used solution from this answer:
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
from scipy.signal import argrelextrema
#https://stackoverflow.com/a/50836425
n=2 # number of points to be checked before and after
# Find local peaks
df_test['min'] = df_test.iloc[argrelextrema(df_test.value.values, np.less_equal, order=n)[0]]['value']
df_test['max'] = df_test.iloc[argrelextrema(df_test.value.values, np.greater_equal, order=n)[0]]['value']
Then are replaced values after extremes to missing values, separately for negative and positive groups:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df_test[['min1','max1']] = df_test[['min','max']].notna().astype(int).iloc[::-1].groupby(g[::-1]).cumsum()
df_test['min1'] = df_test['min1'].where(s.eq(-1) & df_test['min1'].ne(0))
df_test['max1'] = df_test['max1'].where(s.eq(1) & df_test['max1'].ne(0))
df_test['g'] = g
print (df_test)
value min max min1 max1 g
0 -1 NaN -1.0 1.0 NaN 1
1 -2 NaN NaN 1.0 NaN 1
2 -3 -3.0 NaN 1.0 NaN 1
3 -2 NaN NaN NaN NaN 1
4 -1 NaN NaN NaN NaN 1
5 1 NaN NaN NaN 1.0 2
6 2 NaN NaN NaN 1.0 2
7 3 NaN 3.0 NaN 1.0 2
8 2 NaN NaN NaN NaN 2
9 1 NaN NaN NaN NaN 2
10 -1 NaN NaN 1.0 NaN 3
11 -4 NaN NaN 1.0 NaN 3
12 -5 -5.0 NaN 1.0 NaN 3
13 2 NaN NaN NaN 1.0 4
14 4 NaN NaN NaN 1.0 4
15 7 NaN 7.0 NaN 1.0 4
So is possible separately aggregate last 3 values per groups with lambda function and mean, rows with missing values in min1 or max1 are removed by default in groupby:
df1 = df_test.groupby(['g','min1'])['value'].agg(lambda x: x.tail(3).mean())
print (df1)
g min1
1 1.0 -2.000000
3 1.0 -3.333333
Name: value, dtype: float64
df2 = df_test.groupby(['g','max1'])['value'].agg(lambda x: x.tail(3).mean())
print (df2)
g max1
2 1.0 2.000000
4 1.0 4.333333
Name: value, dtype: float64

filter dataframe rows based on length of column values

I have a pandas dataframe as follows:
df = pd.DataFrame([ [1,2], [np.NaN,1], ['test string1', 5]], columns=['A','B'] )
df
A B
0 1 2
1 NaN 1
2 test string1 5
I am using pandas 0.20. What is the most efficient way to remove any rows where 'any' of its column values has length > 10?
len('test string1')
12
So for the above e.g., I am expecting an output as follows:
df
A B
0 1 2
1 NaN 1
If based on column A
In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
A B
0 1 2
1 NaN 1
If based on all columns
In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
A B
0 1 2
1 NaN 1
I had to cast to a string for Diego's answer to work:
df = df[df['A'].apply(lambda x: len(str(x)) <= 10)]
In [42]: df
Out[42]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
2 test string1 5 test string1test string1 2017-01-03
In [43]: df.dtypes
Out[43]:
A object
B int64
C object
D datetime64[ns]
dtype: object
In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Explanation:
df.select_dtypes(['object']) selects only columns of object (str) dtype:
In [45]: df.select_dtypes(['object'])
Out[45]:
A C
0 1 2
1 NaN NaN
2 test string1 test string1test string1
In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
A C
0 False False
1 False False
2 True True
now we can "aggregate" it as follows:
In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0 False
1 False
2 True
dtype: bool
finally we can select only those rows where value is False:
In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Use the apply function of series, in order to keep them:
df = df[df['A'].apply(lambda x: len(x) <= 10)]

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64

Contingency table of pandas Series with null values

Suppose you have:
import pandas as pd
x = pd.Series(["A", "B", "A", "A", None, "B", "A", None], dtype = "category")
y = pd.Series([ 1, 2, 3, None, 1, 2, 3, 2])
If you do pd.crosstab(x, y, dropna = False), you get:
col_0 1.0 2.0 3.0
row_0
A 1 0 2
B 0 2 0
which omits the three (x, y) pairs for which one of the values is null. (The parameter dropna is misleadingly named.) How can I create a contingency table that includes these values, like the table below?
col_0 1.0 2.0 3.0 NaN
row_0
A 1 0 2 1
B 0 2 0 0
NaN 1 1 0 0
Would converting the NaN to a string work?
pd.crosstab(x.replace(np.nan, 'NaN'),y.replace(np.nan, 'NaN'),dropna=False)
Result:
col_0 1.0 2.0 3.0 NaN
row_0
A 1 0 2 1
B 0 2 0 0
NaN 1 1 0 0