Find how many common missing (nan) values are in DataFrame columns - pandas

I have got DataFrame with about 20 columns and 50000 rows. Small part of this Pandas table is below:
I am looking for some way to count how many missing values are in the same positions (rows) in a few columns.
When number of the columns is known simply code like these:
((df['HomeRemote'].isnull() & df['CompanySize'].isnull()).sum()
probably is the answer, but unfortunately number of the columns to compare could be more than 2. I don't know it, because it depends on situation and that's why I looking for something like "universal" solution (working for any number of columns).
My idea is finding a way to "push" every df[col].isnull() into the FOR loop (where col is name of column), but I have a problem with put '&' between every df[col].isnull().
Maybe anyone here have some other possibility to consider?
If something is not clear enough, please let me know.

Try:
Sample input:
>>> df
A B C D E
0 NaN 1.0 1.0 1.0 1.0
1 NaN NaN NaN 1.0 1.0
2 1.0 1.0 1.0 NaN 1.0
3 1.0 1.0 1.0 1.0 NaN
4 1.0 1.0 NaN 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 NaN 1.0
7 1.0 NaN 1.0 NaN 1.0
8 1.0 1.0 1.0 1.0 1.0
9 1.0 1.0 1.0 1.0 1.0
How many missing values are in the same position in columns A, B, C:
>>> df[['A', 'B', 'C']].isnull().all(axis=1).sum()
1
Step by step:
# Find missing values
>>> df[['A', 'B', 'C']].isnull()
A B C
0 True False False
1 True True True # <- HERE
2 False False False
3 False False False
4 False False True
5 True False False
6 False False False
7 False True False
8 False False False
9 False False False
# Reduce
>>> df[['A', 'B', 'C']].isnull().all(axis=1)
0 False
1 True # <- HERE
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
# Reduce again
>>> df[['A', 'B', 'C']].isnull().all(axis=1).sum()
1

Related

Pandas drop duplicates only for main index

I have a multiindex and I want to perform drop_duplicates on a per level basis, I dont want to look at the entire dataframe but only if there is a duplicate with the same main index
Example:
entry,subentry,A,B
1 0 1.0 1.0
1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
2 2.0 2.0
should return:
entry,subentry,A,B
1 0 1.0 1.0
1 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
Use MultiIndex.get_level_values with Index.duplicated for filter out last row per entry in boolean indexing:
df1 = df[df.index.get_level_values('entry').duplicated(keep='last')]
print (df1)
A B
entry subentry
1 0 1.0 1.0
1 1.0 1.0
2 0 1.0 1.0
1 2.0 2.0
Or if need remove duplicates per first level and columns convert first level to column by DataFrame.reset_index, for filter invert boolean mask by ~ and convert Series to numpy array, because indices of mask and original DataFrame not match:
df2 = df[~df.reset_index(level=0).duplicated(keep='last').to_numpy()]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Or create helper column by first level of MultiIndex:
df2 = df[~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last')]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Details:
print (df.reset_index(level=0))
entry A B
subentry
0 1 1.0 1.0
1 1 1.0 1.0
2 1 2.0 2.0
0 2 1.0 1.0
1 2 2.0 2.0
2 2 2.0 2.0
print (~df.reset_index(level=0).duplicated(keep='last'))
0 False
1 True
2 True
0 True
1 False
2 True
dtype: bool
print (df.assign(new=df.index.get_level_values('entry')))
A B new
entry subentry
1 0 1.0 1.0 1
1 1.0 1.0 1
2 2.0 2.0 1
2 0 1.0 1.0 2
1 2.0 2.0 2
2 2.0 2.0 2
print (~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last'))
entry subentry
1 0 False
1 True
2 True
2 0 True
1 False
2 True
dtype: bool
It looks like you want to drop_duplicates per group:
out = df.groupby(level=0, group_keys=False).apply(lambda d: d.drop_duplicates())
Or, a maybe more efficient variant using a temporary reset_index with duplicated and boolean indexing:
out = df[~df.reset_index('entry').duplicated().values]
Output:
A B
entry subentry
1 0 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0

Slicing on entire panda dataframe instead of series results in change of data type and values assignment of first field to NaN, what is happening?

Was trying to do some cleaning on a dataset ,where instead providing a condition on a panda series
head_only[head_only.BasePay > 70000]
I applied the condition on the data frame
head_only[head_only > 70000]
attached images of my observation, could anyone help me understand what is it that's happening ?
Your second solution raise error if numeric with strings columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2.0,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df[df > 5])
TypeError: '>' not supported between instances of 'str' and 'int'
If compare only numeric columns it get values higher like 4 and all another numbers convert to misisng values:
df1 = df.select_dtypes(np.number)
print (df1[df1 > 4])
B C D E
0 NaN 7.0 NaN 5.0
1 5.0 8.0 NaN NaN
2 NaN 9.0 5.0 6.0
3 5.0 NaN 7.0 9.0
4 5.0 NaN NaN NaN
5 NaN NaN NaN NaN
Here are replaced at least one value in each column, so integers columns are converted to floats (because NaN is float):
print (df1[df1 > 4].dtypes)
B float64
C float64
D float64
E float64
dtype: object
If need compare all numeric columns if at least one of them match condition use DataFrame.any for test if at least one value is True:
#returned boolean DataFrame
print ((df1 > 7))
B C D E
0 False False False False
1 False True False False
2 False True False False
3 False False False True
4 False False False False
5 False False False False
print ((df1 > 7).any(axis=1))
0 False
1 True
2 True
3 True
4 False
5 False
dtype: bool
print (df1[(df1 > 7).any(axis=1)])
B C D E
1 5 8.0 3 3
2 4 9.0 5 6
3 5 4.0 7 9
Or if need filter original all columns is possible filter only numeric columns by DataFrame.select_dtypes:
print (df[(df.select_dtypes(np.number) > 7).any(axis=1)])
A B C D E F
1 b 5 8.0 3 3 a
2 c 4 9.0 5 6 a
3 d 5 4.0 7 9 b

Reading Pandas Data frame if condition occurs in any row/column

Attempting to read a data frame that has values in a random rows/columns order and I would like to get a new column where the all the values containing 'that' are summarized.
Input:
0 1 2 3 4
0 this=1 that=2 who=2 was=3 where=5
1 that=4 who=5 this=1 was=3 where=2
2 was=2 who=7 this=7 that=3 where=7
3 was=3 who=4 this=7 that=1 where=8
4 that=1 who=3 this=4 was=1 where=3
Output:
0
0 that=2
1 that=4
2 that=3
3 that=1
4 that=1
I have been successfully able to get the correct result with the following code. But with larger data frames it takes a long time to complete
df1=pd.DataFrame([['this=1', 'that=2', 'who=2', 'was=3', 'where=5'],
['that=4', 'who=5', 'this=1', 'was=3', 'where=2'],
['was=2', 'who=7', 'this=7', 'that=3','where=7'],
['was=3', 'who=4', 'this=7', 'that=1', 'where=8'],
['that=1', 'who=3', 'this=4', 'was=1', 'where=3']],
columns=[0,1,2,3,4])
df2=pd.DataFrame()
for i in df1.index:
data=[name for name in df1[i] if name[0:4]=='that']
df2=df2.append(pd.DataFrame(data))
df1[df1.apply(lambda x: x.str.contains('that'))].stack()
Let's break this down:
df1.apply(lambda x: x.str.contains('that')) Applies our lambda function to the entire dataframe. in english it reads: if that is in our value, True
0 1 2 3 4
0 False True False False False
1 True False False False False
2 False False False True False
3 False False False True False
4 True False False False False
df1[] around that will return the values, instead of True/False:
0 1 2 3 4
0 NaN that=2 NaN NaN NaN
1 that=4 NaN NaN NaN NaN
2 NaN NaN NaN that=3 NaN
3 NaN NaN NaN that=1 NaN
4 that=1 NaN NaN NaN NaN
stack() will stack all the values into one Series. stack() gets rid of NA by default, which is what you needed.
if the extra index is tripping you up, you can also reset the index for a single series:
df1[df1.apply(lambda x: x.str.contains('that'))].stack().reset_index(drop=True)
0 that=2
1 that=4
2 that=3
3 that=1
4 that=1
dtype: object

how to get correct row data with certain restrict in pandas?

I want to extract the correct row based on certain condition.
The dataframe contains a column entry with the entry signals.
A valid entry only when there is no order in market. Therefore, only the first signal is valid in two consecutive signals
A valid exit is 5 bars later after entry.
Here is my code and dataframe
import pandas as pd
df = pd.DataFrame({'entry':[0,1,0,1,0,0,0,1,0,0,0,0,0,0]})
df['exit'] = df['entry'].shift(5)
df['state'] = np.select([df['entry'] == 1, df['exit'] == 1], [1, 0], default=np.nan)
df['state'].ffill(inplace=True)
df['state'].fillna(value=0, inplace=True)
df['change'] = df['state'].diff()
print(df)
entrysig = df[df['change'].eq(1)]
exitsig = df[df['change'].eq(-1)]
tradelist = pd.DataFrame({'entry': entrysig.index, 'exit': exitsig.index})
tradelist['wantedexit'] = [6, 12]
print(tradelist)
The output is like :
entry exit state change
0 0 NaN 0.0 NaN
1 1 NaN 1.0 1.0
2 0 NaN 1.0 0.0
3 1 NaN 1.0 0.0
4 0 NaN 1.0 0.0
5 0 0.0 1.0 0.0
6 0 1.0 0.0 -1.0
7 1 0.0 1.0 1.0
8 0 1.0 0.0 -1.0
9 0 0.0 0.0 0.0
10 0 0.0 0.0 0.0
11 0 0.0 0.0 0.0
12 0 1.0 0.0 0.0
13 0 0.0 0.0 0.0
entry exit wantedexit
0 1 6 6
1 7 8 12
In this example, the first trade entered at bar 1 exit at 6 is correct, it enters at bar 1 and exit after 5 bars which is 6.
The entry on bar 3 is ignored because there is currently an order in market which entered at bar 1.
The second trade entered at bar 7 and exit bar 8 is not correct, because the trade only last for 1 bar while my condition is to exit after 5 bars.
The exit at bar 8 is there because there is an invalid signal at bar 3.
The 'wantedexit' column should be the correct exit bar index.

how to calculate how many times is changed in the column

how I can calculate on the most easy way, how much values changes I have in the specific DataFrame columns. For example I have follow DF:
a b
0 1
1 1
2 1
3 2
4 1
5 2
6 2
7 3
8 3
9 3
In this Data Frame the values in the column b have been changed 4 times (in the rows 4,5,6 and 8).
My very simple solution is:
a = 0
for i in range(df.shape[0] - 1):
if df['b'].iloc[i] != df['b'].iloc[i+1]:
a+=1
I think need boolean indexing with index:
idx = df.index[df['b'].diff().shift().fillna(0).ne(0)]
print (idx)
Int64Index([4, 5, 6, 8], dtype='int64')
For more general solution is possible indexing by arange:
a = np.arange(len(df))[df['b'].diff().shift().bfill().ne(0)].tolist()
print (a)
[4, 5, 6, 8]
Explanation:
First get difference by Series.diff:
print (df['b'].diff())
0 NaN
1 0.0
2 0.0
3 1.0
4 -1.0
5 1.0
6 0.0
7 1.0
8 0.0
9 0.0
Name: b, dtype: float64
Then shift by one value:
print (df['b'].diff().shift())
0 NaN
1 NaN
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
Replace first NaNs by fillna:
print (df['b'].diff().shift().fillna(0))
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
And compare for not equal to 0
print (df['b'].diff().shift().fillna(0).ne(0))
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 False
Name: b, dtype: bool
If the a is a column and not the index:
idx = df['a'].loc[df['b'].diff().shift().fillna(0) != 0]