Pandas easy API to find out all inf or nan cells? - pandas

I had search stackoverflow about this, all are so complex.
I want to output the row and column info about all cells that is inf or NaN.

You can replace np.inf to missing values and test them by DataFrame.isna and last test at least one True by DataFrame.any passed to DataFrame.loc for SubDataFrame:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.inf],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[np.inf,3,6,9,2,np.nan],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 inf a
1 b 5.0 NaN 3 3.0 a
2 c 4.0 9.0 5 6.0 a
3 d 5.0 4.0 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f inf 3.0 0 NaN b
m = df.replace(np.inf, np.nan).isna()
print (m)
A B C D E F
0 False False False False True False
1 False False True False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
5 False True False False True False
df = df.loc[m.any(axis=1), m.any()]
print (df)
B C E
0 4.0 7.0 inf
1 5.0 NaN 3.0
5 inf 3.0 NaN
Or if need index and columns names in DataFrame use DataFrame.stack with Index.to_frame:
s = df.replace(np.inf, np.nan).stack(dropna=False)
df1 = s[s.isna()].index.to_frame(index=False)
print (df1)
0 1
0 0 E
1 1 C
2 5 B
3 5 E

Related

How to get count for the non duplicates in column

My code to get the duplicates, how to negate the below meaning
df.duplicated(subset='col', keep='last').sum()
len(df['col'])-len(df['col'].drop_duplicates())
I think you need DataFrame.duplicated with keep=False for all duplicates, invert mask and sum for count Trues:
df = pd.DataFrame({'col':[1,2,2,3,3,3,4,5,5]})
print (df.duplicated(subset='col', keep=False))
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
dtype: bool
print (~df.duplicated(subset='col', keep=False))
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print ((~df.duplicated(subset='col', keep=False)).sum())
2
Another solution with Series.drop_duplicates and keep=False with length of Series:
print (df['col'].drop_duplicates(keep=False))
0 1
6 4
Name: col, dtype: int64
print (len(df['col'].drop_duplicates(keep=False)))
2

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)

Count How Many Columns in Dataframe before NaN

I want to count how many column data (pd.Dataframe) before Nan data. My data:
df
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Id
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 6 6 7 7 8 9 10 NaN NaN NaN NaN NaN NaN NaN
C 1 2 3 3 4 5 6 6 7 7 8 9 10 NaN
my desire output:
df_result
count
Id
A 5
B 7
C 13
thank you in advance for the answer.
Use:
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN 54.0
B 6 6 7 7 8 9.0 10.0 NaN NaN NaN NaN NaN 5.0 NaN
C 1 2 3 3 4 5.0 6.0 6.0 7.0 7.0 8.0 9.0 10.0 NaN
df = df.isnull().cumsum(axis=1).eq(0).sum(axis=1)
print (df)
A 5
B 7
C 13
dtype: int64
Detail:
First check NaNs:
print (df.isnull())
0 1 2 3 4 5 6 7 8 9 \
A False False False False False True True True True True
B False False False False False False False True True True
C False False False False False False False False False False
10 11 12 13
A True True True False
B True True False True
C False False False True
Get cumsum - Trues are processes like 1, False like 0
print (df.isnull().cumsum(axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 0 0 0 0 0 1 2 3 4 5 6 7 8 8
B 0 0 0 0 0 0 0 1 2 3 4 5 5 6
C 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Compare by 0:
print (df.isnull().cumsum(axis=1).eq(0))
0 1 2 3 4 5 6 7 8 9 10 \
A True True True True True False False False False False False
B True True True True True True True False False False False
C True True True True True True True True True True True
11 12 13
A False False False
B False False False
C True True False
Sum boolean mask - Trues like 1s:
print (df.isnull().cumsum(axis=1).eq(0).sum(axis=1))
A 5
B 7
C 13
dtype: int64

pandas most efficient way to compare dataframe and series

I have a dataframe of shape (n, p) and a series of length n
I can compare them with:
for i in df.keys():
df[i] > ts
Is there a way to do it in one line? something like df > ts.
if yes, is it more efficient?
I think you need DataFrame.gt:
print (df.gt(s, axis=0))
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
s = pd.Series([1,2,3])
print (s)
0 1
1 2
2 3
dtype: int64
print (df.gt(s, axis=0))
A B C D E F
0 False True True False True True
1 False True True True True True
2 False True True True True False
If need another functions for compare:
lt
gt
le
ge
ne
eq

Unable to remove NaN from panda Series

I know this question has been asked many times before, but all the solutions I have found don't seem to be working for me. I am unable to remove the NaN values from my pandas Series or DataFrame.
First, I tried removing directly from the DataFrame like in I/O 7 and 8 in the documentation (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)
In[1]:
df['salary'][:5]
Out[1]:
0 365788
1 267102
2 170941
3 NaN
4 243293
In [2]:
pd.isnull(df['salary'][:5])
Out[2]:
0 False
1 False
2 False
3 False
4 False
I was expecting line 3 to show up as True, but it didn't. I removed the Series from the DataFrame to try it again.
sal = df['salary'][:5]
In [100]:
type(sals)
Out[100]:
pandas.core.series.Series
In [101]:
sal.isnull()
Out[101]:
0 False
1 False
2 False
3 False
4 False
Name: salary, dtype: bool
In [102]:
sal.dropna()
Out[102]:
0 365788
1 267102
2 170941
3 NaN
4 243293
Name: salary, dtype: object
Can someone tell me what I'm doing wrong? I am using IPython Notebook 2.2.0.
The datatype of your column is object, which tells me it probably contains strings rather than numerical values. Try converting to float:
>>> sa1 = pd.Series(["365788", "267102", "170941", "NaN", "243293"])
>>> sa1
0 365788
1 267102
2 170941
3 NaN
4 243293
dtype: object
>>> sa1.isnull()
0 False
1 False
2 False
3 False
4 False
dtype: bool
>>> sa1 = sa1.astype(float)
>>> sa1.isnull()
0 False
1 False
2 False
3 True
4 False
dtype: bool