I know this question has been asked many times before, but all the solutions I have found don't seem to be working for me. I am unable to remove the NaN values from my pandas Series or DataFrame.
First, I tried removing directly from the DataFrame like in I/O 7 and 8 in the documentation (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)
In[1]:
df['salary'][:5]
Out[1]:
0 365788
1 267102
2 170941
3 NaN
4 243293
In [2]:
pd.isnull(df['salary'][:5])
Out[2]:
0 False
1 False
2 False
3 False
4 False
I was expecting line 3 to show up as True, but it didn't. I removed the Series from the DataFrame to try it again.
sal = df['salary'][:5]
In [100]:
type(sals)
Out[100]:
pandas.core.series.Series
In [101]:
sal.isnull()
Out[101]:
0 False
1 False
2 False
3 False
4 False
Name: salary, dtype: bool
In [102]:
sal.dropna()
Out[102]:
0 365788
1 267102
2 170941
3 NaN
4 243293
Name: salary, dtype: object
Can someone tell me what I'm doing wrong? I am using IPython Notebook 2.2.0.
The datatype of your column is object, which tells me it probably contains strings rather than numerical values. Try converting to float:
>>> sa1 = pd.Series(["365788", "267102", "170941", "NaN", "243293"])
>>> sa1
0 365788
1 267102
2 170941
3 NaN
4 243293
dtype: object
>>> sa1.isnull()
0 False
1 False
2 False
3 False
4 False
dtype: bool
>>> sa1 = sa1.astype(float)
>>> sa1.isnull()
0 False
1 False
2 False
3 True
4 False
dtype: bool
Related
import pandas as pd
numbers = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(numbers)
condition = df.loc[:, 1:2] < 4
df[condition]
0 1 2
0 NaN 2.0 3.0
1 NaN NaN NaN
2 NaN NaN NaN
Why am I getting these wrong results, and what can I do to get the correct results?
Boolean condition has to be Series, but here your selected columns return DataFrame:
print (condition)
1 2
0 True True
1 False False
2 False False
So for convert boolean Dataframe to mask use DataFrame.all for test if all Trues per rows or
DataFrame.any if at least one True per rows:
print (condition.any(axis=1))
print (condition.all(axis=1))
0 True
1 False
2 False
dtype: bool
Or select only one column for condition:
print (df.loc[:, 1] < 4)
0 True
1 False
2 False
Name: 1, dtype: bool
print (df[condition.any(axis=1)])
0 1 2
0 1 2 3
I have a dataframe name train_data.
This is the datatype of each column.
The columns workclass, occupation, and native-country are of "Object" datatype and some of the rows contain values of "?".
In this example, you can see row index 5 has some values with "?".
I want to delete all rows with any cell that has any "?".
I tried the following code, but it didn't work.
train_data = train_data[~(train_data.values == '?').any(1)]
train_data
use .loc for index slicing.
import pandas as pd
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,'?',9],
'C' : [0,'?',2,3,4]})
print(df1)
A B C
0 0 2 0
1 1 4 ?
2 2 5 2
3 3 ? 3
4 ? 9 4
print(df1.loc[~df1.eq('?').any(1)])
A B C
0 0 2 0
2 2 5 2
if you only want to check object columns use
pd.select_dtypes
df1.select_dtypes('object').eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
Edit.
One method for handling leading or trailing white spaces.
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,' ?',9],
'C' : [0,'? ',2,3,4]})
df1.eq('?').any(1)
0 False
1 False
2 False
3 False
4 True
dtype: bool
df1.replace('(\s+\?)|(\?\s+)',r'?',regex=True).eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
str.strip() with a lambda
str_cols = df1.select_dtypes('object').columns
df1[str_cols] = df1[str_cols].apply(lambda x : x.str.strip())
df1.eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
I had search stackoverflow about this, all are so complex.
I want to output the row and column info about all cells that is inf or NaN.
You can replace np.inf to missing values and test them by DataFrame.isna and last test at least one True by DataFrame.any passed to DataFrame.loc for SubDataFrame:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.inf],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[np.inf,3,6,9,2,np.nan],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 inf a
1 b 5.0 NaN 3 3.0 a
2 c 4.0 9.0 5 6.0 a
3 d 5.0 4.0 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f inf 3.0 0 NaN b
m = df.replace(np.inf, np.nan).isna()
print (m)
A B C D E F
0 False False False False True False
1 False False True False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
5 False True False False True False
df = df.loc[m.any(axis=1), m.any()]
print (df)
B C E
0 4.0 7.0 inf
1 5.0 NaN 3.0
5 inf 3.0 NaN
Or if need index and columns names in DataFrame use DataFrame.stack with Index.to_frame:
s = df.replace(np.inf, np.nan).stack(dropna=False)
df1 = s[s.isna()].index.to_frame(index=False)
print (df1)
0 1
0 0 E
1 1 C
2 5 B
3 5 E
My code to get the duplicates, how to negate the below meaning
df.duplicated(subset='col', keep='last').sum()
len(df['col'])-len(df['col'].drop_duplicates())
I think you need DataFrame.duplicated with keep=False for all duplicates, invert mask and sum for count Trues:
df = pd.DataFrame({'col':[1,2,2,3,3,3,4,5,5]})
print (df.duplicated(subset='col', keep=False))
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
dtype: bool
print (~df.duplicated(subset='col', keep=False))
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print ((~df.duplicated(subset='col', keep=False)).sum())
2
Another solution with Series.drop_duplicates and keep=False with length of Series:
print (df['col'].drop_duplicates(keep=False))
0 1
6 4
Name: col, dtype: int64
print (len(df['col'].drop_duplicates(keep=False)))
2
Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)