Python Pandas: How to delete row with certain value of 'Object' datatype? - pandas

I have a dataframe name train_data.
This is the datatype of each column.
The columns workclass, occupation, and native-country are of "Object" datatype and some of the rows contain values of "?".
In this example, you can see row index 5 has some values with "?".
I want to delete all rows with any cell that has any "?".
I tried the following code, but it didn't work.
train_data = train_data[~(train_data.values == '?').any(1)]
train_data

use .loc for index slicing.
import pandas as pd
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,'?',9],
'C' : [0,'?',2,3,4]})
print(df1)
A B C
0 0 2 0
1 1 4 ?
2 2 5 2
3 3 ? 3
4 ? 9 4
print(df1.loc[~df1.eq('?').any(1)])
A B C
0 0 2 0
2 2 5 2
if you only want to check object columns use
pd.select_dtypes
df1.select_dtypes('object').eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
Edit.
One method for handling leading or trailing white spaces.
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,' ?',9],
'C' : [0,'? ',2,3,4]})
df1.eq('?').any(1)
0 False
1 False
2 False
3 False
4 True
dtype: bool
df1.replace('(\s+\?)|(\?\s+)',r'?',regex=True).eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
str.strip() with a lambda
str_cols = df1.select_dtypes('object').columns
df1[str_cols] = df1[str_cols].apply(lambda x : x.str.strip())
df1.eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool

Related

Return last occurrence of True in series of dtype bool

If using the following series:
sodf = pd.Series([9,10,10,9,10,10])
sodf
0 9
1 10
2 10
3 9
4 10
5 10
dtype: int64
And then returning a bool series:
sodf == 9
0 True
1 False
2 False
3 True
4 False
5 False
dtype: bool
How do I return a bool series but only with the last (highest row index number) occurrence of 9?
Desired output:
0 False
1 False
2 False
3 True
4 False
5 False
dtype: bool
Try with:
mask = sodf == 9
out = mask[::-1].cumsum().eq(1) & mask
Output:
0 False
1 False
2 False
3 True
4 False
5 False
dtype: bool

How to get count for the non duplicates in column

My code to get the duplicates, how to negate the below meaning
df.duplicated(subset='col', keep='last').sum()
len(df['col'])-len(df['col'].drop_duplicates())
I think you need DataFrame.duplicated with keep=False for all duplicates, invert mask and sum for count Trues:
df = pd.DataFrame({'col':[1,2,2,3,3,3,4,5,5]})
print (df.duplicated(subset='col', keep=False))
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
dtype: bool
print (~df.duplicated(subset='col', keep=False))
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print ((~df.duplicated(subset='col', keep=False)).sum())
2
Another solution with Series.drop_duplicates and keep=False with length of Series:
print (df['col'].drop_duplicates(keep=False))
0 1
6 4
Name: col, dtype: int64
print (len(df['col'].drop_duplicates(keep=False)))
2

update pandas dataframe column with a sequence for matching rows

I have a pandas dataframe with 3 columns.
Event_occur Boolean
Event_predict Boolean
Incorrect_pred Number default 0
Please refer to the screenshot. I am trying to update Incorrect_pred based on certain conditions.
Whenever Event_occur is True and Event_predict is False,
Incorrect_pred should be updated with an increasing number sequence.
Example, the first occurrence of Event_occur =True and Event_predict
=False, should update Incorrect_pred with 1, the second occurence with 2 and so on.
Whenever the 2 events are both True, then
Incorrect_pred should be updated with the previous non zero
number.(refer index row 5 and 9 in the example).
Whenever
Event_occur is False, the update is always 0 which is the default
value.
If this was sql, I could have used a windows function. Something like:
(case
when Event_occur = 'FALSE' then 0
else sum(case when Event_occur = Event_predict) then 0 else 1 end)
over (order by <some column>) end)
Is there a way I can do this in pandas?
expected dataframe
Let's try:
df['pred'] = np.where(df.Event_occur == False,0,np.where(df.Event_occur != df.Event_predict,1,0)).cumsum()
df['Incorrect_pred']= df.pred.where(df.Event_occur == True).fillna(0)
print(df)
Output:
Event_occur Event_predict Incorrect_pred pred
0 False True 0.0 0
1 True False 1.0 1
2 True False 2.0 2
3 False False 0.0 2
4 True False 3.0 3
5 True True 3.0 3
6 True True 3.0 3
7 True False 4.0 4
8 False True 0.0 4
9 True True 4.0 4

Unable to remove NaN from panda Series

I know this question has been asked many times before, but all the solutions I have found don't seem to be working for me. I am unable to remove the NaN values from my pandas Series or DataFrame.
First, I tried removing directly from the DataFrame like in I/O 7 and 8 in the documentation (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)
In[1]:
df['salary'][:5]
Out[1]:
0 365788
1 267102
2 170941
3 NaN
4 243293
In [2]:
pd.isnull(df['salary'][:5])
Out[2]:
0 False
1 False
2 False
3 False
4 False
I was expecting line 3 to show up as True, but it didn't. I removed the Series from the DataFrame to try it again.
sal = df['salary'][:5]
In [100]:
type(sals)
Out[100]:
pandas.core.series.Series
In [101]:
sal.isnull()
Out[101]:
0 False
1 False
2 False
3 False
4 False
Name: salary, dtype: bool
In [102]:
sal.dropna()
Out[102]:
0 365788
1 267102
2 170941
3 NaN
4 243293
Name: salary, dtype: object
Can someone tell me what I'm doing wrong? I am using IPython Notebook 2.2.0.
The datatype of your column is object, which tells me it probably contains strings rather than numerical values. Try converting to float:
>>> sa1 = pd.Series(["365788", "267102", "170941", "NaN", "243293"])
>>> sa1
0 365788
1 267102
2 170941
3 NaN
4 243293
dtype: object
>>> sa1.isnull()
0 False
1 False
2 False
3 False
4 False
dtype: bool
>>> sa1 = sa1.astype(float)
>>> sa1.isnull()
0 False
1 False
2 False
3 True
4 False
dtype: bool

Drop contradicting duplicates from a pandas dataframe

I have a Pandas dataframe as following :
df = DataFrame({'id' : [0,1,1,2,2], 'married' : [True,True,False,False,False]})
id married
0 0 True
1 1 True
2 1 False
3 2 False
4 2 False
i would like to group this dataframe by the column id, but also to remove the whole duplicates if the values in married is not the same for same value of of id not just taking the first row as we get from using drop_duplicates method.
df.drop_duplicates(subset=["id"])
id married
0 0 True
1 1 True
3 2 False
instead i want to have this as my result
id married
0 0 True
3 2 False
you can use .groupby on id followed by .filter and then .drop_duplicates:
>>> pred = lambda obj: obj['married'].nunique() == 1
>>> df.groupby('id').filter(pred).drop_duplicates('id')
id married
0 0 True
3 2 False