How to check nan value in pandas dataframe? - pandas

nan_rows = []
for index, row in df.iterrows():
topic = row['topic']
if topic != np.nan:
nan_rows.append(row)
I want to split my dataframe into two: if the 'topic' value is nan, then extract it out. But the code above doesn't work. Why is that?

Here is a minimal example how to group by isnull() and assign the two groups to a dataframe each.
df = pd.DataFrame({'topic':[np.nan, 1, np.nan, 1]})
>>> df
topic
0 NaN
1 1.0
2 NaN
3 1.0
grouper = df.groupby(df.topic.isnull())
>>> grouper.groups
{False: [1, 3], True: [0, 2]}
df_True = grouper.get_group(True)
>>> df_True
topic
0 NaN
2 NaN
df_False = grouper.get_group(False)
>>> df_False
topic
1 1.0
3 1.0

Related

Replace all column values with the values ? and n.a with NaN

In dataframe, how to replace all column values with the values ? and n.a with NaN?
I tried
df.fillna(0),inplace=True
but '?' didn't replace.
To replace all non-NaN values, you can try
df = df.where(~df.notna(), "?")
and to replace all NaN values,
df.fillna(0, inplace=True)
IIUC try with replace:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2, "?"],
['n.a', 5, 6],
[7, '?', 9]],
columns=list('ABC'))
df = df.replace({'?': np.NaN, 'n.a': np.NaN})
df before replace:
A B C
0 1 2 ?
1 n.a 5 6
2 7 ? 9
df after replace:
A B C
0 1.0 2.0 NaN
1 NaN 5.0 6.0
2 7.0 NaN 9.0

Condensing Wide Data Based on Column Name

Is there an elegant way to do what I'm trying to do in Pandas? My data looks something like:
df = pd.DataFrame({
'alpha': [1, np.nan, np.nan, np.nan],
'bravo': [np.nan, np.nan, np.nan, -1],
'charlie': [np.nan, np.nan, np.nan, np.nan],
'delta': [np.nan, 1, np.nan, np.nan],
})
print(df)
alpha bravo charlie delta
0 1.0 NaN NaN NaN
1 NaN NaN NaN 1.0
2 NaN NaN NaN NaN
3 NaN -1.0 NaN NaN
and I want to transform that into something like:
position value
0 alpha 1
1 delta 1
2 NaN NaN
3 bravo -1
So for each row in the original data I want to find the non-NaN value and retrieve the name of the column it was found in. Then I'll store the column and value in new columns called 'position' and 'value'.
I can guarantee that each row in the original data contains exactly zero or one non-NaN values.
My only idea is to iterate over each row but I know that idea is bad and there must be a more pandorable way to do it. I'm not exactly sure how to word my problem so I'm having trouble Googling for ideas. Thanks for any advice!
We can use DataFrame.melt to un pivot your data, then use sort_values and drop_duplicates:
df = (
df.melt(var_name='position')
.sort_values('value')
.drop_duplicates('position', ignore_index=True)
)
position value
0 bravo -1.0
1 alpha 1.0
2 delta 1.0
3 charlie NaN
Another option would be to use DataFrame.bfill over the column axis. Since you noted that:
can guarantee that each row in the original data contains exactly zero or one non-NaN values
values = df.bfill(axis=1).iloc[:, 0]
dfn = pd.DataFrame({'positions': df.columns, 'values': values})
positions values
0 alpha 1.0
1 bravo 1.0
2 charlie NaN
3 delta -1.0
Another way to do this. Actually, I just noticed, that it is quite similar to Erfan's first proposal:
# get the index as a column
df2= df.reset_index(drop=False)
# melt the columns keeping index as the id column
# and sort the result, so NaNs appear at the end
df3= df2.melt(id_vars=['index'])
df3.sort_values('value', ascending=True, inplace=True)
# now take the values of the first row per index
df3.groupby('index')[['variable', 'value']].agg('first')
Or shorter:
(
df.reset_index(drop=False)
.melt(id_vars=['index'])
.sort_values('value')
.groupby('index')[['variable', 'value']].agg('first')
)
The result is:
variable value
index
0 alpha 1.0
1 delta 1.0
2 alpha NaN
3 bravo -1.0

How to find column names in pandas dataframe that contain all unique values except NaN?

I want to find columns that contain all non-duplicates from a pandas data frame except NaN.
x y z
a 1 2 A
b 2 2 B
c NaN 3 D
d 4 NaN NaN
e NaN NaN NaN
The columns "x" and "z" have non-duplicate values except NaN, so I want to pick them out and create a new data frame.
Let us use nunique
m=df.nunique()==df.notnull().sum()
subdf=df.loc[:,m]
x z
a 1.0 A
b 2.0 B
c NaN D
d 4.0 NaN
e NaN NaN
m.index[m].tolist()
['x', 'z']
Compare length of unique values and length of values after applying dropna().
Try this code.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x":[1, 2, np.nan, 4, np.nan],
"y":[2, 2, 3, np.nan, np.nan],
"z":["A", "B", "D", np.nan, np.nan]})
for col in df.columns:
if len(df[col].dropna()) == len(df[col].dropna().unique()):
print(col)

Pandas: Number of rows with missing data

How do I find out the total number of rows that have missing data in a Pandas DataFrame?
I have tried this:
df.isnull().sum().sum()
But this is for the total missing fields.
I need to know how many rows are affected.
You can use .any. This will return True if any element is True and False otherwise.
df = pd.DataFrame({'a': [0, np.nan, 1], 'b': [np.nan, np.nan, 'c']})
print(df)
outputs
a b
0 0.0 NaN
1 NaN NaN
2 1.0 c
and
df.isnull().any(axis=1).sum() # returns 2

pandas set_index with NA and None values seem to be not working

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.
This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows