Pandas: Number of rows with missing data - pandas

How do I find out the total number of rows that have missing data in a Pandas DataFrame?
I have tried this:
df.isnull().sum().sum()
But this is for the total missing fields.
I need to know how many rows are affected.

You can use .any. This will return True if any element is True and False otherwise.
df = pd.DataFrame({'a': [0, np.nan, 1], 'b': [np.nan, np.nan, 'c']})
print(df)
outputs
a b
0 0.0 NaN
1 NaN NaN
2 1.0 c
and
df.isnull().any(axis=1).sum() # returns 2

Related

Fill in NA column values with values from another row based on condition

I want to replace the missing values of one row with column values of another row based on a condition. The real problem has many more columns with NA values. In this example, I want to fill na values for row 4 with values from row 0 for columns A and B, as the value 'e' maps to 'a' for column C.
df = pd.DataFrame({'A': [0, 1, np.nan, 3, np.nan],
'B': [5, 6, np.nan, 8, np.nan],
'C': ['a', 'b', 'c', 'd', 'e']})
df
Out[21]:
A B C
0 0.0 5.0 a
1 1.0 6.0 b
2 NaN NaN c
3 3.0 8.0 d
4 NaN NaN e
I have tried this:
df.loc[df.C == 'e', ['A', 'B']] = df.loc[df.C == 'a', ['A', 'B']]
Is it possible to use a nested np.where statement instead?
Your code fails due to index alignement. As the indices are different (0 vs 4), NaN are assigned.
Use the underlying numpy array to bypass index alignement:
df.loc[df.C == 'e', ['A', 'B']] = df.loc[df.C == 'a', ['A', 'B']].values
NB. You must have the same size on both sides of the equal sign.
Output:
A B C
0 0.0 5.0 a
1 1.0 6.0 b
2 NaN NaN c
3 3.0 8.0 d
4 0.0 5.0 e

Replace all column values with the values ? and n.a with NaN

In dataframe, how to replace all column values with the values ? and n.a with NaN?
I tried
df.fillna(0),inplace=True
but '?' didn't replace.
To replace all non-NaN values, you can try
df = df.where(~df.notna(), "?")
and to replace all NaN values,
df.fillna(0, inplace=True)
IIUC try with replace:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2, "?"],
['n.a', 5, 6],
[7, '?', 9]],
columns=list('ABC'))
df = df.replace({'?': np.NaN, 'n.a': np.NaN})
df before replace:
A B C
0 1 2 ?
1 n.a 5 6
2 7 ? 9
df after replace:
A B C
0 1.0 2.0 NaN
1 NaN 5.0 6.0
2 7.0 NaN 9.0

How to check nan value in pandas dataframe?

nan_rows = []
for index, row in df.iterrows():
topic = row['topic']
if topic != np.nan:
nan_rows.append(row)
I want to split my dataframe into two: if the 'topic' value is nan, then extract it out. But the code above doesn't work. Why is that?
Here is a minimal example how to group by isnull() and assign the two groups to a dataframe each.
df = pd.DataFrame({'topic':[np.nan, 1, np.nan, 1]})
>>> df
topic
0 NaN
1 1.0
2 NaN
3 1.0
grouper = df.groupby(df.topic.isnull())
>>> grouper.groups
{False: [1, 3], True: [0, 2]}
df_True = grouper.get_group(True)
>>> df_True
topic
0 NaN
2 NaN
df_False = grouper.get_group(False)
>>> df_False
topic
1 1.0
3 1.0

Condensing Wide Data Based on Column Name

Is there an elegant way to do what I'm trying to do in Pandas? My data looks something like:
df = pd.DataFrame({
'alpha': [1, np.nan, np.nan, np.nan],
'bravo': [np.nan, np.nan, np.nan, -1],
'charlie': [np.nan, np.nan, np.nan, np.nan],
'delta': [np.nan, 1, np.nan, np.nan],
})
print(df)
alpha bravo charlie delta
0 1.0 NaN NaN NaN
1 NaN NaN NaN 1.0
2 NaN NaN NaN NaN
3 NaN -1.0 NaN NaN
and I want to transform that into something like:
position value
0 alpha 1
1 delta 1
2 NaN NaN
3 bravo -1
So for each row in the original data I want to find the non-NaN value and retrieve the name of the column it was found in. Then I'll store the column and value in new columns called 'position' and 'value'.
I can guarantee that each row in the original data contains exactly zero or one non-NaN values.
My only idea is to iterate over each row but I know that idea is bad and there must be a more pandorable way to do it. I'm not exactly sure how to word my problem so I'm having trouble Googling for ideas. Thanks for any advice!
We can use DataFrame.melt to un pivot your data, then use sort_values and drop_duplicates:
df = (
df.melt(var_name='position')
.sort_values('value')
.drop_duplicates('position', ignore_index=True)
)
position value
0 bravo -1.0
1 alpha 1.0
2 delta 1.0
3 charlie NaN
Another option would be to use DataFrame.bfill over the column axis. Since you noted that:
can guarantee that each row in the original data contains exactly zero or one non-NaN values
values = df.bfill(axis=1).iloc[:, 0]
dfn = pd.DataFrame({'positions': df.columns, 'values': values})
positions values
0 alpha 1.0
1 bravo 1.0
2 charlie NaN
3 delta -1.0
Another way to do this. Actually, I just noticed, that it is quite similar to Erfan's first proposal:
# get the index as a column
df2= df.reset_index(drop=False)
# melt the columns keeping index as the id column
# and sort the result, so NaNs appear at the end
df3= df2.melt(id_vars=['index'])
df3.sort_values('value', ascending=True, inplace=True)
# now take the values of the first row per index
df3.groupby('index')[['variable', 'value']].agg('first')
Or shorter:
(
df.reset_index(drop=False)
.melt(id_vars=['index'])
.sort_values('value')
.groupby('index')[['variable', 'value']].agg('first')
)
The result is:
variable value
index
0 alpha 1.0
1 delta 1.0
2 alpha NaN
3 bravo -1.0

How to find column names in pandas dataframe that contain all unique values except NaN?

I want to find columns that contain all non-duplicates from a pandas data frame except NaN.
x y z
a 1 2 A
b 2 2 B
c NaN 3 D
d 4 NaN NaN
e NaN NaN NaN
The columns "x" and "z" have non-duplicate values except NaN, so I want to pick them out and create a new data frame.
Let us use nunique
m=df.nunique()==df.notnull().sum()
subdf=df.loc[:,m]
x z
a 1.0 A
b 2.0 B
c NaN D
d 4.0 NaN
e NaN NaN
m.index[m].tolist()
['x', 'z']
Compare length of unique values and length of values after applying dropna().
Try this code.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x":[1, 2, np.nan, 4, np.nan],
"y":[2, 2, 3, np.nan, np.nan],
"z":["A", "B", "D", np.nan, np.nan]})
for col in df.columns:
if len(df[col].dropna()) == len(df[col].dropna().unique()):
print(col)