How can I check Nan value in dataframe? - pandas

I want to check whether specific columns in dataframe contains nan or not. Then remove the row whose specific columns contain nan.
Here is my wrong code:
import numpy as np
import pandas as pd
from numpy import nan
df = pd.DataFrame(np.array([[nan, 2, 3], [nan, nan, 6], [nan, 8, 9]]),
columns=['a', 'b', 'c'])
for i in range(len(df.index)):
print(type(df["b"].loc[i]))
if df["b"].loc[i] is np.float64(nan):
df = df.drop([i])
print(df)
But df["b"].loc[i] is np.float64(nan) is False and the result is
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
a b c
0 NaN 2.0 3.0
1 NaN NaN 6.0
2 NaN 8.0 9.0
I can use another code to make it, but I want to know why the above code cannot do it.
Right code is
df1 = pd.DataFrame(np.array([[nan, 2, 3], [nan, nan, 6], [nan, 8, 9]]),
columns=['a', 'b', 'c'])
for i in range(len(df1.index)):
if df1.isna()["b"].loc[i]:
df1 = df1.drop([i])
print(df1)

The reason is that the is operator is not a suitable way to test equality in the context of NaN values.
Here is a post which discusses the topic in more detail.

Related

How to merge same name column from two different dataframes?

I have four different datasets. I have merged three of the dataframes correctly. I have same name column in 3rd and 4th dataset. When I merge it with 4th dataset. I am not getting the same name column values in well mannerd way. The user_id is repeating when I merge. I don't want to repeat the user_id. I want to see the value in the del_keys column where it's showing me NaN value rather than it's showing me the value in the last of table. Moreover, I want to merge values of same name column on the basis of their user_id.
In the above image you can see what kind of problem I am getting.
My expected output will look like. There should not be repeated user_id.
using merge on user_id column
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'del': [1.0, np.nan, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
final=df.merge(df2,on="user_id",how="outer")
Combine first to get rid of Nan values and then drop duplicates
final["del_keys"]=final['del_keys_y'].combine_first(final['del_keys_x'])
final.drop(columns=["del_keys_x","del_keys_y"],inplace=True)
final.drop_duplicates(subset="user_id")
I'm guessing that you use pd.concat to merge the dataframes.
Some dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3],
'del_keys': [1.0, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
Merge using pd.concat:
df = pd.concat([df1, df2])
>>> user_id del_keys
0 1 1.0
1 2 NaN
2 3 NaN
0 3 1.0
1 4 2.0
2 5 3.0
Remove duplicates using pd.drop_duplicates:
(
df
.sort_values('del_keys')
.drop_duplicates('user_id', keep='first')
.sort_values('user_id')
)
>>> user_id del_keys
0 1 1.0
1 2 NaN
0 3 1.0
1 4 2.0
2 5 3.0
First, we sort the values by del_keys such that all NaNs are the bottom of the dataframe. Then we can drop the duplicates and keep the first occurrence for each user_id. Lastly, we can sort again to restore the original order.

How to remove all type of nan from the dataframe.?

I had a data frame, which is shown below. I want to merge column values into one column, excluding nan values.
Image 1:
When I am using the code
df3["Generation"] = df3[df3.columns[5:]].apply(lambda x: ','.join(x.dropna()), axis=1)
I am getting results like this.
Image 2:
I suspect that these columns are of type string; thus, they are not affected by x.dropna().
One example that I made is this, which gives similar results as yours.
df = pd.DataFrame({'a': [np.nan, np.nan, 1, 2], 'b': [1, 1, np.nan, None]}).astype(str)
df.apply(lambda x: ','.join(x.dropna()))
0 nan,1.0
1 nan,1.0
2 1.0,nan
3 2.0,nan
dtype: object
-----------------
# using simple string comparing solves the problem
df.apply(lambda x: ','.join(x[x!='nan']), axis=1)
0 1.0
1 1.0
2 1.0
3 2.0
dtype: object

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

Python pandas DataFrame operations with NaN

On pandas DataFrame, I'm trying to compute percent change between two features. For example:
df = pd.DataFrame({'A': [100, 100, 100], 'B': [105, 110, 93], 'C': ['NaN', 102, 'NaN']})
I attempting to compute change between df['A'] - df['C'], but on the rows where we have 'NaN', use value from 'B' column.
Expecting result: [-5, -2, 7]
since, df['C'].loc[0] is NaN, first value is 100 - 105 (from 'B').
But second value is 100 -102.
I think simpliest is replace missing values by another column by Series.fillna:
#if need replace strings NaN to missing values np.nan
df['C'] = pd.to_numeric(df.C, errors='coerce')
s = df['A'] - df['C'].fillna(df.B)
print (s)
0 -5.0
1 -2.0
2 7.0
dtype: float64
Another idea with numpy.where and test missing values by Series.isna:
a = np.where(df.C.isna(), df['A'] - df['B'], df['A'] - df['C'])
print (a)
[-5. -2. 7.]
s = df['A'] - np.where(df.C.isna(), df['B'], df['C'])
print (s)
0 -5.0
1 -2.0
2 7.0
Name: A, dtype: float64

Pandas replace empty with value based on column using dictionary

I have a dataframe with a few dozen columns. I'd like to replace NaN or empty values with a specific number or string, depending on the column. Is there a dictionary approach that would work? Dictionary example below, not sure how to apply it to a dataframe. Using Python 2.7
mydict ={'ColA': -999, 'ColB': -888, 'ColC': 'TBD'}
Just use pandas.DataFrame.fillna:
import pandas as pd
df = pd.DataFrame({'ColA': [1, np.nan, 3], 'ColB':[10, np.nan, 30], 'ColC':[100, np.nan, 300]})
mydict ={'ColA': -999, 'ColB': -888, 'ColC': 'TBD'}
new_df = df.fillna(mydict)
print(new_df)
Output:
ColA ColB ColC
0 1.0 10.0 100
1 -999.0 -888.0 TBD
2 3.0 30.0 300