pandas valid null values - pandas

I am looking for the list of valid null values that pandas fillna() method will replace, e.g. 'NaN', 'NA', 'NULL', 'NaT'. I could not find it in the documentation.

fillna method will only replace actual missing values represented as NaN or NaT or None but not as strings ('NaN' or anyother string).
Before using fillna you can check what will be replaced in a column COL of your dataframe df using isnull():
df.loc[df['COL'].isnull()]
will show you the subset of your dataframe for which the column 'COL' is NaN/NaT/None.
You can replace strings to NaN using replace. Say you have strings like "NAN":
from numpy import nan
df = df.replace('NAN', nan)
refer to this post

Related

need to fill the non-null values to NaN based on column name

here need to update the non-null values of the columns values where specific text contains want to make them to nan values based on the column names. attached image for you reference.
Consider the following dataset
Replace column name with nan. But this won't replace any other string e.g. X0 contains value oth.
df.apply(lambda s: s.replace({s.name: np.nan}))
Replace all string with nan
df.apply(lambda s: pd.to_numeric(s, errors='coerce'))
Replace all string with nan on subset of columns
COLS = ['X0', 'X2']
df.apply(lambda s: pd.to_numeric(s, errors='coerce') if s.name in COLS else s)
Note: I have used pandas apply function but same result can be achieved with for loop

How to change a string to NaN when applying astype?

I have a column in a dataframe that has integers like: [1,2,3,4,5,6..etc]
My problem: In this field one of the field has a string, like this: [1,2,3,2,3,'hello form France',1,2,3]
the Dtype of this column is object.
I want to cast it to float with column.astype(float) but I get an error because that string.
The columns has over 10.000 records and there is only this record with string. How can I cast to float and change this string to NaN for example?
You can use pd.to_numeric with errors='coerce'
import pandas as pd
df = pd.DataFrame({
'all_nums':range(5),
'mixed':[1,2,'woo',4,5],
})
df['mixed'] = pd.to_numeric(df['mixed'], errors='coerce')
df.head()
Before:
After:

Change NaN to None in Pandas dataframe

I try to replace Nan to None in pandas dataframe. It was working to use df.where(df.notnull(),None).
Here is the thread for this method.
Use None instead of np.nan for null values in pandas DataFrame
When I try to use the same method on another dataframe, it failed.
The new dataframe is like below
A NaN B C D E, the print out of the dataframe is like this:
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 A NaN B C D E
even when I use the working code run against the new dataframe, it failed.
I just wondering is it is because in the excel, the cell format has to be certain type.
Any suggestion on this?
This always works for me
df = df.replace({np.nan:None})
You can check this related question, Credit from here
The problem is that I did not follow the format.
The format I used that cause the problem was
df.where(df.notnull(), None)
If I wrote the code like this, there is no problem
df = df.where(df.notnull(), None)
To do it just over one column
df.col_name.replace({np.nan: None}, inplace=True)
This is not as easy as it looks.
1.NaN is the value set for any cell that is empty when we are reading file using pandas.read_csv()
2.None is the value set for any cell that is NULL when we are reading file using pandas.read_sql() or readin from a database
import pandas as pd
import numpy as np
x=pd.DataFrame()
df=pd.read_csv('file.csv')
df=df.replace({np.NaN:None})
df['prog']=df['prog'].astype(str)
print(df)
if there is compatibility issue of datatype , which will be because on replacing np.NaN will make the column of dataframe as object type.
so in this case first replace np.NaN with None and then choose the required datatype for the column
file.csv
column names : batch,prog,name
'prog' column is empty

Pandas DataFrame: sort_values by an index with empty strings

I have a pandas DataFrame with multi level index. I want to sort by one of the index levels. It has float values, but occasionally few empty strings too which I want to be considered as nan.
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df.sort_values('i')
TypeError: '<' not supported between instances of 'str' and 'int'
One way to solve the problem is to replace the empty strings with nan, do the sort, and then replace nan with empty strings again.
I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
Why there are empty strings in the first place?
In my application, actually the data read has missing values which is read as np.nan. But, np.nan values cause problem with groupby. So, they are replace to empty strings. I wish we had a constant like nan which is treated like empty string by groupby and like nan for numeric operations.
I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
In pandas missing values are not empty values, only if save DataFrame with missing values then are replaced by empty strings.
Btw, main problem is mixed values - numeric with strings (empty values), best is convert all strings to numeric for avoid it.
You can replace empty values by missing values by rename:
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df = df.rename({'':np.nan})
df = df.sort_values('i')
print (df)
x
i
1.0 1
2.0 2
3.0 3
NaN 4
Possible solution if cannot be changed original data is get positions of sorted values by Index.argsort and change order by DataFrame.iloc:
df = df.iloc[df.rename({'':np.nan}).index.argsort()]
print (df)
x
i
1 1
2 2
3 3
4

How do I replace all NaNs in a pandas dataframe with the string "None"

I have a dataframe and some of them are empty. I want to make it a None string so I can parse it easier than a NaN value.
df = df.replace(np.nan, 'None', regex=True)
Use the code above.