Why NaN appears in part of pd data output? - pandas

After my data read in, df.head() showes everything OK.
But df1 shows some NaN. '1222','1224','1223' are the X variable names.
What should I do to get the X values back? Thanks!
input
df = pd.read_excel(r'C:\Users\Data\input.xlsx',sheetname='InputData')
df.head()
df1 = df.ix[:, ['1222','1224','1223','Y']]
df1.head()
output
1222 1224 1223 Y
NaN NaN NaN 0.6785
NaN NaN NaN 0.6801

Related

At each NaN value, drop the row and column it's located in from pandas DataFrame

I have some unknown DataFrame that can be of any size and shape, for example:
first1 first2 first3 first4
a NaN 22 56.0 65
c 380.0 40 NaN 66
b 390.0 50 80.0 64
My objective is to delete all columns and rows at which there is a NaN value.
In this specific case, the output should be:
first2 first4
b 50 64
Also, I need to preserve the option to use "all" like in pandas.DataFrame.dropna, meaning when an argument "all" passed, a column or a row must be dropped only if all its values are missing.
When I tried the following code:
def dropna_mta_style(df, how='any'):
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
It obviously didn't work, because it drops firstly the rows, and then searches for columns with Nan's, but it was already dropped.
Thanks in advance!
P.S: for and while loops, python built-in functions that act on iterables (all, any, map, ...), list and dictionary comprehensions shouldn't be used.
Solution intended for readability:
rows = df.dropna(axis=0).index
cols = df.dropna(axis=1).columns
df = df.loc[rows, cols]
Would something like this work ?
df.dropna(axis=1,how='any').loc[df.dropna(axis=0,how='any').index]
(Meaning we take the indexes of all rows for which we dont have NaNs in any row df.dropna(axis=0,how='any').index - then use that to locate the rows we want from the original df for which we drop all columns having at least one NaN)
This should remove all rows and columns dynamically
df['Check'] = df.isin([np.nan]).any(axis=1)
df = df.dropna(axis = 1)
df = df.loc[df['Check'] == False]
df.drop('Check', axis = 1, inplace = True)
df
def dropna_mta_style(df, how='any'):
if (how == 'all'):
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
if len(row_names) > 0:
new_df=df.drop(axis=1, columns=col_names)
else:
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
here is a breakdown of the change made to the function
BEFORE;
first1 first2 first3 first4 first5
a NaN 22.0 NaN 65.0 NaN
c 380.0 40.0 NaN 66.0 NaN
b 390.0 50.0 NaN 64.0 NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
find the null columns
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
col_names
Index(['first3', 'first5'], dtype='object')
find the rows with all null rows
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
row_index
Index([3, 4, 5, 6], dtype='object')
if len(row_names) > 0:
df2=df.drop(axis=1, columns=col_names)
df2
AFTER:
first1 first2 first4
a NaN 22.0 65.0
c 380.0 40.0 66.0
b 390.0 50.0 64.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
incorporating in your method

How do I turn a categorical column into the level 0 of a column multi-index

If I have a data-frame like so:
generated with:
import pandas as pd
import numpy as np
df = pd.DataFrame({'dataset': ['dataset1']*2 + ['dataset2']*2 + ['dataset3']*2,
'frame': [1,2] * 3,
'result1': np.random.randn(6),
'result2': np.random.randn(6),
'result3': np.random.randn(6),
'method': ['A']*3 + ['B']*3
})
df = df.set_index(['dataset','frame'])
df
How can I transform it, so that I have multi-indexed columns, where the values in column 'method' are level 0 of the multi-index.
Missing values should be filled in like, e.g. like so:
The final goal is that I want to be able to easily compare corresponding values in the columns 'result1', 'result2', 'result3' between method 'A' and 'B'.
You can add method to MultiIndex by DataFrame.set_index, reshape by DataFrame.unstack and last DataFrame.swaplevel with DataFrame.sort_index:
df = df.set_index('method', append=True).unstack().swaplevel(1,0, axis=1).sort_index(axis=1)
print (df)
method A B
result1 result2 result3 result1 result2 result3
dataset frame
dataset1 1 1.488609 1.130858 0.409016 NaN NaN NaN
2 0.676011 0.645002 0.102751 NaN NaN NaN
dataset2 1 -0.418451 0.106414 -1.907722 NaN NaN NaN
2 NaN NaN NaN -0.806521 0.422155 1.100224
dataset3 1 NaN NaN NaN 0.555876 0.124207 -1.402325
2 NaN NaN NaN -0.705504 -0.837953 -0.225081
#if need remove second level
df = df.reset_index(level=1, drop=True)

Pandas DataFrame .merge not displaying Nan

I have a data frame df containing only dates from 2007-01-01 to 2018-04-30 (not as index)
I have a second data frame sub containing dates and values from 2007-01-01 to 2018-04-20
I want to have a result data frame res with ALL dates from df and the values from sub at the right place. I am using
res = pd.merge(df, sub, on='date', how='outer')
I expect to have NaNs from 2018-04-21 to 2018-04-30 in the res data frame.
Instead I got res has only values up to 2018-04-20 (it truncated the missing ones)
Why?
IIUC, setting indexes and then joining will be useful here:
## create sample data
df = pd.DataFrame({'mdates': pd.date_range('12/13/1989', periods=100, freq='D')})
df['val'] = np.random.randint(10, 500, 100)
df1 = pd.DataFrame({'mdates': pd.date_range('12/01/1989', periods=50, freq='D')})
## join data
df1 = df1.set_index('mdates').join(df.set_index('mdates'))
print(df1.head(20))
val
mdates
1989-12-01 NaN
1989-12-02 NaN
1989-12-03 NaN
1989-12-04 NaN
1989-12-05 NaN
1989-12-06 NaN
1989-12-07 NaN
1989-12-08 NaN
1989-12-09 NaN
1989-12-10 NaN
1989-12-11 NaN
1989-12-12 NaN
1989-12-13 215.0
1989-12-14 189.0
1989-12-15 97.0
1989-12-16 264.0
1989-12-17 419.0
1989-12-18 57.0
1989-12-19 376.0
1989-12-20 448.0

pandas update specific row with nan values

I've been browsing around but I cannot find the answer to my particular question.
I have a Dataframe with hundreds of columns and hundreds of rows. I want to change the occurring NaN values only for the first row and replace them with an empty string. This has been answered for changing a column or an entire dataframe, but not a particular row. I also don't want to modify the NaNs occurring in other rows.
I've tried the following:
dataframe.loc[0].replace(np.nan, '', regex=True)
and I also tried with:
dataframe.update(dataframe.loc[0].fillna(''))
but when I call the dataframe, it is not modified. Any help would be greatly appreciated!
Consider the data frame df
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.choice([1, np.nan], size=(4, 4)),
list('WXYZ'), list('ABCD')
)
df
A B C D
W 1.0 NaN 1.0 NaN
X 1.0 1.0 NaN 1.0
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1.0
If we use a non-scalar, namely an array like thing to select the first row, we'll get a pd.DataFrame object back and can conveniently fillna and pass to pd.DataFrame.update
df.update(df.iloc[[0]].fillna(''))
df
A B C D
W 1.0 1.0
X 1.0 1 NaN 1
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1
Notice that I use [0] instead of 0 within the iloc.

pandas df after fillna() is still NaN

A 40000 rows 1 column data saved as excel. There are hundred null values in it. Such as row 361...
When I carried out df.fillna(method='bfill'), the NaN values is still NaN.
If sliced a df fragment contained Null values, it processed expectently.
I tried but still could not fill NaN cells.
So what's wrong with it?
The df file is here:
excel file click here
df=pd.read_execel('npp.xlsx')
df.fillna(method='bfill')
print( df.iloc[360:370,] )
Out[122]:
0
t360 NaN
t361 NaN
t362 NaN
t363 NaN
t364 220.50
t365 228.59
t366 NaN
t367 NaN
t368 NaN
t369 NaN
When apply fillna() on sliced df, the NaN values could be replaced:
print( df.iloc[360:370,].fillna(method='bfill') )
0
t360 220.50
t361 220.50
t362 220.50
t363 220.50
t364 220.50
t365 228.59
t366 NaN
t367 NaN
t368 NaN
t369 NaN
You need assign output:
df = pd.read_excel('npp.xlsx')
df = df.fillna(method='bfill')
df = df[df[0].isnull()]
print (df)
Empty DataFrame
Columns: [0]
Index: []
Or use inplace=True parameter:
df = pd.read_excel('npp.xlsx')
df.fillna(method='bfill', inplace=True)
df = df[df[0].isnull()]
print (df)
Empty DataFrame
Columns: [0]
Index: []
Or shorter:
df = df.bfill()
df.bfill(inplace=True)