A 40000 rows 1 column data saved as excel. There are hundred null values in it. Such as row 361...
When I carried out df.fillna(method='bfill'), the NaN values is still NaN.
If sliced a df fragment contained Null values, it processed expectently.
I tried but still could not fill NaN cells.
So what's wrong with it?
The df file is here:
excel file click here
df=pd.read_execel('npp.xlsx')
df.fillna(method='bfill')
print( df.iloc[360:370,] )
Out[122]:
0
t360 NaN
t361 NaN
t362 NaN
t363 NaN
t364 220.50
t365 228.59
t366 NaN
t367 NaN
t368 NaN
t369 NaN
When apply fillna() on sliced df, the NaN values could be replaced:
print( df.iloc[360:370,].fillna(method='bfill') )
0
t360 220.50
t361 220.50
t362 220.50
t363 220.50
t364 220.50
t365 228.59
t366 NaN
t367 NaN
t368 NaN
t369 NaN
You need assign output:
df = pd.read_excel('npp.xlsx')
df = df.fillna(method='bfill')
df = df[df[0].isnull()]
print (df)
Empty DataFrame
Columns: [0]
Index: []
Or use inplace=True parameter:
df = pd.read_excel('npp.xlsx')
df.fillna(method='bfill', inplace=True)
df = df[df[0].isnull()]
print (df)
Empty DataFrame
Columns: [0]
Index: []
Or shorter:
df = df.bfill()
df.bfill(inplace=True)
Related
I have some unknown DataFrame that can be of any size and shape, for example:
first1 first2 first3 first4
a NaN 22 56.0 65
c 380.0 40 NaN 66
b 390.0 50 80.0 64
My objective is to delete all columns and rows at which there is a NaN value.
In this specific case, the output should be:
first2 first4
b 50 64
Also, I need to preserve the option to use "all" like in pandas.DataFrame.dropna, meaning when an argument "all" passed, a column or a row must be dropped only if all its values are missing.
When I tried the following code:
def dropna_mta_style(df, how='any'):
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
It obviously didn't work, because it drops firstly the rows, and then searches for columns with Nan's, but it was already dropped.
Thanks in advance!
P.S: for and while loops, python built-in functions that act on iterables (all, any, map, ...), list and dictionary comprehensions shouldn't be used.
Solution intended for readability:
rows = df.dropna(axis=0).index
cols = df.dropna(axis=1).columns
df = df.loc[rows, cols]
Would something like this work ?
df.dropna(axis=1,how='any').loc[df.dropna(axis=0,how='any').index]
(Meaning we take the indexes of all rows for which we dont have NaNs in any row df.dropna(axis=0,how='any').index - then use that to locate the rows we want from the original df for which we drop all columns having at least one NaN)
This should remove all rows and columns dynamically
df['Check'] = df.isin([np.nan]).any(axis=1)
df = df.dropna(axis = 1)
df = df.loc[df['Check'] == False]
df.drop('Check', axis = 1, inplace = True)
df
def dropna_mta_style(df, how='any'):
if (how == 'all'):
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
if len(row_names) > 0:
new_df=df.drop(axis=1, columns=col_names)
else:
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
here is a breakdown of the change made to the function
BEFORE;
first1 first2 first3 first4 first5
a NaN 22.0 NaN 65.0 NaN
c 380.0 40.0 NaN 66.0 NaN
b 390.0 50.0 NaN 64.0 NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
find the null columns
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
col_names
Index(['first3', 'first5'], dtype='object')
find the rows with all null rows
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
row_index
Index([3, 4, 5, 6], dtype='object')
if len(row_names) > 0:
df2=df.drop(axis=1, columns=col_names)
df2
AFTER:
first1 first2 first4
a NaN 22.0 65.0
c 380.0 40.0 66.0
b 390.0 50.0 64.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
incorporating in your method
If I have a data-frame like so:
generated with:
import pandas as pd
import numpy as np
df = pd.DataFrame({'dataset': ['dataset1']*2 + ['dataset2']*2 + ['dataset3']*2,
'frame': [1,2] * 3,
'result1': np.random.randn(6),
'result2': np.random.randn(6),
'result3': np.random.randn(6),
'method': ['A']*3 + ['B']*3
})
df = df.set_index(['dataset','frame'])
df
How can I transform it, so that I have multi-indexed columns, where the values in column 'method' are level 0 of the multi-index.
Missing values should be filled in like, e.g. like so:
The final goal is that I want to be able to easily compare corresponding values in the columns 'result1', 'result2', 'result3' between method 'A' and 'B'.
You can add method to MultiIndex by DataFrame.set_index, reshape by DataFrame.unstack and last DataFrame.swaplevel with DataFrame.sort_index:
df = df.set_index('method', append=True).unstack().swaplevel(1,0, axis=1).sort_index(axis=1)
print (df)
method A B
result1 result2 result3 result1 result2 result3
dataset frame
dataset1 1 1.488609 1.130858 0.409016 NaN NaN NaN
2 0.676011 0.645002 0.102751 NaN NaN NaN
dataset2 1 -0.418451 0.106414 -1.907722 NaN NaN NaN
2 NaN NaN NaN -0.806521 0.422155 1.100224
dataset3 1 NaN NaN NaN 0.555876 0.124207 -1.402325
2 NaN NaN NaN -0.705504 -0.837953 -0.225081
#if need remove second level
df = df.reset_index(level=1, drop=True)
I'm looking for an efficient idiom for creating a new Pandas DataFrame with the same columns and types as an existing DataFrame, but with no rows. The following works, but is presumably much less efficient than it could be, because it has to create a long indexing structure and then evaluate it for each row. I'm assuming that's O(n) in the number of rows, and I would like to find an O(1) solution (that's not too bad to look at).
out = df.loc[np.repeat(False, df.shape[0])].copy()
I have the copy() in there because I honestly have no idea under what circumstances I'm getting a copy or getting a view into the original.
For comparison in R, a nice idiom is to do df[0,], because there's no zeroth row. df[NULL,] also works.
I think the equivalent in pandas would be slicing using iloc
df = pd.DataFrame({'A' : [0,1,2,3], 'B' : [4,5,6,7]})
print(df1)
A B
0 0 4
1 1 5
2 2 6
3 3 7
df1 = df.iloc[:0].copy()
print(df1)
Empty DataFrame
Columns: [A, B]
Index: []
Df1 the existing DataFrame:
df1 = pd.DataFrame({'x1':[1,2,3], 'x2':[4,5,6]})
Df2 the new, based on the columns in df1:
df2 = pd.DataFrame({}, columns=df1.columns)
For setting the dtypes of the different columns:
for x in df1.columns:
df2[x]=df2[x].astype(df1[x].dtypes.name)
Update no rows
Use reindex:
dfcopy = pd.DataFrame().reindex(columns=df.columns)
print(dfcopy)
Output:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: []
We can use reindex_like.
dfcopy = pd.DataFrame().reindex_like(df)
MCVE:
#Create dummy source dataframe
df = pd.DataFrame(np.arange(25).reshape(5,-1), index=[*'ABCDE'], columns=[*'abcde'])
dfcopy = pd.DataFrame().reindex_like(df)
print(dfcopy)
Output:
a b c d e
A NaN NaN NaN NaN NaN
B NaN NaN NaN NaN NaN
C NaN NaN NaN NaN NaN
D NaN NaN NaN NaN NaN
E NaN NaN NaN NaN NaN
Please deep copy original df and drop index.
#df1=(df.copy(deep=True)).drop(df.index)#If df is small
df1=df.drop(df.index).copy()#If df is large and dont want to copy and discard
I have a data frame df containing only dates from 2007-01-01 to 2018-04-30 (not as index)
I have a second data frame sub containing dates and values from 2007-01-01 to 2018-04-20
I want to have a result data frame res with ALL dates from df and the values from sub at the right place. I am using
res = pd.merge(df, sub, on='date', how='outer')
I expect to have NaNs from 2018-04-21 to 2018-04-30 in the res data frame.
Instead I got res has only values up to 2018-04-20 (it truncated the missing ones)
Why?
IIUC, setting indexes and then joining will be useful here:
## create sample data
df = pd.DataFrame({'mdates': pd.date_range('12/13/1989', periods=100, freq='D')})
df['val'] = np.random.randint(10, 500, 100)
df1 = pd.DataFrame({'mdates': pd.date_range('12/01/1989', periods=50, freq='D')})
## join data
df1 = df1.set_index('mdates').join(df.set_index('mdates'))
print(df1.head(20))
val
mdates
1989-12-01 NaN
1989-12-02 NaN
1989-12-03 NaN
1989-12-04 NaN
1989-12-05 NaN
1989-12-06 NaN
1989-12-07 NaN
1989-12-08 NaN
1989-12-09 NaN
1989-12-10 NaN
1989-12-11 NaN
1989-12-12 NaN
1989-12-13 215.0
1989-12-14 189.0
1989-12-15 97.0
1989-12-16 264.0
1989-12-17 419.0
1989-12-18 57.0
1989-12-19 376.0
1989-12-20 448.0
After my data read in, df.head() showes everything OK.
But df1 shows some NaN. '1222','1224','1223' are the X variable names.
What should I do to get the X values back? Thanks!
input
df = pd.read_excel(r'C:\Users\Data\input.xlsx',sheetname='InputData')
df.head()
df1 = df.ix[:, ['1222','1224','1223','Y']]
df1.head()
output
1222 1224 1223 Y
NaN NaN NaN 0.6785
NaN NaN NaN 0.6801