Explain curious behavior of Pandas.Series.interpolate - pandas

s = pd.Series([0, 2, np.nan, 8])
print(s)
interp = s.interpolate(method='polynomial', order=2)
print(interp)
This prints:
0 0.0
1 2.0
2 NaN
3 8.0
dtype: float64
0 0.000000
1 2.000000
2 4.666667
3 8.000000
dtype: float64
Now if I add one more np.nan to the series,
s = pd.Series([0, 2, np.nan, np.nan, 8])
print(s)
interp = s.interpolate(method='polynomial', order=2)
print(interp)
I get much more accurate results:
0 0.0
1 2.0
2 NaN
3 NaN
4 8.0
dtype: float64
0 0.0
1 2.0
2 4.0
3 6.0
4 8.0
dtype: float64
Is Series.interpolate recursive in that it uses interpolated values for further interpolated values, which then can affect previously interpolated values?

You are actually interpolating two different functions!
In the first case you look for a function that goes thorugh the following points:
(0,0), (1,2), (3,8)
But in the second case you look for a function that goes through the following points:
(0,0), (1,2), (4,8)
The indices of a pd.Series represent the points on the x-Axis and the data of a pd.Series represents the points on the y-Axis.
So try the following change in your first example:
s = pd.Series([0, 2, np.nan, 8])
s = pd.Series([0, 2, np.nan, 8], [0,1,2,4])
s.interpolate(method='polynomial', order=2)
You should get the output:
0 0.0
1 2.0
2 4.0
4 8.0
dtype: float64
As an alternative you could also do:
s = pd.Series([0, 2, np.nan, 8], [0,1,3,4])
and the output:
0 0.0
1 2.0
3 6.0
4 8.0
dtype: float64
Hope this helps.

Related

Dataframe forward-fill till column-specific last valid index

How do I go from:
[In]: df = pd.DataFrame({
'col1': [100, np.nan, np.nan, 100, np.nan, np.nan],
'col2': [np.nan, 100, np.nan, np.nan, 100, np.nan]
})
df
[Out]: col1 col2
0 100 NaN
1 NaN 100
2 NaN NaN
3 100 NaN
4 NaN 100
5 NaN NaN
To:
[Out]: col1 col2
0 100 NaN
1 100 100
2 100 100
3 100 100
4 NaN 100
5 NaN NaN
My current approach is a to apply a custom method that works on one column at a time:
[In]: def ffill_last_valid(s):
last_valid = s.last_valid_index()
s = s.ffill()
s[s.index > last_valid] = np.nan
return s
df.apply(ffill_last_valid)
But it seems like an overkill to me. Is there a one-liner that works on the dataframe directly?
Note on accepted answer:
See the accepted answer from mozway below.
I know it's a tiny dataframe but:
You can ffill, then keep only the values before the last stretch of NaN with a combination of where and notna/reversed-cummax:
out = df.ffill().where(df[::-1].notna().cummax())
variant:
out = df.ffill().mask(df[::-1].isna().cummin())
Output:
col1 col2
0 100.0 NaN
1 100.0 100.0
2 100.0 100.0
3 100.0 100.0
4 NaN 100.0
5 NaN NaN
interpolate:
In theory, df.interpolate(method='ffill', limit_area='inside') should work, but while both options work as expected separately, for some reason it doesn't when combined (pandas 1.5.2). This works with df.interpolate(method='zero', limit_area='inside'), though.

Replace outliers in Pandas dataframe by NaN

I'd like to replace outliers by np.nan. I have a dataframe containing floats, int and NaNs such as:
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
with this function:
def outliers(s, replace=np.nan):
Q1, Q3 = np.percentile(s, [25 ,75])
IQR = Q3-Q1
return s.where((s >= (Q1 - 1.5 * IQR)) & (s <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis=1)
but I get:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
Thanks as always for your help.
Don't use apply here is the annotated code for the optimized version:
def mask_outliers(df, replace):
# Calculate Q1 and Q2 quantile
q = df.agg('quantile', q=[.25, .75])
# Calculate IQR = Q2 - Q1
iqr = q.loc[.75] - q.loc[.25]
# Calculate lower and upper limits to decide outliers
lower = q.loc[.25] - 1.5 * iqr
upper = q.loc[.75] + 1.5 * iqr
# Replace the values that does not lies between [lower, upper]
return df.where(df.ge(lower) & df.le(upper), replace)
Result
mask_outliers(df_ex, np.nan)
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0
This answer provides an answer to the question:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
where the another (accepted) answer provides only a better solution to what you want to achieve.
The are two issues to fix in order to make your code doing what it should:
the NaN values have to be removed from the column before calculating np.percentile() to avoid getting for both Q1 and Q3 the value of NaN.
This is one of the reasons for so many NaN values you see in the result of applying your code to the DataFrame. np.percentile() behaves here another way as Pandas .agg('quantile',...) which calculates the Q1 and Q3 thresholds skipping implicit the NaN values from consideration.
the value for the axis has to be changed from 1 to 0 (i.e. to .apply(outliers, axis=0)) in order to apply outliers column wise.
This is another reason for so many NaN values you see in the result you got. The only row without all values set to NaN is these one which does not have a NaN value in itself, else also in these row all the values would be set to NaN for the reason explained above.
Following changes to your code:
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
and
df_ex_o = df_ex.apply(outliers, axis=0)
will solve the issues. Below the entire code and its output:
import pandas as pd
import numpy as np
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
# print(df_ex)
def outliers(colmn, replace=np.nan):
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
IQR = Q3-Q1
return colmn.where((colmn >= (Q1 - 1.5 * IQR)) & (colmn <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis = 0)
print(df_ex_o)
gives:
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0

During calculation of "distance average" in knn imputation method for replacing NaN value in particular column

I encounter this problem when I implement the Knn imputation method for handling missing data from scratch. I create a dummy dataset and find the nearest neighbors for rows that contain missing values here is my dataset
A B C D E
0 NaN 2.0 4.0 10.0 100.0
1 NaN 3.0 9.0 12.0 NaN
2 5.0 2.0 20.0 50.0 75.0
3 3.0 5.0 7.0 NaN 150.0
4 2.0 9.0 7.0 30.0 90.0
for row 0 the nearest neighbors are 1 and 2 and to replace the NaN value at (0, A) we compute the distance average between the nearest neighbors value in the same column but what if one of the nearest neighbors value is also NaN?
Example:
let suppose the nearest neighbors for row 3 is 2 and 4 so in row 3 the missing value in column D and to replace this missing value we compute distance average between nearest neighbors value in column D which is like that
distance average = [(1/D1) * 50.0 + (1/D2) * 30.0]/2
and replace the nan value at (3, D) with this average (where D1 and D2 are corresponding nan euclidian distance). But in the case of row 0 the nearest neighbor is 1 and 2 and to replace the nan value at (0, A ) we need to compute the distance average between row 1 and 2 value in column A the value at (2, A) is 5.0 great but at (1, A) it's NaN so we can't compute like that
distance average = [(1/D3) * NaN + (1/D4) * 5.0]/2
so how do we replace the NaN value at (0, A)? and how does sklearn KNNImputer handle this kind of scenario?
The sklearn KNNImputer uses the nan_euclidean_distances metric as a default. According to its user guide
If a sample has more than one feature missing, then the neighbors for
that sample can be different depending on the particular feature being
imputed.
The algorithm might use different sets of neighborhoods to impute the single missing value in column D and the two missing values in column A.
This is a simple implementation of the KNNImputer:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
A = [np.nan, np.nan, 5, 3, 2]
B = [2, 3, 2, 5, 9]
C = [4, 9, 20, 7, 7]
D = [10, 12, 50, np.nan, 30]
E = [100, np.nan, 75, 150, 90]
columns=['A', 'B', 'C', 'D', 'E']
data = pd.DataFrame(list(zip(A, B, C, D, E)),
columns=columns)
imputer = KNNImputer(n_neighbors=2)
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=columns)
This is the output:
A B C D E
0 3.5 2.0 4.0 10.0 100.0
1 2.5 3.0 9.0 12.0 125.0
2 5.0 2.0 20.0 50.0 75.0
3 3.0 5.0 7.0 11.0 150.0
4 2.0 9.0 7.0 30.0 90.0

Why does replacing multiple columns change the dtype

Why does replacing one value give object dtype but replacing two values give float64 dtype?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
In [3]: df
Out[3]:
a b
0 1 4
1 2 5
2 3 6
In [6]: df.replace({1: None})
Out[6]:
a b
0 None 4
1 2 5
2 3 6
In [7]: df.replace({1: None, 5: None})
Out[7]:
a b
0 NaN 4.0
1 2.0 NaN
2 3.0 6.0
In [8]: df.replace({1: None}).dtypes
Out[8]:
a object
b object
dtype: object
In [9]: df.replace({1: None, 5: None}).dtypes
Out[9]:
a float64
b float64
dtype: object
Just the code:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df.replace({1: None})
df.replace({1: None, 5: None})
df.replace({1: None}).dtypes
df.replace({1: None, 5: None}).dtypes
This seems to be buried in the replace if/then logic that does something slightly different depending on the length of the mapping
I you want consistent behavior, don't use None. Use np.nan
df.replace({1: np.nan})
a b
0 NaN 4.0
1 2.0 5.0
2 3.0 6.0
Or
df.replace({1: np.nan, 5: np.nan})
a b
0 NaN 4.0
1 2.0 NaN
2 3.0 6.0
If you want to replace one column and leave the others alone, pass a nested dictionary that specifies what to do for which column
df.replace({'a': {1: np.nan}})
a b
0 NaN 4
1 2.0 5
2 3.0 6
Or
df.replace({'a': {1: np.nan}, 'b': {5: None}})
a b
0 NaN 4
1 2.0 None
2 3.0 6

How to fill in pandas column with previous column value using apply [duplicate]

Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3