Interpolate function in Pandas Dataframe - pandas

Which are the equations used to interpolate a DataFrame in Pandas?
Reading the following link, I couldn't find anything related to them.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html
I need exactly this:
But I'm not sure if the interpolate() function is doing the same thing. If that's the case, is there anyway I can change it to work like that?
EDIT: Example of dataframe:
df = pd.DataFrame([[np.nan, 10, np.nan, 20, 17, np.nan, np.nan, 14, np.nan, 10, np.nan],
[5, np.nan, 0, np.nan, np.nan, np.nan, 5, np.nan, 10, np.nan, np.nan],
[3, np.nan, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan, np.nan, np.nan],
[np.nan, np.nan, np.nan, 3, 4, 5, np.nan, 7, 8, 9, np.nan]],
columns=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10'])

Unfortunately, the interpolate method is NOT doing exactly that. However, it is still possible to achieve what you want.
Short answer
df.interpolate(limit=1).mul(~(df.shift(-1).isna() & df.isna())).fillna(0)
Step by step explanation
By default, the interpolate method treats the values as equally spaced. So if you input [0,NaN,10,NaN,NaN,16] for instance, you'll get [0,5,10,12,14,16]. This behavior is controlled by the method parameter of the interpolate function. You don't have to change it in your case.
>>> df = pd.DataFrame([np.nan, 10, np.nan, 20, 17, np.nan, np.nan, 14, np.nan, 10, np.nan], columns=["value"])
>>> df
value
0 NaN
1 10.0
2 NaN
3 20.0
4 17.0
5 NaN
6 NaN
7 14.0
8 NaN
9 10.0
10 NaN
>>> df.interpolate()
value
0 NaN
1 10.0
2 15.0
3 20.0
4 17.0
5 16.0
6 15.0
7 14.0
8 12.0
9 10.0
10 10.0
Now, the default behavior will replace any NaN, but you don't want consecutives NaNs to be replaced, so you need to use the limit parameter.
This parameter limits the number of consecutives NaN that will be replaced, but crucially, if you set the limit to 1, the first NaN of the consecutive NaNs will still be replaced; you don't want that!
>>> df.interpolate(limit=1)
value
0 NaN
1 10.0
2 15.0
3 20.0
4 17.0
5 16.0
6 NaN
7 14.0
8 12.0
9 10.0
10 10.0
To get rid of those first values, you need to know which values are NaN and directly followed by another NaN. Use this :
>>> df.shift(-1).isna() & df.isna()
value
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
You can then multiply your dataframe by the negation (~) of this expression. (Note that n*False = 0 and n*True = n)`
>>> df.interpolate(limit=1).mul(~(df.shift(-1).isna() & df.isna()))
value
0 NaN
1 10.0
2 15.0
3 20.0
4 17.0
5 0.0
6 NaN
7 14.0
8 12.0
9 10.0
10 0.0
Finally, replace the remaining NaN values with 0, using fillna
>>> df.interpolate(limit=1).mul(~(df.shift(-1).isna() & df.isna())).fillna(0)
value
0 0.0
1 10.0
2 15.0
3 20.0
4 17.0
5 0.0
6 0.0
7 14.0
8 12.0
9 10.0
10 0.0

Related

Pandas Shift: Looking for better alternative [duplicate]

This question already has answers here:
Make Multiple Shifted (Lagged) Columns in Pandas
(4 answers)
Closed 3 months ago.
import pandas as pd
df = pd.DataFrame(np.array([[1, 0, 0], [4, 5, 0], [7, 7, 7], [7, 4, 5], [4, 5, 0], [7, 8, 9], [3, 2, 9], [9, 3, 6], [6, 8, 5]]),
columns=['a', 'b', 'c'],
index = ['1/1/2000', '1/1/2001', '1/1/2002', '1/1/2003', '1/1/2004', '1/1/2005', '1/1/2006', '1/1/2007', '1/1/2008'])
df['a_1'] = df['a'].shift(1)
df['a_3'] = df['a'].shift(3)
df['a_5'] = df['a'].shift(5)
df['a_7'] = df['a'].shift(7)
Above is a dummy example of how I am shifting.
Issues: 1. Need extra line for different period of shift, can it be done in one go?
2. Above df is small, in case of massive dataframe this operation is slow. I checked different questions: most are relating it to shift not being cython optimized, is there a faster way (apart from numba which few answer do talk about)
nums = [1, 3, 5, 7]
pd.concat([df] + [df['a'].shift(i).to_frame(f'a_{i}') for i in nums], axis=1)
result:
a b c a_1 a_3 a_5 a_7
1/1/2000 1 0 0 NaN NaN NaN NaN
1/1/2001 4 5 0 1.0 NaN NaN NaN
1/1/2002 7 7 7 4.0 NaN NaN NaN
1/1/2003 7 4 5 7.0 1.0 NaN NaN
1/1/2004 4 5 0 7.0 4.0 NaN NaN
1/1/2005 7 8 9 4.0 7.0 1.0 NaN
1/1/2006 3 2 9 7.0 7.0 4.0 NaN
1/1/2007 9 3 6 3.0 4.0 7.0 1.0
1/1/2008 6 8 5 9.0 7.0 7.0 4.0

Replace outliers in Pandas dataframe by NaN

I'd like to replace outliers by np.nan. I have a dataframe containing floats, int and NaNs such as:
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
with this function:
def outliers(s, replace=np.nan):
Q1, Q3 = np.percentile(s, [25 ,75])
IQR = Q3-Q1
return s.where((s >= (Q1 - 1.5 * IQR)) & (s <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis=1)
but I get:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
Thanks as always for your help.
Don't use apply here is the annotated code for the optimized version:
def mask_outliers(df, replace):
# Calculate Q1 and Q2 quantile
q = df.agg('quantile', q=[.25, .75])
# Calculate IQR = Q2 - Q1
iqr = q.loc[.75] - q.loc[.25]
# Calculate lower and upper limits to decide outliers
lower = q.loc[.25] - 1.5 * iqr
upper = q.loc[.75] + 1.5 * iqr
# Replace the values that does not lies between [lower, upper]
return df.where(df.ge(lower) & df.le(upper), replace)
Result
mask_outliers(df_ex, np.nan)
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0
This answer provides an answer to the question:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
where the another (accepted) answer provides only a better solution to what you want to achieve.
The are two issues to fix in order to make your code doing what it should:
the NaN values have to be removed from the column before calculating np.percentile() to avoid getting for both Q1 and Q3 the value of NaN.
This is one of the reasons for so many NaN values you see in the result of applying your code to the DataFrame. np.percentile() behaves here another way as Pandas .agg('quantile',...) which calculates the Q1 and Q3 thresholds skipping implicit the NaN values from consideration.
the value for the axis has to be changed from 1 to 0 (i.e. to .apply(outliers, axis=0)) in order to apply outliers column wise.
This is another reason for so many NaN values you see in the result you got. The only row without all values set to NaN is these one which does not have a NaN value in itself, else also in these row all the values would be set to NaN for the reason explained above.
Following changes to your code:
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
and
df_ex_o = df_ex.apply(outliers, axis=0)
will solve the issues. Below the entire code and its output:
import pandas as pd
import numpy as np
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
# print(df_ex)
def outliers(colmn, replace=np.nan):
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
IQR = Q3-Q1
return colmn.where((colmn >= (Q1 - 1.5 * IQR)) & (colmn <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis = 0)
print(df_ex_o)
gives:
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0

how can I fill NaN values by the mean of the adjacent column in Pandas DataFrame

I have a large data set, and I have some missing value, I want to fill the NAN values by the mean of the column before and after , and in certain cases i have NaN values consecutive in these case I want to replace all this nan values by the first value of non nan can found
for examples :
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
NaN NaN NaN NaN 29.0 30.0 NaN 16.0 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
29.0 29.0 29.0 29.0 29.0 30.0 23.0 16.0 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
Updated Proposal based on your feedback:
import pandas as pd
import numpy as np
#Build the DataFrame (create list of dict)
items = [{0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: 29.0, 5 : 30.0,
6: np.nan, 7 : 16.0, 8 : 15.0, 9 : 16.0, 10 : 17.0, 11: np.nan, 12 : 28.0,
13 : 30.0, 14: np.nan, 15 : 28.0, 16 : 18.0}]
your_example = pd.DataFrame(data = items, index=[1])
your_example # (this is what you have in your question above as I understand it)
Desired Endstate Outcomes for NaN Values:
Situation A: 2 float/ints on either side of the NaN --> calculate the average of the 2 values and replace NaN
Old: 16.0, NaN, 18.0
Modified: 16.0, 17.0, 18.0
Situation B: More than 1 (n+1) consecutive NaN values, replace the n to n+1 NaN's with the first non-zero float/int in the column.
Old: NaN, NaN, NaN, 29.0
Modified: 29.0, 29.0, 29.0, 29.0
Situation C: If there is an NaN in the last value of a given column, write the last non-zero float/int to replace the NaN value.
Old: 19.0, 11.0, 0, NaN
Modified: 19.0, 11.0, 0, 11.0
Initial Proposal based on interpolate function
Based on the data in your example and my assumption that using the mean of values before and after a NaN is the same as a linear function - I would use the pandas.Series.interpolate function to achieve this.
your_example.interpolate(method='linear', axis=1, limit_direction='backward')
The axis=1 is critical to running it horizontal across your columns, otherwise it's set to 0. backward fill ensure your starting NaN values are accounted for. This can be changed and modified based on the documentation (link above). Hope this helps!

Explain curious behavior of Pandas.Series.interpolate

s = pd.Series([0, 2, np.nan, 8])
print(s)
interp = s.interpolate(method='polynomial', order=2)
print(interp)
This prints:
0 0.0
1 2.0
2 NaN
3 8.0
dtype: float64
0 0.000000
1 2.000000
2 4.666667
3 8.000000
dtype: float64
Now if I add one more np.nan to the series,
s = pd.Series([0, 2, np.nan, np.nan, 8])
print(s)
interp = s.interpolate(method='polynomial', order=2)
print(interp)
I get much more accurate results:
0 0.0
1 2.0
2 NaN
3 NaN
4 8.0
dtype: float64
0 0.0
1 2.0
2 4.0
3 6.0
4 8.0
dtype: float64
Is Series.interpolate recursive in that it uses interpolated values for further interpolated values, which then can affect previously interpolated values?
You are actually interpolating two different functions!
In the first case you look for a function that goes thorugh the following points:
(0,0), (1,2), (3,8)
But in the second case you look for a function that goes through the following points:
(0,0), (1,2), (4,8)
The indices of a pd.Series represent the points on the x-Axis and the data of a pd.Series represents the points on the y-Axis.
So try the following change in your first example:
s = pd.Series([0, 2, np.nan, 8])
s = pd.Series([0, 2, np.nan, 8], [0,1,2,4])
s.interpolate(method='polynomial', order=2)
You should get the output:
0 0.0
1 2.0
2 4.0
4 8.0
dtype: float64
As an alternative you could also do:
s = pd.Series([0, 2, np.nan, 8], [0,1,3,4])
and the output:
0 0.0
1 2.0
3 6.0
4 8.0
dtype: float64
Hope this helps.

Getting the last n elements of a series by group?

df:
d = pd.DataFrame({'tic': ['B', 'C', 'A', 'A', 'C', 'A', 'A', 'B','B', 'C', 'A', 'A'],
'em': [10, 5, np.nan, 5, np.nan, np.nan, 12, np.nan, 12, 7,
5, np.nan],
'C':[1,4,np.nan,2, 7, np.nan, 7, 9,7, np.nan, 7, 9]}
)
d.set_index(['tic'], inplace=True, drop=False)
d.sort_index(level=0, inplace=True)
If d['em'][-3:] does get me the last 3 elements of column em, why doesn't d['em'][-3:].groupby(level=0) get me the last 3, by group?
Also, why d['em'][-3:].groupby('tic') would give:
KeyError: 'tic' ?
I thought level=0 and 'tic' could both be used in this case, based on:
In[40]: d.index.names
Out[40]: FrozenList(['tic', 'None'])
I think you need groupby with function GroupBy.tail, last for DataFrame reset_index and rename column level_1:
print (d.groupby(level='tic')['em'].tail(3))
tic
A 1971-09-30 12.0
1972-09-30 5.0
1972-12-31 NaN
B 1970-03-31 10.0
1971-12-31 NaN
1972-03-31 12.0
C 1970-06-30 5.0
1971-03-31 NaN
1972-06-30 7.0
Name: em, dtype: float64
d1 = d.groupby(level='tic')['em'].tail(3).reset_index().rename(columns={'level_1':'date'})
print (d1)
tic date em
0 A 1971-09-30 12.0
1 A 1972-09-30 5.0
2 A 1972-12-31 NaN
3 B 1970-03-31 10.0
4 B 1971-12-31 NaN
5 B 1972-03-31 12.0
6 C 1970-06-30 5.0
7 C 1971-03-31 NaN
8 C 1972-06-30 7.0