Replace outliers in Pandas dataframe by NaN - pandas

I'd like to replace outliers by np.nan. I have a dataframe containing floats, int and NaNs such as:
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
with this function:
def outliers(s, replace=np.nan):
Q1, Q3 = np.percentile(s, [25 ,75])
IQR = Q3-Q1
return s.where((s >= (Q1 - 1.5 * IQR)) & (s <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis=1)
but I get:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
Thanks as always for your help.

Don't use apply here is the annotated code for the optimized version:
def mask_outliers(df, replace):
# Calculate Q1 and Q2 quantile
q = df.agg('quantile', q=[.25, .75])
# Calculate IQR = Q2 - Q1
iqr = q.loc[.75] - q.loc[.25]
# Calculate lower and upper limits to decide outliers
lower = q.loc[.25] - 1.5 * iqr
upper = q.loc[.75] + 1.5 * iqr
# Replace the values that does not lies between [lower, upper]
return df.where(df.ge(lower) & df.le(upper), replace)
Result
mask_outliers(df_ex, np.nan)
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0

This answer provides an answer to the question:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
where the another (accepted) answer provides only a better solution to what you want to achieve.
The are two issues to fix in order to make your code doing what it should:
the NaN values have to be removed from the column before calculating np.percentile() to avoid getting for both Q1 and Q3 the value of NaN.
This is one of the reasons for so many NaN values you see in the result of applying your code to the DataFrame. np.percentile() behaves here another way as Pandas .agg('quantile',...) which calculates the Q1 and Q3 thresholds skipping implicit the NaN values from consideration.
the value for the axis has to be changed from 1 to 0 (i.e. to .apply(outliers, axis=0)) in order to apply outliers column wise.
This is another reason for so many NaN values you see in the result you got. The only row without all values set to NaN is these one which does not have a NaN value in itself, else also in these row all the values would be set to NaN for the reason explained above.
Following changes to your code:
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
and
df_ex_o = df_ex.apply(outliers, axis=0)
will solve the issues. Below the entire code and its output:
import pandas as pd
import numpy as np
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
# print(df_ex)
def outliers(colmn, replace=np.nan):
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
IQR = Q3-Q1
return colmn.where((colmn >= (Q1 - 1.5 * IQR)) & (colmn <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis = 0)
print(df_ex_o)
gives:
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0

Related

During calculation of "distance average" in knn imputation method for replacing NaN value in particular column

I encounter this problem when I implement the Knn imputation method for handling missing data from scratch. I create a dummy dataset and find the nearest neighbors for rows that contain missing values here is my dataset
A B C D E
0 NaN 2.0 4.0 10.0 100.0
1 NaN 3.0 9.0 12.0 NaN
2 5.0 2.0 20.0 50.0 75.0
3 3.0 5.0 7.0 NaN 150.0
4 2.0 9.0 7.0 30.0 90.0
for row 0 the nearest neighbors are 1 and 2 and to replace the NaN value at (0, A) we compute the distance average between the nearest neighbors value in the same column but what if one of the nearest neighbors value is also NaN?
Example:
let suppose the nearest neighbors for row 3 is 2 and 4 so in row 3 the missing value in column D and to replace this missing value we compute distance average between nearest neighbors value in column D which is like that
distance average = [(1/D1) * 50.0 + (1/D2) * 30.0]/2
and replace the nan value at (3, D) with this average (where D1 and D2 are corresponding nan euclidian distance). But in the case of row 0 the nearest neighbor is 1 and 2 and to replace the nan value at (0, A ) we need to compute the distance average between row 1 and 2 value in column A the value at (2, A) is 5.0 great but at (1, A) it's NaN so we can't compute like that
distance average = [(1/D3) * NaN + (1/D4) * 5.0]/2
so how do we replace the NaN value at (0, A)? and how does sklearn KNNImputer handle this kind of scenario?
The sklearn KNNImputer uses the nan_euclidean_distances metric as a default. According to its user guide
If a sample has more than one feature missing, then the neighbors for
that sample can be different depending on the particular feature being
imputed.
The algorithm might use different sets of neighborhoods to impute the single missing value in column D and the two missing values in column A.
This is a simple implementation of the KNNImputer:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
A = [np.nan, np.nan, 5, 3, 2]
B = [2, 3, 2, 5, 9]
C = [4, 9, 20, 7, 7]
D = [10, 12, 50, np.nan, 30]
E = [100, np.nan, 75, 150, 90]
columns=['A', 'B', 'C', 'D', 'E']
data = pd.DataFrame(list(zip(A, B, C, D, E)),
columns=columns)
imputer = KNNImputer(n_neighbors=2)
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=columns)
This is the output:
A B C D E
0 3.5 2.0 4.0 10.0 100.0
1 2.5 3.0 9.0 12.0 125.0
2 5.0 2.0 20.0 50.0 75.0
3 3.0 5.0 7.0 11.0 150.0
4 2.0 9.0 7.0 30.0 90.0

Explain curious behavior of Pandas.Series.interpolate

s = pd.Series([0, 2, np.nan, 8])
print(s)
interp = s.interpolate(method='polynomial', order=2)
print(interp)
This prints:
0 0.0
1 2.0
2 NaN
3 8.0
dtype: float64
0 0.000000
1 2.000000
2 4.666667
3 8.000000
dtype: float64
Now if I add one more np.nan to the series,
s = pd.Series([0, 2, np.nan, np.nan, 8])
print(s)
interp = s.interpolate(method='polynomial', order=2)
print(interp)
I get much more accurate results:
0 0.0
1 2.0
2 NaN
3 NaN
4 8.0
dtype: float64
0 0.0
1 2.0
2 4.0
3 6.0
4 8.0
dtype: float64
Is Series.interpolate recursive in that it uses interpolated values for further interpolated values, which then can affect previously interpolated values?
You are actually interpolating two different functions!
In the first case you look for a function that goes thorugh the following points:
(0,0), (1,2), (3,8)
But in the second case you look for a function that goes through the following points:
(0,0), (1,2), (4,8)
The indices of a pd.Series represent the points on the x-Axis and the data of a pd.Series represents the points on the y-Axis.
So try the following change in your first example:
s = pd.Series([0, 2, np.nan, 8])
s = pd.Series([0, 2, np.nan, 8], [0,1,2,4])
s.interpolate(method='polynomial', order=2)
You should get the output:
0 0.0
1 2.0
2 4.0
4 8.0
dtype: float64
As an alternative you could also do:
s = pd.Series([0, 2, np.nan, 8], [0,1,3,4])
and the output:
0 0.0
1 2.0
3 6.0
4 8.0
dtype: float64
Hope this helps.

The previous value in each group is padded with missing values

If there are three columns of data, the first column is some category id, the second column and the third column have some missing values, I want to aggregate the id of the first column, after grouping, fill in the third column of each group by the method: 'ffill' Missing value
I found a good idea here: Pandas: filling missing values by weighted average in each group! , but it didn't solve my problem because the output it produced was not what I wanted
Enter the following code to get an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'sss':[1, np.nan, 3, np.nan, np.nan, np.nan, 2, np.nan, np.nan]})
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN NaN
2 B NaN 3.0
3 B 2.0 NaN
4 B 3.0 NaN
5 B 1.0 NaN
6 C 3.0 2.0
7 C NaN NaN
8 C 3.0 NaN
Fill in missing values with a previous value after grouping
Then I ran the following code, but it outputs strange results:
df["sss"] = df.groupby("name").transform(lambda x: x.fillna(axis = 0,method = 'ffill'))
df
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN NaN
3 B 2.0 2.0
4 B 3.0 3.0
5 B 1.0 1.0
6 C 3.0 3.0
7 C NaN 3.0
8 C 3.0 3.0
The result I want is this:
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN 3.0
3 B 2.0 3.0
4 B 3.0 3.0
5 B 1.0 3.0
6 C 3.0 2.0
7 C NaN 2.0
8 C 3.0 2.0
Can someone point out where I am wrong?strong text

Pandas Dataframe multiply with only the right dataframe taking fill_value

The fill_value argument of pandas.DataFrame.multiply() fills missing values in both dataframes. However, I'd like to have only missing values filled in the 2nd DataFrame. What would be a good way beyond my hacky solution below?
>>> df1 = pd.DataFrame({'a':[1, np.nan, 2], 'b':[np.nan, 3, 4]}, index = [1, 2, 3])
>>> df1
a b
1 1.0 NaN
2 NaN 3.0
3 2.0 4.0
>>> df2 = pd.DataFrame({'a':[2, np.nan], 'b':[3, np.nan], 'c':[1, 1]}, index = [1, 2])
>>> df2
a b c
1 2.0 3.0 1.0
2 NaN NaN 1.0
I would like to multiply the two DataFrames element-wise, by keeping df1 as the dominant one so that the resulting shape and NaN entries should match df1, while filling NaNs in df2 by value 1, to get
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
The naive solution doesn't work:
>>> df1.multiply(df2, fill_value=1)
a b c
1 2.0 3.0 1.0
2 NaN 3.0 1.0
3 2.0 4.0 NaN
My hacky solution is to create a matrix with 1s where df1 has value, and update by df2
>>> df3 = df1/df1
>>> df3.update(df2)
>>> df3
a b
1 2.0 3.0
2 NaN 1.0
3 1.0 1.0
>>> df1.multiply(df3)
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
It just doesn't feel very elegant. Any cool idea on direct manipulation with df1 and df2, hopefully a one-liner?
You can use reindex and fillna on df2:
df1.multiply(df2.reindex(df1.index).fillna(1))
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
You don't need to explicitly call multiply in this case, and can just use * for multiplication:
df1 * df2.reindex(df1.index).fillna(1)
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
Additionally, if you need to align the columns of df2 with df1, use the columns parameter of reindex:
df1 * df2.reindex(index=df1.index, columns=df1.columns).fillna(1)
One alternative would be to filter the result based on the nulls in df1:
df1.multiply(df2, fill_value=1)[df1.notnull()]
Out:
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.
Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0
Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0