Pandas: replace outliers in all columns with nan - pandas

I have a data frame with 3 columns, for ex
c1,c2,c3
10000,1,2
1,3,4
2,5,6
3,1,122
4,3,4
5,5,6
6,155,6
I want to replace the outliers in all the columns which are outside 2 sigma. Using the below code, I can create a dataframe without the outliers.
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 2).all(axis=1)]
c1,c2,c3
1,3,4
2,5,6
4,3,4
5,5,6
I can find the outliers for each column separately and replace with "nan", but that would not be the best way as the number of lines in the code increases with the number of columns. There must be a better way of doing this. May be boolean output from the above command for rows and then replace "TRUE" with "nan".
Any suggestions, many thanks.

pandas
Use pd.DataFrame.mask
df.mask(df.sub(df.mean()).div(df.std()).abs().gt(2))
c1 c2 c3
0 NaN 1.0 2.0
1 1.0 3.0 4.0
2 2.0 5.0 6.0
3 3.0 1.0 NaN
4 4.0 3.0 4.0
5 5.0 5.0 6.0
6 6.0 NaN 6.0
numpy
v = df.values
mask = np.abs((v - v.mean(0)) / v.std(0)) > 2
pd.DataFrame(np.where(mask, np.nan, v), df.index, df.columns)
c1 c2 c3
0 NaN 1.0 2.0
1 1.0 3.0 4.0
2 2.0 5.0 6.0
3 3.0 1.0 NaN
4 4.0 3.0 4.0
5 5.0 5.0 6.0
6 6.0 NaN 6.0

lb = df.quantile(0.01)
ub = df.quantile(0.99)
df_new = df[(df < ub) & (df > lb)]
df_new
I am using interquatile range method to detect outliers. Firstly it calculates the lower bound and upper bound of the df using quantile function. Then based on the condition that all the values should be between lower bound and upper bound it returns a new df with outlier values replaced by NaN.

Related

How to transform summary statistics count into integers on pandas

I'm running into the following issue.
I have pulled some summary statistics from a dataframe, using pd.describe(). Now, I'm trying to convert the number of observations (or count) into an integer. I've used the following but it does not work:
summary_stats = df.describe()
summary_stats = summary_stats.round(2)
summary_stats.iloc[0] = summary_stats.iloc[0].astype(int)
Then, when I print out the summary statistics table, the number of observations is not an integer. Thanks a lot for your insights!
It is problem, because floats with integers in same column. So integers are converted to floats.
Possible solution with transpose - then column has integer dtype:
d = {'A':[1,2,3,4,5], 'B':[2,2,2,2,2], 'C':[3,3,3,3,3]}
df = pd.DataFrame(data=d)
summary_stats = df.describe().T
summary_stats = summary_stats.round(2)
summary_stats['count'] = summary_stats['count'].astype(int)
print (summary_stats)
count mean std min 25% 50% 75% max
A 5 3.0 1.58 1.0 2.0 3.0 4.0 5.0
B 5 2.0 0.00 2.0 2.0 2.0 2.0 2.0
C 5 3.0 0.00 3.0 3.0 3.0 3.0 3.0
If need only display values, here is hack - converted values to object:
summary_stats = df.describe()
summary_stats = summary_stats.round(2).astype(object)
summary_stats.iloc[0] = summary_stats.iloc[0].astype(int)
print (summary_stats)
A B C
count 5 5 5
mean 3.0 2.0 3.0
std 1.58 0.0 0.0
min 1.0 2.0 3.0
25% 2.0 2.0 3.0
50% 3.0 2.0 3.0
75% 4.0 2.0 3.0
max 5.0 2.0 3.0

How to represent the column with max Nan values in pandas df?

i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN

Sum of NaNs to equal NaN (not zero)

I can add a TOTAL column to this DF using df['TOTAL'] = df.sum(axis=1), and it adds the row elements like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN 0.0
However, I would like the total of the bottom row to be NaN, not zero, like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN Nan
Is there a way I can achieve this in a performant way?
Add parameter min_count=1 to DataFrame.sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
df['TOTAL'] = df.sum(axis=1, min_count=1)
print (df)
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN NaN

How to plot values from the DataFrame? Python 3.0

I'm trying to plot the values from the A column against the index(Of the DataFrame table), but it doesnt allow me to. How to do it?
INDEX is the index from the DataFrame and not the declared variable.
You need plot column A only, index is used for x and values for y by default in Series.plot:
#line is default method, so omitted
Test['A'].plot(style='o')
Another solution is reset_index for column from index and then DataFrame.plot:
Test.reset_index().plot(x='index', y='A', style='o')
Sample:
Test=pd.DataFrame({'A':[3.0,4,5,10], 'B':[3.0,4,5,9]})
print (Test)
A B
0 3.0 3.0
1 4.0 4.0
2 5.0 5.0
3 10.0 9.0
Test['A'].plot(style='o')
print (Test.reset_index())
index A B
0 0 3.0 3.0
1 1 4.0 4.0
2 2 5.0 5.0
3 3 10.0 9.0
Test.reset_index().plot(x='index', y='A', style='o')

Pandas add new second level column to column multiindex based on other columns

I have a DataFrame with column multi-index:
System A B
Trial Exp1 Exp2 Exp1 Exp2
1 NaN 1 2 3
2 4 5 NaN NaN
3 6 NaN 7 8
Turns out for each system (A, B) and each measurement (1, 2, 3 in index), results from Exp1 is always superior to Exp2. So I want to generate a 3rd column for each system, call it Final, that should take Exp1 whenever available, and default to Exp2 otherwise. The desired result is
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1 1 2 3 2
2 4 5 4 NaN NaN NaN
3 6 NaN 6 7 8 7
What is the best way to do this?
I've tried to use groupby on the columns:
grp = df.groupby(level=0, axis=1)
And was thinking of using either transform or apply combined by assign to achieve it. But am not able to find either a working or an efficient way of doing it. Specifically I am avoiding native python for loops for efficiency reasons (else the problem is trivial).
Use stack for reshape, add column with fillna and then reshape back by unstack with swaplevel + sort_index:
df = df.stack(level=0)
df['Final'] = df['Exp1'].fillna(df['Exp1'])
df = df.unstack().swaplevel(0,1,axis=1).sort_index(axis=1)
print (df)
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 NaN 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
Another solution with xs for select DataFrames, create new DataFrame by combine_first, but there is missing second level - was added by MultiIndex.from_product and last concat both DataFrames together:
a = df.xs('Exp1', axis=1, level=1)
b = df.xs('Exp2', axis=1, level=1)
df1 = a.combine_first(b)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['Final']])
df = pd.concat([df, df1], axis=1).sort_index(axis=1)
print (df)
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
Similar solution with rename:
a = df.xs('Exp1', axis=1, level=1, drop_level=False)
b = df.xs('Exp2', axis=1, level=1, drop_level=False)
df1 = a.rename(columns={'Exp1':'Final'}).combine_first(b.rename(columns={'Exp2':'Final'}))
df = pd.concat([df, df1], axis=1).sort_index(axis=1)
print (df)
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
stack with your first level of the column index stack(0) leaving ['Exp1', 'Exp2'] in the column index
Use a lambda function that gets applied to the whole dataframe within an assign call.
Finally, unstack, swaplevel, sort_index to clean it up and put everything where it belongs.
f = lambda x: x.Exp1.fillna(x.Exp2)
df.stack(0).assign(Final=f).unstack() \
.swaplevel(0, 1, 1).sort_index(1)
A B
Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
Another concept using xs
d1 = df.xs('Exp1', 1, 1).fillna(df.xs('Exp2', 1, 1))
d1.columns = [d1.columns, ['Final'] * len(d1.columns)]
pd.concat([df, d1], axis=1).sort_index(1)
A B
Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
doesnt feel super optimal but try this :
for system in df.columns.levels[0]:
df[(system, 'final')] = df[(system, 'Exp1')].fillna(df[(system, 'Exp2')])