Using IQR method to remove outliers does not change shape of data frame - pandas

I'm trying to remove outliers using IQR method. However, the shape of my df remains the same.
Here is the code:
def IQR_outliers(df):
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)
IQR=Q3-Q1
df=df[~((df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR)))]
return df
IQR_outliers(df['Distance'])
IQR_outliers(df['Price'])

Your function considers the whole object that is passed, but you're only passing a single series each time you use it. You're also not capturing the output. All of these things stack on top of each to make your problem pretty complex.
So here's what I would do:
add a column argument to your function
modifying the function to only consider that column when selecting rows from the entire dataframe
pipe the dataframe to that function a couple of times
So that's:
def IQR_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
df = df.loc[lambda df: ~((df[column] < (Q1 - 1.5 * IQR)) | (df[column] > (Q3 + 1.5 * IQR)))]
return df
revised_df = df.pipe(IQR_outliers, 'Distance').pipe(IQR_outliers, 'Price')
Note that the way you've demonstrated this, you'll very likely drop rows where Distance is an outlier even if Price is not. If you don't want to do that, you'll need to stack your dataframe, apply this function to a groupby operation, and then optionally unstack the dataframe

Related

How to using define round function like pandas round that executing one line code

Goal
Only one line to execute.
I refer round function from this post. But I want using like df.round(2) which changes the affected columns but keep the sequence of data but not required selecting float or int type.
df.applymap(myfunction) will get TypeError: must be real number, not str, which means I have to select type first.
Try
I refer round source code but I could not and understand how to change my function.
Firstly get the columns where values are float:
cols=df.select_dtypes('float').columns
Finally:
df[cols]=df[cols].agg(round,ndigits=2)
If you want to make changes in the function then add if/else condition:
from numpy import ceil, floor
def float_round(num, places=2, direction=ceil):
if isinstance(num,float):
return direction(num * (10 ** places)) / float(10 ** places)
else:
return num
out=df.applymap(float_round)
With the error message you mention, it's likely the column is already a string, and needs to be converted to some numeric type.
Let's now assume that the column is numeric, there are a few ways you could implement custom rounding functions that don't require reimplementing the .round() method of a dataframe object.
With the requirements you laid above, we want a way to round a data frame that:
fits on one line
doesn't require selecting numeric type
There are two ways we could do this that are functionally equivalent. One is to treat the dataframe as an argument to a function that is safe for numpy arrays.
Another is to use the apply method (explanation here) which applies a function to a row or a column.
import pandas as pd
import numpy as np
from numpy import ceil
# generate a 100x10 dataframe with a null value
data = np.random.random(1000) * 10
data = data.reshape(100,10)
data[0, 0] = np.nan
df = pd.DataFrame(data)
# changing data type of the second column
df[1] = df[1].astype(int)
# verify dtypes are different
print(df.dtypes)
# taken from other stack post
def float_round(num, places=2, direction=ceil):
return direction(num * (10 ** places)) / float(10 ** places)
# method 1 - use the dataframe as an argument
result1 = float_round(df)
print(result1.head())
# method 2 - apply
result2 = df.apply(float_round)
print(result2)
Because apply is applied row or column-wise, you can specify logic in your round function to ignore non-numeric columns. For instance:
# taken from other stack post
def float_round(num, places=2, direction=ceil):
# check type of a specific column
if num.dtype == 'O':
return num
return direction(num * (10 ** places)) / float(10 ** places)
# this will work, method 1 will fail
result2 = df.apply(float_round)
print(result2)

Rolling apply lambda function based on condtion

I have a dataframe with normalised (to 100) returns for 18 products (columns). I want to apply a lambda function which multplies the next row by the previous row.
I can do :
df= df.rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
But some of my columns dont have values on row 1 (they go live on row 4). So I need to either:
Have a lambda function that starts only on row 4 yet applies to the entire df. I can create the first 4 rows manually.
As my values are 100 until "live" I could have the lambda function only applying when the value does not equal 100.
I have tried both :
1.
df.iloc[3:,:] = df.iloc[3:,:].rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
df= df.rolling(2).apply(lambda x: (x[0]*x[1]) if x[0] != 100 else x,raw=True)
But both meet with total failure.
Any advice welcomed - I've spent hours looking through the site and have yet to find any outcome that works for this situation.
So given the lack of responses I came up with a solution where I split my df in 2 parts and appended it back together.
My lambda function was also garbage I needed something like :
df2 = df.copy()
for i in range(df2.index.size):
if not i:
continue
df2.iloc[i] = (df2.iloc[i - 1] * (df.iloc[i]))
df2
to actually achieve what I was after.

Find dates and difference between extreme observations

he function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. Pandas offers a wide range of method that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.
easiest is an aggregation with groupby and then do a select
# make index a column
df = df.reset_index()
# get min of holdings for each ticker
lowest = df[['ticker','holdings']].groupby('ticker').min()
print(lowest)
# select lowest my performing a left join (solutions with original)
# this gives only the matching rows of df in return
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
print(lowest_dates)
If you just want a series of Date you can use this function.
def getLowest(df):
df = df.reset_index()
lowest = df[['ticker','holdings']].groupby('ticker').min()
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
return lowest_dates['Date']
From my point of view it would be better to return the entire dataframe, to know which ticker was lowest when. In this case you can :
return lowest_dates

How to use Pandas vector methods based on rolling custom function that involves entire row and prior data

While its easy to use pandas rolling method to apply standard formulas, but i find it hard if it involves multiple column with limited past rows. Using the following code to better elaborate: -
import numpy as np
import pandas as pd
#create dummy pandas
df=pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
def func1(shortdf):
#dummy formula
#use last row of col1 multiply by sum of col2
return (shortdf.col1.tail(1).values[0]+shortdf.col2.sum())*3.14
for idx, i in df.iterrows():
if idx>3:
#only interested in the last 3 rows from position of dataframe
df.loc[idx,'col3']=func1(df.iloc[idx-3:idx])
I currently use this iterrow method which needless to say is extremely slow. can anyone has better suggestion?
Option 1
So shift is the solution here. You do have to use rolling for the summation, and then shift that series after the addition and multiplication.
df = pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
ans = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
You can check to see that ans is the same as df['col3'] by using ans.eq(df['col3']). Once you see that all but the first few are the same, just change ans to df['col3'] and you should be all set.
Option 2
Without additional information about the customized weight function, it is hard to help. However, this option may be a solution as it separates the rolling calculation at the cost of using more memory.
# df['col3'] = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
s = df['col2']
stride = pd.DataFrame([s.shift(x).values[::-1][:3] for x in range(len(s))[::-1]])
res = pd.concat([df, stride], axis=1)
# here you can perform your custom weight function
res['final'] = ((res[0] + res[1] + res[2] + res['col1']) * 3.14).shift(1)
stride is adapted from this question and the calculation is concatenated row-wise to the original dataframe. In this way each column has the value needed to compute whatever it is you may need.
res['final'] is identical to option 1's ans

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.