insert data from another function into another - pandas

I feel like I am done with my project but i dont know to insert data from one function to another.
I tried calling the function first. but without the tkinter window appearing and the user "clicking" on the drop down menu the function does not have an argument.
I'll paste in my code and answer questions later :)
import pandas as pd
import matplotlib.pyplot as plt
from tkinter import *
#read data
excel = 'new_export.xlsx'
data = pd.read_excel(excel, parse_dates=['Closed Date Time'])
df = pd.DataFrame(data)
#Format / delete time from date column
data['Closed Date Time'] = pd.to_datetime(data['Closed Date Time'])
df['Close_Date'] = data['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(data['Close_Date'])
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
#--------------------GUI ------------------
root = Tk()
root.title("Graph by Team")
root.geometry('400x200')
# --------------------FUNCTIONS---------------------------- #
#----------GRAPH--------------
def average(clicked):
# check what team to look for
choice = df.groupby(clicked)
#choice = [tm for tm in df['Owned By Team'] if df['Owned By Team'] == clicked]
# count number of tickets in month
months = choice.groupby('Year_Month').size()
# check if owner already exist in choice
worked_on = set(df['Owned By'])
count = len(worked_on)
#calculate average
calculate = months / count
return calculate
def graph(average):
x = average()
plt.style.use("featherweight")
# cant plot a function
plt.hist(x)
plt.title('Average by users ')
plt.xlabel('Year & Month')
# plt.ylabel('Average of tickets')
plt.show()
# Drop Down Box
team = set(df['Owned By Team'])
clicked = StringVar()
drop = OptionMenu(root, clicked, *team)
drop.pack()
newGraph = Button(root, text='Show graph', command=graph)
newGraph.pack()
root.mainloop()
EDIT
Sooo... I went step by step through the code and found one significant problem at the moment.
I have 'choice' , 'month' and 'worked_on'
'choice' --> this works. it filters from the excel all lines that have eg. "IT Service Desk" written.
'month' --> this shows tickets done per month within choice.
'worked_on' --> now this is the problem. I need to count each month how many users were working on the tickets that where filtered out by the keyword eg. "IT Service Desk"
he needs to be able to differentiate what month it was and then count the average for each month in the last step.
Any idea ???

Since you are declaring df and your gui elements outside of the scope of the functions, you don't even need to pass arguments. (Note: it would be more correct to create a class for the whole though)
# beginning of your code...
#--------------------GUI ------------------
root = Tk()
root.title("Graph by Team")
root.geometry('400x200')
team = set(df['Owned By Team'])
clicked = StringVar()
drop = OptionMenu(root, clicked, *team)
drop.pack()
# --------------------FUNCTIONS---------------------------- #
#----------GRAPH--------------
def average():
# check what team to look for
choice = df.groupby(clicked.get())
#choice = [tm for tm in df['Owned By Team'] if df['Owned By Team'] == clicked]
# count number of tickets in month
months = choice.groupby('Year_Month').size()
# check if owner already exist in choice
worked_on = set(df['Owned By'])
count = len(worked_on)
#calculate average
calculate = months / count
return calculate
def graph():
x = average()
plt.style.use("featherweight")
# cant plot a function
plt.hist(x)
plt.title('Average by users ')
plt.xlabel('Year & Month')
# plt.ylabel('Average of tickets')
plt.show()
newGraph = Button(root, text='Show graph', command=graph)
newGraph.pack()
root.mainloop()

Related

Using openpyxl and scipy to solve non-linear system related to resistor network

I'm trying to do the right thing for my career by learning how to really use the data science tools and automate excel.
There is a pretty basic resistor circuit called a Pi attenuator.
On a particular component, two of the terminals are shorted together by design.
Because these two of the elements are common (Terminal 3 shorts R1 and R2 - see spreadsheet screenshot) it isn't possible to just measure them with a meter and get the real value of the element. What is measured
I have measurement data both before and after exposing them to high temp oven
The left side has the Before oven measurements (measured and converted)
I was able to use scipy fsolve to get the correct element values (as if R1 and R2 weren't common). This code absolutely works and gives the correct info.
The F[] equations came from solving for the series/parallel value when an ohmmeter is put across pins (1,3), (2,3), and (1,2). I know those are correct. As I understand it, the system is non-linear because of the variables existing in the denominators.
I used this function to manually enter the starting measurements and converted them to their real values.
The problem is that I need to do this hundreds of times for many test groups, and will need to test other attenuator values. I don't want to have to re-run the script after tying in successive measurements if I can help it.
I'm able to get the data out of excel and into numpy arrays, but when I pass the data to the fsolve function I get
printed values of that have add a bunch of decimal places
RuntimeWarning: overflow encountered in double_scalars-divide by zero encountered in double_scalars
To me these are scary insurmountable issues. The lines the warnings occurred on might have changed, but they happen during the F[] = equation assignments.
I wanted to be able to provide the spreadsheet but I'm told this link won't work.
Actual excel datafile
The code I'm working with that uses openpyxl has been limited to 1 row of data for my sanity. If I can get the correct values I would tackle the next part next.
I tried to do what I could to lay out the basis of what I'm doing with screenshots and by providing as much detail as possible. I'm hoping to receive feedback about how to do what I've explained by using numpy arrays, fsolve, and openpyxl.
In general (in my personal programming "experience") I have issues with making sure I've casted to the correct type. I wouldn't be surprised if that was part of my problem. I have issues with "scope". I also overthink everything, and the majority of my EE schooling used C or assembly.
It's embarrassing to say how long it too me to even get this far. I'm so ignorant that I don't even know what I don't know. I've put it down and picked it back up enough that I'm starting to get emotional and need another set of eyes. I'm trying to fail forward here, but I need another set of eyes.
I tried:
-enforcing float64 (or float32) dtype to the np arrays
-creating other np arrays in the convert_measured function
-casting the parameters passed to convert_measured to seperate variables within the function
-rounding the values from the cells because they seem to expand. I don't need any more than 2 decimals of precision for this
import numpy as np
from scipy.optimize import fsolve
import openpyxl
wb = openpyxl.load_workbook("datafile.xlsx")
ws = wb['Group A']
""" Excel Data Cell Locations
Measured R1,R2,R3 Effective Starting Values
Group A,B,C,D worksheet -Cell Range I15 to K34
I = column 9
K = colum 11
"""
#Data limited to just one row here to debug
MeasColBegin = int(9)
MeasColEnd = int(11)
MeasRowBegin = int(15) # Measured values start in row 15
MeasRowEnd = int(15) # Row 34 is the real end. If this is set to 15 then it's only looking at one row
rows_input = ws.iter_rows(min_row=MeasRowBegin,max_row=MeasRowEnd, min_col=MeasColBegin,max_col=MeasColEnd)
"""
Calculated R1,R2,R3 Actual Finished Values
Group A,B,C,D worksheet -Cell Range I39 to K58
I = column 9
K = colum 11
"""
#These aren't being used yet
CalcColBegin = int(9)
CalcColEnd = int(11)
CalcRowBegin = int(39)
CalcRowEnd = int(58)
#row iterator for actual/calculated values
#This isn't being used yet later
rows_output = ws.iter_rows(min_row=CalcRowBegin,max_row=CalcRowEnd, min_col=CalcColBegin,max_col=CalcColEnd)
#This is called by fsolve
#This works when I don't use data passed from the excel spreadsheet
def convert_measured(z): #Maybe I need to cast z to a different type/size...?
F = np.empty((3))
F.astype('float64')
#print("z datatypes ", z[0].dtype, z[1].dtype, z[2].dtype)
# IDK why this prints so many decimal places, (or why it prints them so many times)
# I assume it has to do how the optimizer works
#print("z[]= ", z[0], z[1], z[2])
# Same assignments as simplier version of code that provides correct answers
#x = z[0]
#y = z[1]
#w = z[2]
x = z[0]
y = z[1]
w = z[2]
#print("x=",x," y=",y," w=",w)
#This is certainly wrong
F[0] = 1/(1/z[0]+1/(z[1]+z[2]))-z[0]
F[1] = 1/(1/z[1]+1/(z[0]+z[2]))-z[1]
F[2] = 1/(1/z[2]+1/(z[0]+z[1]))-z[2]
#I tried thinking that rounding might help. Nope.
#F[0] = 1/(1/x+1/(y+w))-np.around(z[0],4)
#F[1] = 1/(1/y+1/(x+w))-np.around(z[1],4)
#F[2] = 1/(1/w+1/(x+y))-np.around(z[2],4)
#Original code from example that works when I enter the numbers
#F[0] = 1/(1/x+1/(y+w))-148.884
#F[1] = 1/(1/y+1/(x+w))-148.853
#F[2] = 1/(1/w+1/(x+y))-16.506
#Bottom line is that one of my problems is that I don't know how to represet F[0,1,2] in terms of z
return F
def main():
# numpy array where measured values to be converted will go
zGuess = np.array([1,1,1])
zGuess = zGuess.astype('float64')
# numpy array where converted solution/values will go
zSol = np.array([1,1,1])
zSol = zSol.astype('float64')
# For loop used to iterate through rows and extract measurements
# These are passed to the convert_measured function
for row in rows_input:
print("row[]= ", row[0].value, row[1].value, row[2].value) #print out values to check it's the right data
#store values into np array that will be sent to convert_measured
zGuess[0]=row[0].value
zGuess[1]=row[1].value
zGuess[2]=row[2].value
print("zGuess[]=", zGuess[0], zGuess[1], zGuess[2]) #print again to make sure because I had problems with dtype
# Solve for measurements / Convert to actual values
zSol = fsolve(convert_measured, zGuess)
#This should print the true values of the elements as if no shunt between R1 and R2 exists
print("zSol= ",zSol[0], zSol[1], zSol[2])
#Todo: store correct solutions into array and write to spreadsheet
if __name__ == "__main__":
main()
I've made some changes to your code and ran this on your manually calculated data. The result looks to be the same apart from a few rounding differences.
Therefore 'rows_input' currently points to the range C15:E34 in the code sample (other rows_input line is commented out).
The main change was to the line that calls the convert function
zSol = fsolve(convert_measured, zGuess)
to call the function with the params for calculating and round the array values to 2 decimal places
zSol = np.round(fsolve(convert_measured, zSol, zGuess), 2)
The convert_measured function was also changed to accept the inputs for the conversions.
Changed code sample below; (I commented out the print statements except for the zsol values)
import numpy as np
from scipy.optimize import fsolve
import openpyxl
wb = openpyxl.load_workbook("datafile.xlsx")
ws = wb['Group A']
""" Excel Data Cell Locations
Measured R1,R2,R3 Effective Starting Values
Group A,B,C,D worksheet -Cell Range I15 to K34
I = column 9
K = colum 11
"""
# Data limited to just one row here to debug
MeasColBegin = int(9)
MeasColEnd = int(11)
MeasRowBegin = int(15) # Measured values start in row 15
MeasRowEnd = int(15) # Row 34 is the real end. If this is set to 15 then it's only looking at one row
# rows_input = ws.iter_rows(min_row=MeasRowBegin, max_row=MeasRowEnd, min_col=MeasColBegin, max_col=MeasColEnd)
rows_input = ws.iter_rows(min_row=15, max_row=34, min_col=3, max_col=5)
"""
Calculated R1,R2,R3 Actual Finished Values
Group A,B,C,D worksheet -Cell Range I39 to K58
I = column 9
K = colum 11
"""
# These aren't being used yet
CalcColBegin = int(9)
CalcColEnd = int(11)
CalcRowBegin = int(39)
CalcRowEnd = int(58)
# row iterator for actual/calculated values
# This isn't being used yet later
rows_output = ws.iter_rows(min_row=CalcRowBegin, max_row=CalcRowEnd, min_col=CalcColBegin, max_col=CalcColEnd)
def convert_measured(z, xl):
F = np.empty((3))
# F.astype('float64')
x = z[0]
y = z[1]
w = z[2]
# Original code from example that works when I enter the numbers
F[0] = 1/(1/x+1/(y+w))-xl[0]
F[1] = 1/(1/y+1/(x+w))-xl[1]
F[2] = 1/(1/w+1/(x+y))-xl[2]
return F
def main():
# numpy array where measured values to be converted will go
zGuess = np.array([1, 1, 1])
zGuess = zGuess.astype('float64')
# numpy array where converted solution/values will go
zSol = np.array([1, 1, 1])
zSol = zSol.astype('float64')
# For loop used to iterate through rows and extract measurements
# These are passed to the convert_measured function
for row in rows_input:
#print("row[]= ", row[0].value, row[1].value, row[2].value) # print out values to check it's the right data
# store values into np array that will be sent to convert_measured
zGuess[0] = row[0].value
zGuess[1] = row[1].value
zGuess[2] = row[2].value
#print("zGuess[]=", zGuess[0], zGuess[1], zGuess[2]) # print again to make sure because I had problems with dtype
# Solve for measurements / Convert to actual values
# zSol = fsolve(convert_measured(zSol), zGuess)
zSol = np.round(fsolve(convert_measured, zSol, zGuess), 2)
# This should print the true values of the elements as if no shunt between R1 and R2 exists
print("zSol= ", zSol[0], zSol[1], zSol[2])
# Todo: store correct solutions into array and write to spreadsheet
if __name__ == "__main__":
main()
Output looks like this
zSol= 290.03 288.94 16.99
zSol= 283.68 294.84 16.97
zSol= 280.83 297.43 17.07
zSol= 292.67 286.63 16.99
zSol= 277.51 301.04 16.93
zSol= 294.98 284.66 16.95
...

Unsure of Control flow in Pandas

I've been working on a Pandas project in Python and am confused a bit on how to accomplish a condition in Pandas.
The code at the below shows how i sort of propose to calculate business_minutes and calendar_minutes between a close_date and a open_date. It works great except when close_date has not yet been recorded or that it is null.
I'm thinking I can use control logic something like the following except I know the logic is not sound. Is there a way to do what i'd like to do but correctly?
if close_date:
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
elif:
now = dt.now(timezone.utc)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], now), axis=1)
df_incident['Cal_Mins'] = (now - df_incident['Open_Date']).dt.total_seconds()/60
# get current utc time
now = dt.now(timezone.utc)
# set start and stop times of business day
#Specify Business Working hours (7am - 5pm)
start_time = dt.time(7,00,0)
end_time = dt.time(17,0,0)
us_holidays = pyholidays.US()
unit='min'
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration, starttime=start_time, endtime=end_time, holidaylist=us_holidays, unit=unit)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
Have I presented my need clearly? Is it possible to do?
Thanks,
Jeff
As written, your code will not work because you call bduration() before defining it. Also, you assign to Bus_Mins and Cal_Mins twice in the body of the else condition. The second assignment will probably not work because close date is null. It is a syntax error to have an elif without a condition, so else: should be used instead. Something like the following might work:
# set start and stop times of business day
#Specify Business Working hours (7am - 5pm)
start_time = dt.time(7,00,0)
end_time = dt.time(17,0,0)
us_holidays = pyholidays.US()
unit='min'
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration, starttime=start_time, endtime=end_time, holidaylist=us_holidays, unit=unit)
if close_date:
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
else:
# get current utc time
now = dt.now(timezone.utc)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], now), axis=1)
df_incident['Cal_Mins'] = (now - df_incident['Open_Date']).dt.total_seconds()/60

Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?

I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]

Need help getting the difference between a min and max date grouped by patient id using only Python

This is a script I am working on for homework for a Big Data class. I got the statistics needed worked out except for this last piece. I need to find the average, min, and max days between a given patient's first appointment and last appointment using only Python. the libraries I have available to me are Numpy, Time, Pandas, and I can import datetime and dateutil in the environment I am working in.
I have gotten as far as getting an output of Patient_id, timestamp amin, timestamp amax using:
alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})
I have tried just simply subtracting the output of timestamp amin from timestamp amax but I get an error. I have also tried relativedelta but it also generates an error. This is what I have so far.
import time
import pandas as pd
import numpy as np
import datetime as dt
from dateutil import relativedelta as r
'''Given Data'''
events = pd.read_csv('../data/train/events.csv')
mortality = pd.read_csv('../train/mortality_events.csv')
'''Join both dataframes'''
events = events.join(mortality.set_index('patient_id'), on = 'patient_id', rsuffix = '_mortality')
'''use mortality dataframe to list all deceased patients and events dataframe to list all living patients'''
mortality = events.loc[events['label']==1]
events = events.loc[events['label']!=1]
'''changing data type from object to datetime'''
mortality['timestamp'] = pd.to_datetime(mortality['timestamp'], infer_datetime_format = True)
events['timestamp'] = pd.to_datetime(events['timestamp'], infer_datetime_format = True)
mortality['timestamp_mortality'] = pd.to_datetime(mortality['timestamp_mortality'], infer_datetime_format = True)
events['timestamp_mortality'] = pd.to_datetime(events['timestamp_mortality'], infer_datetime_format = True)
'''group by patient ids and find minimum and maximum event dates'''
alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})
If it helps, I am able to get what i need in SQL with the following code, but this homework requires me to do it in Python.
SELECT e.patient_id,
MIN(e.event_timestamp) as 'min date',
MAX(e.event_timestamp)as 'max date',
DATEDIFF(day,min(e.event_timestamp),max(e.event_timestamp)) as Delta
FROM Big_Data_Health_HW1.dbo.events e
LEFT JOIN Big_Data_Health_HW1.dbo.mortality_events m on m.patient_id =
e.patient_id
WHERE m.label is not null
GROUP BY e.patient_id
I get a DataFrame object has no attribute 'relativedelta' when using
alvRl['RecLen'] = alvRl.relativedelta(alvRl['(timestamp, amin)'],alvRl['(timestamp, amin)'])
Relatice Delta Error
same error for date_range when I use
alvRl['RecLen'] = alvRl.date_range(alvRl['(timestamp, amin'],alvRl['(timestamp, amin'])
Date_Range Error
I get a key error when using:
alvRl['RecLen'] = alvRl['(timestamp, amin)'] - alvRl['(timestamp, amin)']
Key Error
I'm just not sure if there is a better way of getting that value.
Desired Output
Current Output
You can subtract amin from amax but alvRl's columns are a MultiIndex. You have to access them like this:
alvRl[('timestamp', 'RecLen')] = (alvRl[('timestamp', 'amax')] - alvRl[('timestamp', 'amin')]) / pd.Timedelta(days=1)
Or simply drop the first level of the MultiIndex:
alvRl = alvRl.droplevel(0, axis=1)
alvRl['RecLen'] = (alvRl['amax'] - alvRl['amin']) / pd.Timedelta(days=1)
The error you have is because you have renamed relativedelta by r in this row:
from dateutil import relativedelta as r

Filtering out outliers in Pandas dataframe with rolling median

I am trying to filter out some outliers from a scatter plot of GPS elevation displacements with dates
I'm trying to use df.rolling to compute a median and standard deviation for each window and then remove the point if it is greater than 3 standard deviations.
However, I can't figure out a way to loop through the column and compare the the median value rolling calculated.
Here is the code I have so far
import pandas as pd
import numpy as np
def median_filter(df, window):
cnt = 0
median = df['b'].rolling(window).median()
std = df['b'].rolling(window).std()
for row in df.b:
#compare each value to its median
df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])
median_filter(df, 10)
How can I loop through and compare each point and remove it?
Just filter the dataframe
df['median']= df['b'].rolling(window).median()
df['std'] = df['b'].rolling(window).std()
#filter setup
df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]
There might well be a more pandastic way to do this - this is a bit of a hack, relying on a sorta manual way of mapping the original df's index to each rolling window. (I picked size 6). The records up and until row 6 are associated with the first window; row 7 is the second window, and so on.
n = 100
df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])
## set window size
window=6
std = 1 # I set it at just 1; with real data and larger windows, can be larger
## create df with rolling stats, upper and lower bounds
bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
'std':df['b'].rolling(window).std()})
bounds['upper']=bounds['median']+bounds['std']*std
bounds['lower']=bounds['median']-bounds['std']*std
## here, we set an identifier for each window which maps to the original df
## the first six rows are the first window; then each additional row is a new window
bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))
## then we can assign the original 'b' value back to the bounds df
bounds['b']=df['b']
## and finally, keep only rows where b falls within the desired bounds
bounds.loc[bounds.eval("lower<b<upper")]
This is my take on creating a median filter:
def median_filter(num_std=3):
def _median_filter(x):
_median = np.median(x)
_std = np.std(x)
s = x[-1]
return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
return _median_filter
df.y.rolling(window).apply(median_filter(num_std=3), raw=True)