Unsure of Control flow in Pandas - pandas

I've been working on a Pandas project in Python and am confused a bit on how to accomplish a condition in Pandas.
The code at the below shows how i sort of propose to calculate business_minutes and calendar_minutes between a close_date and a open_date. It works great except when close_date has not yet been recorded or that it is null.
I'm thinking I can use control logic something like the following except I know the logic is not sound. Is there a way to do what i'd like to do but correctly?
if close_date:
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
elif:
now = dt.now(timezone.utc)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], now), axis=1)
df_incident['Cal_Mins'] = (now - df_incident['Open_Date']).dt.total_seconds()/60
# get current utc time
now = dt.now(timezone.utc)
# set start and stop times of business day
#Specify Business Working hours (7am - 5pm)
start_time = dt.time(7,00,0)
end_time = dt.time(17,0,0)
us_holidays = pyholidays.US()
unit='min'
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration, starttime=start_time, endtime=end_time, holidaylist=us_holidays, unit=unit)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
Have I presented my need clearly? Is it possible to do?
Thanks,
Jeff

As written, your code will not work because you call bduration() before defining it. Also, you assign to Bus_Mins and Cal_Mins twice in the body of the else condition. The second assignment will probably not work because close date is null. It is a syntax error to have an elif without a condition, so else: should be used instead. Something like the following might work:
# set start and stop times of business day
#Specify Business Working hours (7am - 5pm)
start_time = dt.time(7,00,0)
end_time = dt.time(17,0,0)
us_holidays = pyholidays.US()
unit='min'
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration, starttime=start_time, endtime=end_time, holidaylist=us_holidays, unit=unit)
if close_date:
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
else:
# get current utc time
now = dt.now(timezone.utc)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], now), axis=1)
df_incident['Cal_Mins'] = (now - df_incident['Open_Date']).dt.total_seconds()/60

Related

insert data from another function into another

I feel like I am done with my project but i dont know to insert data from one function to another.
I tried calling the function first. but without the tkinter window appearing and the user "clicking" on the drop down menu the function does not have an argument.
I'll paste in my code and answer questions later :)
import pandas as pd
import matplotlib.pyplot as plt
from tkinter import *
#read data
excel = 'new_export.xlsx'
data = pd.read_excel(excel, parse_dates=['Closed Date Time'])
df = pd.DataFrame(data)
#Format / delete time from date column
data['Closed Date Time'] = pd.to_datetime(data['Closed Date Time'])
df['Close_Date'] = data['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(data['Close_Date'])
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
#--------------------GUI ------------------
root = Tk()
root.title("Graph by Team")
root.geometry('400x200')
# --------------------FUNCTIONS---------------------------- #
#----------GRAPH--------------
def average(clicked):
# check what team to look for
choice = df.groupby(clicked)
#choice = [tm for tm in df['Owned By Team'] if df['Owned By Team'] == clicked]
# count number of tickets in month
months = choice.groupby('Year_Month').size()
# check if owner already exist in choice
worked_on = set(df['Owned By'])
count = len(worked_on)
#calculate average
calculate = months / count
return calculate
def graph(average):
x = average()
plt.style.use("featherweight")
# cant plot a function
plt.hist(x)
plt.title('Average by users ')
plt.xlabel('Year & Month')
# plt.ylabel('Average of tickets')
plt.show()
# Drop Down Box
team = set(df['Owned By Team'])
clicked = StringVar()
drop = OptionMenu(root, clicked, *team)
drop.pack()
newGraph = Button(root, text='Show graph', command=graph)
newGraph.pack()
root.mainloop()
EDIT
Sooo... I went step by step through the code and found one significant problem at the moment.
I have 'choice' , 'month' and 'worked_on'
'choice' --> this works. it filters from the excel all lines that have eg. "IT Service Desk" written.
'month' --> this shows tickets done per month within choice.
'worked_on' --> now this is the problem. I need to count each month how many users were working on the tickets that where filtered out by the keyword eg. "IT Service Desk"
he needs to be able to differentiate what month it was and then count the average for each month in the last step.
Any idea ???
Since you are declaring df and your gui elements outside of the scope of the functions, you don't even need to pass arguments. (Note: it would be more correct to create a class for the whole though)
# beginning of your code...
#--------------------GUI ------------------
root = Tk()
root.title("Graph by Team")
root.geometry('400x200')
team = set(df['Owned By Team'])
clicked = StringVar()
drop = OptionMenu(root, clicked, *team)
drop.pack()
# --------------------FUNCTIONS---------------------------- #
#----------GRAPH--------------
def average():
# check what team to look for
choice = df.groupby(clicked.get())
#choice = [tm for tm in df['Owned By Team'] if df['Owned By Team'] == clicked]
# count number of tickets in month
months = choice.groupby('Year_Month').size()
# check if owner already exist in choice
worked_on = set(df['Owned By'])
count = len(worked_on)
#calculate average
calculate = months / count
return calculate
def graph():
x = average()
plt.style.use("featherweight")
# cant plot a function
plt.hist(x)
plt.title('Average by users ')
plt.xlabel('Year & Month')
# plt.ylabel('Average of tickets')
plt.show()
newGraph = Button(root, text='Show graph', command=graph)
newGraph.pack()
root.mainloop()

How do I do an OLS on GLS time series regression in python?

I am attempting to transfer my team's Eview code to Python and I got stock with the following line in Eviews:
equation eq_LSTrend.ls(cov=hac) log({Price})=c(1) * #trend + c(2).
Here, the time regression analysis of a certain time window is to be performed on the log(price) and the intercept c(1) as well as the slope c(2) have to be determined.
Let's say I have the following df:
import pandas as pd
Range = pd.date_range('1990-01-01', periods=8, freq='D')
log_price = [5.0835, 5.0906, 5.0946, 5.0916, 5.0825, 5.0833, 5.0782, 5.0709]
df = pd.DataFrame({ 'Date': Range, 'Log Price': log_price })
df.set_index('Date', inplace=True)
And the df looks like this:
Date Log Price
1990-01-01 5.0835
1990-01-02 5.0906
1990-01-03 5.0946
1990-01-04 5.0916
1990-01-05 5.0825
1990-01-06 5.0833
1990-01-07 5.0782
1990-01-08 5.0709
How could I, for example, take a rolling 5 period window, do a OLS or GLS analysis and get the wanted parameters (the slope and the intercept parameters?)
Also, which library would be appropriate for it (statsmodels or maybe some other library)?
Ideally, the code would look something like this:
df_window = df.rolling(window = 5)
slope_output = sm.GLS(df_window).slope
or if separate columns have to be provided as an input (in this case I would leave "Date" as a separate column in df)
df_window = df.rolling(window = 5)
slope_output = sm.GLS(depend_var = df_window["Log Price"], independ_var = df_window["Date"]).slope
I am quite new to python so please pardon my bad coding!

Create and Merge Pandas Dataframes in loop

I need to read in bunch of i/p dataframes based on some conditions and then merge them and finally create dataframes as 'merge_m0', 'merge_m1', 'merge_m2' and so on.
In the actual code, I need to read about 20 dataframes. But, for simplicity and ease of understanding, I'm creating 3 dataframes and using a for loop to read them and merge.
#INPUT: Sample input dataframes df0, df1 &df2
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
To do this, I'm using globals() to create dataframes in loop and to merge them but it's not working and throwing " 'DataFrame' object has no attribute 'globals'" error.
#Code:
def comb_mths(x,y):
globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].globals()[f'm{x}_val_mthd'].isin([1,25])]
globals()[f"m{y}"] = globals()[f'df{y}'][(globals()[f'df{y}'].globals()[f'm{y}_val_mthd'].isin([8,10,11,12])) & (globals()[f'df{y}'].globals()[f'm{y}_orig_val_mthd'].isin([2,3,4,5]))]
globals()[f"merge_m{x}"]=pd.merge(globals()[f"m{x}"],globals()[f"m{y}"], how='inner',on=['id'])
for i in range(0,3):
comb_mths(i, i+1)
I've tried as below as well in place of the 1st line in the above function
#globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].m{x}_val_mthd.isin([1,25])]
#globals()[f"m{x}"] = globals()[f'df{x}']["[f'm{x}_val_mthd']"].isin([1,25])
I think there must be some better and easy alternative to do this and appreciate if anyone can help. Thanks!
Edit#
my updated post:
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
df_list=[]
for i in range(0,3):
df_list.append(globals()[f'df{i}']) #I'm appending all the i/p dataframes which are created already by other step in the code and hope this works
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
dfma = dfa[dfa.iloc[:, 1].isin([1,25])]
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"]
for i in range(0,2):
comb_mths(i)
print(merge_m0)
print(merge_m1)
in the above function after creating "merge_m{i}" dataframe, I need to check one more 'if-else' condition and calculate a variable say 'mths'.
**The logic goes like this:
when i=0, I need to check for "m1_orig_val_mthd", when i=1, I need to check for "m2_orig_val_mthd", when i=2, I need to check for "m3_orig_val_mthd" and so on**
and that if-else condition pseudo code is like below. Can you please show me how do I add this below condition also in the above function?
when i=0 1st iteration
if m1_orig_val_mthd isin (2,4,6):
diff = (mydate - m1_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m1_orig_val_mthd isin (1,3,5):
diff = (mydate - m1_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
when i=1 2nd iteration
if m2_orig_val_mthd isin (2,4,6):
diff = (mydate - m2_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m2_orig_val_mthd isin (1,3,5):
diff = (mydate - m2_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
and so on...
I took a different approach assuming you can create all the input dataframes first. If you can create your dataframes and put them in a list, it makes handling them easier and code easier to read.
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
# add your inputs to the list
df_list = [df0, df1, df2]
# only pass in i, then call dfa, dfb by position in the list
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
# print(dfa)
# print(dfb)
# print('\n'*3)
# I wasn't exactly sure what you wanted here, but I think the original issue was you were calling your new dataframe before it was created.
dfma = dfa[dfa.iloc[:, 1].isin([1,25])] # as long as columns are in the same position, you don't need to call them by name, just position
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
#creating new merged datframes. cleaned this up too
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"] #added return statement
for i in range(0,2): # watch range end or you'll get an error
comb_mths(i)
print(merge_m0)
print(merge_m1)
Additional code:
# to populate the df_list, do this
# you aren't actually naming them, I only did that in example above due to your Example
# when you call them, you are calling the position in the list
df_list = []
for i in range(0,20):
df = 'do your code here'
df_list.append(df)
# print the df to verify they are created
for df in df_list:
print(df)

Fill Forward cupy / cudf

should be possible execute a fill forward with cupy/cudf? the idea is execute a schimitt trigger function, something like:
# pandas version
df = some_random_vector
on_off = (df>.3)*1 + (df<.3)*-1
on_off[on_off==0) = np.nan
on_off = on_off.fillna(method='ffill').fillna(0)
i was trying this one but cupy don't have accumulate ufunc:
def schmitt_trigger(x, th_lo, th_hi, initial = False):
on_off = ((x >= th_hi)*1 + (x <= th_lo)*-1).astype(cp.int8)
mask = (on_off==0)
idx = cp.where(~mask, cp.arange(start=0, stop=mask.shape[0], step=1), 0)
cp.maximum.accumulate(idx,axis=1, out=idx)
out = on_off[cp.arange(idx.shape[0])[:,None], idx]
return out
any idea?
thanks!
Sadly, RAPIDS currently doesn't have that feature in cudf and may not for 0.16 either. There is the feature request in github for it. https://github.com/rapidsai/cudf/issues/1361
Would love for you to chime in on the request so that the devs can know its highly desired.
As for the Schmitt Trigger, I'll look into it and your code and edit this post if I get any progress.

I have a dataframe and I want to find the standard deviation for some specific cells

I'm trying to use pandas to find the standard deviation for the entries in some specific cells
I have tried using numPy's stdev like so:
numpy.std(df[columnName][j:i])
I have also tried using this:
df.std(axis=0)[columnName][j:i]
Just pseudocode becuase my actual code is more confusing than necessary for this question:
df = loadIris()
for feat in df.columns:
i = 0
j = 0
flower = df['flower'][i]
while i < df.index.max():
if df['flower'][i] == flower:
i+=1
else:
j = i
stand = df.std(axis=0)[feat][j:i]
flower = df['flower'][i]
I ended up just appending all of the values to a list and then calculating the standard deviation using statistics.stdev which you can get by importing statistics.