Comparing timedelta fields - pandas

I am looking at file delivery times and can't work out how to compare two timedelta fields using a for loop if statement.
time_diff is the difference between cob_date and last_update_time
average_diff is based on the average for a particular file
I want to find the delay for each row.
I have been able to produce a column delay using average_diff - time_diff
However, when the average_diff - time_diff < 0 I just want to return delay = 0 as this is not a delay.
I have made a for loop but this isn't working and I don't know why. I'm sure the answer is very simple but I can't get there.
test_pv_import_v2['delay2'] = pd.to_timedelta('0')
for index, row in test_pv_import_v2.iterrows():
if test_pv_import_v2['time_diff'] > test_pv_import_v2['average_diff'] :
test_pv_import_v2['delay2'] = test_pv_import_v2['time_diff'] - test_pv_import_v2['average_diff']

Use Series.where for set 0 Timedelta by condition:
mask = test_pv_import_v2['time_diff'] > test_pv_import_v2['average_diff']
s = (test_pv_import_v2['time_diff'] - test_pv_import_v2['average_diff'])
test_pv_import_v2['delay2'] = s.where(mask, pd.to_timedelta('0'))

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

Trigger Point of Moving average crossover

I am trying to define the trigger point when wt1(Moving average 1) crosses over wt2(moving average 2) and add it to the column ['side'].
So basically add 1 to side at the moment wt1 crosses above wt2.
This is the current code I am using but doesn't seem to be working.
for i in range(len(df)):
if df.wt1.iloc[i] > df.wt2.iloc[i] and df.wt1.iloc[i-1] < df.wt2.iloc[i-1]:
df.side.iloc[1]
If I do the following:
long_signals = (df.wt1 > df.wt2)
df.loc[long_signals, 'side'] = 1
it return the value of 1 the entire time wt1 is above wt2, which is not what i am trying to do.
Expected outcome is when wt1 crosses above wt2 side should be labeled as 1.
Help would be appreciated!
Use shift in your condition:
long_signals = (df.wt1 > df.wt2) & (df.wt1.shift() <= df.wt2.shift())
df.loc[long_signals, 'side'] = 1
df
if you do not like NaNs in 'side', use df.fillna(0) at the end
Your first piece of code also works with the following small modification
for i in range(len(df)):
if df.wt1.iloc[i] > df.wt2.iloc[i] and df.wt1.iloc[i-1] <= df.wt2.iloc[i-1]:
df.loc[i,'side'] = 1

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Why even though I sliced my original DataFrame and assigned it to another variable, my original DataFrame still changed values?

I am trying to calculate a portfolio's daily total price, by multiplying weights of each asset with the daily price of the assets.
Currently I have a DataFrame tw which is all zeros except for the dates that I want to re-balance, which holds my assets weights. What I would like to do is for each month, populate the zeros with the weights I am trying to re-balance with, till the next re-balancing date, and so on and so forth.
My code:
df_of_weights = tw.loc[dates_to_rebalance[13]:]
temp_date = dates_to_rebalance[13]
counter = 0
for date in df_of_weights.index:
if date.year == temp_date.year and date.month == temp_date.month:
if date.day == temp_date.day:
pass
else:
df_of_weights.loc[date] = df_of_weights.loc[temp_date].values
counter += 1
temp_date = dates_to_rebalance[13+counter]
I understand that if you slice your DataFrame and assign it to a variable (df_of_weights), changing the values of said variable would not affect the original DataFrame. However, the values in tw changed. Have been searching for an answer online for a while now and am really confused.
You should use copy in order to fix the problem such that:
df_of_weights = tw.loc[dates_to_rebalance[13]:].copy()
The problem is slicing provides view instead of copy. The issue is still open.
https://github.com/pandas-dev/pandas/issues/15631

What is the keyword to get time in milliseconds in robot framework?

Currently I am getting time with the keyword Get time epoch , which is returning time in seconds. But I need time in milliseconds , So that I can get time span for a particular event.
or is there any other way to get the time span for a particular event or a testsceanrio?
Check the new test library DateTime, which contains keyword Get Current Date, which also returns milliseconds. It also has keyword Subtract Dates to calculate difference between two timestamps.
One of the more powerful features of robot is that you can directly call python code from a test script using the Evaluate keyword. For example, you can call the time.time() function, and do a little math:
*** Test cases
| Example getting the time in milliseconds
| | ${ms}= | Evaluate | int(round(time.time() * 1000)) | time
| | log | time in ms: ${ms}
Note that even though time.time returns a floating point value, not all systems will return a value more precise than one second.
Using the DateTime library, as suggested by janne:
*** Settings ***
Library DateTime
*** Test Cases ***
Performance Test
${timeAvgMs} = Test wall clock time 100 MyKeywordToPerformanceTest and optional arguments
Should be true ${timeAvgMs} < 50
*** Keywords ***
MyKeywordToPerformanceTest
# Do something here
Test wall clock time
[Arguments] ${iterations} #{commandAndArgs}
${timeBefore} = Get Current Date
:FOR ${it} IN RANGE ${iterations}
\ #{commandAndArgs}
${timeAfter} = Get Current Date
${timeTotalMs} = Subtract Date From Date ${timeAfter} ${timeBefore} result_format=number
${timeAvgMs} = Evaluate int(${timeTotalMs} / ${iterations} * 1000)
Return from keyword ${timeAvgMs}
In the report, for each suite, test and keyword, you have the information about start, end and length with millisecond details. Something like:
Start / End / Elapsed: 20140602 10:57:15.948 / 20140602 10:57:16.985 / 00:00:01.037
I don't see a way to do it using Builtin, look:
def get_time(format='timestamp', time_=None):
"""Return the given or current time in requested format.
If time is not given, current time is used. How time is returned is
is deternined based on the given 'format' string as follows. Note that all
checks are case insensitive.
- If 'format' contains word 'epoch' the time is returned in seconds after
the unix epoch.
- If 'format' contains any of the words 'year', 'month', 'day', 'hour',
'min' or 'sec' only selected parts are returned. The order of the returned
parts is always the one in previous sentence and order of words in
'format' is not significant. Parts are returned as zero padded strings
(e.g. May -> '05').
- Otherwise (and by default) the time is returned as a timestamp string in
format '2006-02-24 15:08:31'
"""
time_ = int(time_ or time.time())
format = format.lower()
# 1) Return time in seconds since epoc
if 'epoch' in format:
return time_
timetuple = time.localtime(time_)
parts = []
for i, match in enumerate('year month day hour min sec'.split()):
if match in format:
parts.append('%.2d' % timetuple[i])
# 2) Return time as timestamp
if not parts:
return format_time(timetuple, daysep='-')
# Return requested parts of the time
elif len(parts) == 1:
return parts[0]
else:
return parts
You have to write your own module, you need something like:
import time
def get_time_in_millies():
time_millies = lambda: int(round(time.time() * 1000))
return time_millies
Then import this library in Ride for the suite and you can use the method name like keyword, in my case it would be Get Time In Millies. More info here.