How do I create an Apdex score in Librato? - librato

Say I have a series of request timings and and I want to score them and I have 2 thresholds, 4s and 12s. A request completed in 4s or less gets +1, between 4s and 12s gets +0, and over 12s gets -1. I want to sum up the scores and then divide them by the total count of the timings. How can I do this in Librato?

Currently Librato doesn't offer an Apdex score, but there are ways to get an overview of the request timings.
One simple method would be to build a composite metric using the map() function. This would allow you to break out your metrics request time by host, and visualize the request rate per minute:
map({source:"*"},
rate(s("query-api.rails.request.time", "&", { period: "60", function: "sum" }))
)
If you are wanting a Big Number chart that provides a general idea of the overall request times (as the Apdex score provides) then you could use the following composite metric to display the percentage of hosts reporting under 400 ms:
scale(
divide([
sum(
map({source:"*"},
divide([
filter(s("query-api.rails.request.time", "&", { period:"60", function:"mean" }), {lt: "400", function: "mean"}),
filter(s("query-api.rails.request.time", "&", { period:"60", function:"mean" }), {lt: "400", function: "mean"})
])
)
),
sum(
map({source:"*"},
divide([
s("query-api.rails.request.time", "&", { period:"60", function:"mean" }),
s("query-api.rails.request.time", "&", { period:"60", function:"mean" })
])
)
)
]),
{factor:"100"})
Both of these examples utilize the metric query-api.rails.request.time which comes from librato-rails, but you could substitute this metric with any metric that reports the request time (eg. the front-end collector librato-client).

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

Moving Average of time series using a sliding window over an array

I need to write a function below that can compute the moving average of time series using a sliding window over an array. This function should take an array of date strings (say arr_date), an array of numbers (say arr_record), and a sliding window (default value 50). It should:
Return a list of dictionaries for all windows.
Each dictionary should include the date, average value, min, max, standard deviation at each window.
Able to handle missing data in time series by replacing missing data with the most recent available data.
(b) Download SPY daily data (Dec. 31, 2017 to Dec. 31, 2018) from Yahoo! as your test data in a .csv file. Read reading .csv file example and write a test programming for calling your function.
Does anyone have any thoughts? Extremely new to python and struggling.
So something following this logic should probably be a good starting point. Hope this is a helpful start, and welcome to the cs community.
def sliding_window( dates, numbers, sliding_window_value):
# list of dictionaries
return_dicts =[{}]
# if window size is greater than length of dates, there's only one window
if sliding_window_value >= len(dates):
return_dicts += [create_window(dates, numbers)]
return return_dicts
# gather all our windows into one list
for i in range (0, len(dates) - sliding_window_value ):
# get our window subsets
dates_subset = dates[i:(sliding_window_value+1)]
numbers_subset = numbers[i:(sliding_window_value+1)]
# get our window stats dictionary
window_stats = create_window(dates_subset,numbers_subset)
# add these stats to our return list
return_dicts += [window_stats]
return return_dicts
def create_window(dates_subset, numbers_subset):
window_min = 1000000 # some high minimum to start
window_max = -1000000 # some low maximuim to start
window_total = 0
for i in range ( 0, len(dates_subset)):
# calculate total
window_total += numbers_subset[i]
# calculate max
if numbers_subset[i] > window_max:
window_max = numbers_subset[i]
# calculate min
if numbers_subset[i] < window_min:
window_min = numbers_subset[i]
# other calculations....
return_dict = {
"min" : window_min,
"max" : window_max,
"average" : window_total / len(dates_subset),
# other calculations....
}
return return_dict
Good luck bud, the work is worth it.

What algorithm can I use to compute number of say positive or negative postings seen until a certain timepoint?

I wish to check if my understanding and proposed algorithm below would be correct.
to calculate the number of positive postings I have seen until time point ti, I am proposing a loop as below:
sumofPi = 0
for x = 0 until x = ti
sumofPi = sumofPi + Pi-1
I am not sure if this will work but the idea is to be able to sum up the positive postings that comes in within a certain timepoint in a data stream.
Thanks
The sequence seems fine as long as the events are indexed in order and you are comfortable loosing events that happened at the same time but indexed differently as a result of that limitation. You may also want to address posting type filtering.
Your algorithm in Python:
# Sample data
postingevents=[1,0,1,1,0,1]
# Algorithm:
sumofPi = 0
ti=4
for i in range(0,ti):
sumofPi += postingevents[i]
print(sumofPi)
3
Looks like you are dealing with time series.
For time series, I would suggest rolling sum or rolling weighted averages, there's an example here
Below are some Python code samples using loops and recursion with a data sample (Event indicator & epoch time stamp)
# Data sample:
postingevents=[1,0,1,1,0,1]
postingti=[1497634668,1497634669,1497634697,1497634697,1497634714,1497634718]
postings=([postingevents,postingti])
# All events preceeding time stamp T. Events do not need to be ordered by time.
def sumpi_notordered(X,t):
return sum([xv if yv<=t else 0 for (xv,yv) in zip(X[0],X[1])])
# Sum ordered events indexed by T, using recursion.
def sumpi_ordered(X,t):
if t>=1:
return X[t]+sumpi_ordered(X,t-1)
else:
return(X[t])
print(sumpi_notordered(postings,1497634697))
3
print(sumpi_ordered(postingevents,3))
3

Pandas groupby for k-fold cross-validation with aggregation

say I have a data frame,df, with columns: id |site| time| clicks |impressions
I want to use the machine learning technique of k-fold cross validation ( split the data randomly into k=10 equal sized partitions - based on eg column id) . I think of this as a mapping from id: {0,1,...9} ( so new column 'fold' going from 0-9)
then iteratively take 9/10 partitions as training data and 1/10 partition as validation data
( so first fold==0 is validation, rest is training, then fold==1, rest is training)
[ so am thinking of this as a generator based on grouping by fold column]
finally I want to group all the training data by site and time ( and similarly for validation data) ( in other words sum over the fold index, but keeping the site and time indices)
What is the right way of doing this in pandas?
The way I thought of doing it at the moment is
df_sum=df.groupby( 'fold','site','time').sum()
#so df_sum has indices fold,site, time
# create new Series object,dat, name='cross' by mapping fold indices
# to 'training'/'validation'
df_train_val=df_sum.groupby( [ dat,'site','time']).sum()
df_train_val.xs('validation',level='cross')
Now the direct problem I run into is that groupby with columns will handle introducing a Series object but groupby on multiindices doesn't [df_train_val assignment above doesn't work]. Obviously I could use reset_index but given that I want to group over site and time [ to aggregate over folds 1 to 9, say] this seems wrong. ( I assume grouping is much faster on indices than on 'raw' columns)
So Question 1 is this the right way to do cross-validation followed by aggregation in pandas. More generally grouping and then regrouping based on multiindex values.
Question 2 - is there a way of mixing arbitrary mappings with multilevel indices.
This generator seems to do what I want. You pass in the grouped data (with 1 index corresponding to the fold [0 to n_folds]).
def split_fold2(fold_data, n_folds, new_fold_col='fold'):
i_fold=0
indices=list(fold_data.index.names)
slicers=[slice(None)]*len(fold_data.index.names)
fold_index=fold_data.index.names.index(new_fold_col)
indices.remove(new_fold_col)
while (i_fold<n_folds):
slicers[fold_index]=[i for i in range(n_folds) if i !=i_fold]
slicers_tuple=tuple(slicers)
train_data=fold_data.loc[slicers_tuple,:].groupby(level=indices).sum()
val_data=fold_data.xs(i_fold,level=new_fold_col)
yield train_data,val_data
i_fold+=1
On my data set this takes :
CPU times: user 812 ms, sys: 180 ms, total: 992 ms Wall time: 991 ms
(to retrieve one fold)
replacing train_data assignment with
train_data=fold_data.select(lambda x: x[fold_index]!=i_fold).groupby(level=indices).sum()
takes
CPU times: user 2.59 s, sys: 263 ms, total: 2.85 s Wall time: 2.83 s

What is the keyword to get time in milliseconds in robot framework?

Currently I am getting time with the keyword Get time epoch , which is returning time in seconds. But I need time in milliseconds , So that I can get time span for a particular event.
or is there any other way to get the time span for a particular event or a testsceanrio?
Check the new test library DateTime, which contains keyword Get Current Date, which also returns milliseconds. It also has keyword Subtract Dates to calculate difference between two timestamps.
One of the more powerful features of robot is that you can directly call python code from a test script using the Evaluate keyword. For example, you can call the time.time() function, and do a little math:
*** Test cases
| Example getting the time in milliseconds
| | ${ms}= | Evaluate | int(round(time.time() * 1000)) | time
| | log | time in ms: ${ms}
Note that even though time.time returns a floating point value, not all systems will return a value more precise than one second.
Using the DateTime library, as suggested by janne:
*** Settings ***
Library DateTime
*** Test Cases ***
Performance Test
${timeAvgMs} = Test wall clock time 100 MyKeywordToPerformanceTest and optional arguments
Should be true ${timeAvgMs} < 50
*** Keywords ***
MyKeywordToPerformanceTest
# Do something here
Test wall clock time
[Arguments] ${iterations} #{commandAndArgs}
${timeBefore} = Get Current Date
:FOR ${it} IN RANGE ${iterations}
\ #{commandAndArgs}
${timeAfter} = Get Current Date
${timeTotalMs} = Subtract Date From Date ${timeAfter} ${timeBefore} result_format=number
${timeAvgMs} = Evaluate int(${timeTotalMs} / ${iterations} * 1000)
Return from keyword ${timeAvgMs}
In the report, for each suite, test and keyword, you have the information about start, end and length with millisecond details. Something like:
Start / End / Elapsed: 20140602 10:57:15.948 / 20140602 10:57:16.985 / 00:00:01.037
I don't see a way to do it using Builtin, look:
def get_time(format='timestamp', time_=None):
"""Return the given or current time in requested format.
If time is not given, current time is used. How time is returned is
is deternined based on the given 'format' string as follows. Note that all
checks are case insensitive.
- If 'format' contains word 'epoch' the time is returned in seconds after
the unix epoch.
- If 'format' contains any of the words 'year', 'month', 'day', 'hour',
'min' or 'sec' only selected parts are returned. The order of the returned
parts is always the one in previous sentence and order of words in
'format' is not significant. Parts are returned as zero padded strings
(e.g. May -> '05').
- Otherwise (and by default) the time is returned as a timestamp string in
format '2006-02-24 15:08:31'
"""
time_ = int(time_ or time.time())
format = format.lower()
# 1) Return time in seconds since epoc
if 'epoch' in format:
return time_
timetuple = time.localtime(time_)
parts = []
for i, match in enumerate('year month day hour min sec'.split()):
if match in format:
parts.append('%.2d' % timetuple[i])
# 2) Return time as timestamp
if not parts:
return format_time(timetuple, daysep='-')
# Return requested parts of the time
elif len(parts) == 1:
return parts[0]
else:
return parts
You have to write your own module, you need something like:
import time
def get_time_in_millies():
time_millies = lambda: int(round(time.time() * 1000))
return time_millies
Then import this library in Ride for the suite and you can use the method name like keyword, in my case it would be Get Time In Millies. More info here.