AWS Cloud watch alarm, triggering autoscaling using multiple metrics - amazon-cloudwatch

I want to create a cloud watch alarm which triggers autoscaling based on more than one metric data. Since this is not natively supported by Cloud Watch ( Correct me if i am wrong ). I was wondering how to overcome this.
Can we get the data from different metrics, say CPUUtilization, NetworkIn, NetworkOut and then create a custom metrics using mon-put-data and enter these data to create a new metric based on which to trigger an autoscaling ?

You can now make use of CloudWatch Metric Math.
Metric math enables you to query multiple CloudWatch metrics and use
math expressions to create new time series based on these metrics. You
can visualize the resulting time series in the CloudWatch console and
add them to dashboards.
More information regarding Metric Math Syntax and Functions available here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html#metric-math-syntax
However, it needs to be noted that there are no logical operators and you have to use arithmetic functions to make your way out.
To help out anyone bumping here, posting an example:
Lets say you want to trigger an alarm if CPUUtilization < 20% and MemoryUtilization < 30%.
m1 = Avg CPU Utilization % for 5mins
m2 = Avg Mem Utilization % for 5mins
Then:
Avg. CPU Utilization % < 20 for 5 mins AND Avg Mem Utilization % < 30 for 5mins ... (1)
is same as
(m1 - 20) / ABS([m1 - 20]) + (m2 - 30) / ABS([m2 - 30]) < 0 ... (2)
So, define your two metrics and build a metric query which looks like LHS of equation (2) above. Set your threshhold to be 0 and set comparison operator to be LessThanThreshold.

Yes .. Cloudwatch Alarms can only trigger on a single Cloudwatch Metric so you would need to publish your own 'aggregate' custom metric and alarm on that as you suggest yourself.
Here is a blog post describing using custom metrics to trigger autoscaling.
http://www.thatsgeeky.com/2012/01/autoscaling-with-custom-metrics/

This is supported now. You can check
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html for the same.
As an example, you can use something like (CPU Utilization>80) OR (MEMORY Consumed>55)

Related

Optimize batch transform inference on sagemaker

With current batch transform inference I see a lot of bottlenecks,
Each input file can only have close to 1000 records
Currently it is processing 2000/min records on 1 instance of ml.g4dn.12xlarge
GPU instance are not necessarily giving any advantage over cpu instance.
I wonder if this is the existing limitation of the currently available tensorflow serving container v2.8. If thats the case config should I play with to increase the performance
i tried changing max_concurrent_transforms but doesn't seem to really help
my current config
transformer = tensorflow_serving_model.transformer(
instance_count=1,
instance_type="ml.g4dn.12xlarge",
max_concurrent_transforms=0,
output_path=output_data_path,
)
transformer.transform(
data=input_data_path,
split_type='Line',
content_type="text/csv",
job_name = job_name + datetime.now().strftime("%m-%d-%Y-%H-%M-%S"),
)
Generally speaking, you should first have a performing model (steps 1+2 below) yielding a satisfactory TPS, before you move over to batch transform parallelization techniques to push your overall TPS higher with parallization nobs.
Steps:
GPU enabling - Run manual test to see that your model can utilize GPU instances to begin with (this isn't related to batch transform).
picking instance - Use SageMaker Inference recommender to find the the most cost/effective instance type to run inference on.
Batch transform inputs - Sounds like you have multiple input files which is needed if you'll want to speed up the job by adding more instances.
Batch Transform Job single instance noobs - If you are using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job. If you are using the SageMaker console, you can specify these optimal parameter values in the Additional configuration section of the Batch transform job configuration page. SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
Batch transform cluster size - Increase the instance_count to more than 1, using the cost/effective instance you found in (1)+(2).

AWS Cloudwatch Math Expressions: draw null / empty

I am trying to create a dashboard widget that says "if a metric sample count is less than certain number, don't draw the graph".
The only Math expression that seem promising is IF, however the value can only be a metric or a scalar. I'm trying to find a way to draw a null/no data point/empty instead.
Any way?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html#using-IF-expressions
CloudWatch will drop data points that are not numbers (NaN, +Infinity, -Infinity) when graphing the data. Also, metric math will evaluate basic operations in the expression. You can divide by zero to get non-number value.
So you can do something like this to trick it into dropping the values you don't want:
Have your metric in the graph as m1.
Have the sample count of your metric in the graph as m2.
Add an IF function to drop data points if the sample count is lower than some number (10 in this example): IF(m2 < 10, 1/0, m1)
Disable m1 and m2 on the graph and only show the expression.

How do I obtain Monte Carlo error in R2OpenBugs?

Has anyone managed to obtain a Monte Carlo error for a parameter when running bayesian model un R2OpenBugs?
It is provided in a standard output of OpenBugs, but when run under R2OpenBugs, the log file doesn't have MC error.Is there a way to ask R2OpenBugs to calculate MC error? Or maybe there is a way to calculate it manually? Please, let me know if you heard of any way to do that. Thank you!
Here is the standard log output of R2OpenBugs:
$stats
mean sd val2.5pc median val97.5pc sample
beta0 1.04700 0.13250 0.8130 1.03800 1.30500 1500
beta1 -0.31440 0.18850 -0.6776 -0.31890 0.03473 1500
beta2 -0.05437 0.05369 -0.1648 -0.05408 0.04838 1500
deviance 588.70000 7.87600 575.3000 587.50000 606.90000 1500
$DIC
Dbar Dhat DIC pD
t 588.7 570.9 606.5 17.78
total 588.7 570.9 606.5 17.78
A simple way to calculate Monte Carlo standard error (MCSE) is to divide the standard deviation of the chain by the square root of the effective number of samples. The standard deviation is provided in your output, but the effective sample size should be given as n.eff (the rightmost column) when you print the model output - or at least that is the impression I get from:
https://cran.r-project.org/web/packages/R2OpenBUGS/vignettes/R2OpenBUGS.pdf
I don't use OpenBugs any more so can't easily check for you, but there should be something there that indicates the effective sample size (this is NOT the same as the number of iterations you have sampled, as it also takes into account the loss of information due to correlation within the chains).
Otherwise you can obtain it yourself by extracting the raw MCMC chains and then either computing the effective sample size using the coda package (?coda::effectiveSize) or just use LaplacesDemon::MCSE to calculate the Monte Carlo standard error directly. For more information see:
https://rdrr.io/cran/LaplacesDemon/man/MCSE.html
Note that some people (including me!) would suggest focusing on the effective sample size directly rather than looking at the MCSE, as the old "rule of thumb" that MCSE should be less than 5% of the sample standard deviation is equivalent to saying that the effective sample size should be at least 400 (1/0.05^2). But opinions do vary :)
The MCMC-error is named Time-series SE, and can be found in the statistics section of the summary of the coda object:
library(R2OpenBUGS)
library(coda)
my_result <- bugs(...., codaPg = TRUE)
my_coda <- read.bugs(my_result)
summary(my_coda$statistics)

Local Model performance in Tensorflow Federated

I am implementing federated learning through tensorflow-federated. The tutorial and all other material available compared the accuracy of the federated (global) model after each communication round. Is there a way I can compute the accuracy of each local model to compare against federated (global) model.
Summary:
Total number of clients: 15
For each communication round: Local vs Federated Model performance
References:
(https://colab.research.google.com/github/tensorflow/federated/blob/main/docs/tutorials/federated_learning_for_image_classification.ipynb#scrollTo=blQGiTQFS9_r)
I dont know how you can achieve this with tff.learning.build_federated_averaging_process but I recommend you to take a look at this simple fedavg implementation. Here you can use test_data -the same evaluation dataset you use in server model- for each client. I would suggest you to do client_test_datasets = [test_data for x in sampled_train_ids]. Then pass this as iterative_process.next(server_state, sampled_train_data, client_test_datasets ). Here you need to change the signatures for run_one_round and client_update_fn in the simple_fedavg_tff.py. In each case the signatures from test datasets shall be same as the ones for training dataset. Dont forget actually passing appropriate test datasets as input to each. Now move on to simple_fedavg_tf.py and change your client_update. Here you basically need to write evaluation very similar to one done for server model. Thereafter print the evaluation results if you wish or change the outputs for each level (tf.function, tff.tf_computation, and tff.federated_computation) and pass the eval results as output. If you go this way dont forget to update the output of iterative_process.next
edit: I assumed you wanted the accuracy of clients when the test dataset is the same as server test dataset.

In distributed tensorflow, how to write to summary from workers as well

I am using google cloud ml distributed sample for training a model on a cluster of computers. Input and output (ie rfrecords, checkpoints, tfevents) are all on gs:// (google storage)
Similarly to the distributed sample, I use an evaluation step that is called at the end, and the result is written as a summary, in order to use parameter hypertuning / either within Cloud ML, or using my own stack of tools.
But rather than performing a single evaluation on a large batch of data, I am running several evaluation steps, in order to retrieve statistics on the performance criteria, because I don't want to limited to a single value. I want to get information regarding the performance interval. In particular, the variance of performance is important to me. I'd rather select a model with lower average performance but with better worst cases.
I therefore run several evaluation steps. What I would like to do is to parallelize these evaluation steps because right now, only the master is evaluating. When using large clusters, it is a source of inefficiency, and task workers to evaluate as well.
Basically, the supervisor is created as :
self.sv = tf.train.Supervisor(
graph,
is_chief=self.is_master,
logdir=train_dir(self.args.output_path),
init_op=init_op,
saver=self.saver,
# Write summary_ops by hand.
summary_op=None,
global_step=self.tensors.global_step,
# No saving; we do it manually in order to easily evaluate immediately
# afterwards.
save_model_secs=0)
At the end of training I call the summary writer. :
# only on master, this is what I want to remove
if self.is_master and not self.should_stop:
# I want to have an idea of statistics of accuracy
# not just the mean, hence I run on 10 batches
for i in range(10):
self.global_step += 1
# I call an evaluator, and extract the accuracy
evaluation_values = self.evaluator.evaluate()
accuracy_value = self.model.accuracy_value(evaluation_values)
# now I dump the accuracy, ready to use within hptune
eval_summary = tf.Summary(value=[
tf.Summary.Value(
tag='training/hptuning/metric', simple_value=accuracy_value)
])
self.sv.summary_computed(session, eval_summary, self.global_step)
I tried to write summaries from workers as well , but I got an error : basically summary can be written from masters only. Is there any easy way to workaround ? The error is : "Writing a summary requires a summary writer."
My guess is you'd create a separate summary writer on each worker yourself, and write out summaries directly rather.
I suspect you wouldn't use a supervisor for the eval processing either. Just load a session on each worker for doing eval with the latest checkpoint, and writing out independent summaries.