Using AWS Elastic Inference without making changes to the client code - tensorflow

I have an endpoint deployed in SageMaker with a Tensorflow model, and I make calls to it using the Scala SDK like this:
runtime = AmazonSageMakerRuntimeClientBuilder
.standard()
.withCredentials(credentialsProvider)
.build()
...
val invokeEndpointResult = runtime.invokeEndpoint(request);
Can i use Sagemaker's Elastic Inferece with this code as it is and gain the performance enhancement of EI?
I have tried running an endpoint with a configuration of 8 ml.m5d.xlarge instances vs a configuration with 8 ml.m5d.xlarge instances with added EI of ml.eia2.xlarge but looking at the cloud watch metrics I get the same number of invocations per minute, and the total run time (on the same input) is the same.

Related

How to create alerts using CloudWath for Spring Boot metrics?

I have a Spring Boot microservice (Spring Cloud Gateway) deployed on AWS EKS. It sends metrics to CloudWatch, but I can't figure out how to define an alert on an aggregate metric.
This is a sample of the metrics:
I added a custom tag ("instanceId") which provides the instance identifier of the pod:
#Bean
public MeterRegistryCustomizer<CloudWatchMeterRegistry> cloudWatchMetricsCommonTags() {
return (registry) -> registry.config().commonTags("instanceId", EC2MetadataUtils.getInstanceId());
}
This query provides the slowest request:
SELECT MAX("spring.cloud.gateway.requests.max") FROM "tas-gpp-gateway"
But it seems to be impossible define an alert on it. I found this issue, but I can't provide all possible metric dimensions because outcome, httpMethod and instanceId have many different values. In fact, according to the documentation, you should:
provide a set of metrics
define a math expression using the metrics
create an alert based on the math expression
For example:
But this is not an option, because I can't list all the possible dimensions of the metric spring.cloud.gateway.requests.max.

Boto3 put object results in botocore.exceptions.ReadTimeoutError

I have setup two ceph clusters with a rados gateway on a node for each of them.
What I'm trying to achieve is to transfer all objects from a bucket "A" with an endpoint in my cluster "1" to a bucket "B" which can be reached from another endpoint on my cluster "2". It doesn't really matter for my issue but at least you understand the context.
I created a script in python using the boto3 module.
The script is really simple. I just wanted to put an object in a bucket.
The relevant part is as written below :
s3 = boto3.resource('s3',
endpoint_url=credentials['endpoint_url'],
aws_access_key_id=credentials['access_key'],
aws_secret_access_key=credentials['secret_key'],
use_ssl=False)
s3.Object('my-bucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
(hello.txt just contains a word)
Let's say this script is written and runs from a node (which is the radosgw endpoint node) in my cluster 1. It works well when the "endpoint_url" is the node I'm running the script from but it does not work when I'm trying to reach my other endpoint (the radosgw, located in another node within my cluster "2").
I got this error :
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL
The weird thing is that I can create a bucket without any error :
s3_src.create_bucket(Bucket=bucket_name)
s3_dest.create_bucket(Bucket=bucket_name)
I can even list the buckets of my two endpoints.
Do you have any idea why I can do pretty much everything but not put a single object in my second endpoint ?
I hope it makes any sense.
Ultimately, I found that the issue was not related with boto but with my ceph pool which countains my data.
The bucket pool was healthy, that's why I could create my buckets whereas the data pool was unhealthy, hence the issue when I tried to put an object in a bucket.

Selenium Grid: Node API?

The problem:
I want to run Selenium Grid on AWS and would like to use their dynamic scaling. On scale down, it will just terminate an instance... which mean that a node can disappear just like that. Not the behaviour I would like, but using scripts or lifecycle hooks, I can try and make sure that any sessions on the node is not active before it is terminated.
Seems like I can hit this API to disconnect the node from the hub: http://NODE-IP:5555/selenium-server/driver/?cmd=shutDownSeleniumServer
Ideally, I need to find an API to the node directly to gather data of session activity.
Alternatives? Sessions logs?
Note:
This answer is valid only for Selenium 3.x series (3.14.1 which is as of today the last of the builds in Selenium 3 series). Selenium 4 grid architecture is a complete different one and as such this answer will not necessarily be relevant for Selenium 4 grid (Its yet to be released).
Couple of things. What you are asking for sounds like you need a sort of self healing mechanism. This is not available in the plain vanilla selenium grid flavor.
Selenium node, doesn't have the capability to track sessions that are running within it.
You need to build all of this at the Selenium Hub (which is where all this information resides in).
On a high level, you would need to do the following
Build a custom proxy by extending org.openqa.grid.selenium.proxy.DefaultRemoteProxy which would have the following capabilities:
Add an API which when used would mark the proxy as quiesced (meaning the node has been marked for maintenance and will no longer accept any new session requests)
Override getNewSession(Map<String, Object> requestedCapability) such that it first checks if a node is not quiesced and only then facilitate a new session.
Build a custom servlet which when invoked can do the following:
Given a node it can use the API built via 1.1 and mark a node as quiesced
would return back the list of nodes that don't have any sessions running in them. If you build your servlet by extending org.openqa.grid.web.servlet.RegistryBasedServlet, within your servlet you should be able to get the list of free node urls by doing something like below
List<RemoteProxy> freeProxies =
StreamSupport.stream(getRegistry().getAllProxies().spliterator(), false)
.filter(remoteProxy -> !remoteProxy.isBusy())
.collect(Collectors.toList());
List<URL> urls =
freeProxies.stream().map(RemoteProxy::getRemoteHost).collect(Collectors.toList());
Now that we have the custom Hub which is now enabled with functionality to do this cleanup, you could now first invoke the 2.1 end-point to mark nodes to be shutdown and then keep polling 2.2 end-point to retrieve all the IP and Port combinations for the nodes that are no longer supporting any test session and then invoke http://NODE-IP:5555/selenium-server/driver/?cmd=shutDownSeleniumServer on them.
That on a high level can do what you are looking for.
Some useful links which can help you get oriented on this (All of the provided links are blogs that I wrote at various points in time).
Self healing grid - https://rationaleemotions.wordpress.com/2013/01/28/building-a-self-maintaining-grid-environment/
Building a custom proxy - https://rationaleemotions.github.io/gridopadesham/CUSTOM_PROXY.html
Building a custom servlet for the hub - https://rationaleemotions.github.io/gridopadesham/CUSTOM_SERVLETS.html

Integration and Unit testing Nifi process groups

I have a few Nifi process groups which I want to run integration tests on before promoting to production. The issue is that I can't seem to find any documentation on how to do so.
Data Provenance seems like a promising tool to accomplish what I want, however, over the course of the flowfile's lifecycle, data is published to/from kafka or the file system. As a result, the flowfile UUID changes so I cannot query for it using the nifi-api.
Additionally, I know that Nifi offers a TestRunner library to run tests, however, this seems to only be for processors/processor groups generated via code and not the UI.
Does anyone know of a tool, framework, or pattern for integration and unit testing nifi process groups. Ideally this would be a solution where you can programatically compare input/output of the processor/processor group without modifying the existing workflow.
With the introduction of the Apache NiFi Registry, we have seen users promote flows from a development/sandbox environment to a test/QE environment where there are existing "test harness" flows surrounding the "flow under test" so that they can send repeatable and deterministic (or an anonymized sample of real production data) through the flow and compare the results to an expected value.
As you point out, there is a TestRunner class and a whole testing framework provided for unit tests. While it can be difficult to manually translate a UI-constructed flow to the programmatic construction, you could also create something like a translator to accept a flow template or flow.xml.gz file and convert it into something processable by the test framework.
Maybe plumber will help you with flow testing.
We also wanted to test whole NiFi flows, not just single processor, so we created this library and decided to open-source it.
Simple example in Scala:
// read flow previously exported from NiFi
val template = TemplateDeserializer.deserialize(this.getClass.getClassLoader.getResourceAsStream("exported-flow.xml"))
val flow = NifiTemplateFlowFactory(template).create()
// enqueue some data to any processor
flow.enqueueByName("csv row,12,another value,true", "CsvParserProcessor")
// run entire flow once
flow.run(1)
// get the results from any processor
val records = flow.resultsFromProcessorRelation("LastProcessorInFlow","successRelation")
records should have size 1
This library is still under development so improvements and ideas are welcomed! :)

recommended way of profiling distributed tensorflow

Currently, I am using tensorflow estimator API to train my tf model. I am using distributed training that is almost 20-50 workers and 5-30 parameter servers based on the training data size. Since I do not have access to the session, I cannot use run metadata a=with full trace to look at the chrome trace. I see there are two other approaches :
1) tf.profiler.profile
2) tf.train.profilerhook
I am specifically using
tf.estimator.train_and_evaluate(estimator, train_spec, test_spec)
where my estimator is a prebuilt estimator.
Can someone give me some guidance (concrete code samples and code pointers will be really helpful since I am very new to tensorflow) what is the recommended way to profile estimator? Are the 2 approaches getting some different information or serve the same purpose? Also is one recommended over another?
There are two things you can try:
ProfilerContext
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/profile_context.py
Example usage:
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
train_loop()
ProfilerService
https://www.tensorflow.org/tensorboard/r2/tensorboard_profiling_keras
You can start a ProfilerServer via tf.python.eager.profiler.start_profiler_server(port) on all workers and parameter servers. And use TensorBoard to capture profile.
Note that this is a very new feature, you may want to use tf-nightly.
Tensorflow have recently added a way to sample multiple workers.
Please have a look at the API:
https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/client/trace?version=nightly
The parameter of the above API which is important in this context is :
service_addr: A comma delimited string of gRPC addresses of the
workers to profile. e.g. service_addr='grpc://localhost:6009'
service_addr='grpc://10.0.0.2:8466,grpc://10.0.0.3:8466'
service_addr='grpc://localhost:12345,grpc://localhost:23456'
Also, please look at the API,
https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/ProfilerOptions?version=nightly
The parameter of the above API which is important in this context is :
delay_ms: Requests for all hosts to start profiling at a timestamp
that is delay_ms away from the current time. delay_ms is in
milliseconds. If zero, each host will start profiling immediately upon
receiving the request. Default value is None, allowing the profiler
guess the best value.