The right metrics to monitor AWS Glue Jobs - amazon-cloudwatch

What're the right Glue metrics that allow to monitor the status of the glue jobs
Know that status of the last run (OK or KO) & the number of Success/Failed executions
PS: In order to explore the metrics, I'm using Datadog

Related

How to push AWS ECS Fargate cloudwatch logs to UI for users to see their long running task real time logs

I am creating an app where the long running tasks are getting executed in ECS Fargate and logs are getting pushed to CloudWatch. Now, am looking for a way to give the users an ability in the UI to see those realtime live logs while their tasks are running.
I am thinking of the below approach..
Saving logs temporarily in DynamoDB
DynamoDB stream with batch will trigger a lambda.
Lambda will trigger an AWS Appsync mutation with None data source.
In UI client will subscribed to that mutation to get real time updates. (depends on the batch size, example 5 batch means 5 logs lines )
https://aws.amazon.com/premiumsupport/knowledge-center/appsync-notify-subscribers-real-time/
Is there any other techniques or methods that i can think of?
Why not use Cloudwatch default save in S3 bucket ability and add SNS to let clients choose which topic they want to trail the log. Removing extra DynamoDB.

Is there a CloudWatch Alert and Notification metrics from Performance Insights?

Need to know if there are alert metrics in CloudWatch for RDS Performance insights.
ie. Trigger and Alarm, whenever there is => high load, waits in SQL Server?
You may need to read Overview of Monitoring Amazon RDS
Amazon RDS automatically sends metrics to CloudWatch every minute for
each active database. You are not charged additionally for Amazon RDS
metrics in CloudWatch.
You can watch a single Amazon RDS metric over a specific time period,
and perform one or more actions based on the value of the metric
relative to a threshold you set
You can create an alarm in RDS console and select the metric that is of your interest. Here is a snapshot to display that:
Amazon RDS Performance Insights recently released a feature that sends key performance metrics from Performance Insights to Amazon CloudWatch. Using this feature, you can set alerts on these metrics.
When Performance Insights is enabled, it automatically sends the following three metrics to CloudWatch:
DBLoad
DBLoadCPU
DBLoadNonCPU
https://aws.amazon.com/blogs/database/set-alarms-on-performance-insights-metrics-using-amazon-cloudwatch/

resource management on spark jobs on Yarn and spark shell jobs

Our company has a 9 nodes clusters on cloudera.
We have 41 long running spark streaming jobs [YARN + cluster mode] & some regular spark shell jobs scheduled to run on 1pm daily.
All jobs are currently submitted at user A role [ with root permission]
The issue I encountered are that while all 41 spark streaming jobs are running, my scheduled jobs will not be able to obtain resource to run.
I have tried the YARN fair scheduler, but the scheduled jobs remain not running.
We expect the spark streaming jobs are always running, but it will reduce the resources occupied whenever other scheduled jobs start.
please feel free to share your suggestions or possible solutions.
Your spark streaming jobs are consuming too many resources for your scheduled jobs to get started. This is either because they're always scaled to a point that there aren't enough resources left for scheduled jobs or they aren't scaling back.
For the case where the streaming jobs aren't scaling back you could check whether you have dynamic resource allocation enabled for your streaming jobs. One way of checking is via the spark shell using spark.sparkContext.getConf.get("spark.streaming.dynamicAllocation.enabled"). If dynamic allocation is enabled then you could look at reducing the minimum resources for those jobs.

can cloudwatch send metrics in less than 1 min?

AWS documentation states cloudwatch shares metrics every one minute, is it possible to get the metrics checked every 10 secs or less than a minute? If an instance goes down and I have to wait a full 1 minute to know that it is down? To spin up a new one in its place?
I presume that you are referring to Amazon EC2 metrics that are collected by Amazon CloudWatch.
No, you cannot configure these metrics to be collected more often. By default, Amazon EC2 metrics are collected every five minutes. You can activate detailed monitoring to obtain the metrics every one minute.
However, Elastic Load Balancing health checks can check the health of an instance more often, and it will only send traffic to instances that are responding correctly to health checks.
Amazon EC2 Auto Scaling can be configured to use Elastic Load Balancing health checks to determine the health of instances. If an instance is identified as unhealthy, Auto Scaling will automatically replace the instance. However, this can take several minutes to be identified and have a new instance operational. Thus, it is recommended to always be running a minimum of two instances.

Running multiple jobs at a time on GCML

I had tried running multiple jobs on Google Cloud ML some time back. And it had failed with an error on the lines of "Allowe number of instances exceed". However, when I tried it again I was able to run multiple training jobs at once.
How is the price for this calculated ?
Is there a way/(NEED) to queue an ML training/retraining job if another is already running, considering it uses the same project for both?
Cloud ML's pricing is described here. Quotas are described here. Every Cloud ML job uses up so many ML units depending on the job's tier. There's a limit to how many ML units can be consumed concurrently for a project. You can increase this quota if you need to be able to run more jobs simultaneously.
Server-side queuing in Cloud ML for jobs does not exist as of now.
If your jobs need more ML units than what your quota allows, you either need to increase you quota or you need to implement queuing on your side.