Databricks with CloudWatch metrics without Instanceid dimension - amazon-cloudwatch

I have jobs running on job clusters. And I want to send metrics to the CloudWatch.
I set CW agent followed this guide.
But issue is that I can't create useful metrics dashboard and alarms because I always have InstanceId dimension, and InstanceId is different on every job run.
If you check the link above, you will find init script and part of the json for configuring cw agent is
{
...
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
}
Documentation says that if I remove append_dimension than hostname will be dimension, and again... hostname always has different IP address, so not much useful.
Does someone have experience with Databricks on AWS and monitoring/alerting with CloudWatch? If so, how you resolved this issue?
I would like to set dimension which will be specific and for each executor and same each time it runs.

Related

Is possible to get the logstream into a CloudWatch metric filter?

I want to create a CloudWatch metric filter so that I count the number of log entries containing the error line
Connection State changed to LOST
I have CloudWatch Log Group called "nifi-app.log" with 3 log streams (one for each EC2 instance named `i-xxxxxxxxxxx', 'i-yyyyyyyyyy', etc)
Ideally I would want to extract a metric nifi_connection_state_lost_count with a dimension InstanceId where the value is the log stream name.
From what I gather from the documentation, it is possible to extract dimension from the log file contents themselves but I do not see any way to refer to the log stream name for example.
The log entries look like this
2022-03-15 09:44:47,811 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#3fe60bf7 Connection State changed to LOST
I know that I can extract fields from that log entries with [date,level,xxx,yy,zz] but what I need is not in the log entry itself, it's part of the log entry metadata (the log stream name).
The log files are NiFi log files and do NOT have the instance name, hostname, or anything like that printed in each log line, and I would rather not try to change the log format as it would require a restart of the NiFi cluster and I'm not even sure how to change it.
So, is it possible to get the log stream name as dimension for a CW metric filter in some other way?

How to monitor Databricks jobs using CLI or Databricks API to get the information about all jobs

I want to monitor the status of the jobs to see whether the jobs are running overtime or it failed. if you have the script or any reference then please help me with this. thanks
You can use the databricks runs list command to list all the jobs ran. This will list all jobs and their current status RUNNING/FAILED/SUCCESS/TERMINATED.
If you wanted to see if a job is running over you would then have to use databricks runs get --run-id command to list the metadata from the run. This will return a json which you can parse out the start_time and end_time.
# Lists job runs.
databricks runs list
# Gets the metadata about a run in json form
databricks runs get --run-id 1234
Hope this helps get you on track!

Getting the JOB_ID variable in Pentaho Data Integration

When you log a job in Pentaho Data Integration, one of the fields is ID_JOB, described as "the batch id- a unique number increased by one for each run of a job."
Can I get this ID? I can see it in my logging tables, but I want to set up a transformation to get it. I think there might be a runtime variable that holds an ID for the running job.
I've tried using the Get Variables and Get System Info transformation steps to no avail. I am a new Kettle user.
You have batch_ids of the current transformation and of the parent job available on the Get System Info step. On PDI 5.0 they come before the "command line arguments", but order changes with each version, so you may have to look it up.
You need to create the variable yourself to house the parent job batch ID. The way to do this is to add another transformation as the first step in your job that sets the variable and makes it available to all the other subsequent transformations and job steps that you'll call from the job. Steps:
1) As you have probably already done, enable logging on the job
JOB SETTINGS -> SETTINGS -> CHECK: PASS BATCH ID
JOB SETTINGS -> LOG -> ENABLE LOGGING, DEFINE DATABASE LOG TABLE, ENABLE: ID_JOB FIELD
2) Add a new transformation call it "Set Variable" as the first step after the start of your job
3) Create a variable that will be accessible to all your other transformations that contains the value of the current jobs batch id
3a) ADD A GET SYSTEM INFO STEP. GIVE A NAME TO YOUR FIELD - "parentJobBatchID" AND TYPE OF "parent job batch ID"
3b) ADD A SET VARIABLES STEP AFTER THE GET SYSTEM INFO STEP. DRAW A HOP FROM THE GET SYSTEM INFO STEP TO THE SET VARIABLES STEP AS ITS MAIN OUTPUT
3c) IN THE SET VARIABLES STEP SET FIELDNAME: "parentJobBatchID", SET A VARIABLE NAME - "myJobBatchID", VARIABLE SCOPE TYPE "Valid in the Java Virtual Machine", LEAVE DEFAULT VALUE EMPTY
And that's it. After that, you can go back to your job and add subsequent transformations and steps and they will all be able to access the variable you defined by substituting ${myJobBatchID} or whatever you chose to name it.
IT IS IMPORTANT THAT THE SET VARIABLES STEP IS THE ONLY THING THAT HAPPENS IN THE "Set Variables" TRANSFORMATION AND ANYTHING ELSE YOU WANT TO ACCESS THAT VARIABLE IS ADDED ONLY TO OTHER TRANSFORMATIONS CALLED BY THE JOB. This is because transformations in Pentaho are multi-threaded and you cannot guarantee that the set variables step will happen before other activities in that transformation. The parent job, however, executes sequentially so you can be assured that once you establish the variable containing parent job batch ID in the first transformation of the job that all other transformaitons and job steps will be able to use that variable.
You can test that it worked before you add other functionality by adding a "Write To Log" step after the Set Variables transformation that writes the variable ${myJobBatchID} to the log for you to view and confirm it is working.

Pig step execution details

I am newbee to pig .
I have written a small script in pig , where in i first load the data from two different tables and further right outer join the two tables ,later also i have next join of tables for two different st of data .It works fine .But i want to see
the steps of execution , like in which step my data is loaded that way i can note the time
needed for loading later details of step for data joining like how much time it is
taking for these much records to be joined .
Basically i want to know which part of my pig script is taking longer time to run so
that way i can further optimize my pig script .
Anyway we could println within the script and find which steps got executed which has started to execute .
Through jobtracker details link i could not get much info , just could see mapper is running & reducer is running , but idealy mapper for which part of script is running could not find that.
For example for a hive job run we can see in the jobtracker details link which step is currently getting executed.
Any information will be really helpfull.
Thanks in advance .
I'd suggest you to have a look at the followings:
Pig's Progress Notification Listener
Penny : this is a monitoring tool but I'm afraid that it hasn't been updated in the recent past (e.g: it won't compile for Pig 0.12.0 unless you do some code changes)
Twitter's Ambrose project. https://github.com/twitter/ambrose
On the other, after executing the script you can see a detailed statistics about the execution time of each alias (see: Job Stats (time in seconds)).
Have a look at the EXPLAIN operator. This doesn't give you real-time stats as your code is executing, but it should give you enough information about the MapReduce plan your script generates that you'll be able to match up the MR jobs with the steps in your script.
Also, while your script is running you can inspect the configuration of the Hadoop job. Look at the variables "pig.alias" and "pig.job.feature". These tell you, respectively, which of your aliases (tables/relations) is involved in that job and what Pig operations are being used (e.g., HASH_JOIN for a JOIN step, SAMPLER or ORDER BY for an ORDER BY step, and so on). This information is also available in the job stats that are output to the console upon completion.

How to see output in Amazon EMR/S3?

I am new to Amazon Services and tried to run the application in Amazon EMR.
For that I have followed the steps as:
1) Created the Hive Scripts which contains --> create table, load data statement in Hive with some file and select * from command.
2) Created the S3 Bucket. And I load the object into it as: Hive Script, File to load into the table.
3) Then Created the Job Flow (Using Sample Hive Program). Given the input, ouput, and script path (like s3n://bucketname/script.q, s3n://bucketname/input.txt, s3n://bucketname/out/). Didn't create out directory. I think it will get created automatically.
4) Then Job Flow start to run and after some time I saw the states as STARTING, BOOTSTRAPING, RUNNING, and SHUT DOWN.
5) While running SHUT DOWN state, it get terminated automatically showing FAILES status for SHUT DOWN.
Then on the S3, I didn't see the out directory. How to see the output? I saw directory like daemons, nodes, etc......
And also how to see the data from HDFS in Amazon EMR?
The output path that you specified in step 3 should contain your results (From your description, it is s3n://bucketname/out/)
If it doesn't, something went wrong with your Hive script. If your Hive job failed, you will find information about the failure/exception in the jobtracker log. The jobtracker log exists under <s3 log location>/daemons/<master instance name>/hadoop-hadoop-jobtracker-<some Amazon internal IP>.log
Only one file in your logs directory would have it's S3 key in the above format. This file will contain any exceptions that may have happened. You probably want to concentrate on the bottom end of the file.