Is possible to get the logstream into a CloudWatch metric filter? - amazon-cloudwatch

I want to create a CloudWatch metric filter so that I count the number of log entries containing the error line
Connection State changed to LOST
I have CloudWatch Log Group called "nifi-app.log" with 3 log streams (one for each EC2 instance named `i-xxxxxxxxxxx', 'i-yyyyyyyyyy', etc)
Ideally I would want to extract a metric nifi_connection_state_lost_count with a dimension InstanceId where the value is the log stream name.
From what I gather from the documentation, it is possible to extract dimension from the log file contents themselves but I do not see any way to refer to the log stream name for example.
The log entries look like this
2022-03-15 09:44:47,811 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#3fe60bf7 Connection State changed to LOST
I know that I can extract fields from that log entries with [date,level,xxx,yy,zz] but what I need is not in the log entry itself, it's part of the log entry metadata (the log stream name).
The log files are NiFi log files and do NOT have the instance name, hostname, or anything like that printed in each log line, and I would rather not try to change the log format as it would require a restart of the NiFi cluster and I'm not even sure how to change it.
So, is it possible to get the log stream name as dimension for a CW metric filter in some other way?

Related

How to generate logs in Pentaho to capture response from sftp if each file is put or not, for a job which loops to put dynamic files into sftp

I have a Pentaho job which has '10' ids as input in first transformation. In the next job I have a SQL query which needs to loop through each id from input.
So, I am using copy rows to results in first transformation and get rows from result in next job and selected execute for every input row in job properties of second job to loop through every 'id' from the first transformation. In the next steps data from the SQL query in second job is stored as files into local machine dynamically in different folders based on a specific field from the telesis query.
Next, I need to send these files in different folders of the local machine into sftp server through 'sftp put' step.
Now I want to track the logs with columns as:
1.number of files loaded into local machine for each id from input.
2.number of files loaded to sftp successfully from local machine to check whether all files loaded into local machine are sent successfully to sftp or not.
3.If a file is not sent to sftp for any reason, I need the name of the file which failed to load into sftp.
Thanks in advance..

How to check consumer group already exists in Redis?

Currently I am looking for ellegant solution to check that consumer groups in Redis stream already exist.
I have a few modules which connect to the same stream and read data from it. But they can start in different order and in case consumer groups is not created - try to create it.
In case first module have created group, others get an error according to documentation.
From the documentation:
If the specified consumer group already exists, the command returns a -BUSYGROUP error.
I would like to avoid this error.
I use Jedis client for work with Redis.
I know there is XINFO command (which can returns groups list), but it doesn't work when Redis was started in cluster mode (which can be one of my configuration).
There is no other way, as you covered in your questions there are two options:
XGROUP CREATE and catch an error in case the group is already there.
XINFO STREAM and look for the group, but that won't be atomic and a parallel group create, might be called right after you get the info back.

Azure Power-shell command to get the Count of records in Azure Data lake file

I have set of files on Azure Data-lake store folder location. Is there any simple power-shell command to get the count of records in a file? I would like to do this with out using Get-AzureRmDataLakeStoreItemContent command on the file item as the size of the files in gigabytes. Using this command on big files is giving the below error.
Error:
Get-AzureRmDataLakeStoreItemContent : The remaining data to preview is greater than 1048576 bytes. Please specify a
length or use the Force parameter to preview the entire file. The length of the file that would have been previewed:
749319688
Azure data lake operates at the file/folder level. The concept of a record really depends on how an application interprets it. For instance, in one case the file may have CSV line or in another a set of JSON objects. In some cases files contain binary data. Therefore, there is no way at the file system level to get the count of records.
The best way to get this information is to submit a job such as a USQL job in Azure Data Lake Analytics. The script will be really simple: An EXTRACT statement followed by a COUNT aggregation and an OUTPUT statement.
If you prefer Spark or Hadoop here is a StackOverflow question that discusses that: Finding total number of lines in hdfs distributed file using command line

Getting the JOB_ID variable in Pentaho Data Integration

When you log a job in Pentaho Data Integration, one of the fields is ID_JOB, described as "the batch id- a unique number increased by one for each run of a job."
Can I get this ID? I can see it in my logging tables, but I want to set up a transformation to get it. I think there might be a runtime variable that holds an ID for the running job.
I've tried using the Get Variables and Get System Info transformation steps to no avail. I am a new Kettle user.
You have batch_ids of the current transformation and of the parent job available on the Get System Info step. On PDI 5.0 they come before the "command line arguments", but order changes with each version, so you may have to look it up.
You need to create the variable yourself to house the parent job batch ID. The way to do this is to add another transformation as the first step in your job that sets the variable and makes it available to all the other subsequent transformations and job steps that you'll call from the job. Steps:
1) As you have probably already done, enable logging on the job
JOB SETTINGS -> SETTINGS -> CHECK: PASS BATCH ID
JOB SETTINGS -> LOG -> ENABLE LOGGING, DEFINE DATABASE LOG TABLE, ENABLE: ID_JOB FIELD
2) Add a new transformation call it "Set Variable" as the first step after the start of your job
3) Create a variable that will be accessible to all your other transformations that contains the value of the current jobs batch id
3a) ADD A GET SYSTEM INFO STEP. GIVE A NAME TO YOUR FIELD - "parentJobBatchID" AND TYPE OF "parent job batch ID"
3b) ADD A SET VARIABLES STEP AFTER THE GET SYSTEM INFO STEP. DRAW A HOP FROM THE GET SYSTEM INFO STEP TO THE SET VARIABLES STEP AS ITS MAIN OUTPUT
3c) IN THE SET VARIABLES STEP SET FIELDNAME: "parentJobBatchID", SET A VARIABLE NAME - "myJobBatchID", VARIABLE SCOPE TYPE "Valid in the Java Virtual Machine", LEAVE DEFAULT VALUE EMPTY
And that's it. After that, you can go back to your job and add subsequent transformations and steps and they will all be able to access the variable you defined by substituting ${myJobBatchID} or whatever you chose to name it.
IT IS IMPORTANT THAT THE SET VARIABLES STEP IS THE ONLY THING THAT HAPPENS IN THE "Set Variables" TRANSFORMATION AND ANYTHING ELSE YOU WANT TO ACCESS THAT VARIABLE IS ADDED ONLY TO OTHER TRANSFORMATIONS CALLED BY THE JOB. This is because transformations in Pentaho are multi-threaded and you cannot guarantee that the set variables step will happen before other activities in that transformation. The parent job, however, executes sequentially so you can be assured that once you establish the variable containing parent job batch ID in the first transformation of the job that all other transformaitons and job steps will be able to use that variable.
You can test that it worked before you add other functionality by adding a "Write To Log" step after the Set Variables transformation that writes the variable ${myJobBatchID} to the log for you to view and confirm it is working.

How to see output in Amazon EMR/S3?

I am new to Amazon Services and tried to run the application in Amazon EMR.
For that I have followed the steps as:
1) Created the Hive Scripts which contains --> create table, load data statement in Hive with some file and select * from command.
2) Created the S3 Bucket. And I load the object into it as: Hive Script, File to load into the table.
3) Then Created the Job Flow (Using Sample Hive Program). Given the input, ouput, and script path (like s3n://bucketname/script.q, s3n://bucketname/input.txt, s3n://bucketname/out/). Didn't create out directory. I think it will get created automatically.
4) Then Job Flow start to run and after some time I saw the states as STARTING, BOOTSTRAPING, RUNNING, and SHUT DOWN.
5) While running SHUT DOWN state, it get terminated automatically showing FAILES status for SHUT DOWN.
Then on the S3, I didn't see the out directory. How to see the output? I saw directory like daemons, nodes, etc......
And also how to see the data from HDFS in Amazon EMR?
The output path that you specified in step 3 should contain your results (From your description, it is s3n://bucketname/out/)
If it doesn't, something went wrong with your Hive script. If your Hive job failed, you will find information about the failure/exception in the jobtracker log. The jobtracker log exists under <s3 log location>/daemons/<master instance name>/hadoop-hadoop-jobtracker-<some Amazon internal IP>.log
Only one file in your logs directory would have it's S3 key in the above format. This file will contain any exceptions that may have happened. You probably want to concentrate on the bottom end of the file.