Getting the JOB_ID variable in Pentaho Data Integration - pentaho

When you log a job in Pentaho Data Integration, one of the fields is ID_JOB, described as "the batch id- a unique number increased by one for each run of a job."
Can I get this ID? I can see it in my logging tables, but I want to set up a transformation to get it. I think there might be a runtime variable that holds an ID for the running job.
I've tried using the Get Variables and Get System Info transformation steps to no avail. I am a new Kettle user.

You have batch_ids of the current transformation and of the parent job available on the Get System Info step. On PDI 5.0 they come before the "command line arguments", but order changes with each version, so you may have to look it up.

You need to create the variable yourself to house the parent job batch ID. The way to do this is to add another transformation as the first step in your job that sets the variable and makes it available to all the other subsequent transformations and job steps that you'll call from the job. Steps:
1) As you have probably already done, enable logging on the job
JOB SETTINGS -> SETTINGS -> CHECK: PASS BATCH ID
JOB SETTINGS -> LOG -> ENABLE LOGGING, DEFINE DATABASE LOG TABLE, ENABLE: ID_JOB FIELD
2) Add a new transformation call it "Set Variable" as the first step after the start of your job
3) Create a variable that will be accessible to all your other transformations that contains the value of the current jobs batch id
3a) ADD A GET SYSTEM INFO STEP. GIVE A NAME TO YOUR FIELD - "parentJobBatchID" AND TYPE OF "parent job batch ID"
3b) ADD A SET VARIABLES STEP AFTER THE GET SYSTEM INFO STEP. DRAW A HOP FROM THE GET SYSTEM INFO STEP TO THE SET VARIABLES STEP AS ITS MAIN OUTPUT
3c) IN THE SET VARIABLES STEP SET FIELDNAME: "parentJobBatchID", SET A VARIABLE NAME - "myJobBatchID", VARIABLE SCOPE TYPE "Valid in the Java Virtual Machine", LEAVE DEFAULT VALUE EMPTY
And that's it. After that, you can go back to your job and add subsequent transformations and steps and they will all be able to access the variable you defined by substituting ${myJobBatchID} or whatever you chose to name it.
IT IS IMPORTANT THAT THE SET VARIABLES STEP IS THE ONLY THING THAT HAPPENS IN THE "Set Variables" TRANSFORMATION AND ANYTHING ELSE YOU WANT TO ACCESS THAT VARIABLE IS ADDED ONLY TO OTHER TRANSFORMATIONS CALLED BY THE JOB. This is because transformations in Pentaho are multi-threaded and you cannot guarantee that the set variables step will happen before other activities in that transformation. The parent job, however, executes sequentially so you can be assured that once you establish the variable containing parent job batch ID in the first transformation of the job that all other transformaitons and job steps will be able to use that variable.
You can test that it worked before you add other functionality by adding a "Write To Log" step after the Set Variables transformation that writes the variable ${myJobBatchID} to the log for you to view and confirm it is working.

Related

how can I pass the result in a shell script to a variable in Job

Now I have a pentaho job using shell script to process some data.
But I found if I want to use the result in the script, I had to write that into a file and read the file to asign variables.
Is there an esaier way to use the result of a script step in the following steps?
This is the Script content.
Here is the whole process.
In pentaho you cannot create a variable and used in the same place.
Basically you just need to create one ktr and one job:
the first one is in charge of perform some task and save the variable with set-variable step (root job level option)
variables created in the first ktr are also available at job level
If you want to use the variable in another ktr
the second ktr, at the beginning should use the get-variable step to retrieve the variable created in previous transformation
transformations should be executed sequentially using a job
In your case, you should run the shell in the first ktr, transform the result into a variable and save it using set-variable. Your job which invokes the ktr, are able to use the variable create in the previous ktr

Kettle Change connection used at runtime

I need, at runtime, to change which connection is used by a table input step.
I have 3 connections defined: STG, DWH, DM.
I want to choose at runtime between them.
I can't create a new connection with parameters for server name, database name, etc. I must use the existing connections.
I wish I can write down a variable ${my_connection} in the box below, but the field cannot be edited.
Any suggestion?
Instead of using the variable in the connection selector of the Step, use the Host and Database name in the connection configuration.
EDIT:
You can pass a variable for the KTR to capture and test it using a Switch/Case step that calls a Transformation Executor, in this KTR you'll have your Table input and a copy rows to result step, results which will be captured after the Transformation Executor. You'll need 3 different KTR's, each with the Table input step that's going to execute the row passed by the Switch / Case step.
If i'm not clear or you need further explanation i can perhaps produce an example.

ControlM job parameters are not floating while running job

I created ControlM job with all required parameters(named PARM1, PARM2, PARM3). I able to Order the job and under Monitoring section when i try to run the job the parameters i set while creating job are not passing to script.
May i know why the parameters are not passing to script?
Could you specify what kind of Job you are designing?
Job's type could be determining this behavior.
It is important to consider this aspect. For example, if the Job type is OS Script then Control-M will send the values ​​of the 3 parameters at the indicated positions. However, if the Job is of type OS Command then you must specify, yourself, those parameters in the command line to execute.
For example, in the What specification of the command you should write:
thecommnad %%PARM1 %%PARM2 %%PARM3

How to execute X times a Job Executor step

Introduction
To keep it simple, let's imagine a simple transformation.
This transformation gets an input of 4 rows, from a Data Grid step.
The stream passes through a Job Executor, referencing to a simple job, with a Write Log component.
Expectations
I would like the simple job executes 4 times, that means 4 log messages.
Results
It turns out that the Job Executor step launches the simple job only once, instead of 4 times : I only have one log message.
Hints
The documentation of the Job Executor component specifies the following :
By default the specified job will be executed once for each input row.
This is parametrized in the "Row grouping" tab, with the following field :
The number of rows to send to the job: after every X rows the job will be executed and these X rows will be passed to the job.
Answer
The step actually works well : an input of X rows will execute a "Job Executor" step X times. The fact is I wasn't able to see it with the logs.
To verify it, I have added a simple transformation inside the "Job Executor" step, which writes into a text file. After I have checked this file, it appeared that the "Job Executor" was perfectly executed X times.
Research
Trying to understand why I didn't have X log messages after the X times execution of "Job Executor", I have added a "Wait for" component inside the initial simple job. Finally, adding two seconds allowed me to see X log messages appearing during the execution.
Hope this helps because it's pretty tricky. Please feel free to provide further details.
A little late to the party, as a side note:
Pentaho is a set of programs (Spoon, Kettle, Chef, Pan, Kitchen), The engine is Kettle, and everything inside transformations is started in parallel. This makes log retrieving a challenging task for Spoon (the UI), you don't actually need a Wait for entry, try outputting the logs into a file (specifying a log file on the Job executor entry properties) and you'll see everything in place.
Sometimes we need to give Spoon a little bit of time to get everything in place, personally that's why I recommend not relying on Spoon Execution Results logging tab, it is better to output the logs to a DB or files.

Table output name from command line in pentaho kettle

There is a case in my ETL where i am trying to take "table output" name from command line. The table name does not correspond to any streaming field's name. Is there any way to get it done in pentaho kettle?
Pentaho DI is a metadata based tool. I assume you will be trying to pass the output table name from the command line like below:
.../pan.sh -file:"/home/user/sample.ktr" -param:table_output=SOMETABLE
Assuming the command above is what you are trying.
So firstly, change the transformation settings of sample.ktr (just an example) and add the parameter name : "table_output" to the Parameters section.
Next, in the Table Output Step, use this parameter name in the format : ${table_output} in place of table name. This should solve your query.
Incase you are passing the parameters to a job. As mentioned above, the first section of the adding the parameters remains the same.
You can next take a separate transformation (.ktr) file inside a job, double click on the ktr (from the job file) and you will find PARAMETERS Section like the image below. Add the parameters
Thirdly inside the .ktr file, repeat the step from above (first section) and use a SET VARIABLE or TABLE OUTPUT. SET Variable step will ensure that you have the parameter available across the entire job. Mostly depends on your requirement.
Hope it helps :)
This should give you an idea how to do it. Since transformations are just xml you can read the metadata from them. Basically you find the table output step and set it as a variable in this case "TABLE"