I am performing a batch operation using batch scope in mule 4. I am using a Sfdc connector inside my batch step to query the Ids. The batch is happening in sets of say 200 and total 1000 inputs. I need to get all the 1000 Ids from Sfdc query outside my batch scope. I tried few ways to access the payload coming out of the batch step but failed to get the payload outside scope. I see that the payload is accessible only inside the process stage. Are there any ways to achieve this?
Related
My requirement is to loop over same set of files using multiple pipelines
e.g.
pipeline 1 consumes file 1 and then a certain output is generated
now pipeline 2 has to be triggered based on output of pipeline 1 else we should skip the run
If pipeline 2 runs then pipeline 3 has to be triggered based on output of pipeline 2 else we should skip the run
Is there any way to do this in ADF without nesting if-else ?
You can simply loop through multiple pipelines using combination of "Execute Pipeline activity" and "If Activity". The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline. For false condition execute a different pipeline. Optionally if the child pipeline flow add "Execute Pipeline activity" refering the previous caller!
Caution: can get into a dangerous loop if right conditions are not configured.
Introduction
To keep it simple, let's imagine a simple transformation.
This transformation gets an input of 4 rows, from a Data Grid step.
The stream passes through a Job Executor, referencing to a simple job, with a Write Log component.
Expectations
I would like the simple job executes 4 times, that means 4 log messages.
Results
It turns out that the Job Executor step launches the simple job only once, instead of 4 times : I only have one log message.
Hints
The documentation of the Job Executor component specifies the following :
By default the specified job will be executed once for each input row.
This is parametrized in the "Row grouping" tab, with the following field :
The number of rows to send to the job: after every X rows the job will be executed and these X rows will be passed to the job.
Answer
The step actually works well : an input of X rows will execute a "Job Executor" step X times. The fact is I wasn't able to see it with the logs.
To verify it, I have added a simple transformation inside the "Job Executor" step, which writes into a text file. After I have checked this file, it appeared that the "Job Executor" was perfectly executed X times.
Research
Trying to understand why I didn't have X log messages after the X times execution of "Job Executor", I have added a "Wait for" component inside the initial simple job. Finally, adding two seconds allowed me to see X log messages appearing during the execution.
Hope this helps because it's pretty tricky. Please feel free to provide further details.
A little late to the party, as a side note:
Pentaho is a set of programs (Spoon, Kettle, Chef, Pan, Kitchen), The engine is Kettle, and everything inside transformations is started in parallel. This makes log retrieving a challenging task for Spoon (the UI), you don't actually need a Wait for entry, try outputting the logs into a file (specifying a log file on the Job executor entry properties) and you'll see everything in place.
Sometimes we need to give Spoon a little bit of time to get everything in place, personally that's why I recommend not relying on Spoon Execution Results logging tab, it is better to output the logs to a DB or files.
Am having an inbound database endpoint am selecting records with a condition which returns 500 rows as result set.Now i want to insert the coloumns in another DB.I used the batch process and have two batch steps selecting data and inserting data.
Now if while selecting data any error occurs I have to send a mail and If while inserting if it fails I need to log it in a different place.So how can I create two different exceptions for each step am not able to use catch exception in batch process.For now am using a flow reference inside batch step and handling the exception.Please provide me a better way.AM using batch execute -> batch -> batch step -> flow reference->exception handling
<batch:job name="BOMTable_DataLoader">
<batch:process-records>
<batch:step name="SelectData">
<flow-ref name="InputDatabase" doc:name="InputDatabase"/>
</batch:step>
<batch:step name="InsertData">
<batch:commit size="1000" doc:name="Batch Commit">
<flow-ref name="InserDatabase" doc:name="UpdateDataBase"/>
</batch:commit>
</batch:step>
</batch:process-records>
<batch:on-complete>
For 500 rows only I will not use a batch processing you could simply create with an iteration a SQL script with all the insert in one shot and execute it via the database connector.
If you want to know which insert fails than you could use foreach processor to iterate over the rows and insert them one by one so that you can control the output, it will be slower but it all depends on your needs.
Just use the right type of query depending on your needs and remember you can use MEL in your query. Just pay attention to injection if your input are also coming from untrusted sources (if you use parametrized query than you are on the safe side becouse parameters are escaped)
More on the db connector
https://docs.mulesoft.com/mule-user-guide/v/3.7/database-connector
I do confirm after your question update that the only way to handle exception is the way you use it. The fact that the module in mule is called "Batch processing" does not mean that everytime you have something that looks like a batch job you need to use that component. When you got some complex case just don't use it and use standard mule components like VM async for a async execution and normal tools like foreach or even better collection splitter and get all the freedom and control over exception handling.
I am copying 50 million records to amazon dynamodb using hive script. The script failed after running for 2 days with an item size exceeded exception.
Now if I restart the script again, it will start the insertions again from first record. Is there a way where I can say like "Insert only those records which are not in dynamo db" ?
You can use conditional writes to only write the item if it the specified attributes are not equal to the values you provide. This is done by using the ConditionExpression for a PutItem request. However, it still uses write capacity even if a write fails (emphasis mine) so this may not even be the best option for you:
If a ConditionExpression fails during a conditional write, DynamoDB
will still consume one write capacity unit from the table. A failed
conditional write will return a ConditionalCheckFailedException
instead of the expected response from the write operation. For this
reason, you will not receive any information about the write capacity
unit that was consumed. However, you can view the
ConsumedWriteCapacityUnits metric for the table in Amazon CloudWatch
to determine the provisioned write capacity that was consumed from the
table.
When you log a job in Pentaho Data Integration, one of the fields is ID_JOB, described as "the batch id- a unique number increased by one for each run of a job."
Can I get this ID? I can see it in my logging tables, but I want to set up a transformation to get it. I think there might be a runtime variable that holds an ID for the running job.
I've tried using the Get Variables and Get System Info transformation steps to no avail. I am a new Kettle user.
You have batch_ids of the current transformation and of the parent job available on the Get System Info step. On PDI 5.0 they come before the "command line arguments", but order changes with each version, so you may have to look it up.
You need to create the variable yourself to house the parent job batch ID. The way to do this is to add another transformation as the first step in your job that sets the variable and makes it available to all the other subsequent transformations and job steps that you'll call from the job. Steps:
1) As you have probably already done, enable logging on the job
JOB SETTINGS -> SETTINGS -> CHECK: PASS BATCH ID
JOB SETTINGS -> LOG -> ENABLE LOGGING, DEFINE DATABASE LOG TABLE, ENABLE: ID_JOB FIELD
2) Add a new transformation call it "Set Variable" as the first step after the start of your job
3) Create a variable that will be accessible to all your other transformations that contains the value of the current jobs batch id
3a) ADD A GET SYSTEM INFO STEP. GIVE A NAME TO YOUR FIELD - "parentJobBatchID" AND TYPE OF "parent job batch ID"
3b) ADD A SET VARIABLES STEP AFTER THE GET SYSTEM INFO STEP. DRAW A HOP FROM THE GET SYSTEM INFO STEP TO THE SET VARIABLES STEP AS ITS MAIN OUTPUT
3c) IN THE SET VARIABLES STEP SET FIELDNAME: "parentJobBatchID", SET A VARIABLE NAME - "myJobBatchID", VARIABLE SCOPE TYPE "Valid in the Java Virtual Machine", LEAVE DEFAULT VALUE EMPTY
And that's it. After that, you can go back to your job and add subsequent transformations and steps and they will all be able to access the variable you defined by substituting ${myJobBatchID} or whatever you chose to name it.
IT IS IMPORTANT THAT THE SET VARIABLES STEP IS THE ONLY THING THAT HAPPENS IN THE "Set Variables" TRANSFORMATION AND ANYTHING ELSE YOU WANT TO ACCESS THAT VARIABLE IS ADDED ONLY TO OTHER TRANSFORMATIONS CALLED BY THE JOB. This is because transformations in Pentaho are multi-threaded and you cannot guarantee that the set variables step will happen before other activities in that transformation. The parent job, however, executes sequentially so you can be assured that once you establish the variable containing parent job batch ID in the first transformation of the job that all other transformaitons and job steps will be able to use that variable.
You can test that it worked before you add other functionality by adding a "Write To Log" step after the Set Variables transformation that writes the variable ${myJobBatchID} to the log for you to view and confirm it is working.